LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

Photo by campaign_creators from unsplash

High-dimensional LASSO (Hi-LASSO) is a powerful feature selection tool for high-dimensional data. Our previous study showed that Hi-LASSO outperformed the other state-of-the-art LASSO methods. However, the substantial cost of bootstrapping… Click to show full abstract

High-dimensional LASSO (Hi-LASSO) is a powerful feature selection tool for high-dimensional data. Our previous study showed that Hi-LASSO outperformed the other state-of-the-art LASSO methods. However, the substantial cost of bootstrapping and the lack of experiments for a parametric statistical test for feature selection have impeded to apply Hi-LASSO for practical application. In this paper, the Python package and its Spark library are efficiently implemented in a parallel manner for practice with real-world problems, as well as providing the capability of the parametric statistical tests for feature selection on high-dimensional data. We demonstrate Hi-LASSO’s outperformance with various intensive experiments in a practical manner. Hi-LASSO will be efficiently and easily performed by using the packages for feature selection. Hi-LASSO packages are publicly available at https://github.com/datax-lab/Hi-LASSO under the MIT license. The packages can be easily installed by Python PIP, and additional documentation is available at https://pypi.org/project/hi-lasso and https://pypi.org/project/Hi-LASSO-spark. Author summary We provide the brief presentation in the literature of Hi-LASSO comparing to Random LASSO. Then, we describe Hi-LASSO’s open-source packages in Python and Apache Spark, specifying parameters. The open-source packages improves efficiency and scalability of Hi-LASSO, so that the time-consuming bootstrapping-based parametric statistical test can be practically applied for high-dimensional data. We conducted intensive experiments to assess the performance of the packages with the parametric statistical test using simulation data, semi-real datasets, and TCGA cancer dataset. The Hi-LASSO packages showed outstanding and robust performance in feature selection. The packages are available through PyPI and can be easily installed using Python PIP.

Keywords: lasso; feature selection; dimensional data; high dimensional; selection; python

Journal Title: PLOS ONE
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.