LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

HSPXY: A hybrid‐correlation and diversity‐distances based data partition method

Photo by onelast from unsplash

A representative dataset is crucial to build a robust and generalized machine learning model, especially for small databases. Correlation is not usually considered in distance‐based set partition methods; therefore, distant… Click to show full abstract

A representative dataset is crucial to build a robust and generalized machine learning model, especially for small databases. Correlation is not usually considered in distance‐based set partition methods; therefore, distant yet correlated samples might be incorrectly assigned. An improved sample subset partition method based on joint hybrid correlation and diversity x‐y distances (HSPXY) is proposed in the framework of the sample set partition based on joint x‐y distances (SPXY). Therein, a hybrid distance consisting of both cosine angle distance and Euclidean distance in variable spaces cooperates the correlation of samples in the distance‐based set partition method. To compare with some existing partition methods, partial least squares (PLS) regression models are built on four set partition methods, random sampling (RS), Kennard‐Stone (KS), SPXY, and HSPXY. Upon the applications on small chemical databases, the proposed HSPXY algorithm‐based models achieved smaller root mean square errors and better coefficients of determination than other tested set partition methods, which indicates the training set is well represented. This suggests the proposed algorithm provides a new option to obtain a representative calibration set. Sample subset partition is widely considered in machine learning modeling. An improved sample subset partition method based on a hybrid correlation and diversity x‐y distance (HSPXY) is proposed in the framework of SPXY. Cosine angle distance and Euclidean distance in variable spaces are used to represent the correlation and diversity of samples, respectively. To explore the effectiveness of HSPXY, PLS models are built on four set partition methods, RS, KS, SPXY, and HSPXY. The models based on the proposed HSPXY algorithm carried the overall best result among all regression models, which suggests the proposed algorithm may be taken as an alternative to other existing data partition methods.

Keywords: hspxy; set partition; partition methods; distance; correlation; partition

Journal Title: Journal of Chemometrics
Year Published: 2019

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.