4.7 Article

An advanced variable selection method based on information gain and Fisher criterion reselection iteration for multivariate calibration

Journal

Publisher

ELSEVIER
DOI: 10.1016/j.chemolab.2023.104796

Keywords

Variable selection; Information gain; Fisher criterion; Multivariate calibration

Ask authors/readers for more resources

This paper proposes a variable selection method called IFRI based on the information gain and Fisher's criterion. It improves the interpretability of selected variables by selecting variables that have strong associations with the property of interest. IFRI does not depend on the PLS model parameters, making it more interpretable and reproducible for the same dataset. It can be applied to variable selection of different types of high-dimensional data by combining with other modeling methods.
The difficulty of analyzing high-dimensional data makes dimensionality reduction essential, and variable selection is widely used as a dimensionality reduction method to improve the interpretability of models. A variable selection method based on information gain and Fisher's criterion reselection iteration (IFRI) is proposed in this paper. It selects variables that have strong associations with the property of interest by two classical feature selection functions to improve interpretability of the selected variables. At first, information gain was employed to pre-select global feature variables with strong ability to distinguish between different classes, and Fisher's criterion was applied to re-select key variables further from the selected global variables that enable larger inter-class variance and smaller intra-class variance for the samples. Pearson correlation coefficients were then applied to eliminate variables with strong linear correlation and a series of subsets were obtained iteratively. At the end, cross-validation (CV) was used to calculate the obtained subset, and the one with the lowest root mean square error of cross-validation (RMSECV) was considered optimal. Combining with partial least squares (PLS), it was applied to the analysis of nicotine and total sugar content in tobacco samples. By comparing and validating with two representative variable selection algorithms, CARS and MC-UVE, IFRI has been demonstrated to be an effective and applicable method to establish high-performance models by selecting a few key variables. Differing from most methods, IFRI does not need to depend on the PLS model parameters in the process of variable selection, which makes it more interpretable. Moreover, the selected variables by IFRI are reproducible for the same dataset. Apparently, the principle of the method makes it not limited to spectral data and can be applied to the variable selection of different types of high-dimensional data by combining it with other modeling methods, for example it can be combined with qualitative discriminant models to classify tobacco or used to select critical chemical components affecting the tobacco quality.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available