4.7 Article

A variable selection method based on mutual information and variance inflation factor

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.saa.2021.120652

Keywords

Mutual information; Variance inflation factor; Variable selection; Spectrum

Categories

Funding

  1. Priority Academic Program Development of Jiangsu Higher Education Institutions [PAPD-2018-87]

Ask authors/readers for more resources

Feature selection plays a vital role in reducing dimensionality in the quantitative analysis of high-dimensional data. This paper proposes a variable selection method called Mutual Information-Variance Inflation Factor (MI-VIF) that combines mutual information (MI) and the variance inflation factor (VIF). By maximizing the correlation between the independent variable and the response variable and minimizing multicollinearity, MI-VIF achieves effective feature selection.
Feature selection plays a vital role in the quantitative analysis of high-dimensional data to reduce dimensionality. Recently, the variable selection method based on mutual information (MI) has attracted more and more attention in the field of feature selection, where the relevance between the candidate variable and the response is maximized and the redundancy of the selected variables is minimized. However, multicollinearity often is a serious problem in linear models. Collinearity can cause unstable parameter estimation, unreliable models, and weak predictive ability. In order to address this problem, the variance inflation factor (VIF) was introduced for feature selection. Therefore, a variable selection method based on MI combined with VIF was proposed in this paper, called Mutual Information-Variance Inflation Factor (MI-VIF). By calculating the MI between the independent variable and the response variable, the variable with greater MI was selected to maximize the correlation between the independent variable and the response variable. By calculating the VIF between the independent variables, the multicollinearity test was performed. The variables that cause the multicollinearity of the model were eliminated to minimize the collinearity between the independent variables. The proposed method was tested based on two high-dimensional spectral datasets. The regression models (PLSR, MLR) were established based on feature selection through MI-VIF and MI-based methods (MIFS, MMIFS) to compare the prediction accuracy of the models. The results showed that under two datasets, the MI-VIF showed a good prediction performance. Based on the tea dataset, the established MI-VIF-MLR model achieved accuracy with Rp(2) of 0.8612 and RMSEP of 0.4096, the MI-VIF-PLSR model achieved accuracy with Rp(2) of 0.8614 and RMSEP of 0.4092. Based on the diesel fuels dataset, the established MI-VIF-MLR model achieved accuracy with Rp(2) of 0.9707 and RMSEP of 0.6568, the MI-VIF-PLSR model achieved accuracy with Rp(2) of 0.9431 and RMSEP of 0.9675. In addition, the MI-VIF was compared with the Successive projections algorithm (SPA), which is a method to reduce the collinearity between variables in the wavelength selection of the near-infrared spectrum. It was found that MI-VIF also had a good predictive effect compared to SPA. It proves that the MI-VIF is an effective variable selection method. (C) 2021 Published by Elsevier B.V.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available