4.3 Article

A comparative study between PCR, PLSR, and LW-PLS on the predictive performance at different data splitting ratios

Journal

CHEMICAL ENGINEERING COMMUNICATIONS
Volume 209, Issue 11, Pages 1439-1456

Publisher

TAYLOR & FRANCIS INC
DOI: 10.1080/00986445.2021.1957853

Keywords

Data splitting; locally weighted partial least square regression; partial least square regression; prediction; principal component regression; soft sensors

Funding

  1. Curtin University Malaysia

Ask authors/readers for more resources

Principal component regression (PCR), partial least squares regression (PLSR), and locally weighted partial least squares (LW-PLS) models were studied for their predictive performance at different data splitting ratios, with LW-PLS performing better due to its capability to handle nonlinear data. Optimal splitting ratios were determined by evaluating root mean squared error, coefficient of determination, and error of approximation for five case studies. Split-sample ratios above 70% of training data showed significant improvements in predictive performance compared to base scenarios with higher E-a values.
Principal component regression (PCR), partial least squares regression (PLSR), and locally weighted partial least squares (LW-PLS) models are supervised learning methods in which a labeled dataset is used to train the model. The split-sample validation is normally used to train these models where a dataset is split into training and testing datasets to develop and evaluate the model. However, a limited study is done to evaluate the prediction performance of PCR, PLSR, and LW-PLS models at the different data splitting ratios. Hence, to address this research gap, this submitted work is conducted to investigate the predictive performance of the abovementioned regression models at the different split sample ratios for the data. Meanwhile, this study also serves to determine the optimal splitting ratios for PCR, PLSR, and LW-PLS models via a simple data splitting method where a minimum of 50% of the entire dataset is allocated to train the model. The optimal split is determined by evaluating the root mean squared error, coefficient of determination, and error of approximation (E-a) for five case studies. For PCR, PLSR, and LW-PLS models, LW-PLS performed better in most of the case studies since it copes better with the nonlinear data. Among these best models in each case study, it was found that the split-sample ratios of above 70% of training data had allowed major improvements in terms of predictive performance as compared to their base scenarios which have the largest E-a values.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available