3.8 Proceedings Paper

Iterative Random vs. Kennard-stone Sampling for IR Spectrum-based Classification Task using PLS2-DA

Journal

2017 UKM FST POSTGRADUATE COLLOQUIUM
Volume 1940, Issue -, Pages -

Publisher

AMER INST PHYSICS
DOI: 10.1063/1.5028031

Keywords

Forensic science; Kennard-stone sampling; random sampling; PLS-DA; IR spectrum

Funding

  1. Universiti Kebangsaan Malaysia (UKM)
  2. Ministry of Higher Education Malaysia (MOHE) [FRGS/2/2013/ST06/UKM/02/1]

Ask authors/readers for more resources

External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The 'best' performance achievable from the dataset is estimated with AP on the full dataset (AP(F, error)). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares - discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (AP(F, error)); (b) AP on training set (AP(S, error)); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce AP(S,) (error) and EVS, error that is similar to the APF error. Beiring that in mind, the similarities between (a) AP(S, error), VS. AP(F, error); (b) ETS, error VS. AP(F, error) and; (c) AP(S, error) vs. ETS, error were were evaluated using correlation tests (i.e. Pearson and Speannan's rank test), using series of PLS2-DA models computed tioin KS-set and IRS-set, respectively. Overall, models constnicted from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available