☆ 4.5 Article

SPlit: An Optimal Method for Data Splitting

TECHNOMETRICS (2022)

期刊

TECHNOMETRICS

卷 64, 期 2, 页码 166-176

出版社

TAYLOR & FRANCIS INC

DOI: 10.1080/00401706.2021.1921037

关键词

Cross-validation; Quasi-Monte Carlo; Testing; Training; Validation

类别

Statistics & Probability

资金

U.S. National Science Foundation [CBET-1921873]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

In this article, an optimal method named SPlit for splitting a dataset into training and testing sets is proposed, which is based on the support points algorithm and can be applied to both regression and classification problems. The implementation on real datasets shows substantial improvement compared to the commonly used random splitting procedure.

In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.

SPlit: An Optimal Method for Data Splitting

期刊

TECHNOMETRICS

出版社

TAYLOR & FRANCIS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

SPlit: An Optimal Method for Data Splitting

期刊

TECHNOMETRICS

出版社

TAYLOR & FRANCIS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文