☆ 4.7 Article

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2021)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 33, Issue 5, Pages 2223-2236

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2019.2951388

Keywords

Training; Data models; Support vector machines; Iterative algorithms; Artificial neural networks; Predictive models; Iterative machine learning; large input datasets; model parameter updating; MapReduce

Funding

National Key Research and Development Plan of China [2018YFB1003701, 2018YFB1003700]
National Natural Science Foundation of China [61872337]
Swiss National Science Foundation [NRP75, 407540_167266]
Swiss National Science Foundation (SNF) [407540_167266] Funding Source: Swiss National Science Foundation (SNF)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This article challenges the prevalent assumption in machine learning that all data points are equally relevant to model parameter updating, proposing a new SlimML framework that trains models only on critical data points to significantly improve training performance. Experimental results show that SlimML accelerates the model training process by an average of 3.61 times for large datasets, with only a 0.37% accuracy loss.

The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

SlimML: Removing Non-Critical Input Data in Large-Scale Iterative Machine Learning

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper