3.8 Article

A Survey on Classifying Big Data with Label Noise

Journal

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3492546

Keywords

Label noise; data quality; big data; machine learning; classification; deep learning; data streams

Ask authors/readers for more resources

This survey reviews the literature extensively on treating label noise within big data, addressing the challenges associated with big data and presenting 30 methods for treating class label noise in different big data contexts. The surveyed works include distributed solutions, deep learning techniques, and streaming techniques. The paper identifies common trends and best practices, reviews implementation details, compares empirical results, and provides references to open-source projects. The emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for newand improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available