4.7 Article

Similarity encoding for learning with dirty categorical variables

Journal

MACHINE LEARNING
Volume 107, Issue 8-10, Pages 1477-1494

Publisher

SPRINGER
DOI: 10.1007/s10994-018-5724-2

Keywords

Dirty data; Categorical variables; Statistical learning; String similarity measures

Funding

  1. DirtyData grant [ANR-17-CE23-0018]
  2. Wendelin grant
  3. Agence Nationale de la Recherche (ANR) [ANR-17-CE23-0018] Funding Source: Agence Nationale de la Recherche (ANR)

Ask authors/readers for more resources

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. Dirty non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available