☆ 4.6 Article

A Deep-Learned Embedding Technique for Categorical Features Encoding

IEEE ACCESS (2021)

Journal

IEEE ACCESS

Volume 9, Issue -, Pages 114381-114391

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2021.3104357

Keywords

Encoding; Numerical models; Machine learning; Data models; Training; Biological neural networks; Computational modeling; Data preprocessing; categorical variables; natural language processing; machine learning

Funding

Institute for Information and Communication Technology Promotion (IITP) - Korea Government [Ministry of Science, ICT and Future Planning (MSIP)] (Development of the technology) [2020-0-00107]
Institute for Information & Communication Technology Planning & Evaluation (IITP), Republic of Korea [2020-0-00107-002] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper presents a deep-learned embedding technique for encoding categorical features, which achieves a higher F1 score compared to one-hot encoding and consumes less memory while generating fewer features.

Many machine learning algorithms and almost all deep learning architectures are incapable of processing plain texts in their raw form. This means that their input to the algorithms must be numerical in order to solve classification or regression problems. Hence, it is necessary to encode these categorical variables into numerical values using encoding techniques. Categorical features are common and often of high cardinality. One-hot encoding in such circumstances leads to very high dimensional vector representations, raising memory and computability concerns for machine learning models. This paper proposes a deep-learned embedding technique for categorical features encoding on categorical datasets. Our technique is a distributed representation for categorical features where each category is mapped to a distinct vector, and the properties of the vector are learned while training a neural network. First, we create a data vocabulary that includes only categorical data, and then we use word tokenization to make each categorical data a single word. After that, feature learning is introduced to map all of the categorical data from the vocabulary to word vectors. Three different datasets provided by the University of California Irvine (UCI) are used for training. The experimental results show that the proposed deep-learned embedding technique for categorical data provides a higher F1 score of 89% than 71% of one-hot encoding, in the case of the Long short-term memory (LSTM) model. Moreover, the deep-learned embedding technique uses less memory and generates fewer features than one-hot encoding.

A Deep-Learned Embedding Technique for Categorical Features Encoding

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Deep-Learned Embedding Technique for Categorical Features Encoding

Journal

IEEE ACCESS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper