4.7 Article

Categorical Variable Mapping Considerations in Classification Problems: Protein Application

Journal

MATHEMATICS
Volume 11, Issue 2, Pages -

Publisher

MDPI
DOI: 10.3390/math11020279

Keywords

categorical variables; numerical variables; mappings

Categories

Ask authors/readers for more resources

The mapping of categorical variables into numerical values is commonly used in machine learning classification problems. In this study, four assumptions about these mappings in protein classification using amino acid information were numerically tested. A proposed eigenvalue-based matrix representation was used for comparable mapping, which showed advantages and achieved an accuracy of 83.25% across 23 different machine learning algorithms. An optimization algorithm for selecting appropriate neural network neurons in protein classification achieved an accuracy of 85.02% with a quadratic penalty function to prevent overfitting.
The mapping of categorical variables into numerical values is common in machine learning classification problems. This type of mapping is frequently performed in a relatively arbitrary manner. We present a series of four assumptions (tested numerically) regarding these mappings in the context of protein classification using amino acid information. This assumption involves the mapping of categorical variables into protein classification problems without the need to use approaches such as natural language process (NLP). The first three assumptions relate to equivalent mappings, and the fourth involves a comparable mapping using a proposed eigenvalue-based matrix representation of the amino acid chain. These assumptions were tested across a range of 23 different machine learning algorithms. It is shown that the numerical simulations are consistent with the presented assumptions, such as translation and permutations, and that the eigenvalue approach generates classifications that are statistically not different from the base case or that have higher mean values while at the same time providing some advantages such as having a fixed predetermined dimensions regardless of the size of the analyzed protein. This approach generated an accuracy of 83.25%. An optimization algorithm is also presented that selects an appropriate number of neurons in an artificial neural network applied to the above-mentioned protein classification problem, achieving an accuracy of 85.02%. The model includes a quadratic penalty function to decrease the chances of overfitting.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available