4.7 Article

Local Feature Selection for Large-Scale Data Sets With Limited Labels

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume 35, Issue 7, Pages 7152-7163

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2022.3181208

Keywords

Terms-Data mining; semi-supervised learning; local feature selection; rough set; related family

Ask authors/readers for more resources

This paper proposes a local feature selection method based on related family, which can accelerate data processing for large-scale data sets. The experiments demonstrate that the proposed algorithm is 405 times faster than LARD on partially labeled data sets while maintaining high classification accuracy. Additionally, this algorithm can effectively process partially labeled large-scale data sets with 5,000,000 samples or 20,000 features on a typical personal computer.
Processing large-scale data sets with limited labels has always been a difficult task in data mining. Facing this difficulty, two local feature selection algorithms, LARD and LRSD, have been proposed based on dependency degree, which can process partially labeled data sets and greatly improve the computational efficiency. However, it is very difficult for these algorithms to calculate large-scale data with millions of samples on a typical personal computer. Although the related family method is a more efficient approach than dependency degree, it cannot be used for partially labeled large-scale data. As a result, a local feature selection method based on related family is proposed to accelerate data processing in the paper. Experiments show that the proposed algorithm can run 405 times faster than LARD on partially labeled data sets and maintain high classification accuracy. In addition, this new algorithm can effectively process partially labeled large-scale data sets with 5,000,000 samples or 20,000 features on a typical personal computer.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available