4.5 Article

A Data-Driven Approach for Extracting Representative Information From Large Datasets With Mixed Attributes

Journal

IEEE TRANSACTIONS ON ENGINEERING MANAGEMENT
Volume 69, Issue 5, Pages 1806-1822

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TEM.2019.2934485

Keywords

Data-driven; diversified information; information explosion; representative information extraction; web search

Funding

  1. National Natural Science Foundation of China [71871177, 71471144]
  2. National Key R&D Program of China [2018YFB1703001]

Ask authors/readers for more resources

This article proposes a data-driven approach to automatically identify a subset of the original dataset that can cover more themes and content. The approach improves the accuracy of similarity estimation by incorporating external knowledge and attribute interactions, and identifies representative objects using an enhanced density peaks clustering algorithm. Experimental results demonstrate the effectiveness and robustness of the proposed approach.
The rapid growth of information technology and Internet applications has provided users with an explosion of information. Mobile e-commerce applications and web search engines are of great interest in extracting representative information from the original abundant information. However, the information extracted by several existing methods, such as top-k, are often quite similar, which is difficult to meet users' demand for diversified information. In order to increase the diversity of representative information, this article proposes a data-driven approach to automatically identifying a subset of the original dataset that can cover more themes and content. The data-driven approach consists of two stages. First, a new unified similarity measure is proposed for handling dataset with categorical and numeric attributes. We inject external knowledge and attribute interactions into the similarity learning process to improve the accuracy of similarity estimation between data objects. Second, we develop an enhanced density peaks clustering algorithm based on shared nearest neighbors to automatically identify representative objects according to the previous estimated similarity. The enhanced density peaks algorithm takes the local structure in the entire data space into consideration, which makes the proposed approach relatively insensitive to variations in dataset' density and dimensionality. Theoretical analysis demonstrates that the time complexity of the proposed approach can achieve the best O(N log N). Extensive comparison experiments were conducted on artificial and real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed approach.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available