4.6 Article

Restoration of Data Structures Using Machine Learning Techniques

Journal

IEEE ACCESS
Volume 11, Issue -, Pages 113077-113099

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2023.3323846

Keywords

Information extraction; messy datasets; machine learning algorithms; schema discovery

Ask authors/readers for more resources

Tabular data is a common format to represent real-world information, and the problem of importing messy files or files with multiple tables is challenging. This paper proposes the STCExtract algorithm to reconstruct table structures and data, achieving high accuracy according to the evaluation results.
Tabular data is the most common format used to represent real-world information. Almost all programs created for storing or processing data, such as relational database systems, spreadsheets, and statistical analysis software can import or export tabular data. These programs are not sufficiently robust to automatically solve the problems of importing messy delimited files or files that contain data from multiple tables. Additional messy datasets contain data delimited by multiple delimiters without the names of the table columns, and parts of the table rows have substituted or deleted columns. This paper proposes the STCExtract algorithm for reconstructing table structures and data in which the input file can be arranged. The STCExtract algorithm is designed to be domain-independent and modular according to machine learning algorithms and other parameters. The algorithm was developed as a two-phase process, in which the original data tables were recognized in the first phase and the columns of the original data tables in the second phase. The STCExtract algorithm was evaluated through expensive experiments using multiple real datasets. Multiple messy datasets were generated for the four experiments. Three experiments were conducted to determine the optimal parameters for the STCExtract algorithm. A fourth experiment was conducted to evaluate the proposed algorithm. The results show that the STCExtract algorithm correctly arranged the structure of the tables with an accuracy of 94.4% to 100%. The accuracy of the STCExtract algorithm in the second phase (when the data were allocated to columns) ranged from 59.7% to 90.2%.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available