4.3 Article

Data set preprocessing and transformation in a database system

Journal

INTELLIGENT DATA ANALYSIS
Volume 15, Issue 4, Pages 613-631

Publisher

IOS PRESS
DOI: 10.3233/IDA-2011-0485

Keywords

Attribute construction; data transformation; preprocessing; summarization

Funding

  1. National Science Foundation [CCF 0937562, IIS 0914861]
  2. Division of Computing and Communication Foundations
  3. Direct For Computer & Info Scie & Enginr [0937562] Funding Source: National Science Foundation

Ask authors/readers for more resources

In general, there is a significant amount of data mining analysis performed outside a database system, which creates many data management issues. This article presents a summary of our experience and recommendations to compute data set preprocessing and transformation inside a database system (i.e. data cleaning, record selection, summarization, denormalization, variable creation, coding), which is the most time-consuming task in data mining projects. This aspect is largely ignored in the literature. We present practical issues, common solutions and lessons learned when preparing and transforming data sets with the SQL language, based on experience from real-life projects. We then provide specific guidelines to translate programs written in a traditional programming language into SQL statements. Based on successful real-life projects, we present time performance comparisons between SQL code running inside the database system and external data mining programs. We highlight which steps in data mining projects become faster when processed by the database system. More importantly, we identify advantages and disadvantages from a practical standpoint based on data mining users feedback.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available