4.8 Article

High-Efficient Fuzzy Querying With HiveQL for Big Data Warehousing

Journal

IEEE TRANSACTIONS ON FUZZY SYSTEMS
Volume 30, Issue 6, Pages 1823-1837

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TFUZZ.2021.3069332

Keywords

Big Data; Warehousing; Data warehouses; Linguistics; Distributed databases; Relational databases; Fuzzy sets; Big data; data warehousing; fuzzy sets; Hadoop; Hive; querying

Funding

  1. Amazon Web Services [02/100/RGJ21/0009, 02/020/RGP19/0184]
  2. Statutory Research funds of Department of Applied Informatics, Silesian University of Technology [02/100/BK_21/0008, BK-221/RAU7/2021]
  3. National Natural Science Foundation of China [61300167, 61976120]
  4. Natural Science Foundation of Jiangsu Province [BK20191445]
  5. Qing Lan Project of Jiangsu Province, China

Ask authors/readers for more resources

This article introduces the FuzzyHive library, which extends the Hive framework with fuzzy set techniques for flexible querying, analyzing, and reporting in big data warehouses. Through experiments, it is demonstrated that these extensions are efficient and provide new solutions for fuzzy data processing and querying in large datasets.
Querying and reporting from large volumes of structured, semistructured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in a flexible, human-mind-like manner. Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing. Hive allows executing queries and aggregating and analyzing data stored in Hadoop distributed file system and other repositories. Hive responds to the current needs for efficient big data warehousing, which is impossible with traditional data warehouses due to their rigid nature. This article presents the FuzzyHive library that extends the Hive framework with fuzzy sets based techniques for querying, analyzing, and reporting on big data warehouses. We formalize the fuzzy techniques used while operating on Hive-based data warehouses (including fuzzy filtering on dimensional attributes, projection with fuzzy transformation, fuzzy grouping, and joining). We also show how we embedded these operations in Hive query language, which was not studied so far. Such extensions make big data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis. The FuzzyHive library complements the spectrum of available solutions for fuzzy data processing and querying in large datasets. We investigate Hive fuzzy querying performance, effectiveness, and scalability for various data storage formats (text, Avro, and Parquet). Our experiments demonstrate that the proposed extensions introduce more elasticity and are also efficient for big data warehousing, which is the first such kind of solution for this environment.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available