☆ 4.8 Article

High-Efficient Fuzzy Querying With HiveQL for Big Data Warehousing

IEEE TRANSACTIONS ON FUZZY SYSTEMS (2022)

Journal

IEEE TRANSACTIONS ON FUZZY SYSTEMS

Volume 30, Issue 6, Pages 1823-1837

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TFUZZ.2021.3069332

Keywords

Big Data; Warehousing; Data warehouses; Linguistics; Distributed databases; Relational databases; Fuzzy sets; Big data; data warehousing; fuzzy sets; Hadoop; Hive; querying

Funding

Amazon Web Services [02/100/RGJ21/0009, 02/020/RGP19/0184]
Statutory Research funds of Department of Applied Informatics, Silesian University of Technology [02/100/BK_21/0008, BK-221/RAU7/2021]
National Natural Science Foundation of China [61300167, 61976120]
Natural Science Foundation of Jiangsu Province [BK20191445]
Qing Lan Project of Jiangsu Province, China

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This article introduces the FuzzyHive library, which extends the Hive framework with fuzzy set techniques for flexible querying, analyzing, and reporting in big data warehouses. Through experiments, it is demonstrated that these extensions are efficient and provide new solutions for fuzzy data processing and querying in large datasets.

Querying and reporting from large volumes of structured, semistructured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in a flexible, human-mind-like manner. Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing. Hive allows executing queries and aggregating and analyzing data stored in Hadoop distributed file system and other repositories. Hive responds to the current needs for efficient big data warehousing, which is impossible with traditional data warehouses due to their rigid nature. This article presents the FuzzyHive library that extends the Hive framework with fuzzy sets based techniques for querying, analyzing, and reporting on big data warehouses. We formalize the fuzzy techniques used while operating on Hive-based data warehouses (including fuzzy filtering on dimensional attributes, projection with fuzzy transformation, fuzzy grouping, and joining). We also show how we embedded these operations in Hive query language, which was not studied so far. Such extensions make big data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis. The FuzzyHive library complements the spectrum of available solutions for fuzzy data processing and querying in large datasets. We investigate Hive fuzzy querying performance, effectiveness, and scalability for various data storage formats (text, Avro, and Parquet). Our experiments demonstrate that the proposed extensions introduce more elasticity and are also efficient for big data warehousing, which is the first such kind of solution for this environment.

High-Efficient Fuzzy Querying With HiveQL for Big Data Warehousing

Journal

IEEE TRANSACTIONS ON FUZZY SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

High-Efficient Fuzzy Querying With HiveQL for Big Data Warehousing

Journal

IEEE TRANSACTIONS ON FUZZY SYSTEMS

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper