Querying and reporting from large volumes of structured, semistructured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in… Click to show full abstract
Querying and reporting from large volumes of structured, semistructured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in a flexible, human-mind-like manner. Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing. Hive allows executing queries and aggregating and analyzing data stored in Hadoop distributed file system and other repositories. Hive responds to the current needs for efficient big data warehousing, which is impossible with traditional data warehouses due to their rigid nature. This article presents the FuzzyHive library that extends the Hive framework with fuzzy sets based techniques for querying, analyzing, and reporting on big data warehouses. We formalize the fuzzy techniques used while operating on Hive-based data warehouses (including fuzzy filtering on dimensional attributes, projection with fuzzy transformation, fuzzy grouping, and joining). We also show how we embedded these operations in Hive query language, which was not studied so far. Such extensions make big data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis. The FuzzyHive library complements the spectrum of available solutions for fuzzy data processing and querying in large datasets. We investigate Hive fuzzy querying performance, effectiveness, and scalability for various data storage formats (text, Avro, and Parquet). Our experiments demonstrate that the proposed extensions introduce more elasticity and are also efficient for big data warehousing, which is the first such kind of solution for this environment.
               
Click one of the above tabs to view related content.