4.5 Article

Estimating Cardinality for Arbitrarily Large Data Stream With Improved Memory Efficiency

Journal

IEEE-ACM TRANSACTIONS ON NETWORKING
Volume 28, Issue 2, Pages 433-446

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNET.2020.2970860

Keywords

Data streams; cardinality estimation; random hashing

Funding

  1. National Natural Science Foundation of China [61872080]
  2. Jiangsu Provincial Key Laboratory of Computer Networking Technology
  3. National Science Foundation of the United States [STC-1562485, CNS-1719222]
  4. Florida Cybersecurity Center

Ask authors/readers for more resources

Cardinality estimation is the task of determining the number of distinct elements (or the cardinality) in a data stream, under a stringent constraint that the input data stream can be scanned by just one single pass. This is a fundamental problem with many practical applications, such as traffic monitoring of high-speed networks and query optimization of Internet-scale database. To solve the problem, we propose an algorithm named HLL-TailCut, which implements the estimation standard error $1.0 / \sqrt {m}$ using the memory units of four or three bits each, whose cost is much smaller than the five-bit memory units used by HyperLogLog, the best previously known cardinality estimator. This makes it possible to reduce the memory cost of HyperLogLog by 20%-45%. For example, when the target estimation error is 1.1%, state-of-the-art HyperLogLog needs 5.6 kilobytes memory. By contrast, our new algorithm only needs 3 kilobytes memory consumption for attaining the same accuracy. Additionally, our algorithm is able to support the estimation of very large stream cardinalities, even on the Tera and Peta scale.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available