期刊
PROCEEDINGS OF THE VLDB ENDOWMENT
卷 5, 期 3, 页码 265-273出版社
ASSOC COMPUTING MACHINERY
DOI: 10.14778/2078331.2078341
关键词
-
资金
- Australian Research Council
- NICTA Victorian Research Laboratory
- Australian Government
- Digital Economy
- Australian Research Council through the ICT Centre of Excellence program
- Newton Fellowship
Compression techniques that support fast random access are a core component of any information system. Current state-of- the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据