Journal
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS
Volume 19, Issue 6, Pages 1077-1101Publisher
SPRINGER
DOI: 10.1007/s11280-015-0374-9
Keywords
Dynamic block; Wrapper; Block tracking
Funding
- National Basic Research Program of China (973 Program) [2014CB340403]
- Fundamental Research Funds for the Central Universities
- Research Funds of Renmin University of China [14XNLF05, 15XNLF03]
- National Culture Science and Technology Promotion Plan
- National Natural Science Foundation of China [61502501]
- secondary network prototype system development project by Xinhua News Agency
Ask authors/readers for more resources
With the rapid changes in dynamic web pages, there is an increasing need for receiving instant updates for dynamic blocks on the Web. In this paper, we address the problem of automatically following dynamic blocks in web pages. Given a user-specified block on a web page, we continuously track the content of the block and report the updates in real time. This service can bring obvious benefits to users, such as the ability to track top-ten breaking news on CNN, the prices of iPhones on Amazon, or NBA game scores. We study 3,346 human labeled blocks from 1,127 pages, and analyze the effectiveness of four types of patterns, namely visual area, DOM tree path, inner content and close context, for tracking content blocks. Because of frequent web page changes, we find that the initial patterns generated on the original page could be invalidated over time, leading to the failure of extracting correct blocks. According to our observations, we combine different patterns to improve the accuracy and stability of block extractions. Moreover, we propose an adaptive model that adapts each pattern individually and adjusts pattern weights for an improved combination. The experimental results show that the proposed models outperform existing approaches, with the adaptive model performing the best.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available