Finding text reuse on the web (2009)
| Venue: | In WSDM ’09: Proceedings of the Second ACM International Conference on Web Search and Data Mining |
| Citations: | 5 - 0 self |
BibTeX
@INPROCEEDINGS{Bendersky09findingtext,
author = {Michael Bendersky and W. Bruce Croft},
title = {Finding text reuse on the web},
booktitle = {In WSDM ’09: Proceedings of the Second ACM International Conference on Web Search and Data Mining},
year = {2009},
pages = {262--271},
publisher = {ACM}
}
OpenURL
Abstract
With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.







