Results 1 -
4 of
4
Finding text reuse on the web
- In WSDM ’09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
, 2009
"... With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of “facts”, statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.
Quality-Biased Ranking of Web Documents
"... Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Many existing retrieval approaches do not take into account the content quality of the retrieved documents, although link-based measures such as PageRank are commonly used as a form of document prior. In this paper, we present the quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The quality of the document content can be determined by its readability, layout and ease-of-navigation, among other factors. Accordingly, instead of using a single estimate for document quality, we consider multiple contentbased features that are directly integrated into a state-ofthe-art retrieval method. These content-based features are easy to compute, store and retrieve, even for large web collections. We use several query sets and web collections to empirically evaluate the performance of our quality-biased retrieval method. In each case, our method consistently improves by a large margin the retrieval performance of textbased and link-based retrieval methods that do not take into account the quality of the document content.
Utilizing inter-passage and inter-document similarities for re-ranking search results
, 2009
"... ..."
Text Classification: A Sequential Reading Approach
"... Abstract. We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed al ..."
Abstract
- Add to MetaCart
Abstract. We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided. 1

