Results 1 - 10
of
53
Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization ..."
Abstract
-
Cited by 45 (17 self)
- Add to MetaCart
(Show Context)
Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization framework based on sentence-level semantic analysis and symmetric non-negative matrix factorization. We first calculate sentence-sentence similarities using semantic analysis and construct the similarity matrix. Then symmetric matrix factorization, which has been shown to be equivalent to normalized spectral clustering, is used to group sentences into clusters. Finally, the most informative sentences are selected from each group to form the summary. Experimental results on DUC2005 and DUC2006 data sets demonstrate the improvement of our proposed framework over the implemented existing summarization systems. A further study on the factors that benefit the high performance is also conducted.
Good abandonment in mobile and PC internet search
- Proc. SIGIR ’09
, 2009
"... Query abandonment by search engine users is generally considered to be a negative signal. In this paper, we explore the concept of good abandonment. We define a good abandonment as an abandoned query for which the user’s information need was successfully addressed by the search results page, with no ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
(Show Context)
Query abandonment by search engine users is generally considered to be a negative signal. In this paper, we explore the concept of good abandonment. We define a good abandonment as an abandoned query for which the user’s information need was successfully addressed by the search results page, with no need to click on a result or refine the query. We present an analysis of abandoned internet search queries across two modalities (PC and mobile) in three locales. The goal is to approximate the prevalence of good abandonment, and to identify types of information needs that may lead to good abandonment, across different locales and modalities. Our study has three key findings: First, queries potentially indicating good abandonment make up a significant portion of all abandoned queries. Second, the good abandonment rate from mobile search is significantly higher than that from PC search, across all locales tested. Third, classified by type of information need, the major classes of good abandonment vary dramatically by both locale and modality. Our findings imply that it is a mistake to uniformly consider query abandonment as a negative signal. Further, there is a potential opportunity for search engines to drive additional good abandonment, especially for mobile search users, by improving search features and result snippets.
Machine Learned Sentence Selection Strategies for Query-Biased Summarization
- SIGIR LEARNING TO RANK WORKSHOP
, 2008
"... It has become standard for search engines to augment result lists with document summaries. Each document summary consists of a title, abstract, and a URL. In this work, we focus on the task of selecting relevant sentences for inclusion in the abstract. In particular, we investigate how machine learn ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
(Show Context)
It has become standard for search engines to augment result lists with document summaries. Each document summary consists of a title, abstract, and a URL. In this work, we focus on the task of selecting relevant sentences for inclusion in the abstract. In particular, we investigate how machine learning-based approaches can effectively be applied to the problem. We analyze and evaluate several learning to rank approaches, such as ranking support vector machines (SVMs), support vector regression (SVR), and gradient boosted decision trees (GBDTs). Our work is the first to evaluate SVR and GBDTs for the sentence selection task. Using standard TREC test collections, we rigorously evaluate various aspects of the sentence selection problem. Our results show that the effectiveness of the machine learning approaches varies across collections with different characteristics. Furthermore, the results show that GBDTs provide a robust and powerful framework for the sentence selection task and significantly outperform SVR and ranking SVMs on several data sets.
A survey of text summarization techniques
- In Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data
, 2012
"... ..."
On Compressing the Textual Web
"... Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing t ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms — e.g. gzip, or word-based Move-to-Front or Huffman coders — and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressed-storage solutions with the new technology of compressed self-indexes [45]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a nontrivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on InvertedList-compression achieved by [57, 58].
Using Clicks as Implicit Judgments: Expectations Versus Observations
"... Abstract. Clickthrough data has been the subject of increasing popularity as an implicit indicator of user feedback. Previous analysis has suggested that user click behaviour is subject to a quality bias—that is, users click at different rank positions when viewing effective search results than when ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Clickthrough data has been the subject of increasing popularity as an implicit indicator of user feedback. Previous analysis has suggested that user click behaviour is subject to a quality bias—that is, users click at different rank positions when viewing effective search results than when viewing less effective search results. Based on this observation, it should be possible to use click data to infer the quality of the underlying search system. In this paper we carry out a user study to systematically investigate how click behaviour changes for different levels of search system effectiveness as measured by information retrieval performance metrics. Our results show that click behaviour does not vary systematically with the quality of search results. However, click behaviour does vary significantly between individual users, and between search topics. This suggests that using direct click behaviour—click rank and click frequency—to infer the quality of the underlying search system is problematic. Further analysis of our user click data indicates that the correspondence between clicks in a search result list and subsequent confirmation that the clicked resource is actually relevant is low. Using clicks as an implicit indication of relevance should therefore be done with caution. 1
Enhanced results for web search
- In SIGIR’11. ACM
, 2011
"... “Ten blue links ” have defined web search results for the last fifteen years – snippets of text combined with document titles and URLs. In this paper, we establish the notion of enhanced search results that extend web search results to include multimedia objects such as images and video, intentspeci ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
“Ten blue links ” have defined web search results for the last fifteen years – snippets of text combined with document titles and URLs. In this paper, we establish the notion of enhanced search results that extend web search results to include multimedia objects such as images and video, intentspecific key value pairs, and elements that allow the user to interact with the contents of a web page directly from the search results page. We show that users express a preference for enhanced results both explicitly, and when observed in their search behavior. We also demonstrate the effectiveness of enhanced results in helping users to assess the relevance of search results. Lastly, we show that we can efficiently generate enhanced results to cover a significant fraction of search result pages.
Highlighting Diverse Concepts in Documents
, 2009
"... We show the underpinnings of a method for summarizing documents: it ingests a document and automatically highlights a small set of sentences that are expected to cover the different aspects of the document. The sentences are picked using simple coverage and orthogonality criteria. We describe a nove ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
We show the underpinnings of a method for summarizing documents: it ingests a document and automatically highlights a small set of sentences that are expected to cover the different aspects of the document. The sentences are picked using simple coverage and orthogonality criteria. We describe a novel combinatorial formulation that captures exactly the document-summarization problem, and we develop simple and efficient algorithms for solving it. We compare our algorithms with many popular document-summarization techniques via a broad set of experiments on real data. The results demonstrate that our algorithms work well in practice and give high-quality summaries.
Document compaction for efficient query biased snippet generation
- ECIR 2009 31st European Conference on Information Retrieval, volume 5478 of LNCS
, 2009
"... Abstract. Current web search engines return query-biased snippets for each document they list in a result set. For efficiency, search engines operating on large collections need to cache snippets for common queries, and to cache documents to allow fast generation of snippets for uncached queries. To ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Current web search engines return query-biased snippets for each document they list in a result set. For efficiency, search engines operating on large collections need to cache snippets for common queries, and to cache documents to allow fast generation of snippets for uncached queries. To improve the hit rate on a document cache during snippet generation, we propose and evaluate several schemes for reducing document size, hence increasing the number of documents in the cache. In particular, we argue against further improvements to document compression, and argue for schemes that prune documents based on the apriori likelihood that a sentence will be used as part of a snippet for a given document. Our experiments show that if documents are reduced to less than half their original size, 80 % of snippets generated are identical to those generated from the original documents. Moreover, as the pruned, compressed surrogates are smaller, 3-4 times as many documents can be cached. 1
To index or not to index: Time-space trade-offs in search engines with positional ranking functions
- In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2012
"... ..."
(Show Context)