Results 1 - 10
of
16
Fast generation of result snippets in web search
- In Kraaij et al
"... The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snip ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58 % over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
A brief survey of text mining
- LDV Forum - GLDV Journal for Computational Linguistics and Language Technology
, 2005
"... The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful pattern ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful patterns. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. In this article, we discuss text mining as a young and interdisciplinary field in the intersection of the related areas information retrieval, machine learning, statistics, computational linguistics and especially data mining. We describe the main analysis tasks preprocessing, classification, clustering, information extraction and visualization. In addition, we briefly discuss a number of successful applications of text mining. 1
Blind men and elephants: what do citation summaries tell us about a research article
- Journal of the American Society for Information Science and Technology
, 2008
"... The old Asian legend about the blind men and the elephant comes to mind when looking at how different authors of scientific papers describe a piece of related prior work. It turns out that different citations to the same paper often focus on different aspects of that paper and that neither provides ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
The old Asian legend about the blind men and the elephant comes to mind when looking at how different authors of scientific papers describe a piece of related prior work. It turns out that different citations to the same paper often focus on different aspects of that paper and that neither provides a full description of its full set of contributions. In this paper we will describe our investigation of this phenomenon. We studied citation summaries in the context of research papers in the biomed-ical domain. A citation summary is the set of citing sentences for a given article and can be used as a surrogate for the actual article in a variety of scenarios. It contains information that was deemed by peers to be important. Our study shows that citation summaries overlap to some extent with the abstracts of the papers and that they also differ from them in that they focus on different aspects of these papers than the abstracts do. In addition to this, co-cited articles (which are pairs of articles cited by another article) tend to be similar. We show results based on a lexical similarity metric called cohesion to justify our claims. 1 1
Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization
- In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization framework based on sentence-level semantic analysis and symmetric non-negative matrix factorization. We first calculate sentence-sentence similarities using semantic analysis and construct the similarity matrix. Then symmetric matrix factorization, which has been shown to be equivalent to normalized spectral clustering, is used to group sentences into clusters. Finally, the most informative sentences are selected from each group to form the summary. Experimental results on DUC2005 and DUC2006 data sets demonstrate the improvement of our proposed framework over the implemented existing summarization systems. A further study on the factors that benefit the high performance is also conducted.
Generating Impact-Based Summaries for Scientific Literature
"... In this paper, we present a study of a novel summarization problem, i.e., summarizing the impact of a scientific publication. Given a paper and its citation context, we study how to extract sentences that can represent the most influential content of the paper. We propose language modeling methods f ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In this paper, we present a study of a novel summarization problem, i.e., summarizing the impact of a scientific publication. Given a paper and its citation context, we study how to extract sentences that can represent the most influential content of the paper. We propose language modeling methods for solving this problem, and study how to incorporate features such as authority and proximity to accurately estimate the impact language model. Experiment results on a SIGIR publication collection show that the proposed methods are effective for generating impact-based summaries. 1
Automatic Summarization from Multiple Documents
, 2009
"... This work reports on research conducted on the domain of multi-document summarization using background knowledge. The research focuses on summary evaluation and the implementation of a set of generic use tools for NLP tasks
and especially for automatic summarization. Within this work we formalize th ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This work reports on research conducted on the domain of multi-document summarization using background knowledge. The research focuses on summary evaluation and the implementation of a set of generic use tools for NLP tasks
and especially for automatic summarization. Within this work we formalize the n-gram graph representation and its use in NLP tasks. We present the use of n-gram graphs for the tasks of summary evaluation, content selection, novelty
detection and redundancy removal. Furthermore, we present a set of algorithmic constructs and methodologies, based on the notion of n-gram graphs, that aim to support meaning extraction and textual quality quantification.
Classifying Sentence-Based Summaries of Web Documents
"... Text classification categories Web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified Web documents to identi ..."
Abstract
- Add to MetaCart
Text classification categories Web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified Web documents to identify the ones that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. Hereafter, we train a Naïve Bayes classifier on CorSum-generated summaries and verify the classification accuracy using the summaries and the speed-up during the process. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum outperforms other extractive summarization methods, and classification time is significantly reduced using CorSum-generated summaries with compatible accuracy. More importantly, browsing summaries, instead of entire documents, classified to topic-oriented categories facilitates the information searching process on the Web. 1
unknown title
"... In a common law system, which is currently prevailing in countries like India, England, and USA, decisions made by judges are important sources of application and interpretation of law. The increasing availability of legal judgments in digital form ..."
Abstract
- Add to MetaCart
In a common law system, which is currently prevailing in countries like India, England, and USA, decisions made by judges are important sources of application and interpretation of law. The increasing availability of legal judgments in digital form
Automatic Summarization and Background Knowledge: Past, Present and Vision
"... This paper presents the automatic summarization problem and specifies a generic process for the automatic construction of summaries. Most recent approaches to summarization are exposed as contributions to the various steps of this process. The paper elaborates on the grounds of multi-document automa ..."
Abstract
- Add to MetaCart
This paper presents the automatic summarization problem and specifies a generic process for the automatic construction of summaries. Most recent approaches to summarization are exposed as contributions to the various steps of this process. The paper elaborates on the grounds of multi-document automatic summarization and examines the use of background knowledge, indicating related, ongoing and recent efforts. Finally it discusses open problems, proposing further research directions in this field of research. 1 1

