Results 1 - 10
of
41
Advances in Domain Independent Linear Text Segmentation
, 2000
"... This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering. ..."
Abstract
-
Cited by 100 (1 self)
- Add to MetaCart
This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.
Inter-Coder Agreement for Computational Linguistics
- COMPUTATIONAL LINGUISTICS
, 2008
"... This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff’s alpha as well as Scott’s pi and Cohen’s kappa; discusses the use of coefficients in several annotation tasks; ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff’s alpha as well as Scott’s pi and Cohen’s kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in Computational Linguistics, may be more appropriate for many corpus annotation tasks – but that their use makes the interpretation of the value of the coefficient even harder.
Latent Semantic Analysis for Text Segmentation
- In Proceedings of EMNLP
, 2001
"... This paper describes a method for linear text segmentation that is more accurate or at least as accurate as state-of-the-art methods (Utiyama and Isahara, 2001 ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
This paper describes a method for linear text segmentation that is more accurate or at least as accurate as state-of-the-art methods (Utiyama and Isahara, 2001
Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks
- In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , ACM
, 2001
"... Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coa ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700 000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift. Topic areas: Citation and Link Analysis, Machine Learning for IR, Web IR. 1
Coreference for NLP Applications
- In Proceedings ACL
, 1997
"... This paper presents several techniques for performing automatic coreference annotation and perfor- mance results for each of them. To demonstrate that they can be applied to real-world data, we have built a simple question-answering system which uses the techniques. ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
This paper presents several techniques for performing automatic coreference annotation and perfor- mance results for each of them. To demonstrate that they can be applied to real-world data, we have built a simple question-answering system which uses the techniques.
Minimum cut model for spoken lecture segmentation
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006
, 2006
"... We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global a ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
We consider the task of unsupervised lecture segmentation. We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion. Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies. Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors. 1
Twenty-One at TREC-8: using Language Technology for Information Retrieval
, 1999
"... This paper describes the ocial runs of the Twenty-One group for TREC-8. The Twenty-One group participated in the Ad-hoc, CLIR, Adaptive Filtering and SDR tracks. The main focus of our experiments is the development and evaluation of retrieval methods that are motivated by natural language processi ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
This paper describes the ocial runs of the Twenty-One group for TREC-8. The Twenty-One group participated in the Ad-hoc, CLIR, Adaptive Filtering and SDR tracks. The main focus of our experiments is the development and evaluation of retrieval methods that are motivated by natural language processing techniques. The following new techniques are introduced in this paper. In the Ad-Hoc and CLIR tasks we experimented with automatic sense disambiguation followed by query expansion or translation. We used a combination of thesaurial and corpus information for the disambiguation process. We continued research on CLIR techniques which exploit the target corpus for an implicit disambiguation, by importing the translation probabilities into the probabilistic term-weighting framework. In ltering we extended the the use of language models for document ranking with a relevance feedback algorithm for query term reweighting. 1 Introduction Twenty-One 1 is a project funded by the EU Tele...
Statistical Models for Topic Segmentation
, 1999
"... Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic seg ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation.
A Statistical Model for Domain-Independent Text Segmentation
- In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics
, 2001
"... We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is m ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.
Discourse Segmentation in Aid of Document Summarization
- In Proceedings of Hawaii Int. Conf. on System Sciences (HICSS-33), Minitrack on Digital Documents Understanding, IEEE
, 2000
"... This paper describes work to enhance a sentencebased summarizer with notions of salience, dynamicallyadjustable summary size, discourse segmentation, and awareness of topic shifts. Our experiments study strategies to diversify the application of a baseline summarizer, by making it aware of finer-gra ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
This paper describes work to enhance a sentencebased summarizer with notions of salience, dynamicallyadjustable summary size, discourse segmentation, and awareness of topic shifts. Our experiments study strategies to diversify the application of a baseline summarizer, by making it aware of finer-grained ‘aboutness’, capable of discerning changes of topic, and sensitive to longer-thanusual documents. Evaluated against the corpus used in the development of the baseline summarizer, summaries derived either by means of segmentation analysis alone, or by a mix of strategies for combining salience calculation and topic shift detection, are shown to be of comparable, and under certain conditions even better, quality. We describe the summarization and segmentation procedures, outline a number of strategies for mixing the two, evaluate the overall impact of discourse segmentation, and suggest an interface design capable of using the notion of topic shifts to contextualize a summary and facilitate the mediation between it and the full document source. 1.

