Results 1 - 10
of
71
Inter-Coder Agreement for Computational Linguistics
- COMPUTATIONAL LINGUISTICS
, 2008
"... This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff’s alpha as well as Scott’s pi and Cohen’s kappa; discusses the use of coefficients in several annotation tasks; ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff’s alpha as well as Scott’s pi and Cohen’s kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in Computational Linguistics, may be more appropriate for many corpus annotation tasks – but that their use makes the interpretation of the value of the coefficient even harder.
The Fifth PASCAL Recognizing Textual Entailment Challenge
- In Proc Text Analysis Conference (TAC’09
, 2009
"... This paper presents the Fifth Recognizing Textual Entailment Challenge (RTE-5). Following the positive experience of the last campaign, RTE-5 has been proposed for the second time as a track at the Text Analysis Conference (TAC). The structure of the RTE-5 Main Task remained unchanged, offering both ..."
Abstract
-
Cited by 48 (9 self)
- Add to MetaCart
This paper presents the Fifth Recognizing Textual Entailment Challenge (RTE-5). Following the positive experience of the last campaign, RTE-5 has been proposed for the second time as a track at the Text Analysis Conference (TAC). The structure of the RTE-5 Main Task remained unchanged, offering both the traditional two-way task and the threeway task introduced in the previous campaign. Moreover, a pilot Search Task was set up, consisting of finding all the sentences in a set of documents that entail a given hypothesis. 21 teams participated in the campaign, among which 20 in the Main Task (for a total of 54 runs) and 8 in the Pilot Task (for a total of 20 runs). Another important innovation introduced in this campaign was mandatory ablation tests that participants had to perform for all major knowledge resources employed by their systems. 1
The pyramid method: incorporating human content selection variation in summarization evaluation
- ACM Transactions on Speech and Language Processing
, 2007
"... Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and infor ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
Human variation in content selection in summarization has given rise to some fundamental research questions: How can one incorporate the observed variation in suitable evaluation measures? How can such measures reflect the fact that summaries conveying different content can be equally good and informative? In this paper we address these very questions by proposing a method for analysis of multiple human abstracts into semantic content units. Such analysis allows us not only to quantify human variation in content selection, but also to assign empirical importance weight to different content units. It serves as the basis for an evaluation method, the Pyramid Method, that incorporates the observed variation and is predictive of different equally informative summaries. We discuss the reliability of content unit annotation, the properties of Pyramid scores, and their correlation with other evaluation methods.
Automated Summarization Evaluation with Basic Elements
- In Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC
, 2006
"... As part of evaluating a summary automatically, it is usual to determine how much of the contents of one or more human-produced ‘ideal ’ summaries it contains. Past automated methods such as ROUGE compare using fixed word ngrams, which are not ideal for a variety of reasons. In this paper we describe ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
As part of evaluating a summary automatically, it is usual to determine how much of the contents of one or more human-produced ‘ideal ’ summaries it contains. Past automated methods such as ROUGE compare using fixed word ngrams, which are not ideal for a variety of reasons. In this paper we describe a framework in which summary evaluation measures can be instantiated and compared, and we implement a specific evaluation method using very small units of content, called Basic Elements, that address some of the shortcomings of ngrams. This method is tested on DUC 2003, 2004, and 2005 systems
Evaluating DUC 2005 using Basic Elements
- Proceedings of DUC-2005
, 2005
"... In this paper we introduce Basic Elements, a new way of automating the evaluation of text summaries. We show that this method correlates better with human judgments than any other automated procedure to date, and overcomes the subjectivity/variability problems of manual methods that require humans t ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
In this paper we introduce Basic Elements, a new way of automating the evaluation of text summaries. We show that this method correlates better with human judgments than any other automated procedure to date, and overcomes the subjectivity/variability problems of manual methods that require humans to preprocess summaries to be evaluated. This is demonstrated on DUC 2005 peer systems and
Scientific Paper Summarization Using Citation Summary Networks
"... Quickly moving to a new area of research is painful for researchers due to the vast amount of scientific literature in each field of study. One possible way to overcome this problem is to summarize a scientific topic. In this paper, we propose a model of summarizing a single article, which can be fu ..."
Abstract
-
Cited by 26 (9 self)
- Add to MetaCart
Quickly moving to a new area of research is painful for researchers due to the vast amount of scientific literature in each field of study. One possible way to overcome this problem is to summarize a scientific topic. In this paper, we propose a model of summarizing a single article, which can be further used to summarize an entire topic. Our model is based on analyzing others’ viewpoint of the target article’s contributions and the study of its citation summary network using a clustering approach. 1
A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance
- Association for Computational Linguistics
, 2006
"... We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model non-local pragmatic dependencies between paired utterances such as QUESTION-ANSWER that typically appear together in summaries, and show that these models ou ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We describe a probabilistic approach to content selection for meeting summarization. We use skipchain Conditional Random Fields (CRF) to model non-local pragmatic dependencies between paired utterances such as QUESTION-ANSWER that typically appear together in summaries, and show that these models outperform linear-chain CRFs and Bayesian models in the task. We also discuss different approaches for ranking all utterances in a sequence using CRFs. Our best performing system achieves 91.3 % of human performance when evaluated with the Pyramid evaluation metric, which represents a 3.9 % absolute increase compared to our most competitive non-sequential classifier. 1
Incorporating speaker and discourse features into speech summarization
- In: Proc. of the HLT-NAACL 2006
, 2006
"... We have explored the usefulness of incorporating speech and discourse features in an automatic speech summarization system applied to meeting recordings from the ICSI Meetings corpus. By analyzing speaker activity, turn-taking and discourse cues, we hypothesize that such a system can outperform sole ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
We have explored the usefulness of incorporating speech and discourse features in an automatic speech summarization system applied to meeting recordings from the ICSI Meetings corpus. By analyzing speaker activity, turn-taking and discourse cues, we hypothesize that such a system can outperform solely text-based methods inherited from the field of text summarization. The summarization methods are described, two evaluation methods are applied and compared, and the results clearly show that utilizing such features is advantageous and efficient. Even simple methods relying on discourse cues and speaker activity can outperform text summarization approaches. 1.
A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization
- In Proc. of SIGIR
, 2006
"... The usual approach for automatic summarization is sentence extraction, where key sentences from the input documents are selected based on a suite of features. While word frequency often is used as a feature in summarization, its impact on system performance has not been isolated. In this paper, we s ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
The usual approach for automatic summarization is sentence extraction, where key sentences from the input documents are selected based on a suite of features. While word frequency often is used as a feature in summarization, its impact on system performance has not been isolated. In this paper, we study the contribution to summarization of three factors related to frequency: content word frequency, composition functions for estimating sentence importance from word frequency, and adjustment of frequency weights based on context. We carry out our analysis using datasets from the Document Understanding Conferences, studying not only the impact of these features on automatic summarizers, but also their role in human summarization. Our research shows that a frequency based summarizer can achieve performance comparable to that of state-of-the-art systems, but only with a good composition function; context sensitivity improves performance and significantly reduces repetition.
A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?
- in Proceedings of the ACL 2005, Ann Arbor, 2005
, 2005
"... This paper demonstrates the usefulness of summaries in an extrinsic task of relevance judgment based on a new method for measuring agreement, Relevance-Prediction, which compares subjects ' judgments on summaries with their own judgments on full text documents. We demonstrate that, because th ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
This paper demonstrates the usefulness of summaries in an extrinsic task of relevance judgment based on a new method for measuring agreement, Relevance-Prediction, which compares subjects ' judgments on summaries with their own judgments on full text documents. We demonstrate that, because this measure is more reliable than previous gold-standard measures, we are able to make stronger statistical statements about the benefits of summarization. We found positive correlations between ROUGE scores and two different summary types, where only weak or negative correlations were found using other agreement measures. However, we show that ROUGE may be sensitive to the choice of summarization style. We discuss the importance of these results and the implications for future summarization evaluations.

