Results 1 - 10
of
11
Comparing Automatic and Human Evaluation of NLG Systems
- In Proc. EACL’06
, 2006
"... We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain exp ..."
Abstract
-
Cited by 32 (12 self)
- Add to MetaCart
We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NIST, BLEU, and ROUGE. We find that NIST scores correlate best (> 0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human-based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain.
Using n-grams to understand the nature of summaries
- In Proceedings of HLT/NAACL’04
, 2004
"... Although single-document summarization is a well-studied task, the nature of multidocument summarization is only beginning to be studied in detail. While close attention has been paid to what technologies are necessary when moving from single to multi-document summarization, the properties of humanw ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Although single-document summarization is a well-studied task, the nature of multidocument summarization is only beginning to be studied in detail. While close attention has been paid to what technologies are necessary when moving from single to multi-document summarization, the properties of humanwritten multi-document summaries have not been quantified. In this paper, we empirically characterize human-written summaries provided in a widely used summarization corpus by attempting to answer the questions: Can multi-document summaries that are written by humans be characterized as extractive or generative? Are multi-document summaries less extractive than singledocument summaries? Our results suggest that extraction-based techniques which have been successful for single-document summarization may not be sufficient when summarizing multiple documents. 1
Intrinsic vs. extrinsic evaluation measures for referring expression generation
- In Proc. 46th Annual Meeting of the Association for Computational Linguistics (ACL-08
, 2008
"... In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations between intrinsic and extrinsic evaluation measures for this task. 1
Sentiment summarization: Evaluating and learning user preferences
- In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL
, 2009
"... We present the results of a large-scale, end-to-end human evaluation of various sentiment summarization models. The evaluation shows that users have a strong preference for summarizers that model sentiment over non-sentiment baselines, but have no broad overall preference between any of the sentimen ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We present the results of a large-scale, end-to-end human evaluation of various sentiment summarization models. The evaluation shows that users have a strong preference for summarizers that model sentiment over non-sentiment baselines, but have no broad overall preference between any of the sentiment-based models. However, an analysis of the human judgments suggests that there are identifiable situations where one summarizer is generally preferred over the others. We exploit this fact to build a new summarizer by training a ranking SVM model over the set of human preference judgments that were collected during the evaluation, which results in a 30 % relative reduction in error over the previous best summarizer. 1
Generation of Reference Summaries
- In Proceedings of 2nd Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics
, 2005
"... We have constructed an integrated web-based system for collection of extract-based corpora and for evaluation of summaries and summarization systems. During evaluation and examination of the collected and generated data we found that in a situation of low agreement among the informants the corpus gi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We have constructed an integrated web-based system for collection of extract-based corpora and for evaluation of summaries and summarization systems. During evaluation and examination of the collected and generated data we found that in a situation of low agreement among the informants the corpus gives unduly favors to summarization systems that use sentence position as a central weighting feature. The problem is discussed and a possible solution is outlined. 1. Background When developing text summarizers and other information extraction tools it is extremely difficult to assess the performance of these tools. One reason for this is that evaluation is time-consuming and needs large manual efforts. When changing the architecture of the summarizer one needs to carry out the evaluation process again. Therefore it would be fruitful to have a tool that directly can assess the result from a text summarizer repeatedly and automatically. We have for this reason constructed the KTH extract tool to create an extract corpus that can be used to evaluate text summarizers. To create the extract corpus we need a large group of human informants. When the extract corpus is in place it can be used repeatedly with little effort. One other advantage is that one can create an extract corpus in any language and evaluate any language-dependant text summarizer, as long as one is sure about the quality of the corpus. In order to use the extract corpus for evaluation of a summarizer one needs careful preparation of the corpus, also it is important to discuss in what sense the extract corpus can correspond to the output of the summarizer. The specific target for our evaluation is the SweSum text summarizer for Swedish news text and the DanSum 1 text summarizer for Danish news text. SweSum is a text summarizer mainly developed to summarize Swedish news text (Dalianis 2000). SweSum works on sentence level – i.e. extracting sentences, judging the relevance of each sentence and then creating a shorter text (non-redundant extract) containing the highest-ranking sentences from the original text. SweSum has been ported to English, Spanish, French, Danish, Norwegian, German and Farsi so far. SweSum is freely available online at
Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
"... To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test ling ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference information, and summarization specific features. Our best results are 90 % accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input. 1
Looking for a Few Good Metrics: Automatic Summarization Evaluation — How Many Samples Are Enough?
"... ..."
Kernel-based Approach for Automatic Evaluation of Natural Language Generation Technologies: Application to Automatic Summarization
"... In order to promote the study of automatic summarization and translation, we need an accurate automatic evaluation method that is close to human evaluation. In this paper, we present an evaluation method that is based on convolution kernels that measure the similarities between texts considering the ..."
Abstract
- Add to MetaCart
In order to promote the study of automatic summarization and translation, we need an accurate automatic evaluation method that is close to human evaluation. In this paper, we present an evaluation method that is based on convolution kernels that measure the similarities between texts considering their substructures. We conducted an experiment using automatic summarization evaluation data developed for Text Summarization Challenge 3 (TSC-3). A comparison with conventional techniques shows that our method correlates more closely with human evaluations and is more robust. 1
Leveraging Structural Relations for Fluent Compressions at Multiple Compression Rates
"... Prior approaches to sentence compression have taken low level syntactic constraints into account in order to maintain grammaticality. We propose and successfully evaluate a more comprehensive, generalizable feature set that takes syntactic and structural relationships into account in order to sustai ..."
Abstract
- Add to MetaCart
Prior approaches to sentence compression have taken low level syntactic constraints into account in order to maintain grammaticality. We propose and successfully evaluate a more comprehensive, generalizable feature set that takes syntactic and structural relationships into account in order to sustain variable compression rates while making compressed sentences more coherent, grammatical and readable. 1
WHUSUM: Wuhan University at the Update Summarization Task of TAC 2009
"... This paper describes the system WHUSUM we developed to participate in the update summarization task of TAC 2009. Given a topic and corresponding topic statement, this year's task is to write 2 summaries (one for Document Set A and one for Document Set B) that meet the information need expressed in t ..."
Abstract
- Add to MetaCart
This paper describes the system WHUSUM we developed to participate in the update summarization task of TAC 2009. Given a topic and corresponding topic statement, this year's task is to write 2 summaries (one for Document Set A and one for Document Set B) that meet the information need expressed in the topic statement. In order to generate a topic-oriented summary for Set A, We present a co-training based strategy to select the topic relevant sentences from two abundant views and adopt a graph-based ranking algorithm (i.e. GRASSHOPPER) to achieve both information richness and content diversity in the generated summary. Furthermore, to capture the novel information in Set B and remove the possible redundant information in historical Document Set A, we propose two approaches to encourage novelty. One is to incorporate similarity between sentences in historical set and current set in the prior ranking of GRASSHOPPER. Another is to directly rank sentences for Document Set B first, and then to adjust their ranking scores based on the content comparison between the relevant sentence sets in A and B. The official evaluation results show that our system gets competitive performance in general topic-oriented summarization task and ranks in the middle among 52 submitted systems in update summarization task, which demonstrate that there is still large room to improve the novelty detection mechanism of the system. 1

