Unsupervised Topic Segmentation Based on Word Cooccurrence and Multi-Word Units for Text Summarization
| Venue: | In Proceedings of the ELECTRA Workshop associated to 28th Annual International ACM SIGIR Conference |
| Citations: | 1 - 1 self |
BibTeX
@INPROCEEDINGS{Dias_unsupervisedtopic,
author = {Gaël Dias},
title = {Unsupervised Topic Segmentation Based on Word Cooccurrence and Multi-Word Units for Text Summarization},
booktitle = {In Proceedings of the ELECTRA Workshop associated to 28th Annual International ACM SIGIR Conference},
year = {},
pages = {41--48}
}
OpenURL
Abstract
Topic Segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. In particular, Topic Segmentation is extensively used in Passage Retrieval and Text Summarization to provide more coherent results by taking into account raw document structure. However, most methodologies are based on lexical repetition that show evident reliability problems or rely on harvesting linguistic resources that are usually available only for dominating languages and do not apply to less favored and emerging languages. Moreover, most systems have been evaluated using Choi’s data set [1] which is biased for systems using mostly lexical repetition. As a consequence, these systems are not tested in real-world environments and their application may prove worst results than presented in the literature. In order to tackle all these drawbacks, we present an innovative Topic Segmentation system based on a new informative similarity measure based on word co-occurrences and evaluate it on a set of web documents within which Multiword Units have previously been identified.







