Results 1 - 10
of
14
TextTiling: Segmenting text into multi-paragraph subtopic passages
- Computational Linguistics
, 1997
"... TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation t ..."
Abstract
-
Cited by 275 (1 self)
- Add to MetaCart
TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization. 1.
A Critique and Improvement of an Evaluation Metric for Text Segmentation
- COMPUTATIONAL LINGUISTICS
, 2002
"... ..."
Cut as a Querying Unit for WWW, Netnews, and E-mail
- In Proc. of ACM Hypertext
, 1998
"... In this paper, we propose a query framework for hypertext data in general, and for WWW pages, Netnews articles, and e-mails in particular. In existing query tools for hypertext data, such as search engines for WWW or intelligent news/mail readers, data units in query are typically individual nodes. ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
In this paper, we propose a query framework for hypertext data in general, and for WWW pages, Netnews articles, and e-mails in particular. In existing query tools for hypertext data, such as search engines for WWW or intelligent news/mail readers, data units in query are typically individual nodes. In actual hypertext data, however, one topic is often described over a series of connected nodes, and therefore, the logical data unit should be such a series of nodes corresponding to one topic. This discrepancy between the data unit in query and the logical data unit hinders the efficient information discovery from hypertext data. To solve this problem, in our framework, we divide hypertexts into connected subgraphs corresponding to individual topics, and we use those subgraphs as the data units in queries.
A Corpus-Based Approach to Text Partition
- Proceedings of International Conference of Recent Advances on Natural Language Processing, Tzigov Chark
, 1995
"... A text partition model is proposed to determine the boundaries of discourse structures. It is based on association of noun-noun relations and noun-verb relations defined on discourse level and sentence level, respectively. Three factors are considered: 1) repetition of words, 2) importance of words, ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
A text partition model is proposed to determine the boundaries of discourse structures. It is based on association of noun-noun relations and noun-verb relations defined on discourse level and sentence level, respectively. Three factors are considered: 1) repetition of words, 2) importance of words, and 3) collocational semantics. A window is moved from the first sentence to the last one and the association norm for sentences in the current window is calculated. Finally, the peaks in the sentence position vs. association norm graph forms the potential discourse boundaries. Ten texts randomly selected from LOB corpus are used as the testing texts. The experimental results are compared with the readers ' judgment and the real boundaries in the testing texts. The applications of the results to sentence alignment, topic identification, topic shift and topic abstraction are discussed. 1.
Exploratory Analysis of Concept and Document Spaces with Connectionist Networks
- Artificial Intelligence and Law
, 1999
"... . Exploratory analysis is an area of increasing interest in the computational linguistics arena. Pragmatically speaking, exploratory analysis may be paraphrased as natural language processing by means of analyzing large corpora of text. Concerning the analysis, appropriate means are statistics, on t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
. Exploratory analysis is an area of increasing interest in the computational linguistics arena. Pragmatically speaking, exploratory analysis may be paraphrased as natural language processing by means of analyzing large corpora of text. Concerning the analysis, appropriate means are statistics, on the one hand, and artificial neural networks, on the other hand. As a challenging application area for exploratory analysis of text corpora we may certainly identify text databases, be it information retrieval or information filtering systems. With this paper we present recent findings of exploratory analysis based on both statistical and neural models applied to legal text corpora. Concerning the artificial neural networks, we rely on a model adhering to the unsupervised learning paradigm. This choice appears naturally when taking into account the specific properties of large text corpora where one is faced with the fact that input-output-mappings as required by supervised learning models ca...
A Scene-based Model of Word Prediction
- In citeseer.ist.psu.edu/97151.html
, 2004
"... . This paper proposes a semantico-statistical model of word prediction which uses local context of each scene in a text. Scenes are text segments, each of which displays local context. The occurrence of a word inside a scene is predicted by its local context. On the other hand, a text (defined as a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
. This paper proposes a semantico-statistical model of word prediction which uses local context of each scene in a text. Scenes are text segments, each of which displays local context. The occurrence of a word inside a scene is predicted by its local context. On the other hand, a text (defined as a sequence of scenes) is less constrained by word occurrence, so word prediction becomes more difficult. The proposed model reads a given text word by word and predicts the succeeding word at each step according to the local context of the current scene. The prediction procedure consists of (1) detecting scene boundaries, (2) extracting local context from the current scene, and (3) predicting the succeeding word according to the extracted context. An experiment on a short story showed that the proposed method gives (a) a lower prediction error than do other models and (b) scene boundaries correlated with those identified by human subjects. 1 Introduction Recent studies in corpus linguistics h...
Text Segmentation into Paragraphs Based on Local Text Cohesion
"... Abstract. The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically selfcontained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter probl ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically selfcontained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource – a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold. 1
A new hybrid summarizer based on Vector Space model, Statistical Physics and Linguistics
"... Abstract. In this article we present a hybrid approach for automatic summarization of Spanish medical texts. There are a lot of systems for automatic summarization using statistics or linguistics, but only a few of them combining both techniques. Our idea is that to reach a good summary we need to u ..."
Abstract
- Add to MetaCart
Abstract. In this article we present a hybrid approach for automatic summarization of Spanish medical texts. There are a lot of systems for automatic summarization using statistics or linguistics, but only a few of them combining both techniques. Our idea is that to reach a good summary we need to use linguistic aspects of texts, but as well we should benefit of the advantages of statistical techniques. We have integrated the Cortex (Vector Space Model) and Enertex (statistical physics) systems coupled with the Yate term extractor, and the Disicosum system (linguistics). We have compared these systems and afterwards we have integrated them in a hybrid approach. Finally, we have applied this hybrid system over a corpora of medical articles and we have evaluated their performances obtaining good results. 1
Trackin Morphological and Semantic Co-occurrences in Spontaneous Dialogues
"... e seen as aspects of topic tracking. The classical mechanism for lexical prediction is the use of N-gram statistics for the surface forms of the relevant lex ical items. For the purposes of speech recognition and disambiguation in spontaneous language, however, this technique is unsatisfactory in t ..."
Abstract
- Add to MetaCart
e seen as aspects of topic tracking. The classical mechanism for lexical prediction is the use of N-gram statistics for the surface forms of the relevant lex ical items. For the purposes of speech recognition and disambiguation in spontaneous language, however, this technique is unsatisfactory in two respects. First, the range of predictions is too short, as predictions are usually made over a distance of no more than five words [Church, 1990]. To support bottom-up recognition and analysis of noisy material containing gaps and fragments, longer-rang predictions are needed as well. Long-range pre- Tracking Morpholog ical and Semant ic Co-occurrences in Spontaneous Dialogues Mar k Sel igman Universit Joseph Fourier GETA, CLIPS, IMAG-campus, BP 53 385, rue de la Bibliothque 38041 Grenoble Cedex 9, France sel igman @cerf net.c om Jan Alex ander sson Ger man R esear ch In stitu te of Comp uter Scien ce, DFK I GmbH Stu hlsat zenau sweg 3 66 123 S aarbr cke n, Ge r

