Results 1 - 10
of
19
Retrieving Collocations from Text: Xtract
- Computational Linguistics
, 1993
"... Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of wri ..."
Abstract
-
Cited by 229 (1 self)
- Add to MetaCart
Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. These techniques automatically produce large numbers of collocations along with statistical figures intended to reflect the relevance of the associations. However, noue of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations. In this paper, we describe a set of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora. These techniques produce a wide range of collocations and are based on some original filtering methods that allow the production of richer and higher-precision output. These techniques have been implemented and resulted in a lexicographic tool, Xtract. The techniques are described and some results are presented on a 10 million-word corpus of stock market news reports. A lexicographic evaluation of Xtract as a collocation retrieval tool has been made, and the estimated precision of Xtract is 80%.
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Extracting Nested Collocations
, 1996
"... lqfis patmr ln'ovides ;m at>lroach l,hc semi-aul,omat,ic cxl,ra(:l,ion of colloca.- I,ions h'om corl)ora using sl,l,isl, ics. The growing availabilil, y of lmge l,cxtual col t>ora, and the incr('.a,sing nmnbcr of al)- pHca(,ions of colloc[t(,ion (,.xtra(:tion, has given risc (,o wrious a,l)t)r ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
lqfis patmr ln'ovides ;m at>lroach l,hc semi-aul,omat,ic cxl,ra(:l,ion of colloca.- I,ions h'om corl)ora using sl,l,isl, ics. The growing availabilil, y of lmge l,cxtual col t>ora, and the incr('.a,sing nmnbcr of al)- pHca(,ions of colloc[t(,ion (,.xtra(:tion, has given risc (,o wrious a,l)t)r(>a,(:hes on (,ol)ic. In (,his l)ap('.r, wc address problem of nested collocal, ions; (,hal, is, those being lm,r(, of long(.' Mos(, al)ln'Oa(:lms (,ill now, (,rca.l,(d subsl, rings of collo(:il(,ions as collocal,ions, only if they apl)earcd [((lU(',nlJy (toough 1)y (,h(nnsclves in tim coffins, Th(se niques let'(, ,t lo(, of (:o]k)(:;(,iot)s mmx- (,ra(:(;c(t. ht (,his 1)al)er, we 1)rOl)OS( an gorithm for a s(,.mi-aul,omati(: of ltOSl,ed 11HilH,(ri'llpl;od xt(l iill,(rrll])l,od collo(:(,ions, paying t)ar(,i(:ulm' 1,o ncs(,od (:olloca,(,ion.
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
From N-Grams To Collocations: An Evaluation Of XTRACT
, 1991
"... In previous papers we presented methods for retrieving collocations from large samples of texts. We described a tool, Xtract that implements these methods and able to retrieve a wide range of collocations in a two stage process. These methods as well as other re- lated methods however have som ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
In previous papers we presented methods for retrieving collocations from large samples of texts. We described a tool, Xtract that implements these methods and able to retrieve a wide range of collocations in a two stage process. These methods as well as other re- lated methods however have some limitations.
Interactive Retrieval using IRIS: TREC-6 Experiments
- In
, 1998
"... this paper as we address, among other things: ..."
Extraction of multi-word collocations using syntactic bigram composition
- In Proceedings of the International Conference RANLP’03
, 2003
"... This paper presents a method for extracting multi-word collocations (MWCs) from text corpora, which is based on the previous extraction of syntactically bound collocation bigrams. We describe an iterative word linking procedure which relies on a syntactic criterion and aims at building up arbitraril ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper presents a method for extracting multi-word collocations (MWCs) from text corpora, which is based on the previous extraction of syntactically bound collocation bigrams. We describe an iterative word linking procedure which relies on a syntactic criterion and aims at building up arbitrarily long expressions that represent multi-word collocation candidates. We propose several measures to rank candidates according to the collocational strength, and we present the results of a trigram extraction experiment. The methodology used is particularly well-suited for the identification of those collocations whose terms are arbitrarily distant, due to syntactic processes (passivization, relativization, dislocation, topicalization). 1
A Stochastic Model Of Intonation For Text-To-Speech Synthesis
- Proceedings Eurospeech '97 (Rhodes
, 1998
"... This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F 0 curve from the abstract prosodic labels. This model differs ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F 0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large 1 This paper is based on a communication presented at Eurospeech'97 (Vronis et al. 1997) and has been recommended by the Editorial Board of Speech Communication. 2 corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on part-of-speech tagging. The results were validated on French by means of a perception test. Listeners did not perceive a signif...
Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses, ms
- Wu, CJ and J.S. Chang. Alignment of Collocation via Syntactic and Statstical Analyses. Proceedings of the fifteenth Research on Computational Linguistics Conference, ROCLING XV
, 2003
"... In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extrac ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.
NLP-assisted exploration of texts
- In Proceedings RIAO'2000 Content-Based Multimedia Information Access Paris
, 2000
"... Retrieving information does not end with the identification of the top n relevant files. The identified text should then be presented to the user in a form that facilitates accessing the relevant information pieces. This work investigates the possibility of using lexical structures latent in the ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Retrieving information does not end with the identification of the top n relevant files. The identified text should then be presented to the user in a form that facilitates accessing the relevant information pieces. This work investigates the possibility of using lexical structures latent in the text to provide the reader with rich visual representations such as table-of-contents and topic index. The paper describes the approach for topic identification, reconstructing the hierarchical structure and the generation of sections' headings, as well as the back-end visualization system. 1 Introduction The cognitive process of reading expository text attempts to reconstruct the same framework of ideas the author had in mind when the text was originally written. The process is an exploratory one. The reader uses various devices as aids in this reconstruction, such as maps for orientation, or catalogs for objects being found in the process. If such devices are missing the process takes...

