Results 1 - 10
of
17
A Corpus-based Approach to Automatic Compound Extraction
- Mexico State University
, 1994
"... An automatic compound retrieval method is proposed to extract compounds within a text message. It uses n-gram mutual information, relative frequency count and parts of speech as the features for compound extraction. The problem is modeled as a two-class classification problem based on the distributi ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
An automatic compound retrieval method is proposed to extract compounds within a text message. It uses n-gram mutual information, relative frequency count and parts of speech as the features for compound extraction. The problem is modeled as a two-class classification problem based on the distributional characteristics of n-gram tokens in the compound and the non-compound clusters. The recall and precision using the proposed approach are 96.2% and 48.2% for bigram compounds and 96.6% and 39.6% for trigram compounds for a testing corpus of 49,314 words. A significant cutdown in processing time has been observed.
Hierarchical Clustering of Words and Application to NLP Tasks
"... This paper describes a data-driven method for hierarchical clustering of words and clustering of multiword compounds. A large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to corpora ranging in size from 5 million to 50 million words, using mutual information as ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes a data-driven method for hierarchical clustering of words and clustering of multiword compounds. A large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to corpora ranging in size from 5 million to 50 million words, using mutual information as an objective function. The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of (i.e. word bits for) all the words in the vocabulary. Evaluation of the word bits is carried out through the measurement of the error rate of the ATR Decision-Tree Part-Of-Speech Tagger. The same clustering technique is then applied to the classification of multiword compounds. In order to avoid the explosion of the number of compounds to be handled, compounds in a small subclass are bundled and treated as a single compound. Another merit of this approach is that we can avoid the data sparseness problem which is ubiquitous in corpus statistics. The quality of one of the obtained compound classes is examined and compared to a conventional approach.
Integration Of Visual Inter-word Constraints And Linguistic Knowledge In Degraded Text Recognition
- in Proceedings of 32nd Annual Meeting of Association for Computational Linguistics
, 1994
"... I 2 3 4 Degraded text recognition is a difficult task. Given a Please fin in tire 0.90 0.33 0.30 0.80 noisy text image, a word recognizer can be applied to Fleece fill In toe generate several candidates for each word image. High- o. os o. 30 o. 28 o. 10 level knowledge sources can then be used ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
I 2 3 4 Degraded text recognition is a difficult task. Given a Please fin in tire 0.90 0.33 0.30 0.80 noisy text image, a word recognizer can be applied to Fleece fill In toe generate several candidates for each word image. High- o. os o. 30 o. 28 o. 10 level knowledge sources can then be used to select a Pierce flu lo lire decision from the candidate set for each word image. 0.02 0.21 0.25 0.05 In this paper, we propose that visual inter-word con- Fierce flit ill the straints can be used to facilitate candidate selection. o.02 o. to o. 13 0.03 Visual inter-word constraints provide a way to link word Pieces till Io Ike images inside the text page, and to interpret them sys- 0.01 0.06 0.04 0.02 tematically.
Where the Tagger Falters
- In Proceedings of the Fourth Conference on Theoretical and Methodological Issues in Machine Translation
, 1992
"... Statistical n-gram taggers like that of [Church 1988] or [Foster 1991] assign a part-ofspeech label to each word in a text on the basis of probability estimates that are automatically derived from a large, already tagged training corpus. This paper examines the grammatical constructions which cause ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Statistical n-gram taggers like that of [Church 1988] or [Foster 1991] assign a part-ofspeech label to each word in a text on the basis of probability estimates that are automatically derived from a large, already tagged training corpus. This paper examines the grammatical constructions which cause such taggers to falter most frequently. As one would expect, certain of these errors are due to linguistic dependencies that extend beyond the limited scope of statistical taggers, while others can be seen to derive from the composition of the tag set; many can only be corrected through a full syntactic or semantic analysis of the sentence. The paper goes on to consider two very different approaches to the problem of automatically detecting tagging errors. The first uses statistical information that is already at the tagger's disposal; the second attempts to isolate error-prone contexts by formulating linguistic diagnostics in terms of regular expressions over tag sequences. In a small exper...
Key Technologies for Multilingual Information Processing on WWW
- In Proceedings of the Fourth International Symposium on Standardization of Multilingual Information Technology (MLIT-4
, 1999
"... This paper discusses key technologies required to realize a document database which is the multilingual collection of documents typically seen on WWW, and to realize a system which supports easy access to such multilingual information. Specifically, we focus on such techniques as 1) crosslanguage ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper discusses key technologies required to realize a document database which is the multilingual collection of documents typically seen on WWW, and to realize a system which supports easy access to such multilingual information. Specifically, we focus on such techniques as 1) crosslanguage information retrieval (CLIR), which supports conversion of cultural factors such as units, era names and color names, 2) an algorithm for automatic identification of language and coding system of documents. The goal of our research is to develop a system which supports end-user access to multilingual information by integrating these techniques. 1 Introduction With the growth of the Internet and WWW in recent years, documents written in various languages are being provided. Although 80% of current Web pages are written in English, it is estimated that over a half of Web documents will be nonEnglish in 2003 1 . Therefore, WWW can be regarded as a huge document database which contains...
POINTER Project Final Report
"... This document is the Final Report of the POINTER Project (for more information on the goals and history of this project, see II: "POINTER Goals and Methodology"). The Final Report has been collated by the POINTER Workpackage 7 team (Khurshid Ahmad, Robin Bonthrone, Gert Engel, Aggeliki Fotopoulou, D ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This document is the Final Report of the POINTER Project (for more information on the goals and history of this project, see II: "POINTER Goals and Methodology"). The Final Report has been collated by the POINTER Workpackage 7 team (Khurshid Ahmad, Robin Bonthrone, Gert Engel, Aggeliki Fotopoulou, Deborah Fry, Christian Galinski, John Humbley, Norbert Kalfon, Margaret Rogers, Corentin Roulin, Katharina Schmalenbach and Eberhard Tanke) under the co-ordination of Deborah Fry. The following material has been used in this task:
The Mega-Word TaggedCorpus Project
- In Proceedings of the Fifth International Conference on Theoretical and Methodological Issues in Machine Translation
, 1993
"... Large corpora with part-of-speech tagging play a very important role in recent statisticsbased and example-based natural language processing systems. However, no such corpora have become widely available for Japanese so far. Because the Japanese language has no explicit word boundaries, it is imposs ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Large corpora with part-of-speech tagging play a very important role in recent statisticsbased and example-based natural language processing systems. However, no such corpora have become widely available for Japanese so far. Because the Japanese language has no explicit word boundaries, it is impossible even to count words without a corpus that has at. least word segmentations. This paper describes our attempts to develop a tagged corpus with over one million words taken from Japanese newspaper articles in a semi-mechanized way taken from Japanese newspaper articles. After dividing the original text into many chunks, we analyze the first chunk by using a Japanese morphological analyzer, and correct. the output manually; then, using that result, we improve the morphological analyzer and go on to the next chunk. Thus, the quality of the morphological analyzer increases at each iteration, decreasing the effort required for manual editing of the following chunks. Our experience in the first iteration of this 'boot-strapping ' process has been encouraging. 1
Type-based and Token-based Learning of Kanji Morphemes
"... this paper we discuss the performance of kanji morpheme extraction and kanji sequence decomposition, both based on the same bigram statistics, focusing on the effect of type-based and token-based trainings. The experiment shows that type-based training gives consistently better performance, which ha ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
this paper we discuss the performance of kanji morpheme extraction and kanji sequence decomposition, both based on the same bigram statistics, focusing on the effect of type-based and token-based trainings. The experiment shows that type-based training gives consistently better performance, which has both practical and theoretical importance
;E;vNL4p=`$rMQ$$$?%3!<%Q%9$+$i$N Dj7?I=8=$N<+F0Cj=P
"... this paper, we propose a method for automatically extracting frozen patterns from corpora by introducing cost creteria. By considering frozen patterns as one unit, cost criteria make it possible to measure quantitatively the extent to which processing is reduced. The proposed method is evaluated thr ..."
Abstract
- Add to MetaCart
this paper, we propose a method for automatically extracting frozen patterns from corpora by introducing cost creteria. By considering frozen patterns as one unit, cost criteria make it possible to measure quantitatively the extent to which processing is reduced. The proposed method is evaluated through experiments using a Japanese corpus. We also show that morphological-level errors are greatly reduced by incorporating frozen patterns into a morphological analysis module. 1 $O$8$a$K 5!3#K]Lu$r$O$8$a$H$9$k<+A38@8l=hM}$K$*$$$F!"47MQI=8=!J%$%G%# %*%`!K$dDj7?I=8=$O=EMW$JLr3d$r2L$?$9!#47MQI=8=$O!"J#?t$NC18l$, O"7k$7$?7k2L!"8D!9$NC18l$N0UL#$+$i$O=P$F$3$J$$$h$&$J0UL#$rA4BN $H$7$F;}$D!#$3$N$?$a!"$3$l$i$NI=8=$O0l$D$N$^$H$^$C$?C10L$H$7$F =hM}$9$kI,MW$,$"$k!#Dj7?I=8=$H$OJ8>OCf$KIQHK$K;HMQ$5$l$kI=8=$G $"$k$,!"47MQI=8=$H$O0[$J$j!"4pK\E*$K$OC18lC10L$N=hM}$K$h$jA4BN $N0UL#$r$D$+$`$3$H$,$G$-$k!#$7$+$7!"=hM}8zN($NLL$+$i$O!"Dj7?I= 8=$b0l$D$N$^$H$^$j$H$7$F=hM}$9$k$3$H$,K>$^$7$$!#$3$l$i$N47MQI= 8=$dDj7?I=8=$r$I$N$h$&$K$7$F<}=8$9$k$+$H$$$&$N$O!"<+A38@8l=hM} $K$*$1$k=EMW$JLdBj$G$"$k!# $^$?!":G6a$OMQNc$K4p$E$$$?<+A38@8l=hM}(Example-based NLP)
Extracting Case relations from Corpora
, 1997
"... Description of a system for the automatic acquisition of verbal case frames from corpora. The key target is to acquire domain-specific relations rather than the standard relationships found in general dictionaries. Results of experiments on Ecran (and other) corpora are reported. Status of abstract ..."
Abstract
- Add to MetaCart
Description of a system for the automatic acquisition of verbal case frames from corpora. The key target is to acquire domain-specific relations rather than the standard relationships found in general dictionaries. Results of experiments on Ecran (and other) corpora are reported. Status of abstract Public Received on Recipient's catalogue number ECRAN LE 2110 --- Deliverable 2.4 --- Restricted--- page 3 D-2.4 Extracting Case Relations from Corpora Authors: Roberto Basili, Maria-Teresa Pazienza, Paola Velardi (ANC and TV), Roberta Catizone, Robin Collier, Mark Stevenson, Yorick Wilks (SHE), Olivier Ansaldi, Alpha Luk, Barbara Vauthey (FRI), Jean-Michel Grandchamp (THO) 1 Introduction In this document we describe a system for the automatic acquisition of verbal case frames from corpora.The lexicon is acknowledged as one of the major components of NLP and MT systems. It is broadly agreed that the most succesful implementations of NLP-based systems so far have been those based on lex...

