Results 1 - 10
of
13
Information extraction: Identifying protein names from biological papers
- In Proceedings of the Pacific Symposium on Biocomputing '98 (PSB'98
, 1998
"... To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent infor ..."
Abstract
-
Cited by 195 (5 self)
- Add to MetaCart
To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent information extraction (IE) system is needed. To extract these information directly from articles, the system must rstly identify the material names. However, medical and biological documents often include proper nouns newly made by the authors, and conventional methods based on domain speci c dictionaries cannot detect such unknown words or coinages. In this study, we propose a new method of extracting material names, PROPER, using surface clue on character strings. It extracts material names in the sentence with 94.70 % precision and 98.84 % recall, regardless of whether it is already known or newly de ned. 1
An Unsupervised Iterative Method for Chinese New Lexicon Extraction
- International Journal of Computational Linguistics & Chinese Language Processing
, 1997
"... An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input ...
Hierarchical Clustering of Words and Application to NLP Tasks
"... This paper describes a data-driven method for hierarchical clustering of words and clustering of multiword compounds. A large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to corpora ranging in size from 5 million to 50 million words, using mutual information as ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes a data-driven method for hierarchical clustering of words and clustering of multiword compounds. A large vocabulary of English words (70,000 words) is clustered bottom-up, with respect to corpora ranging in size from 5 million to 50 million words, using mutual information as an objective function. The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of (i.e. word bits for) all the words in the vocabulary. Evaluation of the word bits is carried out through the measurement of the error rate of the ATR Decision-Tree Part-Of-Speech Tagger. The same clustering technique is then applied to the classification of multiword compounds. In order to avoid the explosion of the number of compounds to be handled, compounds in a small subclass are bundled and treated as a single compound. Another merit of this approach is that we can avoid the data sparseness problem which is ubiquitous in corpus statistics. The quality of one of the obtained compound classes is examined and compared to a conventional approach.
A Multivariate Gaussian Mixture Model for Automatic Compound Word Extraction
- Department of Electrical Engineering, National Tsing-Hua University
, 1997
"... An improved statistical model is proposed in this paper for extracting compound words from a text corpus. Traditional terminology extraction methods rely heavily on simple filtering-and-thresholding methods, which are unable to minimize the error counts objectively. Therefore, a method for minimizin ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
An improved statistical model is proposed in this paper for extracting compound words from a text corpus. Traditional terminology extraction methods rely heavily on simple filtering-and-thresholding methods, which are unable to minimize the error counts objectively. Therefore, a method for minimizing the error counts is very desirable. In this paper, an improved statistical model is developed to integrate parts of speech information as well as other frequently used word association metrics to jointly optimize the extraction tasks. The features are modelled with a multivariate Gaussian mixture for handling the inter-feature correlations properly. With a training (resp. testing) corpus of 20715 (resp. 2301) sentences, the weighted precision & recall (WPR) can achieve about 84% for bigram compounds, and 86% for trigram compounds. The F-measure performances are about 82% for bigrams and 84% for trigrams. 1. Compound Word Extraction Problems 1.1 Motivation Compound words are very common ...
Bigram Statistics Revisited: A Comparative Examination of Some Statistical Measures in Morphological Analysis of Japanese Kanji Sequences
, 1996
"... this paper, i.e. X 2 (Hoel, 1971; Fienberg, 1977; Reynolds, 1977), 2 likelihood ratio test (Hoel, 1971; Fienberg, 1977; Reynolds, 1 1977; Dunning, 1993), Yule's coefficient of colligation Y (Yule, 1944; Reynolds, 1977; Delcourt 1992; 1994), and mutual information (Fano, 1961; Church, Gale, Hank ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
this paper, i.e. X 2 (Hoel, 1971; Fienberg, 1977; Reynolds, 1977), 2 likelihood ratio test (Hoel, 1971; Fienberg, 1977; Reynolds, 1 1977; Dunning, 1993), Yule's coefficient of colligation Y (Yule, 1944; Reynolds, 1977; Delcourt 1992; 1994), and mutual information (Fano, 1961; Church, Gale, Hanks and Hindle, 1990; Church and Hanks, 1990)
Corpus-Based Learning of Compound Noun Indexing
- Proc. of the ACL'2000 workshop on Recent Advances in Natural Language Processing and Information Retrieval, Hong Kong
, 2000
"... We present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no hu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no human efforts. IR&NLP workshop Corpus-Based Learning of Compound Noun Indexing July 12, 2000 Paper ID Code: IR&NLP workshop Authors: Byung-Kwan Kwak and Jee-Hyub Kim and Geunbae Lee NLP Lab., Dept. of CSE Pohang University of Science & Technology (POSTECH) San 31, Hyoja-Dong, Pohang, 790-784, Korea fnerguri,gbleeg@postech.ac.kr ----------------------------------------------------------------------------- Jung Yun Seo NLP Lab., Dept. of Computer Science Sogang University Sinsu-dong 1, Mapo-gu, Seoul, Korea seojy@ccs.sogang.ac.kr Topic Area(s): Keywords: compound noun, indexing, corpus-based learning, automatic rule extraction, filtering Which Session: T1, T2, T3, T4, or G (you must choose only one)? Word Count: 3,036 Under consideration for other conferences (specify)? None Abstract We present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no human efforts. Abstract In this paper, we present a corpus-based learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an efficient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows a...
A Corpus-Based Learning Method of Compound Noun Indexing Rules for Korean
"... In Korean information retrieval, compound nouns play an important role in improving precision in search experiments. There are two major approaches to compound noun indexing in Korean: statistical and linguistic. Each method, however, has its own shortcomings, such as limitations when indexing diver ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In Korean information retrieval, compound nouns play an important role in improving precision in search experiments. There are two major approaches to compound noun indexing in Korean: statistical and linguistic. Each method, however, has its own shortcomings, such as limitations when indexing diverse types of compound nouns, over-generation of compound nouns, and data sparseness in training. In this paper, we propose a corpus-based learning method, which can index diverse types of compound nouns using rules automatically extracted from a large corpus. The automatic learning method is more portable and requires less human eort, although it exhibits a performance level similar to the manual-linguistic approach. We also present a new ltering method to solve the problems of compound noun over-generation and data sparseness. Keywords: corpus-based learning, compound noun indexing, ltering, information retrieval, search performance evaluation 1.
Using Co-occurrence Statistics as an Information Source for Partial Parsing of Chinese
, 2000
"... Our partial parser for Chinese uses a learned classifier to guide a bottom-up parsing process. We describe improvements'in performance obtained by expanding the information available to the classifier, from POS sequences only, to include measures of word association derived from co-occmrence ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Our partial parser for Chinese uses a learned classifier to guide a bottom-up parsing process. We describe improvements'in performance obtained by expanding the information available to the classifier, from POS sequences only, to include measures of word association derived from co-occmrence statistics. We compare performance using different measures of association, and find that Yule's coefficient of colligation Y gives somewhat better results over other measures.
Corpus-Based Learning of Compound Noun Indexing
, 2000
"... In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. ..."
Abstract
- Add to MetaCart
In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus.

