Results 1 - 10
of
18
Semantic similarity based on corpus statistics and lexical taxonomy
- Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97
, 1997
"... This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantifie ..."
Abstract
-
Cited by 395 (0 self)
- Add to MetaCart
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task. 1.
Fast Statistical Parsing of Noun Phrases for Document Indexing
, 1997
"... Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques hav ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's resuits show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.
Natural Language Information Retrieval: TREC-3 Report
- In Proceedings of the Fifth Text REtrieval Conference (TREC-5
"... In this paper we report on the recent developments in NYU's natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of te ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
In this paper we report on the recent developments in NYU's natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statistical backbone consisting of the indexer module, which builds inverted index files from pre-processed documents, and a retrieval engine which searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC-3 effort, the total of 3.3 GBytes of text articles have been processed (Tipster disks 1 through 3), i...
Empirical Observation of Term Variations and Principles for their Description
, 2000
"... Contents 1 Introduction 2 1.1 Do terms vary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A Symbolic Framework for the Study of Terminological Variation . . . . . . . . . . . . . . . 4 2 The Most Common Types of English Two-word Terms 7 2.1 Adjective N ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Contents 1 Introduction 2 1.1 Do terms vary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A Symbolic Framework for the Study of Terminological Variation . . . . . . . . . . . . . . . 4 2 The Most Common Types of English Two-word Terms 7 2.1 Adjective Noun (A N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Noun Noun (N 2 N 1 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Noun Preposition Noun (N 1 P N 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Observing and Representing Term Variants 9 3.1 An Observation of Term Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 A Two-level Lexico-syntactic Description of Terms . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Two Families of Grammatical Rules . .
NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax
, 1999
"... . We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
. We present a natural language processing (NLP) approach to automatic indexing over controlled vocabulary which accounts for term variation. The approach combines a part of speech tagger, a generator of morphologically related forms, and a shallow transformational parser. The system is applied to the French language; it is trained on newspaper articles and tested on scientific literature. Precision rate of indexing on term and variants is 97.2%. It is only slightly lower than indexing without accounting for term variation (99.7%). Recall rate of indexing on term and variants (93.4%) is much higher than recall of indexing on term occurrences only (72.4%). Conflation of term variants increases indexing coverage up to 30%. The system is a convincing example of the potential synergy between full-fledged morphological analysis and local syntactic analysis. Many details are provided on the implementation of the system. Illustrative examples of syntactic transformations for the French language are given together with the theoretical and empirical methods for their formulation. 2 CHRISTIAN JACQUEMIN AND EVELYNE TZOUKERMANN 1.
Degraded Text Recognition Using Visual And Linguistic Context
, 1995
"... Recognition of degraded text is a challenging problem. To improve the performance of an OCR system on degraded images of text, postprocessing techniques are critical. The objective of postprocessing is to correct errors or to resolve ambiguities in OCR results by using contextual information. Depend ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Recognition of degraded text is a challenging problem. To improve the performance of an OCR system on degraded images of text, postprocessing techniques are critical. The objective of postprocessing is to correct errors or to resolve ambiguities in OCR results by using contextual information. Depending on the extent of context used, there are different levels of postprocessing. In current commercial OCR systems, word-level postprocessing methods, such as dictionary-lookup, have been applied successfully. However, many OCR errors cannot be corrected by word-level postprocessing. To overcome this limitation, passage-level postprocessing, in which global contextual information is utilized, is necessary. In most current studies on passage-level postprocessing, linguistic context is the major resource to be exploited. This thesis addresses problems in degraded text recognition and discusses potential solutions through passage-level postprocessing. The objective is to develop a postprocessin...
Exporting phrases: A statistical analysis of topical language
- Second Symposium on Document Analysis and Information Retrieval
, 1993
"... This paper describes preliminary experiments documenting significant variations in word usage patterns within topical sublanguages. As some phrases have very different collocational patterns than their constituent words, we look beyond occurrences of individual words, to consider word phrases. The m ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
This paper describes preliminary experiments documenting significant variations in word usage patterns within topical sublanguages. As some phrases have very different collocational patterns than their constituent words, we look beyond occurrences of individual words, to consider word phrases. The mutual information statistic is used to measure the information content of phrases beyond that of their constituent words. We find that specialized topic areas give rise to phrases with very descriptive constituents which are then "exported " into general vocabulary. These phrases are also much more informative as word pairs outside the topic area than within it. Further, we find evidence of an intriguing "self-similar " regularity in this exporting relation across different hierarchical levels of topical areas. 1 Introduction The assumption is often made in information retrieval (IR) and corpus-based linguistics that the documents of interest are part of a single, homogeneous and unstructured collection. If the text corpus is of small or moderate size, topically well-focused and generated by one author or a small group of authors sharing a common vocabulary, this can be a reasonable and useful simplification. But as machine-readable corpora increase in size, it becomes more and more likely that significant variations in word usage patterns will be observed within restricted subsets of the collections.
Robust Text Processing In Automated Information Retrieval
, 1994
"... This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
This paper outlines a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. The backbone of our system is a traditional retrieval engine which builds inverted index files from pre-processed docu- ments, and then searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract contents-carrying terms, (2) discover interterm dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. The basic assumption of this design is that term-based representation of contents is in principle sufficient to build an effective if not optimal search query out of any users request. This has been confirmed by an experiment that compared effectiveness of expert-user prepared queries with those derived automatically from an initial narrative information request. In this paper we show that largescale natural language processing (hundreds of millions of words and more) is not only required for a better retrieval, but it is also doable, given appropriate resources. We report on selected preliminary restfits of experiments with 500 MByte database of Wall Street Journal articles, as well as some earlier restfits with a smaller document collection.
Recent Developments In Natural Language Text Retrieval
- Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500-215
, 1994
"... This paper reports on some recent developments in our natural language text retrieval system. The system uses advanced natural language processing techniques to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional statistical engine which builds inv ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
This paper reports on some recent developments in our natural language text retrieval system. The system uses advanced natural language processing techniques to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional statistical engine which builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC-2 effort, the total of 550 MBytes of Wall Street Journal articles (ad-hoc queries database) and 300 MBytes of San Jose Mercury articles (routing data) have been processed. In terms of text quantity this represents approximately 130 million words of English. Unlike ...
What Is The Tree That We See Through The Window: A Linguistic Approach To Windowing And Term Variation
"... Windowing techniques play a key role in information retrieval. Previous works have suggested that the quality of access to information relies heavily on the characteristics of the windows. This study provides a linguistic approach to text windowing through an extraction of term variants with the hel ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Windowing techniques play a key role in information retrieval. Previous works have suggested that the quality of access to information relies heavily on the characteristics of the windows. This study provides a linguistic approach to text windowing through an extraction of term variants with the help of a partial parser. The syntactic grounding of the method ensures that words observed within restricted spans are lexically related and that spurious word co-occurrences are ruled out with a good level of confidence. The system is computationally tractable on large corpora and large lists of terms. Illustrative examples of term variations from a large medical corpus are given. An experimental evaluation of the method shows that only a small proportion of co-occurring words are lexically related and motivates the call for natural language parsing techniques in text windowing. 1. INTRODUCTION The notion of text window -- a span of contiguous words within a document -- is crucial for severa...

