Results 1 - 10
of
61
Frequency Estimates for Statistical Word Similarity Measures
, 2003
"... Statistical measures of word similarity have application in many areas of natural language processing, such as language modeling and information retrieval. We report a comparative study of two methods for estimating word cooccurrence frequencies required by word similarity measures. Our frequency es ..."
Abstract
-
Cited by 106 (3 self)
- Add to MetaCart
(Show Context)
Statistical measures of word similarity have application in many areas of natural language processing, such as language modeling and information retrieval. We report a comparative study of two methods for estimating word cooccurrence frequencies required by word similarity measures. Our frequency estimates are generated from a terabyte-sized corpus of Web data, and we study the impact of corpus size on the effectiveness of the measures. We base the evaluation on one TOEFL question set and two practice questions sets, each consisting of a number of multiple choice questions seeking the best synonym for a given target word.
Name disambiguation in author citations using a K-way spectral clustering method
- INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES
, 2005
"... An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database ..."
Abstract
-
Cited by 72 (7 self)
- Add to MetaCart
An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles 2. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes.
Disco: Distributed co-clustering with map-reduce. ICDM
, 2008
"... Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
(Show Context)
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easyto-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. 1
Extracting Synonymous Gene and Protein Terms From Biological Literature
, 2003
"... Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the g ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
Motivation: Genes and proteins are often associated with multiple names. More names are added as new functional or structural information is discovered. Because authors can use any one of the known names for a gene or protein, information retrieval and extraction would benefit from identifying the gene and protein terms that are synonyms of the same substance.
Word Sense Disambiguation in Information Retrieval Revisited
- ACM SIGIR
, 2003
"... Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguatio ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90 % accurate automated word sense disambiguation can lead to improvements in retrieval effectiveness. In this study we explore the development and subsequent evaluation of a statistical word sense disambiguation system which demonstrates increased precision from a sense based vector space retrieval model over traditional TF*IDF techniques.
Unsupervised methods for determining object and relation synonyms on the web
- Journal of Artificial Intelligence Research
, 2009
"... The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
(Show Context)
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fullyimplemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78 % precision and 68 % recall, and resolves relations with 90 % precision and 35 % recall. Several variations of Resolver’s probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97 % precision and 95 % recall on a data set from the TREC corpus.
Measures and Applications of Lexical Distributional Similarity
, 2003
"... This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, s ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
(Show Context)
This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, statistics, Information Retrieval (IR) and Information Theory. Our aim is to investigate the properties which make a good measure of lexical distributional similarity. We start by introducing the concept of lexical distributional similarity. We discuss potential applications, which can be roughly divided into distributional or language modelling applications and semantic applications, and methods of evaluation (Chapter 2). We look at existing measures of distributional similarity and carry out an empirical comparison of fifteen of these measures, paying particular attention to the effects of word frequency (Chapter 3). We propose a new general framework for distributional similarity based on the context of lexical substitutability, which me measure using the IR concepts of precision and recall. This framework allows us to investigate the key factors in similarity of asymmetry, the relative influence of different contexts and the extent to which words share a context (Chapter 4). Finally, we consider the application of distributional similarity in language modelling (Chapter 5) and as a predictor of semantic similarity using human judgements of similarity and a spelling correction task (Chapter 6).
Information Retrieval Based on Context Distance and Morphology
- SIGIR '99
, 1999
"... We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as well as the lexical co-occurrence information in the set of documents to be retrieved. We also propose to integrate the context distance model with morphological analysis in determining word similarity so that the two can enhance each other. Using the standard vector-space model, we evaluated the proposed method on a subset of TREC-4 corpus (AP88 and AP90 collection, 158,240 documents, 49 queries). Results show that this method improves the 11-point average precision by 8.6%.
Graph-based word clustering using web search engine
- IN PROC. OF EMNLP 2006
, 2006
"... Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.
Clustering Syntactic Positions with Similar Semantic Requirements
- Computational Linguistics
, 2005
"... This article describes an unsupervised strategy to acquire syntactico-semantic requirements of nouns, verbs, and adjectives from partially parsed text corpora. The linguistic notion of requirement underlying this strategy is based on two specific assumptions. First, it is assumed that two words in a ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
This article describes an unsupervised strategy to acquire syntactico-semantic requirements of nouns, verbs, and adjectives from partially parsed text corpora. The linguistic notion of requirement underlying this strategy is based on two specific assumptions. First, it is assumed that two words in a dependency are mutually required. This phenomenon is called here corequirement. Second, it is also claimed that the set of words occurring in similar positions defines extensionally the requirements associated with these positions. The main aim of the learning strategy presented in this article is to identify clusters of similar positions by identifying the words that define their requirements extensionally. This strategy allows us to learn the syntactic and semantic requirements of words in different positions. This information is used to solve attachment ambiguities. Results of this particular task are evaluated at the end of the article. Extensive experimentation was performed on Portuguese text corpora. 1.