Results 1 - 10
of
14
Distributional measures as proxies for semantic relatedness
- In submission
, 2005
"... Abstract. The automatic ranking of word pairs as per their semantic relatedness and ability to mimic human notions of semantic relatedness has widespread applications. Measures that rely on raw data (distributional measures) and those that use knowledge-rich ontologies both exist. Although extensive ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Abstract. The automatic ranking of word pairs as per their semantic relatedness and ability to mimic human notions of semantic relatedness has widespread applications. Measures that rely on raw data (distributional measures) and those that use knowledge-rich ontologies both exist. Although extensive studies have been performed to compare ontological measures with human judgment, the distributional measures have primarily been evaluated by indirect means. This paper is a detailed study of some of the major distributional measures; it lists their respective merits and limitations. New measures that overcome these drawbacks, that are more in line with the human notions of semantic relatedness, are suggested. The paper concludes with an exhaustive comparison of the distributional and ontology-based measures. Along the way, significant research problems are identified. Work on these problems may lead to a better understanding of how semantic relatedness is to be measured.
Measures and Applications of Lexical Distributional Similarity
, 2003
"... This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, s ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This thesis is concerned with the measurement and application of lexical distributional similarity. Two words are said to be distributionally similar if they appear in similar contexts. This loose definition, however, has led to many measures being proposed or adopted from fields such as geometry, statistics, Information Retrieval (IR) and Information Theory. Our aim is to investigate the properties which make a good measure of lexical distributional similarity. We start by introducing the concept of lexical distributional similarity. We discuss potential applications, which can be roughly divided into distributional or language modelling applications and semantic applications, and methods of evaluation (Chapter 2). We look at existing measures of distributional similarity and carry out an empirical comparison of fifteen of these measures, paying particular attention to the effects of word frequency (Chapter 3). We propose a new general framework for distributional similarity based on the context of lexical substitutability, which me measure using the IR concepts of precision and recall. This framework allows us to investigate the key factors in similarity of asymmetry, the relative influence of different contexts and the extent to which words share a context (Chapter 4). Finally, we consider the application of distributional similarity in language modelling (Chapter 5) and as a predictor of semantic similarity using human judgements of similarity and a spelling correction task (Chapter 6).
Topic Detection and Tracking with Spatio-Temporal Evidence
- In Proceedings of 25th European Conference on Information Retrieval Research (ECIR 2003
, 2003
"... Topic Detection and Tracking is an event-based information organization task where online news streams are monitored in order to spot new unreported events and link documents with previously detected events. The detection has proven to perform rather poorly with traditional information retrieval app ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Topic Detection and Tracking is an event-based information organization task where online news streams are monitored in order to spot new unreported events and link documents with previously detected events. The detection has proven to perform rather poorly with traditional information retrieval approaches. We present an approach that formalizes temporal expressions and augments spatial terms with ontological information and uses this data in the detection. In addition, instead using a single term vector as a document representation, we split the terms into four semantic classes and process and weigh the classes separately. The approach is motivated by experiments.
Improving Verb Clustering with Automatically Acquired Selectional Preferences
"... In previous research in automatic verb classification, syntactic features have proved the most useful features, although manual classifications rely heavily on semantic features. We show, in contrast with previous work, that considerable additional improvement can be obtained by using semantic featu ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
In previous research in automatic verb classification, syntactic features have proved the most useful features, although manual classifications rely heavily on semantic features. We show, in contrast with previous work, that considerable additional improvement can be obtained by using semantic features in automatic classification: verb selectional preferences acquired from corpus data using a fully unsupervised method. We report these promising results using a new framework for verb clustering which incorporates a recent subcategorization acquisition system, rich syntactic-semantic feature sets, and a variation of spectral clustering which performs particularly well in high dimensional feature space. 1
Directed Treebank Refinement for PCFG parsing
- Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT-2003), Vaxjö
, 2003
"... The linguistic annotation of a treebank conforms to an annotation scheme, which has to serve several purposes. It should be easy to understand and follow by human annotators, and it should naturally express the syntactic structure assumed for the sentences under consideration. Treebanks may also ser ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The linguistic annotation of a treebank conforms to an annotation scheme, which has to serve several purposes. It should be easy to understand and follow by human annotators, and it should naturally express the syntactic structure assumed for the sentences under consideration. Treebanks may also serve as a resource for Natural
Extracting key phrases to disambiguate personal names on the web
- In CICLing
, 2006
"... Assume that you are looking for information about a particular person. A search engine returns many pages for that person’s name. Some of these pages may be on other people with the same name. One method to reduce the ambiguity in the query and filter out the irrelevant pages, is by adding a phrase ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Assume that you are looking for information about a particular person. A search engine returns many pages for that person’s name. Some of these pages may be on other people with the same name. One method to reduce the ambiguity in the query and filter out the irrelevant pages, is by adding a phrase that uniquely identifies the person we are interested in from his/her namesakes. We propose an unsupervised algorithm that extracts such phrases from the Web. We represent each document by a term-entity model and cluster the documents using a contextual similarity metric. We evaluate the algorithm on a dataset of ambiguous names. Our method outperforms baselines, achieving over 80 % accuracy and significantly reduces the ambiguity in a web search task. 1
Disambiguating personal names on the web using automatically extracted key phrases
- In Proc. ECAI 2006
, 2006
"... Abstract. When you search for information regarding a particular person on the web, a search engine returns many pages. Some of these pages may be for people with the same name. How can we disambiguate these different people with the same name? This paper presents an unsupervised algorithm which pro ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. When you search for information regarding a particular person on the web, a search engine returns many pages. Some of these pages may be for people with the same name. How can we disambiguate these different people with the same name? This paper presents an unsupervised algorithm which produces unique phrases to disambiguate different people with the same name (i.e. namesakes). Our algorithm takes in a personal name and outputs multiple sets of phrases which uniquely identify the different namesakes on the web. These phrases could then be added to the query to narrow down the search to a specific namesake. We evaluated the algorithm on a collection of documents retreived from the Web. Experimental results show a significant improvement over the existing methods proposed for this task. 1
S.: Unsupervised Feature Selection for Text Data
- In Proc of the 8th European Conf on CBR
, 2006
"... nw¡rml¡sm¢ Abstract. Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel fe ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
nw¡rml¡sm¢ Abstract. Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: CLUSTER divides the entire feature space, before then selecting one feature to represent each cluster; and GREEDY increments the feature subset size by a greedily selected feature. In particular we found that GREEDY’s local search is suited to learning smaller feature subset sizes while CLUSTER is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both GREEDY and CLUSTER make significant progress towards the upper bound performance set by a standard supervised feature selection method. 1
Online Distribution Estimation for Streaming Data: Framework and Applications
"... Abstract. In the last few years, we have been witnessing an evergrowing need for continuous observation and monitoring applications. This need is driven by recent technological advances that have made streaming applications possible, and by the fact that analysts in various domains have realized the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In the last few years, we have been witnessing an evergrowing need for continuous observation and monitoring applications. This need is driven by recent technological advances that have made streaming applications possible, and by the fact that analysts in various domains have realized the value that such applications can provide. In this paper, we propose a general framework for computing efficiently an approximation of multi-dimensional distributions of streaming data. This framework enables the development of a wide variety of complex streaming applications. In addition, we demonstrate how our framework can operate in a distributed fashion, thus, making better use of the available resources. We motivate our techniques using two concrete problems, both in the challening context of resource-constrained sensor networks. The first problem is outlier detection, while the second is detection and tracking of homogeneous regions. Experiments with synthetic and real data show that our method is efficient and accurate, and compares favorably to other proposed techniques for both the problems that we studied. 1
Information Filtering, Novelty Detection, and Named-Page Finding Kevyn Collins-Thompson, Paul Ogilvie, Yi Zhang, and Jamie Callan
- In Proceedings of the 11th Text Retrieval Conference
, 2002
"... This paper describes our approaches, experiments, and results. As the approach for each task is quite different, the paper contains a section for each of the tasks. The following section describes our experiments in adaptive filtering, Section 3 describes named-page finding, and section 4 discusses ..."
Abstract
- Add to MetaCart
This paper describes our approaches, experiments, and results. As the approach for each task is quite different, the paper contains a section for each of the tasks. The following section describes our experiments in adaptive filtering, Section 3 describes named-page finding, and section 4 discusses the Novelty track

