Results 1 - 10
of
11
A Cross-Collection Mixture Model for Comparative Text Mining
- In Proceedings of KDD ’04
, 2004
"... problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and di#erences of these collections along each common theme. ..."
Abstract
-
Cited by 47 (11 self)
- Add to MetaCart
problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and di#erences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and withincollection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated e#ciently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two di#erent text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite e#ective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.
Similarity of semantic relations
- Computational Linguistics
, 2006
"... There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47 % on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56 % on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM. 1.
Corpus-based learning of analogies and semantic relations
- Machine Learning
, 2005
"... Abstract. We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning “A is to B as C is to D”; fo ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Abstract. We present an algorithm for learning from unlabeled text, based on the Vector Space Model (VSM) of information retrieval, that can solve verbal analogy questions of the kind found in the SAT college entrance exam. A verbal analogy has the form A:B::C:D, meaning “A is to B as C is to D”; for example, mason:stone::carpenter:wood. SAT analogy questions provide a word pair, A:B, and the problem is to select the most analogous word pair, C:D, from a set of five choices. The VSM algorithm correctly answers 47 % of a collection of 374 collegelevel analogy questions (random guessing would yield 20 % correct; the average college-bound senior high school student answers about 57 % correctly). We motivate this research by applying it to a difficult problem in natural language processing, determining semantic relations in noun-modifier pairs. The problem is to classify a noun-modifier pair, such as “laser printer”, according to the semantic relation between the noun (printer) and the modifier (laser). We use a supervised nearestneighbour algorithm that assigns a class to a given noun-modifier pair by finding the most analogous noun-modifier pair in the training data. With 30 classes of semantic relations, on a collection of 600 labeled noun-modifier pairs, the learning algorithm attains an F value of 26.5 % (random guessing: 3.3%). With 5 classes of semantic relations, the F value is 43.2 % (random: 20%). The performance is state-of-the-art for both verbal analogies and noun-modifier relations.
Event Detection by Eigenvector Decomposition Using Object and Frame
, 2004
"... We develop an event detection framework that has two significant advantages over past work. First, we introduce an extended set of time-wise and object-wise statistical features including not only the trajectories but also histograms and HMM's of speed, orientation, location, size, and aspect ratio. ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
We develop an event detection framework that has two significant advantages over past work. First, we introduce an extended set of time-wise and object-wise statistical features including not only the trajectories but also histograms and HMM's of speed, orientation, location, size, and aspect ratio. The proposed features are more expressive and enable detection of events that cannot be detected with trajectory-based features reported so far. Second, we introduce a spectral clustering method that can estimate the optimal number of clusters automatically. This novel clustering technique that is not adversely affected by high dimensionality. Unlike the conventional approaches that fit predefined models to events, we determine unusual events by analyzing the conformity scores. We compute affinity matrices and apply eigenvalue decomposition to find clusters to obtain the usual events. We prove that the number of clusters governs the number of eigenvectors used to span the feature similarity space. We also improve the feature selection process.
Measuring the Similarity between Implicit Semantic Relations from the Web
- WWW 2009 MADRID! TRACK: SEMANTIC/DATA WEB / SESSION: MINING FOR SEMANTICS
, 2009
"... Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. Googl ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. Google, YouTube), between which a particular relation holds (e.g. acquisition). The person is interested in retrieving other such pairs with similar relations (e.g. Microsoft, Powerset). Existing keyword-based search engines cannot be applied directly in this case because, in keyword-based search, the goal is to retrieve documents that are relevant to the words used in a query – not necessarily to the relations implied by a pair of words. We propose a relational similarity measure, using a Web search engine, to compute the similarity between semantic relations implied by two pairs of words. Our method has three components: representing
Analogical reasoning with relational Bayesian sets
- 11th International Conference on Artificial Intelligence and Statistics, AISTATS
, 2007
"... Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very challenging. Here we develop an approach which, given a set of pairs of related objects S ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. There are many ways in which objects can be related, making automated analogical reasoning very challenging. Here we develop an approach which, given a set of pairs of related objects S = {A 1:B 1, A 2:B 2,..., A N:B N}, measures how well other pairs A:B fit in with the set S. This addresses the question: is the relation between objects A and B analogous to those relations found in S? We recast this classical problem as a problem of Bayesian analysis of relational data. This problem is nontrivial because direct similarity between objects is not a good way of measuring analogies. For instance, the analogy between an electron around the nucleus of an atom and a planet around the Sun is hardly justified by isolated, non-relational, comparisons of an electron to a planet, and a nucleus to the Sun. We develop a generative model for predicting the existence of relationships and extend the framework of Ghahramani and Heller (2005) to provide a Bayesian measure for how analogous a relation is to other relations. This sheds new light on an old problem, which we motivate and illustrate through practical applications in exploratory data analysis. 1
Cross-dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora
- Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002
, 2002
"... We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information bottleneck soft clust ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information bottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.
The Latent Relation Mapping Engine: Algorithm and Experiments
, 2008
"... Many AI researchers and cognitive scientists have argued that analogy is the core of cognition. The most influential work on computational modeling of analogy-making is Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine (SME). A limitation of SME is the requirement ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Many AI researchers and cognitive scientists have argued that analogy is the core of cognition. The most influential work on computational modeling of analogy-making is Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine (SME). A limitation of SME is the requirement for complex hand-coded representations. We introduce the Latent Relation Mapping Engine (LRME), which combines ideas from SME and Latent Relational Analysis (LRA) in order to remove the requirement for handcoded representations. LRME builds analogical mappings between lists of words, using a large corpus of raw text to automatically discover the semantic relations among the words. We evaluate LRME on a set of twenty analogical mapping problems, ten based on scientific analogies and ten based on common metaphors. LRME achieves human-level performance on the twenty problems. We compare LRME with a variety of alternative approaches and find that they are not able to reach the same level of performance. 1.
Cross-component Clustering for Template Induction
, 2002
"... We suggest an unsupervised approach to template induction for information extraction, through detecting sub-topics and themes that cut across the documents of a topical corpus. We introduce a new method cross component clustering that simultaneously clusters the components forming our setting, each ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We suggest an unsupervised approach to template induction for information extraction, through detecting sub-topics and themes that cut across the documents of a topical corpus. We introduce a new method cross component clustering that simultaneously clusters the components forming our setting, each of which consists of the words of a single article. Our algorithm is derived from the Information Bottleneck clustering algorithm. The resulting clusters are found to be in systematic correspondence with sets of terms that are used in filling the slots of the MUC3/4 ready-made template, which was used for evaluation.
Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets
"... Abstract. Domain experts are frequently interested to analyze multiple related spatial datasets. This capability is important for change analysis and contrast mining. In this paper, a novel clustering approach called correspondence clustering is introduced that clusters two or more spatial datasets ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Domain experts are frequently interested to analyze multiple related spatial datasets. This capability is important for change analysis and contrast mining. In this paper, a novel clustering approach called correspondence clustering is introduced that clusters two or more spatial datasets by maximizing cluster interestingness and correspondence between clusters derived from different datasets. A representative-based correspondence clustering framework and clustering algorithms are introduced. In addition, the paper proposes a novel cluster similarity assessment measure that relies on reclustering techniques and co-occurrence matrices. We conducted experiments in which two earthquake datasets had to be clustered by maximizing cluster interestingness and agreement between the spatial clusters obtained. The results show that correspondence clustering can reduce the variance inherent to representative-based clustering algorithms, which is important for reducing the likelihood of false positives in change analysis. Moreover, high agreements could be obtained by only slightly lowering cluster quality.

