Results 1 - 10
of
21
Coupled Semi-Supervised Learning for Information Extraction
"... We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web d ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning—knowledge acquisition;
Entity extraction via ensemble semantics
- In Proc. of EMNLP
, 2009
"... Combining information extraction systems yields significantly higher quality resources than each system in isolation. In this paper, we generalize such a mixing of sources and features in a framework called Ensemble Semantics. We show very large gains in entity extraction by combining state-of-the-a ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Combining information extraction systems yields significantly higher quality resources than each system in isolation. In this paper, we generalize such a mixing of sources and features in a framework called Ensemble Semantics. We show very large gains in entity extraction by combining state-of-the-art distributional and patternbased systems with a large set of features from a webcrawl, query logs, and Wikipedia. Experimental results on a webscale extraction of actors, athletes and musicians show significantly higher mean average precision scores (29 % gain) compared with the current state of the art. 1
Learning 5000 relational extractors
- In ACL
, 2010
"... Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Many researchers are trying to use information extraction (IE) to create large-scale knowledge bases from natural language text on the Web. However, the primary approach (supervised learning of relation-specific extractors) requires manually-labeled training data for each relation and doesn’t scale to the thousands of relations encoded in Web text. This paper presents LUCHS, a self-supervised, relation-specific IE system which learns 5025 relations — more than an order of magnitude greater than any previous approach — with an average F1 score of 61%. Crucial to LUCHS’s performance is an automated system for dynamic lexicon learning, which allows it to learn accurately from heuristically-generated training data, which is often noisy and sparse. 1
Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition
"... Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based SSL algorithms for class-instance acquisition on a variety of graphs constructed from different domains. We find that the recently proposed MAD algorithm is the most effective. We also show that class-instance extraction can be significantly improved by adding semantic information in the form of instance-attribute edges derived from an independently developed knowledge base. All of our code and data will be made publicly available to encourage reproducible research in this area. 1
Not All Seeds Are Equal: Measuring the Quality of Text Mining Seeds
"... Open-class semantic lexicon induction is of great interest for current knowledge harvesting algorithms. We propose a general framework that uses patterns in bootstrapping fashion to learn open-class semantic lexicons for different kinds of relations. These patterns require seeds. To estimate the goo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Open-class semantic lexicon induction is of great interest for current knowledge harvesting algorithms. We propose a general framework that uses patterns in bootstrapping fashion to learn open-class semantic lexicons for different kinds of relations. These patterns require seeds. To estimate the goodness (the potential yield) of new seeds, we introduce a regression model that considers the connectivity behavior of the seed during bootstrapping. The generalized regression model is evaluated on six different kinds of relations with over 10000 different seeds for English and Spanish patterns. Our approach reaches robust performance
Power Iteration Clustering
"... We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. We demonstrate this method’s effectiveness and scalability on several synthetic and real datasets, and conclude that to find a meaningful low-dimensional embedding for clustering, it is not necessary to find any eigenvectors—we just need a linear combination of the top eigenvectors. 1
Semi-Supervised Classification of Network Data Using Very Few Labels
, 2009
"... The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor cla ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor classifier (wvRN) as a simple yet solid baseline for semi-supervised learning on network data. It is shown to be essentially the same as the Gaussian-field classifier proposed by Zhu et al. [22] and proves to be very effective on many benchmark network datasets. We describe another simple and intuitive semisupervised learning method based on random graph walk that outperforms wvRN by a large margin on several benchmark datasets when very few labels are available. Secondly, we show that using authoritative instances as training seeds — instances that arguably cost much less to label — dramatically reduces the amount of labeled data required to achieve the same classification accuracy. For some existing state-of-the-art semi-supervised learning methods the labeled data needed is reduced by a factor of 50. Third, we offer insights as to why learning methods based on random graph walk are able to more fully exploit the unlabeled data than previous methods. Based on the above observations, we strongly recommend the proposed method as a strong baseline for future research on semi-supervised classification of network data.
Semi-supervised learning of semantic classes for query . . .
, 2009
"... Understanding intents from search queries can improve a user’s search experience and boost a site’s advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Understanding intents from search queries can improve a user’s search experience and boost a site’s advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time. This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance – even with far less training data. We focus our study on understanding
A Very Fast Method for Clustering Big Text Datasets
"... Large-scale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pair-wise similarities between data points due to the prohibitive cost, time- and space-wise, in operating on a similarity matrix, where the state-of-the-art is at ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Large-scale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pair-wise similarities between data points due to the prohibitive cost, time- and space-wise, in operating on a similarity matrix, where the state-of-the-art is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pair-wise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix. 1
Factrank: Random walks on a web of facts
- In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010
, 2010
"... Fact collections are mostly built using semi-supervised relation extraction techniques and wisdom of the crowds methods, rendering them inherently noisy. In this paper, we propose to validate the resulting facts by leveraging global constraints inherent in large fact collections, observing that corr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Fact collections are mostly built using semi-supervised relation extraction techniques and wisdom of the crowds methods, rendering them inherently noisy. In this paper, we propose to validate the resulting facts by leveraging global constraints inherent in large fact collections, observing that correct facts will tend to match their arguments with other facts more often than with incorrect ones. We model this intuition as a graph-ranking problem over a fact graph and explore novel random walk algorithms. We present an empirical study, over a large set of facts extracted from a 500 million document webcrawl, validating the model and showing that it improves fact quality over state-of-the-art methods. 1

