Results 1 - 10
of
18
Automatically Refining the Wikipedia Infobox Ontology
, 2008
"... The combined efforts of human volunteers have recently extracted numerous facts from Wikipedia, storing them as machine-harvestable object-attribute-value triples in Wikipedia infoboxes. Machine learning systems, such as Kylin, use these infoboxes as training data, accurately extracting even more se ..."
Abstract
-
Cited by 43 (7 self)
- Add to MetaCart
The combined efforts of human volunteers have recently extracted numerous facts from Wikipedia, storing them as machine-harvestable object-attribute-value triples in Wikipedia infoboxes. Machine learning systems, such as Kylin, use these infoboxes as training data, accurately extracting even more semantic knowledge from natural language text. But in order to realize the full power of this information, it must be situated in a cleanly-structured ontology. This paper introduces KOG, an autonomous system for refining Wikipedia’s infobox-class ontology towards this end. We cast the problem of ontology refinement as a machine learning problem and solve it using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Networks. We present experiments demonstrating the superiority of the joint-inference approach and evaluating other aspects of our system. Using these techniques, we build a rich ontology, integrating Wikipedia’s infobox-class schemata with WordNet. We demonstrate how the resulting ontology may be used to enhance Wikipedia with improved query processing and other features.
Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
"... Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the training set. We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words. We incorporate aspects of these representations into the feature space of our sequence-labeling systems. In an experiment on a standard chunking dataset, our best technique improves a chunker from 0.76 F1 to 0.86 F1 on chunks beginning with rare words. On the same dataset, it improves our part-of-speech tagger from 74 % to 80 % accuracy on rare words. Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling. 1
Unsupervised methods for determining object and relation synonyms on the web
- Journal of Artificial Intelligence Research
, 2009
"... The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fullyimplemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver, introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78 % precision and 68 % recall, and resolves relations with 90 % precision and 35 % recall. Several variations of Resolver’s probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97 % precision and 95 % recall on a data set from the TREC corpus.
What is this, anyway: Automatic hypernym discovery
- In Proceedings of AAAI-09 Spring Symposium on Learning
, 2009
"... Can a system that “learns from reading ” figure out on it’s own the semantic classes of arbitrary noun phrases? This is essential for text understanding, given the limited coverage of proper nouns in lexical resources such as WordNet. Previous methods that use lexical patterns to discover hypernyms ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Can a system that “learns from reading ” figure out on it’s own the semantic classes of arbitrary noun phrases? This is essential for text understanding, given the limited coverage of proper nouns in lexical resources such as WordNet. Previous methods that use lexical patterns to discover hypernyms suffer from limited precision and recall. We present methods based on lexical patterns that find hypernyms of arbitrary noun phrases with high precision. This more than doubles the recall of proper noun hypernyms provided by WordNet at a modest cost to precision. We also present a novel method using a Hidden Markov Model (HMM) to extend recall further.
Information Extraction from the Web: Techniques and Applications
, 2007
"... Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on ha ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on having knowledge about the domains in which they operate, and such knowledge is typically expensive to enter into the system. Furthermore, the knowledge must be entered for every different domain in which an application is to operate. The Web contains knowledge about all kinds of different domains, but in a format that is not readily
usable by AI systems. WIE promises to bridge the gap between the Web and AI.
Natural Language Processing is an example of an area in AI in which knowledge can make a dramatic difference in the performance of an application. Understanding or interpreting
language depends on the ability to understand the words used in a domain. The meanings, usages, and syntactic properties of words, and the relative frequency with which
certain words are used, are necessary pieces of information for effective language processing, and much of this information can be extracted from text. In one case study, this thesis examines methods for using extracted information in improving a particular kind of language
processing tool, a parser.
Before information extraction can become broadly useful, however, more research must be done to improve the quality of the extracted information. A number of factors affect the
quality, including correctness, importance or relevance, and the sophistication of meaning representation. The second case study in this thesis investigates a method for resolving synonyms in extracted information. This technique changes the meaning representation of extractions from one that relates words or names to one that relates entities to one another.
Intelligence in Wikipedia
"... The Intelligence in Wikipedia project at the University of Washington is combining self-supervised information extraction (IE) techniques with a mixed initiative interface designed to encourage communal content creation (CCC). Since IE and CCC are each powerful ways to produce large amounts of struc ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The Intelligence in Wikipedia project at the University of Washington is combining self-supervised information extraction (IE) techniques with a mixed initiative interface designed to encourage communal content creation (CCC). Since IE and CCC are each powerful ways to produce large amounts of structured information, they have been studied extensively — but only in isolation. By combining the two methods in a virtuous feedback cycle, we aim for substantial synergy. While previous papers have described the details of individual aspects of our endeavor [25, 26, 24, 13], this report provides an overview of the project’s progress and vision.
Analysis of a Probabilistic Model of Redundancy in Unsupervised Information Extraction
, 2010
"... Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without the use of hand-labeled training examples. Because UIE systems do not require human intervention, they can recursively discover new relations, attributes, and instances in a scalable manner. When applied to massive corpora such as the Web, UIE systems present an approach to a primary challenge in artificial intelligence: the automatic accumulation of massive bodies of knowledge. A fundamental problem for a UIE system is assessing the probability that its extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in different documents. How does this redundancy impact the probability of correctness? We present a combinatorial “balls-and-urns ” model, called Urns, that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct. We describe methods for estimating Urns’s parameters in practice and demonstrate experimentally that for UIE the model’s log likelihoods are 15 times better, on average, than those obtained by methods used in previous work. We illustrate the generality of the redundancy model by detailing multiple applications beyond UIE in which Urns has been effective. We also provide a theoretical foundation for Urns’s performance, including a theorem showing that PAC Learnability in Urns is guaranteed without hand-labeled data, under certain assumptions.
Improved extraction assessment through better language models
- In Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT
, 2010
"... A variety of information extraction techniques rely on the fact that instances of the same relation are “distributionally similar, ” in that they tend to appear in similar textual contexts. We demonstrate that extraction accuracy depends heavily on the accuracy of the language model utilized to esti ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A variety of information extraction techniques rely on the fact that instances of the same relation are “distributionally similar, ” in that they tend to appear in similar textual contexts. We demonstrate that extraction accuracy depends heavily on the accuracy of the language model utilized to estimate distributional similarity. An unsupervised model selection technique based on this observation is shown to reduce extraction and type-checking error by 26 % over previous results, in experiments with Hidden Markov Models. The results suggest that optimizing statistical language models over unlabeled data is a promising direction for improving weakly supervised and unsupervised information extraction. 1
Relation Validation via Textual Entailment
"... Abstract. This paper addresses a subtask of relation extraction, namely Relation Validation. Relation validation can be described as follows: given an instance of a relation and a relevant text fragment, the system is asked to decide whether this instance is true or not. Instead of following the com ..."
Abstract
- Add to MetaCart
Abstract. This paper addresses a subtask of relation extraction, namely Relation Validation. Relation validation can be described as follows: given an instance of a relation and a relevant text fragment, the system is asked to decide whether this instance is true or not. Instead of following the common approaches of using statistical or context features directly, we propose a method based on textual entailment (called ReVaS). We set up two different experiments to test our system: one is based on an annotated data set; the other is based on real web data via the integration of ReVaS with an existing IE system. For the latter case, we examine in detail the two aspects of the validation process, i.e. directionality and strictness. The results suggest that textual entailment is a feasible way for the relation validation task.

