Results 1 - 10
of
10
Open information extraction from the web
- IN IJCAI
, 2007
"... Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to ma ..."
Abstract
-
Cited by 172 (33 self)
- Add to MetaCart
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
Information Extraction from the Web: Techniques and Applications
, 2007
"... Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on ha ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on having knowledge about the domains in which they operate, and such knowledge is typically expensive to enter into the system. Furthermore, the knowledge must be entered for every different domain in which an application is to operate. The Web contains knowledge about all kinds of different domains, but in a format that is not readily
usable by AI systems. WIE promises to bridge the gap between the Web and AI.
Natural Language Processing is an example of an area in AI in which knowledge can make a dramatic difference in the performance of an application. Understanding or interpreting
language depends on the ability to understand the words used in a domain. The meanings, usages, and syntactic properties of words, and the relative frequency with which
certain words are used, are necessary pieces of information for effective language processing, and much of this information can be extracted from text. In one case study, this thesis examines methods for using extracted information in improving a particular kind of language
processing tool, a parser.
Before information extraction can become broadly useful, however, more research must be done to improve the quality of the extracted information. A number of factors affect the
quality, including correctness, importance or relevance, and the sophistication of meaning representation. The second case study in this thesis investigates a method for resolving synonyms in extracted information. This technique changes the meaning representation of extractions from one that relates words or names to one that relates entities to one another.
Identifying Relations for Open Information Extraction
, 2011
"... Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-ofthe-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-ofthe-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the REVERB Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TEXTRUNNER and WOE pos. More than 30 % of REVERB’s extractions are at precision 0.8 or higher— compared to virtually none for earlier systems. The paper concludes with a detailed analysis of REVERB’s errors, suggesting directions for future work.
Adapting Open Information Extraction to Domain-Specific Relations
, 2010
"... Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE, operates on l ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE, operates on large text corpora without any manual tagging of relations, and indeed without any prespecified relations. Due to its open-domain and open-relation nature, Open IE is purely textual and is unable to relate the surface forms to an ontology, if known in advance. We explore the steps needed to adapt Open IE to a domain-specific ontology and demonstrate our approach of mapping domainindependent tuples to an ontology using domains from the DARPA Machine Reading Project. Our system achieves precision over 0.90 from as few as eight training examples for an NFL-scoring domain.
Learning and Evaluating the Content and Structure of a Term Taxonomy
- IN AAAI-09 SPRING SYMPOSIUM ON LEARNING BY READING AND LEARNING TO READ
, 2009
"... In this paper, we describe a weakly supervised bootstraping algorithm that reads Web texts and learns taxonomy terms. The bootstrapping algorithm starts with two seed words (a seed hypernym (Root concept) and a seed hyponym) that are inserted into a doubly anchored hyponym pattern. In alternating ro ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we describe a weakly supervised bootstraping algorithm that reads Web texts and learns taxonomy terms. The bootstrapping algorithm starts with two seed words (a seed hypernym (Root concept) and a seed hyponym) that are inserted into a doubly anchored hyponym pattern. In alternating rounds, the algorithm learns new hyponym terms and new hypernym terms that are subordinate to the Root concept. We conducted an extensive evaluation with human annotators to evaluate the learned hyponym and hypernym terms for two categories: animals and people.
Interactive Entity and Relation Based Document Retrieval
"... A common text analysis task is to identify entities and relationships within a discourse. Text-based information retrieval is limited in this capacity as searching for entities or relationships generally results in biographical or definitional information respectively. To discover relationships betw ..."
Abstract
- Add to MetaCart
A common text analysis task is to identify entities and relationships within a discourse. Text-based information retrieval is limited in this capacity as searching for entities or relationships generally results in biographical or definitional information respectively. To discover relationships between entities in a corpus, a skilled information analyst must either read many documents or have the capacity of formulating arcane queries to accommodate the text-based search engine. For example, let us suppose that the information need is marital information for potential U.S. presidential candidates. Ideally, the search interface for this information need should be [U.S. Presidential Candidates] [Marriage] [?x], where?x is an unbounded variable and the results would be a list of candidates and their spouses. In this case, the information need was between a particular word class and a specific relationship. Another example would be between a particular entity and a specific relationship such as [Newt Gingrich] [Marriage] [?x], which would possibly return the set of sentences containing Jackie Battley, Marianne Ginther, or Callista Bisek depending on the corpus. We may also be interested in searching for relationships between a set of entities, such as the query [Bill Clinton] [?x] [U.S. Presidential Candidates], which may return text corresponding to endorsement of Al Gore during the 2000 presidential election or that he is married to Hillary Clinton, once again depending on the corpus. The goal of this project is to design an information retrieval system capable of enabling searches
Open Information Extraction: the Second Generation
"... How do we scale information extraction to the massive size and unprecedented heterogeneity of the Web corpus? Beginning in 2003, our KnowItAll project has sought to extract high-quality knowledge from the Web. In 2007, we introduced the Open Information Extraction (Open IE) paradigm which eschews ha ..."
Abstract
- Add to MetaCart
How do we scale information extraction to the massive size and unprecedented heterogeneity of the Web corpus? Beginning in 2003, our KnowItAll project has sought to extract high-quality knowledge from the Web. In 2007, we introduced the Open Information Extraction (Open IE) paradigm which eschews handlabeled training examples, and avoids domainspecific verbs and nouns, to develop unlexicalized, domain-independent extractors that scale to the Web corpus. Open IE systems have extracted billions of assertions as the basis for both commonsense knowledge and novel question-answering systems. This paper describes the second generation of Open IE systems, which rely on a novel model of how relations and their arguments are expressed in English sentences to double precision/recall compared with previous systems such as TEXTRUNNER and WOE. 1
Abstract Leveraging Knowledge Bases in Web Text Processing
, 2012
"... The Web contains more text than any other source in human history, and continues to expand rapidly. Computer algorithms to process and extract knowledge from Web text have the potential not only to improve Web search, but also to collect a sizable fraction of human knowledge and use it to enable sma ..."
Abstract
- Add to MetaCart
The Web contains more text than any other source in human history, and continues to expand rapidly. Computer algorithms to process and extract knowledge from Web text have the potential not only to improve Web search, but also to collect a sizable fraction of human knowledge and use it to enable smarter artificial intelligence. To scale to the size and diversity of the Web, many Web text processing algorithms use domain-independent statistical approaches, rather than limiting their processing to any fixed ontologies or sets of domains. While traditional knowledge bases (KBs) had limited coverage of general knowledge, the last few years have seen the rapid rise of new KBs like Freebase and Wikipedia that now cover millions of general interest topics. While these KBs still do not cover the full diversity of the Web, this thesis demonstrates that they are now close enough that there are ways to effectively leverage them in domain-independent Web text processing. It presents and empirically verifies how these KBs can be used to filter uninteresting Web extractions, enhance understanding and usability of both extracted relations and extracted entities, and even power new functionality for Web search. The effective integration of KBs with

