Results 1 - 10
of
19
Open information extraction from the web
- IN IJCAI
, 2007
"... Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to ma ..."
Abstract
-
Cited by 172 (33 self)
- Add to MetaCart
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
Coupled Semi-Supervised Learning for Information Extraction
"... We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web d ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning—knowledge acquisition;
Sparse information extraction: Unsupervised language models to the rescue
- In Proc. of ACL
, 2007
"... Even in a massive corpus such as the Web, a substantial fraction of extractions appear infrequently. This paper shows how to assess the correctness of sparse extractions by utilizing unsupervised language models. The REALM system, which combines HMMbased and n-gram-based language models, ranks candi ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Even in a massive corpus such as the Web, a substantial fraction of extractions appear infrequently. This paper shows how to assess the correctness of sparse extractions by utilizing unsupervised language models. The REALM system, which combines HMMbased and n-gram-based language models, ranks candidate extractions by the likelihood that they are correct. Our experiments show that REALM reduces extraction error by 39%, on average, when compared with previous work. Because REALM pre-computes language models based on its corpus and does not require any hand-tagged seeds, it is far more scalable than approaches that learn models for each individual relation from handtagged data. Thus, REALM is ideally suited for open information extraction where the relations of interest are not specified in advance and their number is potentially vast. 1
Detecting and summarizing action items in multi-party dialogue
- in Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue
, 2007
"... This paper addresses the problem of identifying action items discussed in open-domain conversational speech, and does so in two stages: firstly, detecting the subdialogues in which action items are proposed, discussed and committed to; and secondly, extracting the phrases that accurately capture or ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
This paper addresses the problem of identifying action items discussed in open-domain conversational speech, and does so in two stages: firstly, detecting the subdialogues in which action items are proposed, discussed and committed to; and secondly, extracting the phrases that accurately capture or summarize the tasks they involve. While the detection problem is hard, we show that we can improve accuracy by taking account of dialogue structure. We then describe a semantic parser that identifies potential summarizing phrases, and show that for some task properties these can be more informative than plain utterance transcriptions. 1
Web-Scale Distributional Similarity and Entity Set Expansion
"... Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 2 ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia. 1
Information Extraction from the Web: Techniques and Applications
, 2007
"... Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on ha ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This has opened the possibility of achieving
an elusive goal in Artificial Intelligence (AI): broad-coverage domain knowledge. AI systems depend to a great extent on having knowledge about the domains in which they operate, and such knowledge is typically expensive to enter into the system. Furthermore, the knowledge must be entered for every different domain in which an application is to operate. The Web contains knowledge about all kinds of different domains, but in a format that is not readily
usable by AI systems. WIE promises to bridge the gap between the Web and AI.
Natural Language Processing is an example of an area in AI in which knowledge can make a dramatic difference in the performance of an application. Understanding or interpreting
language depends on the ability to understand the words used in a domain. The meanings, usages, and syntactic properties of words, and the relative frequency with which
certain words are used, are necessary pieces of information for effective language processing, and much of this information can be extracted from text. In one case study, this thesis examines methods for using extracted information in improving a particular kind of language
processing tool, a parser.
Before information extraction can become broadly useful, however, more research must be done to improve the quality of the extracted information. A number of factors affect the
quality, including correctness, importance or relevance, and the sophistication of meaning representation. The second case study in this thesis investigates a method for resolving synonyms in extracted information. This technique changes the meaning representation of extractions from one that relates words or names to one that relates entities to one another.
Named Entity Recognition in Tweets: An Experimental Study
, 2011
"... People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-bu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25 % over ten common entity types. Our NLP tools are available at:
Domain-Independent Entity Extraction from Web Search Query Logs
"... Query logs of a Web search engine have been increasingly used as a vital source for data mining. This paper presents a study on largescale domain-independent entity extraction from search query logs. We present a completely unsupervised method to extract entities by applying pattern-based heuristics ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Query logs of a Web search engine have been increasingly used as a vital source for data mining. This paper presents a study on largescale domain-independent entity extraction from search query logs. We present a completely unsupervised method to extract entities by applying pattern-based heuristics and statistical measures. We compare against existing techniques that use Web documents as well as search logs, and show that we improve over the state of the art. We also provide an in-depth qualitative analysis outlining differences and commonalities between these methods.
Probase: A Probabilistic Taxonomy for Text Understanding
"... Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing onto ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Knowledge is indispensable to understanding. The ongoing information explosion highlights the need to enable machines to better understand electronic text in human language. Much work has been devoted to creating universal ontologies or taxonomies for this purpose. However, none of the existing ontologies has the needed depth and breadth for “universal understanding”. In this paper, we present a universal, probabilistic taxonomy that is more comprehensive than any existing ones. It contains 2.7 million concepts harnessed automatically from a corpus of 1.68 billion web pages. Unlike traditional taxonomies that treat knowledge as black and white, it uses probabilities to model inconsistent, ambiguous and uncertain information it contains. We present details of how the taxonomy is constructed, its probabilistic modeling, and its potential applications in text understanding.

