Results 1 - 10
of
11
Coupled Semi-Supervised Learning for Information Extraction
"... We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web d ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or relation, plus hundreds of millions of unlabeled web documents. Semi-supervised training using only a few labeled examples is typically unreliable because the learning task is underconstrained. This paper pursues the thesis that much greater accuracy can be achieved by further constraining the learning task, by coupling the semi-supervised training of many extractors for different categories and relations. We characterize several ways in which the training of category and relation extractors can be coupled, and present experimental results demonstrating significantly improved accuracy as a result. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning—knowledge acquisition;
Toward an architecture for never-ending language learning
- In AAAI
, 2010
"... We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74 % after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent.
Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition
"... Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based SSL algorithms for class-instance acquisition on a variety of graphs constructed from different domains. We find that the recently proposed MAD algorithm is the most effective. We also show that class-instance extraction can be significantly improved by adding semantic information in the form of instance-attribute edges derived from an independently developed knowledge base. All of our code and data will be made publicly available to encourage reproducible research in this area. 1
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Toward an Architecture for Never-Ending Language Learning
"... We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on ..."
Abstract
- Add to MetaCart
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74 % after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent.
Materializing Multi-Relational Databases from the Web using Taxonomic Queries
"... Recently, much attention has been given to extracting tables from Web data. In this problem, the column definitions and tuples (such as what “company ” is headquartered in what “city,”) are extracted from Web text, structured Web data such as lists, or results of querying the deep Web, creating the ..."
Abstract
- Add to MetaCart
Recently, much attention has been given to extracting tables from Web data. In this problem, the column definitions and tuples (such as what “company ” is headquartered in what “city,”) are extracted from Web text, structured Web data such as lists, or results of querying the deep Web, creating the table of interest. In this paper, we examine the problem of extracting and discovering multiple tables in a given domain, generating a truly multi-relational database as output. Beyond discovering the relations that define single tables, our approach discovers and leverages “within column” set membership relations, and discovers relations across the extracted tables (e.g., joins). By leveraging within-column relations our method can extract table instances that are ambiguous or rare, and by discovering joins, our method
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery ABSTRACT
"... One popular form of semantic search observed in several modern search engines is to recognize query patterns that trigger instant answers or domain-specific search, producing semantically enriched search results. This often requires understanding the query intent in addition to the meaning of the qu ..."
Abstract
- Add to MetaCart
One popular form of semantic search observed in several modern search engines is to recognize query patterns that trigger instant answers or domain-specific search, producing semantically enriched search results. This often requires understanding the query intent in addition to the meaning of the query terms in order to access structured data sources. A major challenge in intent understanding is to construct a domain-dependent schema and to annotate search queries based on such a schema, a process that to date has required much manual annotation effort. We present an unsupervised method for clustering queries with similar intent and for producing a pattern consisting of a sequence of semantic concepts and/or lexical items for each intent. Furthermore, we leverage the discovered intent patterns to automatically annotate a large number of queries beyond those used in clustering. We evaluated our method on 10 selected domains, discovering over 1400 intent patterns and automatically annotating 125K (and potentially many more) queries. We found that over 90 % of patterns and 80 % of instance annotations tested are judged to be correct by a majority of annotators.
A Proposal for the Evaluation of Adaptive Content Retrieval, Modification and Delivery
"... Abstract. A key advantage of Adaptive Hypermedia Systems (AHS) is their ability to re-sequence and reintegrate content to satisfy a particular user’s need, context or requirements. However, this requires large volumes of content, with appropriate granularities and suitable meta-data descriptions, re ..."
Abstract
- Add to MetaCart
Abstract. A key advantage of Adaptive Hypermedia Systems (AHS) is their ability to re-sequence and reintegrate content to satisfy a particular user’s need, context or requirements. However, this requires large volumes of content, with appropriate granularities and suitable meta-data descriptions, representing a major impediment to the mainstream adoption of Adaptive Hypermedia. Opencorpus content is now widely available on the web, however, traditional information retrieval (IR) approaches are an inadequate means of incorporating these external content resources within AHS. This is due to the “one size fits all ” content delivery paradigm offered by traditional IR. Slicing technology addresses these limitations by providing adaptive retrieval of open corpus resources, tailored to suit AHS specific content requirements. This is achieved through the on demand provision of tailored content called slices. This paper introduces slicing systems and details the objectives and challenges involved in the evaluation of such systems. A framework for the evaluation of slicing systems is presented along with a proposed experimental implementation.
Open Entity Extraction from Web Search Query Logs
"... In this paper we propose a completely unsupervised method for open-domain entity extraction and clustering over query logs. The underlying hypothesis is that classes defined by mining search user activity may significantly differ from those typically considered over web documents, in that they bette ..."
Abstract
- Add to MetaCart
In this paper we propose a completely unsupervised method for open-domain entity extraction and clustering over query logs. The underlying hypothesis is that classes defined by mining search user activity may significantly differ from those typically considered over web documents, in that they better model the user space, i.e. users ’ perception and interests. We show that our method outperforms state of the art (semi-)supervised systems based either on web documents or on query logs (16 % gain on the clustering task). We also report evidence that our method successfully supports
Ensemble Semantics for Large-scale Unsupervised Relation Extraction
"... Discovering significant types of relations from the web is challenging because of its open nature. Unsupervised algorithms are developed to extract relations from a corpus without knowing the relations in advance, but most of them rely on tagging arguments of predefined types. Recently, a new algori ..."
Abstract
- Add to MetaCart
Discovering significant types of relations from the web is challenging because of its open nature. Unsupervised algorithms are developed to extract relations from a corpus without knowing the relations in advance, but most of them rely on tagging arguments of predefined types. Recently, a new algorithm was proposed to jointly extract relations and their argument semantic classes, taking a set of relation instances extracted by an open IE algorithm as input. However, it cannot handle polysemy of relation phrases and fails to group many similar (“synonymous”) relation instances because of the sparseness of features. In this paper, we present a novel unsupervised algorithm that provides a more general treatment of the polysemy and synonymy problems. The algorithm incorporates various knowledge sources which we will show to be very effective for unsupervised extraction. Moreover, it explicitly disambiguates polysemous relation phrases and groups synonymous ones. While maintaining approximately the same precision, the algorithm achieves significant improvement on recall compared to the previous method. It is also very efficient. Experiments on a realworld dataset show that it can handle 14.7 million relation instances and extract a very large set of relations from the web. 1

