Results 11 - 20
of
56
Optimizing SQL Queries over Text Databases
"... Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is timeconsuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and— critically—on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments— over real data sets and multiple information extraction systems— show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality. I.
Ontology Learning from Text: An Overview
- In Paul Buitelaar, P., Cimiano, P., Magnini B. (Eds.), Ontology Learning from Text: Methods, Applications and Evaluation
, 2005
"... ..."
Retrieving Answers from Frequently Asked Questions Pages on the Web
- In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management
, 2005
"... We address the task of answering natural language questions by using the large number of Frequently Asked Questions (FAQ) pages available on the web. The task involves three steps: (1) fetching FAQ pages from the web; (2) automatic extraction of question/answer (Q/A) pairs from the collected pages; ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We address the task of answering natural language questions by using the large number of Frequently Asked Questions (FAQ) pages available on the web. The task involves three steps: (1) fetching FAQ pages from the web; (2) automatic extraction of question/answer (Q/A) pairs from the collected pages; and (3) answering users ’ questions by retrieving appropriate Q/A pairs. We discuss our solutions for each of the three tasks, and give detailed evaluation results on a collected corpus of about 3.6Gb of text data (293K pages, 2.8M Q/A pairs), with real users ’ questions sampled from a web search engine log. Specifically, we propose simple but effective methods for Q/A extraction and investigate task-specific retrieval models for answering questions. Our best model finds answers for 36 % of the test questions in the top 20 results. Our overall conclusion is that FAQ pages on the web provide an excellent resource for addressing real users ’ information needs in a highly focused manner.
Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
"... Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.
Soft Pattern Matching Models for Definitional Question Answering
"... We explore probabilistic lexico-syntactic pattern matching, also known as soft pattern matching, in a definitional question answering system. Most current systems use regular expression-based hard matching patterns to identify definition sentences. Such rigid surface matching often fares poorly when ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We explore probabilistic lexico-syntactic pattern matching, also known as soft pattern matching, in a definitional question answering system. Most current systems use regular expression-based hard matching patterns to identify definition sentences. Such rigid surface matching often fares poorly when faced with language variations. We propose two soft matching models to address this problem: one based on bigrams and the other on the Profile Hidden Markov Model (PHMM). Both models provide a theoretically sound method to model pattern matching as a probabilistic process that generates token sequences. We demonstrate the effectiveness of the models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform the state-of-the-art manually constructed hard matching patterns on recent TREC data. A critical difference between the two models is that the PHMM has a more complex topology. We experimentally show that the PHMM can handle language variations more effectively but requires more training data to converge. While we evaluate soft pattern models only on definitional question answering, we believe that both models are generic and can be extended to other areas where lexico-syntactic pattern matching can be applied.
Extracting Instances of Relations from Web Documents using Redundancy
- In Proceedings of the Third European Semantic Web Conference
, 2006
"... Abstract. In this document we describe our approach to a specific subtask of ontology population, the extraction of instances of relations. We present a generic approach with which we are able to extract information from documents on the Web. The method exploits redundancy of information to compensa ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. In this document we describe our approach to a specific subtask of ontology population, the extraction of instances of relations. We present a generic approach with which we are able to extract information from documents on the Web. The method exploits redundancy of information to compensate for loss of precision caused by the use of domain independent extraction methods. In this paper, we present the general approach and describe our implementation for a specific relation instance extraction task in the art domain. For this task, we describe experiments, discuss evaluation measures and present the results. 1
Halevy: Web-scale extraction of structured data
- SIGMOD Record
, 2008
"... A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction syste ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on “hidden ” databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extracted Web information. 1.
Ontologies on Demand? A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology Learning from Text
, 2006
"... Ontologies are nowadays used for many applications requiring data, services and resources in general to be interoperable and machine understandable. Such applications are for example web service discovery and composition, information integration across databases, intelligent search, etc. The general ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Ontologies are nowadays used for many applications requiring data, services and resources in general to be interoperable and machine understandable. Such applications are for example web service discovery and composition, information integration across databases, intelligent search, etc. The general idea is that data and services are semantically described with respect to ontologies, which are formal specifications of a domain of interest, and can thus be shared and reused in a way such that the shared meaning specified by the ontology remains formally the same across different parties and applications. As the cost of creating ontologies is relatively high, different proposals have emerged for learning ontologies from structured and unstructured resources. In this article we examine the maturity of techniques for ontology learning from textual resources, addressing the question whether the state-of-the-art is mature enough to produce ontologies ‘on demand’.
TOB: Timely Ontologies for Business Relations
"... In this paper we present a suite of methods for extracting temporal relations from semi-structured and textual Web sources. We particularly address the needs for building and maintaining business ontologies, where the time aspects of relations between companies, between companies and products, and b ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper we present a suite of methods for extracting temporal relations from semi-structured and textual Web sources. We particularly address the needs for building and maintaining business ontologies, where the time aspects of relations between companies, between companies and products, and between companies and customers are important. For example, the date on which a company acquired another company or when a new CEO took over is crucial information for business-intelligence applications. Our methods are geared for extracting business relations and their time information from three kinds of sources: Wikipedia infoboxes, Reuter’s news feeds, and news pages provided by Google. All techniques are integrated into the TOB framework for timely business ontologies. Our experiments show that we can achieve fairly high precision for the extracted information. 1.
Searching for commonsense
, 2006
"... Acquiring and representing the large body of “common sense” knowledge underlying ordinary human reasoning and communication is a long standing problem in the field of artificial intelligence. This thesis will address the question whether a significant quantity of this knowledge may be acquired by mi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Acquiring and representing the large body of “common sense” knowledge underlying ordinary human reasoning and communication is a long standing problem in the field of artificial intelligence. This thesis will address the question whether a significant quantity of this knowledge may be acquired by mining natural language content on the Web. Specifically, this thesis emphasizes the representation of knowledge in the form of binary semantic relationships, such as cause, effect, intent, and time, among natural language phrases.
The central hypothesis is that seed knowledge collected from volunteers enables automated acquisition of this knowledge from a large, unannotated, general corpus like the Web. A text mining system, ConceptMiner, was developed to evaluate this hypothesis. ConceptMiner leverages web search engines, Information Extraction techniques and the ConceptNet toolkit to analyze Web content for textual evidence indicating common sense relationships.
Experiments are reported for three semantic relation classes: desire, effect, and capability. A Point wise Mutual Information measure computed from Web hit counts is demonstrated to filter general common sense from instance knowledge true only in specific circumstances. A semantic distance metric is introduced which significantly reduces negative instances from the extracted hypotheses.
The results confirm that significant relational common sense knowledge exists on the Web and provides evidence that the algorithms employed by Concept Miner can extract this knowledge with a precision approaching that provided by human subjects

