Results 1 - 10
of
15
Expressive and Flexible Access to Web-Extracted Data: A Keyword-based Structured Query Language
- In SIGMOD ’10: Proceedings of International Conference on Management of Data
, 2010
"... Automated extraction of structured data from Web sources often leads to large heterogeneous knowledge bases (KB), with data and schema items numbering in the hundreds of thousands or millions. Formulating information needs with conventional structured query languages is difficult due to the sheer si ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
(Show Context)
Automated extraction of structured data from Web sources often leads to large heterogeneous knowledge bases (KB), with data and schema items numbering in the hundreds of thousands or millions. Formulating information needs with conventional structured query languages is difficult due to the sheer size of schema information available to the user. We address this challenge by proposing a new query language that blends keyword search with structured query processing over large information graphs with rich semantics.
From Information to Knowledge: Harvesting Entities and Relationships from Web Sources
"... There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-l ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
(Show Context)
There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-language Web sources. Recent endeavors of this kind include DBpedia, EntityCube, KnowItAll, ReadTheWeb, and our own YAGO-NAGA project (and others). The goal is to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall. This tutorial discusses state-ofthe-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting.
Optimizing SQL Queries over Text Databases
"... Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
(Show Context)
Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is timeconsuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and— critically—on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments— over real data sets and multiple information extraction systems— show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality. I.
A quality-aware optimizer for information extraction
- ACM Transactions on Database Systems
"... A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs ” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.
Join optimization of information extraction output: Quality matters
, 2008
"... Abstract — Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, i ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Abstract — Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality—and, of course, the execution time— of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a qualityaware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems. I.
Exploring a few good tuples from text databases
- In ICDE
, 2009
"... Abstract — Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Abstract — Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden data. There are a large number of text databases available, and not every text database is necessarily relevant to every relation. Hence, it is important to be able to quickly explore the utility of running an extractor for a specific relation over a given text database before carrying out the expensive extraction task. In this paper, we present a novel exploration methodology of finding a few good tuples for a relation that can be extracted from a database which allows for judging the relevance of the database for the relation. Specifically, we propose the notion of a good(k, ℓ) query as one that can return any k tuples for a relation among the top-ℓ fraction of tuples ranked by their aggregated confidence scores, provided by the extractor; if these tuples have high scores, the database can be determined as relevant to the relation. We formalize the access model for information extraction, and investigate efficient query processing algorithms for good(k, ℓ) queries, which do not rely on any prior knowledge about the extraction task or the database. We demonstrate the viability of our algorithms using a detailed experimental study with real text databases. I.
Building Query Optimizers for Information Extraction: The SQoUT Project
"... Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper di ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the userspecified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies. 1.
Crawling Deep Web Using a New Set Covering Algorithm
"... Abstract. Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, how ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.
Rank-Aware Crawling of Hidden Web sites
"... An ever-increasing amount of valuable information on the Web today is stored inside online databases and is accessible only after the users issue a query through a search interface. Such information is collectively called the“Hidden Web”and is mostly inaccessible by traditional search engine crawler ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
An ever-increasing amount of valuable information on the Web today is stored inside online databases and is accessible only after the users issue a query through a search interface. Such information is collectively called the“Hidden Web”and is mostly inaccessible by traditional search engine crawlers that scout the Web following links. Since the only way to access the Hidden Web pages is through the submission of queries to the Hidden Web sites, previous work [14, 18] has focused on how to automatically generate queries in order to incrementally retrieve and cover a Hidden Web site in depth, as much as possible. Forcertainapplicationshoweveritisnotnecessarytohave crawled a Hidden-Web site in-depth. For example, a metasearcher or a content aggregator will utilize only the top portion of the ranked result lists coming from the querying of a Hidden Web site instead of its full content. Hence, if we can crawl a Hidden Web site in breadth, i.e. download just the top results for all potential queries, we can enable such applications without the need for allocating resources for fully crawling a potentially huge Hidden Web site. In this paper we present algorithms for crawling a Hidden Web site by taking the ranking of the results into account. Since we do not know all potential queries that may be directed to the Web site in advance, we study how to approximate the site’s ranking function so that we can compute the top results based on the data collected so far. We provide a framework for performing ranking-aware Hidden Web crawling and we show experimental results on a real Web site demonstrating the performance of our methods.