• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A relational approach to incrementally extracting and querying structure in unstructured data (0)

by E Chu, A Baid, T Chen, A Doan, J Naughton
Add To MetaCart

Tools

Sorted by:
Results 1 - 9 of 9

On the Provenance of Non-Answers to Queries over Extracted Data ∗ ABSTRACT

by Jiansheng Huang, Ting Chen, Anhai Doan, Jeffrey F. Naughton
"... In information extraction, uncertainty is ubiquitous. For this reason, it is useful to provide users querying extracted data with explanations for the answers they receive. Providing the provenance for tuples in a query result partially addresses this problem, in that provenance can explain why a tu ..."
Abstract - Cited by 19 (1 self) - Add to MetaCart
In information extraction, uncertainty is ubiquitous. For this reason, it is useful to provide users querying extracted data with explanations for the answers they receive. Providing the provenance for tuples in a query result partially addresses this problem, in that provenance can explain why a tuple is in the result of a query. However, in some cases explaining why a tuple is not in the result may be just as helpful. In this work we focus on providing provenance-style explanations for non-answers and develop a mechanism for providing this new type of provenance. Our experience with an information extraction prototype suggests that our approach can provide effective provenance information that can help a user resolve their doubts over non-answers to a query. 1.

Optimizing SQL Queries over Text Databases

by Alpa Jain, Anhai Doan, Luis Gravano
"... Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then ..."
Abstract - Cited by 11 (5 self) - Add to MetaCart
Abstract — Text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. By processing a text database with information extraction systems, we can materialize a variety of structured “relations, ” over which we can then issue regular SQL queries. A key challenge to process SQL queries in this text-based scenario is efficiency: information extraction is timeconsuming, so query processing strategies should minimize the number of documents that they process. Another key challenge is result quality: in the traditional relational world, all correct execution strategies for a SQL query produce the same (correct) result; in contrast, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. To address these challenges, we study a family of select-project-join SQL queries over text databases, and characterize query processing strategies on their efficiency and— critically—on their result quality as well. We optimize the execution of SQL queries over text databases in a principled, cost-based manner, incorporating this tradeoff between efficiency and result quality in a user-specific fashion. Our large-scale experiments— over real data sets and multiple information extraction systems— show that our SQL query processing approach consistently picks appropriate execution strategies for the desired balance between efficiency and result quality. I.

From Information to Knowledge: Harvesting Entities and Relationships from Web Sources

by Gerhard Weikum, Martin Theobald
"... There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-l ..."
Abstract - Cited by 7 (4 self) - Add to MetaCart
There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-language Web sources. Recent endeavors of this kind include DBpedia, EntityCube, KnowItAll, ReadTheWeb, and our own YAGO-NAGA project (and others). The goal is to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall. This tutorial discusses state-ofthe-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting.

FOCIH: Form-based ontology creation and information harvesting

by Cui Tao, Davidw. Embley, Stephen W. Liddle , 2009
"... Abstract. Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data—which some see as Web 3.0—is prohibitive. Can we find ways to str ..."
Abstract - Cited by 5 (4 self) - Add to MetaCart
Abstract. Creating an ontology and populating it with data are both labor-intensive tasks requiring a high degree of expertise. Thus, scaling ontology creation and population to the size of the web in an effort to create a web of data—which some see as Web 3.0—is prohibitive. Can we find ways to streamline these tasks and lower the barrier enough to enable Web 3.0? Toward this end we offer a form-based approach to ontology creation that provides a way to create Web 3.0 ontologies without the need for specialized training. And we offer a way to semi-automatically harvest data from the current web of pages for a Web 3.0 ontology. In addition to harvesting information with respect to an ontology, the approach also annotates web pages and links facts in web pages to ontological concepts, resulting in a web of data superimposed over the web of pages. Experience with our prototype system shows that mappings between conceptual-model-based ontologies and forms are sufficient for creating the kind of ontologies needed for Web 3.0, and experiments with our prototype system show that automatic harvesting, automatic annotation, and automatic superimposition of a web of data over a web of pages work well.

Information Extraction Challenges in Managing Unstructured Data

by Anhai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro Derose
"... Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss in particular developing a declarative IE language, optimizing for this language, generating IE provenance, incorporating user feedback into the IE process, developing a novel wikibased user interface for feedback, best-effort IE, pushing IE into RDBMSs, and more. Our work suggests that IE in managing unstructured data can open up many interesting research challenges, and that these challenges can greatly benefit from the wealth of work on managing structured data that has been carried out by the database community. 1.

A First Tutorial on Dataspaces

by Michael Franklin, et al. , 2008
"... Dataspace systems offer services on data without requiring upfront semantic integration. In sharp contrast with existing information-integration systems, dataspaces systems offer best-effort answers even before semantic mappings are ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Dataspace systems offer services on data without requiring upfront semantic integration. In sharp contrast with existing information-integration systems, dataspaces systems offer best-effort answers even before semantic mappings are

University of Wisconsin-Madison, 2 Texas State University-San Marcos,

by Fei Chen, Byron J. Gao, Anhai Doan, Jun Yang, Raghu Ramakrishnan
"... Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus sn ..."
Abstract - Add to MetaCart
Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE“blackbox. ” In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional “workflow.” In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.

Entity-Relationship Queries over Wikipedia

by Xiaonan Li, Chengkai Li, Cong Yu
"... Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in Wikipedia corpus by their properties and inter-relationships. An entity-relationship query consists of arbitrary number of predicates on desired entit ..."
Abstract - Add to MetaCart
Wikipedia is the largest user-generated knowledge base. We propose a structured query mechanism, entity-relationship query, for searching entities in Wikipedia corpus by their properties and inter-relationships. An entity-relationship query consists of arbitrary number of predicates on desired entities. The semantics of each predicate is specified with keywords. Entity-relationship query searches entities directly over text rather than pre-extracted structured data stores. This characteristic brings two benefits: (1) Query semantics can be intuitively expressed by keywords; (2) It avoids information loss that happens during extraction. We present a ranking framework for general entity-relationship queries and a position-based BoundedCumulative Model for accurate ranking of query answers. Experiments on INEX benchmark queries and our own crafted queries show the effectiveness and accuracy of our ranking method.

FactCrawl: A Fact Retrieval Framework for Full-Text Indices

by Christoph Boden, Er Löser, Christoph Nagel, Stephan Pieper
"... We present FactCrawl, a framework for retrieving structured, factual information leveraging the full-text index of a search engine. The framework applies an approximation algorithm to solve problem of retrieving all facts in a document collection using a minimal set of keywords while minimizing cost ..."
Abstract - Add to MetaCart
We present FactCrawl, a framework for retrieving structured, factual information leveraging the full-text index of a search engine. The framework applies an approximation algorithm to solve problem of retrieving all facts in a document collection using a minimal set of keywords while minimizing cost. The search engine is queried with automatically generated keywords, the results are re-ranked according to our fact score and documents are forwarded to a fact extractor. Keywords are determined using structural, syntactic, lexical and semantic information from sample documents. We estimate the fact score of a document by combining the observations of keywords in the document. We report results of an experimental evaluation over 20 fact extractors on a Reuters NIST corpus with 731,752 pages. Our experiments demonstrate that FactCrawl more than doubles recall in an online query scenario and nearly halves processing costs in an archive scenario, compared to existing approaches.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University