• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

ABSTRACT To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Cached

  • Download as a PDF

Download Links

  • [www.cs.columbia.edu]
  • [www1.cs.columbia.edu]
  • [www.mathcs.emory.edu]
  • [www.stern.nyu.edu]
  • [www.cs.columbia.edu]
  • [www.mathcs.emory.edu]
  • [www.ipeirotis.com]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Panagiotis G. Ipeirotis , Pranay Jain
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Ipeirotis_abstractto,
    author = {Panagiotis G. Ipeirotis and Pranay Jain},
    title = {ABSTRACT To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl, ” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets. 1.

Keyphrases

text-centric task    query optimizer    execution plan    text database    cost model    execution time    output completeness    specific topic    crawl-based plan    many important application    fundamental task-specific property    notable example    associated parameter    cost-based way    present efficient technique    unstructured text    textual data    important text-centric task    significant efficiency    appropriate execution plan    plain intuition    large-scale experimental evaluation    rigorous cost model    fundamental building block    query-based execution plan    information extraction application    multiple real-life data set    output completeness benefit    substantial impact    random-graph theory    search engine index    task-specific way    general paradigm   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University