• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Google’s Deep-Web Crawl (2008)

Cached

  • Download as a PDF

Download Links

  • [www.vldb.org]
  • [www.cs.washington.edu]
  • [cseweb.ucsd.edu]
  • [www.cs.washington.edu]
  • [www.cs.ucsd.edu]
  • [www.cs.ucsd.edu]
  • [www.cse.ucsd.edu]
  • [www.sysnet.ucsd.edu]
  • [www.cs.cornell.edu]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Jayant Madhavan , David Ko , Łucja Kot , Vignesh Ganapathy , Alex Rasmussen , Alon Halevy
Citations:26 - 3 self
  • Summary
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Madhavan08google’sdeep-web,
    author = {Jayant Madhavan and David Ko and Łucja Kot and Vignesh Ganapathy and Alex Rasmussen and Alon Halevy},
    title = { Google’s Deep-Web Crawl},
    year = {2008}
}

Bookmark

citeulike Connotea Bibsonomy Del.icio.us Digg Reddit

OpenURL

 

Abstract

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

Citations

2699 Introduction to Modern Information Retrieval - Salton, McGill - 1984
300 Reconciling schemas of disparate data sources: A machine-learning approach - Doan, Domingos, et al. - 2001
180 Answering queries using templates with binding patterns - Rajaraman, Sagiv, et al. - 1995
173 Crawling The Hidden Web - Raghavan, Garcia - 2001
134 Query-based sampling of text databases - Callan, Connell, et al. - 1999
114 The Deep Web: Surfacing Hidden Value - Bergman
85 Distributed search over the hidden web: Hierarchical database sampling and selection - Ipeirotis, Gravano - 2002
73 An interactive clustering-based approach to integrating source query interfaces on the deep web - Wu
53 Sahami: QProber: A System for Automatic Classification of Hidden-Web Databases - Gravano, Ipeirotis, et al.
37 Downloading Textual Hidden Web Content Through Keyword Queries - Ntoulas, Zerfos, et al. - 2005
35 Siphoning hidden-web data through keyword-based interfaces - Barbosa, Freire - 2004
35 Instance-based schema matching for web databases by domain-specific query probing - Wang - 2004
23 A.: Web-scale Data Integration: You Can Only Afford to Pay As You Go - Madhavan, Jeffery, et al.
19 W.Y.Ma,"Query selection techniques for efficient crawling of structured web sources - Wu, Liu - 2006
18 Automatic complex schema matching across web query interfaces: A correlation mining approach - He, Chang
8 Accessing the deep web: A survey - He, Patel, et al. - 2007
5 Efficient acquisition of web data through restricted query interfaces - Byers, Freire, et al. - 2000
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University