MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Learning Page-Independent Heuristics for Extracting Data from Web Pages (1999) [32 citations — 4 self]

by William W. Cohen ,  Wei Fan
In AAAI Spring Symposium on Intelligent Agents in Cyberspace
Add To MetaCart

Abstract:

One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"---programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers. Keywords: information integration, machine learning, extraction. 1 Introduction A number of recent systems operate by taking informatio...

Citations

3363 C4.5: Programs for Machine Learning – Quinlan - 1992
654 Fast effective rule induction – Cohen - 1995
595 Querying Heterogeneous Information Sources Using Source Descriptions – Levy, Rajaraman, et al. - 1996
408 Wrapper Induction for Information Extraction – Kushmerick, Weld, et al. - 1997
323 The TSIMMIS approach to mediation: Data models and languages – Garcia-Molina, Papakonstantinou, et al. - 1997
184 Infomaster: An information integration system – Genereseth, Keller, et al. - 1997
175 Integration of heterogeneous databases without common domains using queries based on textual similarity – Cohen - 1998
165 Wrapper Generation for Semi-structured Internet Sources – Ashish, Knoblock - 1997
156 Extracting semistructured information from the Web – Hammer, Garcia-Molina, et al. - 1997
107 Modeling web sources for information integration – Knoblock, Minton, et al. - 1998
54 A web-based information system that reasons with structured collections of text – Cohen - 1998
47 Wrapper induction for semi-structured, web-based information sources – Muslea, Minton, et al. - 1998
38 The araneus web-base management system – Mecca, Atzeni, et al. - 1998
32 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules – Hsu - 1998
31 Recognizing structure in web pages using similarity queries – Cohen
19 Customizing Information Capture and Access – Rus, Subramanian - 1995
19 The distributed information search component (DISCO) and the world wide web – Tomasic, Amouroux, et al. - 1997
12 Learning with set-valued features – Cohen - 1996
7 User-oriented smart-cache for the web: what you seek is what you get – Lacroix, Sahuguet, et al. - 1998