Abstract:
One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"---programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is a general procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers. Keywords: information integration, machine learning, extraction. 1 Introduction A number of recent systems operate by taking informatio...
Citations
|
3363
|
C4.5: Programs for Machine Learning
– Quinlan
- 1992
|
|
654
|
Fast effective rule induction
– Cohen
- 1995
|
|
595
|
Querying Heterogeneous Information Sources Using Source Descriptions
– Levy, Rajaraman, et al.
- 1996
|
|
408
|
Wrapper Induction for Information Extraction
– Kushmerick, Weld, et al.
- 1997
|
|
323
|
The TSIMMIS approach to mediation: Data models and languages
– Garcia-Molina, Papakonstantinou, et al.
- 1997
|
|
184
|
Infomaster: An information integration system
– Genereseth, Keller, et al.
- 1997
|
|
175
|
Integration of heterogeneous databases without common domains using queries based on textual similarity
– Cohen
- 1998
|
|
165
|
Wrapper Generation for Semi-structured Internet Sources
– Ashish, Knoblock
- 1997
|
|
156
|
Extracting semistructured information from the Web
– Hammer, Garcia-Molina, et al.
- 1997
|
|
107
|
Modeling web sources for information integration
– Knoblock, Minton, et al.
- 1998
|
|
54
|
A web-based information system that reasons with structured collections of text
– Cohen
- 1998
|
|
47
|
Wrapper induction for semi-structured, web-based information sources
– Muslea, Minton, et al.
- 1998
|
|
38
|
The araneus web-base management system
– Mecca, Atzeni, et al.
- 1998
|
|
32
|
Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules
– Hsu
- 1998
|
|
31
|
Recognizing structure in web pages using similarity queries
– Cohen
|
|
19
|
Customizing Information Capture and Access
– Rus, Subramanian
- 1995
|
|
19
|
The distributed information search component (DISCO) and the world wide web
– Tomasic, Amouroux, et al.
- 1997
|
|
12
|
Learning with set-valued features
– Cohen
- 1996
|
|
7
|
User-oriented smart-cache for the web: what you seek is what you get
– Lacroix, Sahuguet, et al.
- 1998
|