The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Wrappers are often used for this purpose. A wrapper is a procedure for extracting a particular resource's content. Unfortunately, hand-coding wrappers is tedious. We introduce wrapper induction, a technique for automatically constructing wrappers. Our techniques can be described in terms of three main contributions. First, we pose the problem of wrapper construction as one of inductive learn...
|
1328
|
A theory of the learnable
– Valiant
- 1984
|
|
525
|
Learnability and the Vapnik-Chervonenkis dimension
– Blumer, Ehrenfeucht, et al.
- 1989
|
|
422
|
An Introduction to Computational Learning Theory
– Kearns, Vazirani
- 1994
|
|
408
|
The TSIMMIS project: Integration of heterogeneous information sources
– Chawathe, Garcia-Molina, et al.
|
|
365
|
Learning Regular Sets from Queries and Counterexamples
– Angluin
- 1987
|
|
272
|
A Scalable Comparison-Shopping Agent for the World Wide Web
– Doorenbos, Etzioni, et al.
- 1997
|
|
265
|
A survey of inductive inference: Theory and methods
– Smith
- 1983
|
|
261
|
A softbot-based interface to the Internet
– Etzioni, Weld
- 1994
|
|
200
|
Introduction to Software Agents
– BradShaw
- 1997
|
|
184
|
Middle-agents for the internet
– Decker, Sycara, et al.
- 1997
|
|
179
|
Learning from noisy examples
– Angluin, Laird
- 1988
|
|
170
|
Query caching and optimization in distributed mediator systems
– Adali, Candan, et al.
- 1996
|
|
150
|
Resource integration using a large knowledge base in Carnot
– Huhns, Collet, et al.
- 1991
|
|
140
|
The Information Manifold
– Kirk, Levy, et al.
- 1995
|
|
122
|
Inference of reversible languages
– Angluin
- 1982
|
|
106
|
Semi-automatic wrapper generation for Internet information sources
– Ashish, Knoblock
|
|
103
|
A comparative review of selected methods for learning from examples
– Dietterich
- 1983
|
|
91
|
Migrating Legacy System
– Brodie, Stonebraker
- 1995
|
|
90
|
Towards heterogeneous multimedia information systems: The garlic approach
– CAREY, HAAS, et al.
- 1995
|
|
86
|
KQML–A language and protocol for knowledge and information exchange
– Finin, Fritzson
- 1994
|
|
68
|
Computational learning theory: Survey and selected bibliography
– Angluin
- 1992
|
|
48
|
The world wide web: Quagmire or gold mine
– Etzioni
- 1996
|
|
39
|
Intelligence without robots (a reply to brooks
– Etzioni
- 1993
|
|
38
|
Learnability by fixed distributions
– Benedek, Itai
- 1988
|
|
30
|
Building softbots for UNIX (preliminary report
– Etzioni, Lesh, et al.
- 1993
|
|
29
|
The Constraint-Based Knowledge Broker Model: Semantics, Implementation and Analysis
– Andreoli, Borghoff, et al.
- 1996
|
|
27
|
Learning to query the web
– Cohen, Singer
- 1996
|
|
26
|
Getty’s synoname and its cousins: A survey of applications of personal name-matching algorithms
– Borgman, Siegfried
- 1992
|
|
21
|
Using natural language processing for identifying and interpreting tables in texts
– Douglas, Hurst, et al.
- 1995
|
|
19
|
Towards Sophisticated Wrapping of Web-based Information Repositories
– Chidlovskii, Borghoff, et al.
- 1997
|
|
16
|
On learning from noisy and incomplete examples
– Decatur, Gennaro
- 1995
|
|
12
|
Investigating the distributional assumptions of the pac learning model
– Bartlett, Williamson
- 1991
|
|
12
|
On the synthesis of finite state machines from samples of their behavior
– Biermann, Feldman
- 1972
|
|
12
|
Moving up the information food chain: softbots as information carnivores
– Etzioni
- 1996
|
|
7
|
Quantifying inductive bias
– Haussler
- 1988
|
|
6
|
6th Message Understanding Conference
– Proc
- 1995
|
|
5
|
Database Language SQL
– ANSI
- 1986
|
|
5
|
Wrapper generation for semi-structured information sources
– Asish, Knoblock
- 1997
|
|
4
|
SIMS: Single interface to multiple sources
– Arens, Knoblock, et al.
- 1996
|
|
2
|
Wrapper Construction for Information Extraction
– Kushmerick
- 1997
|
|
2
|
Data Reverse Engineering: Staying the Legacy Dragon
– Aiken
- 1995
|
|
2
|
6th Message Understanding Conf
– Proc
- 1995
|
|
2
|
Scalable Internet discovery: Research problems and approaches
– Bowman, Danzig, et al.
- 1994
|
|
1
|
Layout and language: Lists and tables in tehcnical documents
– Douglas, Hurst
- 1996
|