Results 1 -
2 of
2
Wrapper Induction for Information Extraction
, 1997
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually form ..."
Abstract
-
Cited by 460 (30 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Wrappers are often used for this purpose. A wrapper is a procedure for extracting a particular resource's content. Unfortunately, hand-coding wrappers is tedious. We introduce wrapper induction, a technique for automatically constructing wrappers. Our techniques can be described in terms of three main contributions. First, we pose the problem of wrapper construction as one of inductive learn...
Learning to remove Internet advertisements
, 1999
"... AdEater is a fully implemented browsing assistant that automatically removes advertisement images from Internet pages. Unlike related systems that rely on hand-crafted rules, AdEater takes an inductive learning approach, automatically generating rules from training examples. Our experiments demonstr ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
AdEater is a fully implemented browsing assistant that automatically removes advertisement images from Internet pages. Unlike related systems that rely on hand-crafted rules, AdEater takes an inductive learning approach, automatically generating rules from training examples. Our experiments demonstrate that our approach is practical: the off-line training phase takes less than six minutes; on-line classification takes about 70 msec; and classification accuracy exceeds 97% given a modest set of training data. 1 Introduction Many Internet sites draw income from third-party advertisements, usually in the form of images sprinkled throughout the site's pages. If judged to be interesting or relevant, users can click on these so-called "banner advertisements", jumping to the advertiser's own site. Some users prefer not to view such advertisements. Images tend to dominate a page's total download time, so users connecting through slow links find that advertisements substantially impede their b...

