MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Automatic Web News Extraction Using Tree Edit Distance (2004) [42 citations — 4 self]

by Davi De Castro Reis ,  Reis Paulo ,  Alberto H.F. Laender ,  Paulo B. Golgher ,  Altigran S. da Silva
in Proceedings of World Wide Web Conference (WWW04
Add To MetaCart

Abstract:

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.

Citations

1439 Modern Information Retrieval – Baeza-Yates, Ribeiro - 1999
212 Database techniques for the world-wide web: A survey – Florescu, Levy, et al. - 1998
176 Roadrunner: Towards automatic data extraction from large web sites – Crescenzi, Mecca, et al. - 2001
112 Extracting structured data from web pages – Arasu, Garcia-Molina - 2003
73 Xtract: A system for extracting document type descriptors from xml documents – Garofalakis - 2000
39 Comparing hierarchical data in external memory – Chawathe - 1999
18 Automatic Annotation of Data Extracted from Large Web Sites – Arlotta, Crescenzi, et al. - 2003
15 New algorithm for ordered tree-to-tree correction problem – Chen
11 In Search of the Lost Schema – Grumbach, Mecca - 1999
9 ChangeDetector: A Site-Level Monitoring Tool for the WWW – Boyapati, Chevrier, et al. - 2002
6 Wrapping-oriented classification of Web pages – Crescenzi, Mecca, et al. - 2002