• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Record-Boundary Discovery In Web Documents (1998)

Cached

  • Download as a PDF

Download Links

  • [osm7.cs.byu.edu]
  • [www.deg.byu.edu]
  • [lantern.cs.byu.edu]
  • [osm7.cs.byu.edu]
  • [www.deg.byu.edu]
  • [www.deg.byu.edu]
  • [osm7.cs.byu.edu]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Yuan Jiang
Citations:127 - 20 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Jiang98record-boundarydiscovery,
    author = {Yuan Jiang},
    title = {Record-Boundary Discovery In Web Documents},
    year = {1998}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this thesis we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the...

Keyphrases

web document    record-boundary discovery    record boundary    semistructured web document    multiple record    data-extraction problem    record information    candidate separator tag    nested html tag    practical case    independent heuristic    information relevant    heuristic approach    consensus separator tag   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University