Results 1 -
7 of
7
Automatic wrapper induction from hidden-web sources with domain knowledge
- In WIDM
"... We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidd ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its fields to concepts of the domain, and understanding how resulting records are represented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with domain instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.
Learning Extractors from Unlabeled Text using Relevant Databases
, 2007
"... Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.
Integrating hidden Markov models into semantic web annotation platforms
"... Information Extraction Approaches Used in Semantic Annotation Platforms………… … 4 Hidden Markov Models……………………………………………………………… … 6 Hidden Markov Models and Semantic Annotation………………………………….…. 8 ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Information Extraction Approaches Used in Semantic Annotation Platforms………… … 4 Hidden Markov Models……………………………………………………………… … 6 Hidden Markov Models and Semantic Annotation………………………………….…. 8
representation
, 2006
"... Reference metadata extraction using a hierarchical knowledge ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Reference metadata extraction using a hierarchical knowledge
FireCite: Lightweight real-time reference string extraction from webpages
"... We present FireCite, a Mozilla Firefox browser extension that helps scholars assess and manage scholarly references on the web by automatically detecting and parsing such reference strings in real-time. FireCite has two main components: 1) a reference string recognizer that has a high recall of 96%, ..."
Abstract
- Add to MetaCart
We present FireCite, a Mozilla Firefox browser extension that helps scholars assess and manage scholarly references on the web by automatically detecting and parsing such reference strings in real-time. FireCite has two main components: 1) a reference string recognizer that has a high recall of 96%, and 2) a reference string parser that can process HTML web pages with an overall F1 of.878 and plaintext reference strings with an overall F1 of.97. In our preliminary evaluation, we presented our FireCite prototype to four academics in separate unstructured interviews. Their positive feedback gives evidence to the desirability of FireCite’s citation management capabilities. 1
A Simple Extraction Procedure for Bibliographical Author Field
, 902
"... A procedure for bibliographic author metadata extraction from scholarly texts is presented. The author segments are identified based on capitalization and line break patterns. Two main author layout templates, which can retrieve from a varied set of title pages, are provided. Additionally, several d ..."
Abstract
- Add to MetaCart
A procedure for bibliographic author metadata extraction from scholarly texts is presented. The author segments are identified based on capitalization and line break patterns. Two main author layout templates, which can retrieve from a varied set of title pages, are provided. Additionally, several disambiguating rules are described.

