In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems.
|
2372
|
A tutorial on hidden Markov Models and selected applications in speech recognition
– Rabiner
- 1989
|
|
408
|
Wrapper induction for information extraction
– Kushmerick, Doorenbos, et al.
- 1997
|
|
259
|
Maximum entropy markov models for information extraction and segmentation
– McCallum, Freitag, et al.
- 2000
|
|
249
|
Learning Information Extraction Rules for Semi-structured and Free Text
– Soderland
- 1999
|
|
231
|
Relational learning of pattern-match rules for information extraction
– Califf, Mooney
- 1999
|
|
211
|
Digital Libraries and Autonomous Citation Indexing
– Lawrence, Giles, et al.
- 1999
|
|
206
|
Nymble: a high-performance Learning Name-finder
– Bikel, Miller, et al.
- 1997
|
|
202
|
The merge/purge problem for large databases
– Hernandez, Stolfo
- 1995
|
|
156
|
Extracting semistructured information from the Web
– Hammer, Garcia-Molina, et al.
- 1997
|
|
120
|
A Hierarchal Approach to Wrapper Induction
– Muslea, Minton, et al.
- 1999
|
|
115
|
Learning hidden markov model structure for information extraction
– Seymore, McCallum, et al.
- 1999
|
|
113
|
XWRAP: An XML-enabled wrapper construction system for web information sources
– Liu, Pu, et al.
- 2000
|
|
102
|
Generating Finite-State Transducers for Semi-Structured Data Extraction from
– Hsu, Dung
- 1998
|
|
98
|
The Field Matching Problem: Algorithms and Applications
– Monge, Elkan
- 1996
|
|
83
|
Recordboundary discovery in web-documents
– Embley, Jiang, et al.
- 1999
|
|
72
|
Extraction Patterns for Information Extraction Tasks: A Survey, in
– Muslea
- 1999
|
|
69
|
Information extraction with hmm structures learned by stochastic optimization
– Freitag
- 2000
|
|
59
|
Learning information extraction patterns from examples
– Huffman
- 1995
|
|
53
|
Building LightWeight Wrappers for Legacy Web Data-Sources Using W4F
– Sahuguet, Azavant
- 1999
|
|
47
|
Philosophical Essays on Probabilities
– Laplace
- 1995
|
|
24
|
Information extraction using hmms and shrinkage
– Freitag, McCallum
- 1999
|
|
20
|
Araneus in the Era of XML
– Mecca, Merialdo, et al.
- 1999
|
|
12
|
Dealing with Dirty Data
– Kimball
- 1996
|
|
6
|
Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents
– Aldelberg
- 1998
|
|
6
|
A survey of semi-automatic extraction and transformation. http://www-db.stanford.edu/ crespo/publications
– Crespo, Jannink, et al.
- 2002
|
|
6
|
Robust part of speech tagging using a hidden markov model. Computer Speech and Language
– Kupiec
- 1992
|
|
4
|
Theaterloc: Using information integration technology to rapidly build virtual applications
– Barish, Chen, et al.
- 2000
|