MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Automatic Segmentation of Text Into Structured Records (2001) [40 citations — 0 self]

by Vinayak Borkar ,  Kaustubh Deshmukh ,  Sunita Sarawagi
Add To MetaCart

Abstract:

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems.

Citations

2372 A tutorial on hidden Markov Models and selected applications in speech recognition – Rabiner - 1989
408 Wrapper induction for information extraction – Kushmerick, Doorenbos, et al. - 1997
259 Maximum entropy markov models for information extraction and segmentation – McCallum, Freitag, et al. - 2000
249 Learning Information Extraction Rules for Semi-structured and Free Text – Soderland - 1999
231 Relational learning of pattern-match rules for information extraction – Califf, Mooney - 1999
211 Digital Libraries and Autonomous Citation Indexing – Lawrence, Giles, et al. - 1999
206 Nymble: a high-performance Learning Name-finder – Bikel, Miller, et al. - 1997
202 The merge/purge problem for large databases – Hernandez, Stolfo - 1995
156 Extracting semistructured information from the Web – Hammer, Garcia-Molina, et al. - 1997
120 A Hierarchal Approach to Wrapper Induction – Muslea, Minton, et al. - 1999
115 Learning hidden markov model structure for information extraction – Seymore, McCallum, et al. - 1999
113 XWRAP: An XML-enabled wrapper construction system for web information sources – Liu, Pu, et al. - 2000
102 Generating Finite-State Transducers for Semi-Structured Data Extraction from – Hsu, Dung - 1998
98 The Field Matching Problem: Algorithms and Applications – Monge, Elkan - 1996
83 Recordboundary discovery in web-documents – Embley, Jiang, et al. - 1999
72 Extraction Patterns for Information Extraction Tasks: A Survey, in – Muslea - 1999
69 Information extraction with hmm structures learned by stochastic optimization – Freitag - 2000
59 Learning information extraction patterns from examples – Huffman - 1995
53 Building LightWeight Wrappers for Legacy Web Data-Sources Using W4F – Sahuguet, Azavant - 1999
47 Philosophical Essays on Probabilities – Laplace - 1995
24 Information extraction using hmms and shrinkage – Freitag, McCallum - 1999
20 Araneus in the Era of XML – Mecca, Merialdo, et al. - 1999
12 Dealing with Dirty Data – Kimball - 1996
6 Nodose: A tool for semi-automatically extracting structured and semistructured data from text documents – Aldelberg - 1998
6 A survey of semi-automatic extraction and transformation. http://www-db.stanford.edu/ crespo/publications – Crespo, Jannink, et al. - 2002
6 Robust part of speech tagging using a hidden markov model. Computer Speech and Language – Kupiec - 1992
4 Theaterloc: Using information integration technology to rapidly build virtual applications – Barish, Chen, et al. - 2000