Results 1 -
7 of
7
Automatic Segmentation of Text Into Structured Records
, 2001
"... In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the wareh ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems.
Automatically extracting structure from free text addresses
- Bulletin of the Technical Committee on Data Engineering
, 2000
"... In this paper we present a novel way to automatically elementize postal addresses seen as a plain text string into atomic structured elements like ”City ” and ”Street name”. This is an essential step in all warehouse data cleaning activities. In spite of the practical importance of the problem and t ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
In this paper we present a novel way to automatically elementize postal addresses seen as a plain text string into atomic structured elements like ”City ” and ”Street name”. This is an essential step in all warehouse data cleaning activities. In spite of the practical importance of the problem and the technical challenges it offers, research effort on the topic has been limited. Existing commercial approaches are based on hand-tuned, rule-based approaches that are brittle and require extensive manual effort when moved to a different postal system. We present a Hidden Markov Model based approach that can work with just about any address domain when seeded with a small training data set. Experiments on real-life datasets yield accuracy of 89 % on a heterogeneous nationwide database of Indian postal addresses and 99.6 % on US addresses that tend to be more templatized. 1
Information extraction and automatic markup for XML documents
- In Blanken et al
, 2003
"... As XML is going to become the standard document format, there is still the legacy problem of large amounts of text (written in the past as well as today) that are not available in this format. In order to exploit the benefits of XML, these legacy texts must be converted into XML. In this chapter, we ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
As XML is going to become the standard document format, there is still the legacy problem of large amounts of text (written in the past as well as today) that are not available in this format. In order to exploit the benefits of XML, these legacy texts must be converted into XML. In this chapter, we discuss the
Searching with Numbers
- Proceedings of WWW
, 2002
"... A large fraction of the useful web comprises of specification documents that largely consist of hattribute name, numeric valuei pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first est ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
A large fraction of the useful web comprises of specification documents that largely consist of hattribute name, numeric valuei pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has "low reflectivity ", we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i.e., imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains.
XML Information Retrieval and Information Extraction
- Text Mining. Theoretical Aspects and Applications
, 2003
"... We present a new query language for information retrieval in XML documents and discuss its combination with information extraction methods. XIRQL is an XML query language which implements IR-related features such as weighting and ranking, relevance-oriented search, datatypes with vague predicates, a ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present a new query language for information retrieval in XML documents and discuss its combination with information extraction methods. XIRQL is an XML query language which implements IR-related features such as weighting and ranking, relevance-oriented search, datatypes with vague predicates, and structural relativism. For information extracted from texts, XIRQL can rank records based on uncertainty weights, and single conditions may be evaluated using vague predicates for fact retrieval. When IE is used for automatic XML markup of plain texts, XIRQL is able to consider uncertainty weights resulting from this process, and the markup leads to increased precision of text searches.
Text Structure Recognition using a Region Algebra
, 2001
"... We consider the problem of incrementally developing a parser for text structure. This means building the parser specification a piece at a time while simultaneously developing our understanding of the text. We argue that existing solutions... ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We consider the problem of incrementally developing a parser for text structure. This means building the parser specification a piece at a time while simultaneously developing our understanding of the text. We argue that existing solutions...

