• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Ad Hoc Data and the Token Ambiguity Problem

Cached

  • Download as a PDF

Download Links

  • [www.cs.princeton.edu]
  • [www.cs.sjtu.edu.cn]
  • [www.cs.princeton.edu]
  • [www.cs.princeton.edu:80]
  • [www.cs.princeton.edu]
  • [www.research.att.com]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Qian Xi , Kathleen Fisher , David Walker , Kenny Q. Zhu
Citations:4 - 3 self
  • Summary
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Xi_adhoc,
    author = {Qian Xi and Kathleen Fisher and David Walker and Kenny Q. Zhu},
    title = {Ad Hoc Data and the Token Ambiguity Problem},
    year = {}
}

Bookmark

citeulike Connotea Bibsonomy Del.icio.us Digg Reddit

OpenURL

 

Abstract

Abstract. PADS is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The PADS compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using PADS as an intermediate language in a system that first infers a PADS description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem — the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (PADS) as a key intermediate form, we have implemented the system as a whole in ML. 1

Citations

3116 A tutorial on hidden Markov models and selected applications in speech recognition - Rabiner - 1989
2034 LIBSVM: A Library for Support Vector Machines,” Software available at http://www.csie.ntu.edu.tw/ ∼cjlin/libsvm - Chang, Lin - 2001
1548 BConditional random fields: Probabilistic models for segmenting and labeling sequence data - Lafferty, McCallum, et al.
846 A Maximum Entropy Approach to Natural Language Processing - Berger, Pietra, et al. - 1996
793 Language identification in the limit - Gold - 1967
460 Wrapper induction for information extraction - Kushmerick, Weld, et al. - 1997
296 Learning Information Extraction Rules for Semistructured and free texts - Soderland
164 Extracting Structured Data from Web Pages - Arasu, Garcia-Molina - 2003
144 Inference of reversible languages - Angluin - 1982
122 A generalized hidden Markov model for the recognition of human genes in DNA - Kulp, Haussler, et al. - 1996
85 XTRACT: a system for extracting document type descriptors from XML documents, 2000, p. 165–176, http://citeseer.ist.psu.edu/ garofalakis00xtract.html. AxIS 45 - GAROFALAKIS, GIONIS, et al.
76 The Minimum Description Length Principle - Grünwald - 2007
76 Table extraction using conditional random fields - Pinto, McCallum, et al. - 2003
61 Speech repairs, intonational phrases, and discourse markers: modeling speakers’ utterances in spoken dialogue - Heeman, Allen - 1999
53 PADS: a domain-specific language for processing ad hoc data - Fisher, Gruber - 2005
49 Bayesian grammar induction for language modeling - Chen - 1995
29 Inference of concise DTDs from XML data - Bex, Neven, et al. - 2006
24 From dirt to shovels: Fully automatic tool generation from ad hoc data - Fisher, Walker, et al. - 2008
16 Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records - Borkar - 2001
5 LearnPADS: automatic tool generation from ad hoc data - Fisher, Walker, et al.
4 Grammatical inference: An introduction survey - Vidal - 1994
3 Grammatical Inference for Information Extraction and Visualisation on the Web - Hong - 2002
1 model optimization package - MEGA - 2007
1 model optimization package. http://www.cs.utah.edu/ hal/megam - MEGA - 2007
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University