Results 1 - 10
of
109
Wrapper Induction for Information Extraction
, 1997
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually form ..."
Abstract
-
Cited by 460 (30 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Wrappers are often used for this purpose. A wrapper is a procedure for extracting a particular resource's content. Unfortunately, hand-coding wrappers is tedious. We introduce wrapper induction, a technique for automatically constructing wrappers. Our techniques can be described in terms of three main contributions. First, we pose the problem of wrapper construction as one of inductive learn...
Wrapper Induction: Efficiency and Expressiveness
- Artificial Intelligence
, 2000
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatt ..."
Abstract
-
Cited by 191 (12 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then...
Discovering Models of Software Processes from Event-Based Data
- ACM Transactions on Software Engineering and Methodology
, 1998
"... this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the m ..."
Abstract
-
Cited by 187 (7 self)
- Add to MetaCart
this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the methods and discuss their application in an industrial case study.
Computational Limitations on Learning from Examples
- Journal of the ACM
, 1988
"... Abstract. The computational complexity of learning Boolean concepts from examples is investigated. It is shown for various classes of concept representations that these cannot be learned feasibly in a distribution-free sense unless R = NP. These classes include (a) disjunctions of two monomials, (b) ..."
Abstract
-
Cited by 182 (10 self)
- Add to MetaCart
Abstract. The computational complexity of learning Boolean concepts from examples is investigated. It is shown for various classes of concept representations that these cannot be learned feasibly in a distribution-free sense unless R = NP. These classes include (a) disjunctions of two monomials, (b) Boolean threshold functions, and (c) Boolean formulas in which each variable occurs at most once. Relationships between learning of heuristics and finding approximate solutions to NP-hard optimization problems are given. Categories and Subject Descriptors: F. 1.1 [Computation by Abstract Devices]: Models of Computation-relations among models; F. 1.2 [Computation by Abstract Devices]: Modes of Computation-probabi-listic computation; F. 1.3 [Computation by Abstract Devices]: Complexity Classes-reducibility and completeness; 1.2.6 [Artificial Intelligence]: Learning-concept learning; induction
Identifying hierarchical structure in sequences: A linear-time algorithm
, 1997
"... SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offe ..."
Abstract
-
Cited by 131 (3 self)
- Add to MetaCart
SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method’s simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences. 1.
Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1993
"... Abstract-The “interpretation ” framework in pattern recognition (PR) arises in the many cases in which the more classical paradigm of “classification ” is not properly applicable generally because the number of classes is rather large or simply because the concept of “class ” does not hold. A very g ..."
Abstract
-
Cited by 88 (14 self)
- Add to MetaCart
Abstract-The “interpretation ” framework in pattern recognition (PR) arises in the many cases in which the more classical paradigm of “classification ” is not properly applicable generally because the number of classes is rather large or simply because the concept of “class ” does not hold. A very general way of representing the results of interpretations of given objects or data is in terms of sentences of a “semantic language ” in which the actions to be performed for each different object or datum are described. Interpretation can therefore be conveniently formalized through the concept of formal transduction, giving rise to the central PR problem of how to automatically learn a transducer from a training set of examples of the desired input-output behavior. This paper presents a formalization of the stated transducer learning problem, as well as an effective and efficient method for the inductive learning of an important class of transducers, namely, the class of subsequential transducers. The capabilities of subsequential transductions are illustrated through a series of experiments that also show the high effectiveness of the proposed learning method in obtaining very accurate and compact transducers for the corresponding tasks. Index Terms-Formal languages, inductive inference, learning, rational transducers, subsequential functions, syntactic pattern recognition. I.
Inductive Inference, DFAs and Computational Complexity
- 2nd Int. Workshop on Analogical and Inductive Inference (AII
, 1989
"... This paper surveys recent results concerning the inference of deterministic finite automata (DFAs). The results discussed determine the extent to which DFAs can be feasibly inferred, and highlight a number of interesting approaches in computational learning theory. 1 ..."
Abstract
-
Cited by 73 (1 self)
- Add to MetaCart
This paper surveys recent results concerning the inference of deterministic finite automata (DFAs). The results discussed determine the extent to which DFAs can be feasibly inferred, and highlight a number of interesting approaches in computational learning theory. 1
Discovering Algebraic Specifications from Java Classes
- In ECOOP
, 2003
"... We present and evaluate an automatic tool for extracting algebraic specifications from Java classes. Our tool maps a Java class to an algebraic signature and then uses the signature to generate a large number of terms. The tool evaluates these terms and based on the results of the evaluation, it pro ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
We present and evaluate an automatic tool for extracting algebraic specifications from Java classes. Our tool maps a Java class to an algebraic signature and then uses the signature to generate a large number of terms. The tool evaluates these terms and based on the results of the evaluation, it proposes equations. Finally, the tool generalizes equations to axioms and eliminates many redundant axioms. Since our tool uses dynamic information, it is not guaranteed to be sound or complete. However, we manually inspected the axioms generated in our experiments and found them all to be correct.
Diversity-based Inference of Finite Automata
- Journal of ACM
, 1994
"... Abstract. We present new procedures for inferring the structure of a finite-state automaton (FSA) from its input \ output behavior, using access to the automaton to perform experiments. Our procedures use a new representation for finite automata, based on the notion of equivalence between tesfs. We ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
Abstract. We present new procedures for inferring the structure of a finite-state automaton (FSA) from its input \ output behavior, using access to the automaton to perform experiments. Our procedures use a new representation for finite automata, based on the notion of equivalence between tesfs. We call the number of such equivalence classes the diLersL@of the automaton; the diversity may be as small as the logarithm of the number of states of the automaton. For the special class of pennatatton aatornata, we describe an inference procedure that runs in time polynomial in the diversity and log(l/6), where 8 is a given upper bound on the probability that our procedure returns an incorrect result. (Since our procedure uses randomization to perform experiments, there is a certain controllable chance that it will return an erroneous result.) We also discuss techniques for handling more general automata. We present evidence for the practical efficiency of our approach. For example, our procedure is able to infer the structure of an automaton based on Rubik’s Cube (which has approximately 10 lY states) in about 2 minutes on a DEC MicroVax. This automaton is many orders of magnitude larger than possible with previous techniques, which would require time proportional at least to the number of global states. (Note that in this example, only a small fraction (10-14, of the global
Wrapper Maintenance: A Machine Learning Approach
- Journal of Artificial Intelligence Research
, 2003
"... The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and e#cient generation of wrappers, the development of tools for wrapper maintenance has received less attention. ..."
Abstract
-
Cited by 54 (13 self)
- Add to MetaCart
The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and e#cient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an e#cient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year.

