Results 1  10
of
66
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
, 2001
"... The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities ..."
Abstract

Cited by 299 (7 self)
 Add to MetaCart
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and di#erences. Experimental results on reallife dataintensive Web sites confirm the feasibility of the approach. 1
Workflow Mining: Discovering process models from event logs
 IEEE Transactions on Knowledge and Data Engineering
, 2003
"... Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated timeconsuming process and typically there are discrepancies between the ac ..."
Abstract

Cited by 239 (42 self)
 Add to MetaCart
Contemporary workflow management systems are driven by explicit process models, i.e., a completely specified workflow design is required in order to enact a given workflow process. Creating a workflow design is a complicated timeconsuming process and typically there are discrepancies between the actual workflow processes and the processes as perceived by the management. TherefS3A we have developed techniques fi discovering workflow models. Starting pointfS such techniques is a socalled "workflow log" containinginfg3SfiHfl" about the workflow process as it is actually being executed. We present a new algorithm to extract a process modelf3q such a log and represent it in terms of a Petri net. However, we will also demonstrate that it is not possible to discover arbitrary workflow processes. In this paper we explore a classof workflow processes that can be discovered. We show that the #algorithm can successfqFS mine any workflow represented by a socalled SWFnet. Key words: Workflow mining, Workflow management, Data mining, Petri nets. 1
Discovering Models of Software Processes from EventBased Data
 ACM Transactions on Software Engineering and Methodology
, 1998
"... this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the m ..."
Abstract

Cited by 232 (7 self)
 Add to MetaCart
this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the methods and discuss their application in an industrial case study.
XTRACT: A System for Extracting Document Type Descriptors from XML Documents. Bell Labs Tech. Memorandum
, 1999
"... XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus ha ..."
Abstract

Cited by 103 (4 self)
 Add to MetaCart
XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general ” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with reallife and synthetic DTDs demonstrate the effectiveness of XTRACT’s approach in inferring concise and semantically meaningful DTD schemas for XML databases. 1
Learning Models of Intelligent Agents
, 1996
"... Agents that operate in a multiagent system need an efficient strategy to handle their encounters with other agents involved. Searching for an optimal interactive strategy is a hard problem because it depends mostly on the behavior of the others. In this work, interaction among agents is represented ..."
Abstract

Cited by 83 (2 self)
 Add to MetaCart
Agents that operate in a multiagent system need an efficient strategy to handle their encounters with other agents involved. Searching for an optimal interactive strategy is a hard problem because it depends mostly on the behavior of the others. In this work, interaction among agents is represented as a repeated twoplayer game, where the agents' objective is to look for a strategy that maximizes their expected sum of rewards in the game. We assume that agents' strategies can be modeled as finite automata. A modelbased approach is presented as a possible method for learning an effective interactive strategy. First, we describe how an agent should find an optimal strategy against a given model. Second, we present an unsupervised algorithm that infers a model of the opponent's automaton from its input/output behavior. A set of experiments that show the potential merit of the algorithm is reported as well. Introduction In recent years, a major research effort has been invested in desi...
Diversitybased Inference of Finite Automata
 Journal of ACM
, 1994
"... Abstract. We present new procedures for inferring the structure of a finitestate automaton (FSA) from its input \ output behavior, using access to the automaton to perform experiments. Our procedures use a new representation for finite automata, based on the notion of equivalence between tesfs. We ..."
Abstract

Cited by 74 (1 self)
 Add to MetaCart
Abstract. We present new procedures for inferring the structure of a finitestate automaton (FSA) from its input \ output behavior, using access to the automaton to perform experiments. Our procedures use a new representation for finite automata, based on the notion of equivalence between tesfs. We call the number of such equivalence classes the diLersL@of the automaton; the diversity may be as small as the logarithm of the number of states of the automaton. For the special class of pennatatton aatornata, we describe an inference procedure that runs in time polynomial in the diversity and log(l/6), where 8 is a given upper bound on the probability that our procedure returns an incorrect result. (Since our procedure uses randomization to perform experiments, there is a certain controllable chance that it will return an erroneous result.) We also discuss techniques for handling more general automata. We present evidence for the practical efficiency of our approach. For example, our procedure is able to infer the structure of an automaton based on Rubik’s Cube (which has approximately 10 lY states) in about 2 minutes on a DEC MicroVax. This automaton is many orders of magnitude larger than possible with previous techniques, which would require time proportional at least to the number of global states. (Note that in this example, only a small fraction (1014, of the global
Probably Approximately Correct Learning
 Proceedings of the Eighth National Conference on Artificial Intelligence
, 1990
"... This paper surveys some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We th ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
This paper surveys some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory. 2 Introduction It's a dangerous thing to try to formalize an enterprise as complex and varied as machine learning so that it can be subjected to rigorous mathematical analysis. To be tractable, a formal model must be simple. Thus, inevitably, most people will feel that important aspects of the activity have been left out of the theory. Of course, they will be right. Therefore, it is not advisable to present a theory of machine learning as having reduced the entire field to its bare essentials. All ...
Opponent Modeling in a Multiagent System
 Lecture note in AI, 1042: Adaptation and Learning in Multiagent Systems, Lecture Notes in Artificial Intelligence
, 1995
"... Agents that operate in a multiagent system need an efficient strategy to handle their encounters with other agents involved in that system. Searching for an optimal interactive strategy is a hard problem because it depends mostly on the behavior of the others. In this work, interaction among agents ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
Agents that operate in a multiagent system need an efficient strategy to handle their encounters with other agents involved in that system. Searching for an optimal interactive strategy is a hard problem because it depends mostly on the behavior of the others. In this work, interaction among agents is represented as a repeated twoplayer game, where an agents' objective is to look for a strategy that maximizes their expected sum of rewards in the game. We assume that agents' strategies can be modeled as finite automata. A model based reasoning approach is presented as a possible method for learning an efficient interactive strategy. First, we describe how an agent should find an optimal strategy against a given model. Second, we present a heuristic algorithm that infers a model of the opponent's automata from its input/output behavior. A set of experiments that show the potential merit of the algorithm is reported as well. Keywords: Opponent modeling, Model based reasoning, Finite au...
Efficient Reinforcement Learning
 In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory
, 1994
"... In this paper we propose a new formal model for studying reinforcement learning, based on Valiant's PAC framework. In our model the learner does not have direct access to every state of the environment. Instead, every sequence of experiments starts in a fixed initial state and the learner is provide ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
In this paper we propose a new formal model for studying reinforcement learning, based on Valiant's PAC framework. In our model the learner does not have direct access to every state of the environment. Instead, every sequence of experiments starts in a fixed initial state and the learner is provided with a "reset" operation that interrupts the current sequence of experiments and starts a new one (from the initial state). We do not require the agent to learn the optimal policy but only a good approximation of it with high probability. More precisely, we require the learner to produce a policy whose expected value from the initial state is "close to that of the optimal policy, with probability no less than 1 \Gamma ffi . For this model, we describe an algorithm that produces such an (",ffi)optimal policy, for any environment, in time polynomial in N , K, 1=", 1=ffi, 1=(1 \Gamma fi) and r max , where N is the number of states of the environment, K is the maximum number of actions in a...
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
, 2008
"... Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regu ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as koccurrence regular expressions (kOREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns kOREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.