Results 1  10
of
24
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
, 2008
"... Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regu ..."
Abstract

Cited by 40 (8 self)
 Add to MetaCart
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as koccurrence regular expressions (kOREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns kOREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Inference of Concise Regular Expressions and DTDs
"... We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XMLdocuments, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regula ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XMLdocuments, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets. 11
Schema evolution for XML: A consistencypreserving approach
 In Mathematical Foundations of Computer Science, number 3153 in LNCS
, 2004
"... Abstract. This paper deals with updates of XML documents that satisfy a given schema, e.g., a DTD. In this context, when a given update violates the schema, it might be the case that this update is accepted, thus implying to change the schema. Our method is intended to be used by a data administrato ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper deals with updates of XML documents that satisfy a given schema, e.g., a DTD. In this context, when a given update violates the schema, it might be the case that this update is accepted, thus implying to change the schema. Our method is intended to be used by a data administrator who is an expert in the domain of application of the database, but who is not required to be a computer science expert. Our approach consists in proposing different schema options that are derived from the original one. The method is consistencypreserving: documents valid with respect to the original schema remain valid. The schema evolution is implemented by an algorithm (called GREC) that performs changes on the graph of a finite state automaton and that generates regular expressions for the modified graphs. Each regular expression proposed by GREC is a choice of schema given to the administrator. 1
Generating XML Structure Using Examples and Constraints
, 2008
"... This paper presents a framework for automatically generating structural XML documents. The user provides a target DTD and an example of an XML document, called a GenerateXMLByExample Document, or a GxBE document, for short. GxBE documents use a natural declarative syntax, which includes XPath exp ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This paper presents a framework for automatically generating structural XML documents. The user provides a target DTD and an example of an XML document, called a GenerateXMLByExample Document, or a GxBE document, for short. GxBE documents use a natural declarative syntax, which includes XPath expressions and the function count. Using GxBE documents, users can express important global and local characteristics for the desired target documents, and can require satisfaction of XPath expressions from a given workload. This paper explores the problem of efficiently generating a document that satisfies a given DTD and GxBE document.
A Characterization of Thompson Digraphs
 Discrete Applied Mathematics
, 1999
"... A finitestate machine is called a Thompson machine if it can be constructed from a regular expression using Thompson's construction. We call the underlying digraph of a Thompson machine a Thompson digraph. We characterize Thompson digraphs and, as one application of the characterization, we gi ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
A finitestate machine is called a Thompson machine if it can be constructed from a regular expression using Thompson's construction. We call the underlying digraph of a Thompson machine a Thompson digraph. We characterize Thompson digraphs and, as one application of the characterization, we give an algorithm that generates an equivalent regular expression from a Thompson machine in time linear in the number of states. Although the construction is simple, it is novel in that the usual constructions of equivalent regular expressions from finitestate machines produce regular expressions that have size exponential in the size of the given machine, in the worst case. The construction provides a first step in the construction of small expressions from finitestate machines. 1 Introduction In 1968, Thompson [8] introduced his inductive construction of a finitestate machine from a regular expression. Thompson's construction is elegant and efficient. Although Kleene [5] gave an inductive con...
Compact and Fast Algorithms for Regular Expression Search
 International Journal of Computer Mathematics (IJCM
, 2004
"... This paper describes an improvement of the brute force determinization algorithm in the case of homogeneous NFAs, as well as its application to pattern matching. Brute force determinization with limited memory may provide a partially determinized automaton, but its bounded complexity makes it be ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This paper describes an improvement of the brute force determinization algorithm in the case of homogeneous NFAs, as well as its application to pattern matching. Brute force determinization with limited memory may provide a partially determinized automaton, but its bounded complexity makes it be a failsafe procedure contrary to the classical subset construction.
SeriesParallel Automata and Short Regular Expressions
, 2009
"... Computing short regular expressions equivalent to a given finite automaton is a hard task. In this work we present a class of acyclic automata for which it is possible to obtain in time O(n² log n) an equivalent regular expression of size O(n). A characterisation of this class is made using propert ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Computing short regular expressions equivalent to a given finite automaton is a hard task. In this work we present a class of acyclic automata for which it is possible to obtain in time O(n² log n) an equivalent regular expression of size O(n). A characterisation of this class is made using properties of the underlying digraphs that correspond to the seriesparallel digraphs class. Using this characterisation we present an algorithm for the generation of automata of this class and an enumerative formula for the underlying digraphs with a given number of vertices.
Assisting XML Schema Evolution that Preserves Validity
 Proc. of SBBD 2007
, 2007
"... Abstract. We consider the problem of XML schema evolution preserving the validity of existing documents related to the original schema. The aim of such schema evolution is to fit new needs without revalidating all existing valid XML documents. We propose an approach to assist users to specify schem ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of XML schema evolution preserving the validity of existing documents related to the original schema. The aim of such schema evolution is to fit new needs without revalidating all existing valid XML documents. We propose an approach to assist users to specify schema updates that have no impact on existing document validity. An XML schema is modeled as a set of regular expressions, each constraining the content model of XML elements. Given the user needs, we work on Glushkov graphs representing regular expressions E in the schema. By this way, we get straightforwardly the right places in E that may be changed while preserving validity. 1.
Identifying a reduced DTD from marked up documents
"... This paper describes a method for the automatic generation of simplified DTDs from a source DTD and a sample of marked up documents. The purpose is to create the minimal DTD with which all the documents in the the sample comply. In this way, new files can be created and parsed using this simplified ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper describes a method for the automatic generation of simplified DTDs from a source DTD and a sample of marked up documents. The purpose is to create the minimal DTD with which all the documents in the the sample comply. In this way, new files can be created and parsed using this simplified DTD but still being compliant with the original, more general one. The pruned DTD makes the task of markup easier, specially for nonexperienced XML writers. This tool was used to obtain simplified versions of the Text Encoding Initiative DTD to be used at the Miguel de Cervantes digital library. This work is
BlockDeterministic Regular Languages
, 1998
"... We introduce the notions of blocked, blockmarked and blockdeterministic regular expressions. We characterize blockdeterministic regular expressions with deterministic Glushkov block automata. The results can be viewed as a generalization of the characterization of oneunambiguous regular expressi ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We introduce the notions of blocked, blockmarked and blockdeterministic regular expressions. We characterize blockdeterministic regular expressions with deterministic Glushkov block automata. The results can be viewed as a generalization of the characterization of oneunambiguous regular expressions with deterministic Glushkov automata. In addition, when a language L has a blockdeterministic expression E, we can construct a deterministic finitestate automaton for L that has size linear in the size of E. 1 Introduction A regular language is oneunambiguous, according to BruggemannKlein and Wood [4], if there is a deterministic Glushkov automaton for the language. An alternative definition of oneunambiguity based on regular expressions is that each position in a regular expression has at most one following position for each symbol in the expression's alphabet. The latter definition is used to define unambiguous content model groups in the Standard Generalized Markup Language (SGM...