Results 1 - 10
of
22
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
, 2008
"... Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regu ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Finite Automata, Digraph Connectivity, and Regular Expression Size
- ICALP
, 2008
"... ..."
(Show Context)
Inference of Concise Regular Expressions and DTDs
"... We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regula ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets. 11
Simplifying XML Schema: Effortless Handling of Nondeterministic Regular Expressions
, 2009
"... Whether beloved or despised, XML Schema is momentarily the only industrially accepted schema language for XML and is unlikely to become obsolete any time soon. Nevertheless, many nontransparent restrictions unnecessarily complicate the design of XSDs. For instance, complex content models in XML Sche ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
(Show Context)
Whether beloved or despised, XML Schema is momentarily the only industrially accepted schema language for XML and is unlikely to become obsolete any time soon. Nevertheless, many nontransparent restrictions unnecessarily complicate the design of XSDs. For instance, complex content models in XML Schema are constrained by the infamous unique particle attribution (UPA) constraint. In formal language theoretic terms, this constraint restricts content models to deterministic regular expressions. As the latter constitute a semantic notion and no simple corresponding syntactical characterization is known, it is very difficult for non-expert users to understand exactly when and why content models do or do not violate UPA. In the present paper, we therefore investigate solutions to relieve users from the burden of UPA by automatically transforming nondeterministic expressions into concise deterministic ones defining the same language or constituting good approximations. The presented techniques facilitate XSD construction by reducing the design task at hand more towards the complexity of the modeling task. In addition, our algorithms can serve as a plug-in for
Simplifying XML Schema: Single-Type Approximations of Regular Tree Languages
, 2010
"... XML Schema Definitions (XSDs) can be adequately abstracted by the single-type regular tree languages. It is wellknown, that these form a strict subclass of the robust class of regular unranked tree languages. Sadly, in this respect, XSDs are not closed under the basic operations of union and set dif ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
XML Schema Definitions (XSDs) can be adequately abstracted by the single-type regular tree languages. It is wellknown, that these form a strict subclass of the robust class of regular unranked tree languages. Sadly, in this respect, XSDs are not closed under the basic operations of union and set difference, complicating important tasks in schema integration and evolution. The purpose of this paper is to investigate how the union and difference of two XSDs can be approximated within the framework of single-type regular tree languages. We consider both optimal lower and upper approximations. We also address the more general question of how to approximate an arbitrary regular tree language by an XSD and consider the complexity of associated decision problems.
Tight Bounds on the Descriptional Complexity of Regular Expressions
, 2009
"... We improve on some recent results on lower bounds for conversion problems for regular expressions. In particular we consider the conversion of planar deterministic finite
automata to regular expressions, study the effect of the complementation operation on the
descriptional complexity of regular exp ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We improve on some recent results on lower bounds for conversion problems for regular expressions. In particular we consider the conversion of planar deterministic finite
automata to regular expressions, study the effect of the complementation operation on the
descriptional complexity of regular expressions, and the conversion of regular expressions
extended by adding intersection or interleaving to ordinary regular expressions. Almost all
obtained lower bounds are optimal, and the presented examples are over a binary alphabet,
which is best possible.
Provably shorter regular expressions from deterministic finite automata (Extended Abstract)
- DEVELOPMENTS IN LANGUAGE THEORY
, 2008
"... We study the problem of finding good elimination orderings for the state elimination algorithm, which is one of the most popular algorithms for the conversion of finite automata into equivalent regular expressions. Based on graph separator techniques we are able to describe elimination strategies th ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
We study the problem of finding good elimination orderings for the state elimination algorithm, which is one of the most popular algorithms for the conversion of finite automata into equivalent regular expressions. Based on graph separator techniques we are able to describe elimination strategies that remove states in large induced subgraphs that are “simple ” like, e.g., independent sets or subgraphs of bounded treewidth, of the underlying automaton, that lead to regular expressions of moderate size. In particular, we show that there is an elimination ordering such that every language over a binary alphabet accepted by an n-state deterministic finite automaton has alphabetic width at most O(1.742 n), which is, to our knowledge, the algorithm with currently the best known performance guarantee. Finally, we apply our technique to the question on the effect of language operations on regular expression size. In case of the intersection operation we prove an upper bound which matches, up to a small factor, a lower bound recently obtained in [9, 10], and thus settles an open problem stated in [7].
Succinctness of pattern-based schema languages for XML
- In Database Programming Languages 2007, LNCS 4797
"... Martens et al. defined a pattern-based specification language equivalent in expressive power to the widely adopted XML Schema definitions (XSDs). This language consists of rules of the form (r, s) where r and s are regular expressions and can be seen as a type-free extension of DTDs with vertical re ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Martens et al. defined a pattern-based specification language equivalent in expressive power to the widely adopted XML Schema definitions (XSDs). This language consists of rules of the form (r, s) where r and s are regular expressions and can be seen as a type-free extension of DTDs with vertical regular expressions. Sets of such rules can be interpreted both in an existential or universal way. In the present paper, we study the succinctness of both semantics w.r.t. each other and w.r.t. the common abstraction of XSDs in terms of single-type extended DTDs. The investigation is carried out relative to three kinds of vertical pattern languages: regular, linear, and strongly linear patterns. We also consider the complexity of the simplification problem for each of the considered pattern-based schemas.
Short Regular Expressions from Finite Automata: Empirical Results
- CIAA 2009. LNCS
, 2009
"... Abstract. We continue our work [H. Gruber, M. Holzer: Provably shorter regular expressions from deterministic finite automata (extended abstract). In Proc. DLT, LNCS 5257, 2008] on the problem of finding good elimination orderings for the state elimination algorithm, one of the most popular algorith ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. We continue our work [H. Gruber, M. Holzer: Provably shorter regular expressions from deterministic finite automata (extended abstract). In Proc. DLT, LNCS 5257, 2008] on the problem of finding good elimination orderings for the state elimination algorithm, one of the most popular algorithms for the conversion of finite automata into equivalent regular expressions. Here we tackle this problem both from the theoretical and from the practical side. First we show that the problem of finding optimal elimination orderings can be used to estimate the cycle rank of the underlying automata. This gives good evidence that the problem under consideration is difficult, to a certain extent. Moreover, we conduct experiments on a large set of carefully chosen instances for five different strategies to choose elimination orderings, which are known from the literature. Perhaps the most surprising result is that a simple greedy heuristic by [M. Delgado, J. Morais: Approximation to the smallest regular expression for a given regular language. In Proc. CIAA, LNCS 3317, 2004] almost always outperforms all other strategies, including those with a provable performance guarantee. 1