Results 1  10
of
15
XTRACT: A System for Extracting Document Type Descriptors from XML Documents. Bell Labs Tech. Memorandum
, 1999
"... XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus ha ..."
Abstract

Cited by 101 (4 self)
 Add to MetaCart
XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general ” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with reallife and synthetic DTDs demonstrate the effectiveness of XTRACT’s approach in inferring concise and semantically meaningful DTD schemas for XML databases. 1
An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries
, 2003
"... ..."
Discovering Patterns and Subfamilies in Biosequences
 In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
, 1996
"... We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective su ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns. Keywords: pattern discovery, sequence motifs, machine learning, protein subfamilies, PROSITE, clustering, algorithms, Bayesian inference, MDL principle Introduction The problem that we are considering in this paper is, given a set of biosequences, find simultaneously subsets sharing interesting, biologicall...
Retree: an efficient index structure for regular expressions
 Bell Laboratories Tech. Memorandum
, 2002
"... Abstract. Due to their expressive power, regular expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mec ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Abstract. Due to their expressive power, regular expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mechanism that, given an input string, quickly identifies the REs in the database that match it. In this paper, we propose the REtree, a novel index structure for large databases of RE specifications. Given an input query string, the REtree speeds up the retrieval of matching REs by focusing the search and comparing the input string with only a small fraction of REs in the database. Even though the REtree is similar in spirit to other treebased structures that have been proposed for indexing multidimensional data, RE indexing is significantly more challenging since REs typically represent infinite sets of strings with no welldefined notion of spatial locality. To address these new challenges, our REtree index structure relies on novel measures for comparing the relative sizes of infinite regular languages. We also propose innovative solutions for the various REtree operations including the effective splitting of REtree nodes and computing a “tight ” bounding RE for a collection of REs. Finally, we demonstrate how samplingbased approximation algorithms can be used to significantly speed up the performance of REtree operations. Preliminary experimental results with moderately large synthetic data sets indicate that the REtree is effective in pruning the search space and easily outperforms naive sequential search approaches.
Ordinal Mind Change Complexity of Language Identification
"... The approach of ordinal mind change complexity, introduced by Freivalds and Smith, uses (notations for) constructive ordinals to bound the number of mind changes made by a learning machine. This approach provides a measure of the extent to which a learning machine has to keep revising its estimate o ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
The approach of ordinal mind change complexity, introduced by Freivalds and Smith, uses (notations for) constructive ordinals to bound the number of mind changes made by a learning machine. This approach provides a measure of the extent to which a learning machine has to keep revising its estimate of the number of mind changes it will make before converging to a correct hypothesis for languages in the class being learned. Recently, this notion, which also yields a measure for the difficulty of learning a class of languages, has been used to analyze the learnability of rich concept classes. The present paper further investigates the utility of ordinal mind change complexity. It is shown that for identification from both positive and negative data and n ≥ 1, the ordinal mind change complexity of the class of languages formed by unions of up to n + 1 pattern languages is only ω ×O notn(n) (where notn(n) is a notation for n, ω is a notation for the least limit ordinal and ×O represents ordinal multiplication). This result nicely extends an observation of Lange and Zeugmann
Discovering Unbounded Unions of Regular Pattern Languages from Positive Examples (Extended Abstract)
 In Proceedings of the 7th International Symposium on Algorithms and Computation (ISAAC
"... ) Alvis Br¯azma 1 Esko Ukkonen 2 Jaak Vilo 2 1 Institute of Mathematics and Computer Science, University of Latvia 29 Rainis Bulevard, LV1459 Riga, Latvia abra@cclu.lv 2 Department of Computer Science, University of Helsinki P.O.Box 26, FIN00014 University of Helsinki, Finland ukkonen,vilo ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
) Alvis Br¯azma 1 Esko Ukkonen 2 Jaak Vilo 2 1 Institute of Mathematics and Computer Science, University of Latvia 29 Rainis Bulevard, LV1459 Riga, Latvia abra@cclu.lv 2 Department of Computer Science, University of Helsinki P.O.Box 26, FIN00014 University of Helsinki, Finland ukkonen,vilo@cs.helsinki.fi Abstract. The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xffy, where x and y are variables and ff is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a `good covering' is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description...
Parsimony Hierarchies for Inductive Inference
 Journal of Symbolic Logic
"... Freivalds defined an acceptable programming system independent criterion for learning programs for functions in which the final programs were required to be both correct and "nearly" minimal size, i.e, within a computable function of being purely minimal size. Kinber showed that this parsimony requi ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Freivalds defined an acceptable programming system independent criterion for learning programs for functions in which the final programs were required to be both correct and "nearly" minimal size, i.e, within a computable function of being purely minimal size. Kinber showed that this parsimony requirement on final programs limits learning power. However, in scientific inference, parsimony is considered highly desirable. A limcomputable function is (by definition) one calculable by a total procedure allowed to change its mind finitely many times about its output. Investigated is the possibility of assuaging somewhat the limitation on learning power resulting from requiring parsimonious final programs by use of criteria which require the final, correct programs to be "notsonearly" minimal size, e.g., to be within a limcomputable function of actual minimal size. It is shown that some parsimony in the final program is thereby retained, yet learning power strictly increases. Considered, then, are limcomputable functions as above but for which notations for constructive ordinals are used to bound the number of mind changes allowed regarding the output. This is a variant of an idea introduced by Freivalds and Smith. For this ordinal notation complexity bounded version of limcomputability, the power of the resultant learning criteria form finely graded, infinitely ramifying, infinite hierarchies intermediate between the computable and the limcomputable cases. Some of these hierarchies, for the natural notations determining them, are shown to be optimally tight.
GraSS: Graph Structure Summarization
"... Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details about the original graph topology. However, this summarization process leaves the database owner with the challenge of processing queries that are expressed in terms of the original graph, but are answered using the summary. In this paper, we propose a formal semantics for answering queries on summaries of graph structures. At its core, our formulation is based on a random worlds model. We show that important graphstructure queries (e.g., adjacency, degree, and eigenvector centrality) can be answered efficiently and in closed form using these semantics. Further, based on this approach to query answering, we formulate three novel graph partitioning/compression problems. We develop algorithms for finding a graph summary that least affects the accuracy of query results, and we evaluate our proposed algorithms using both real and synthetic data. 1
DTD Inference from XML Documents: The XTRACT Approach
"... XML is rapidly emerging as the new standard for data representation and exchange on the Web. Document Type Descriptors (DTDs) contain valuable information on the structure of XML documents and thus have a crucial role in the efficient storage and querying of XML data. Despite their importance, howev ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
XML is rapidly emerging as the new standard for data representation and exchange on the Web. Document Type Descriptors (DTDs) contain valuable information on the structure of XML documents and thus have a crucial role in the efficient storage and querying of XML data. Despite their importance, however, DTDs are not mandatory, and it is quite possible for documents in XML databases to not have accompanying DTDs. In this paper, we present an overview of XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate "general " candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates.