Results 1 - 10
of
13
XTRACT: A System for Extracting Document Type Descriptors from XML Documents
- In ACM SIGMOD
, 2000
"... XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus ha ..."
Abstract
-
Cited by 85 (4 self)
- Add to MetaCart
XML is rapidly emerging as the new standard for data representation and exchange on the Web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate "general" candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization...
Discovering Patterns and Subfamilies in Biosequences
- In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
, 1996
"... We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective su ..."
Abstract
-
Cited by 24 (6 self)
- Add to MetaCart
We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns. Keywords: pattern discovery, sequence motifs, machine learning, protein subfamilies, PROSITE, clustering, algorithms, Bayesian inference, MDL principle Introduction The problem that we are considering in this paper is, given a set of biosequences, find simultaneously subsets sharing interesting, biologicall...
An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries
, 2003
"... ..."
Re-tree: an efficient index structure for regular expressions
- Bell Laboratories Tech. Memorandum
, 2002
"... Abstract. Due to their expressive power, regular expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mec ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Abstract. Due to their expressive power, regular expressions (REs) are quickly becoming an integral part of language specifications for several important application scenarios. Many of these applications have to manage huge databases of RE specifications and need to provide an effective matching mechanism that, given an input string, quickly identifies the REs in the database that match it. In this paper, we propose the RE-tree, a novel index structure for large databases of RE specifications. Given an input query string, the RE-tree speeds up the retrieval of matching REs by focusing the search and comparing the input string with only a small fraction of REs in the database. Even though the RE-tree is similar in spirit to other tree-based structures that have been proposed for indexing multidimensional data, RE indexing is significantly more challenging since REs typically represent infinite sets of strings with no well-defined notion of spatial locality. To address these new challenges, our RE-tree index structure relies on novel measures for comparing the relative sizes of infinite regular languages. We also propose innovative solutions for the various RE-tree operations including the effective splitting of RE-tree nodes and computing a “tight ” bounding RE for a collection of REs. Finally, we demonstrate how samplingbased approximation algorithms can be used to significantly speed up the performance of RE-tree operations. Preliminary experimental results with moderately large synthetic data sets indicate that the RE-tree is effective in pruning the search space and easily outperforms naive sequential search approaches.
Discovering Unbounded Unions of Regular Pattern Languages from Positive Examples (Extended Abstract)
- In Proceedings of the 7th International Symposium on Algorithms and Computation (ISAAC
"... ) Alvis Br¯azma 1 Esko Ukkonen 2 Jaak Vilo 2 1 Institute of Mathematics and Computer Science, University of Latvia 29 Rainis Bulevard, LV-1459 Riga, Latvia abra@cclu.lv 2 Department of Computer Science, University of Helsinki P.O.Box 26, FIN-00014 University of Helsinki, Finland ukkonen,vilo ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
) Alvis Br¯azma 1 Esko Ukkonen 2 Jaak Vilo 2 1 Institute of Mathematics and Computer Science, University of Latvia 29 Rainis Bulevard, LV-1459 Riga, Latvia abra@cclu.lv 2 Department of Computer Science, University of Helsinki P.O.Box 26, FIN-00014 University of Helsinki, Finland ukkonen,vilo@cs.helsinki.fi Abstract. The problem of learning unions of certain pattern languages from positive examples is considered. We restrict to the regular patterns, i.e., patterns where each variable symbol can appear only once, and to the substring patterns, which is a subclass of regular patterns of the type xffy, where x and y are variables and ff is a string of constant symbols. We present an algorithm that, given a set of strings, finds a good collection of patterns covering this set. The notion of a `good covering' is defined as the most probable collection of patterns likely to be present in the examples, assuming a simple probabilistic model, or equivalently using the Minimum Description...
Parsimony Hierarchies for Inductive Inference
- Journal of Symbolic Logic
"... Freivalds defined an acceptable programming system independent criterion for learning programs for functions in which the final programs were required to be both correct and "nearly" minimal size, i.e, within a computable function of being purely minimal size. Kinber showed that this parsimony requi ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Freivalds defined an acceptable programming system independent criterion for learning programs for functions in which the final programs were required to be both correct and "nearly" minimal size, i.e, within a computable function of being purely minimal size. Kinber showed that this parsimony requirement on final programs limits learning power. However, in scientific inference, parsimony is considered highly desirable. A lim-computable function is (by definition) one calculable by a total procedure allowed to change its mind finitely many times about its output. Investigated is the possibility of assuaging somewhat the limitation on learning power resulting from requiring parsimonious final programs by use of criteria which require the final, correct programs to be "not-so-nearly" minimal size, e.g., to be within a lim-computable function of actual minimal size. It is shown that some parsimony in the final program is thereby retained, yet learning power strictly increases. Considered, then, are lim-computable functions as above but for which notations for constructive ordinals are used to bound the number of mind changes allowed regarding the output. This is a variant of an idea introduced by Freivalds and Smith. For this ordinal notation complexity bounded version of lim-computability, the power of the resultant learning criteria form finely graded, infinitely ramifying, infinite hierarchies intermediate between the computable and the lim-computable cases. Some of these hierarchies, for the natural notations determining them, are shown to be optimally tight.
GraSS: Graph Structure Summarization
"... Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details about the original graph topology. However, this summarization process leaves the database owner with the challenge of processing queries that are expressed in terms of the original graph, but are answered using the summary. In this paper, we propose a formal semantics for answering queries on summaries of graph structures. At its core, our formulation is based on a random worlds model. We show that important graph-structure queries (e.g., adjacency, degree, and eigenvector centrality) can be answered efficiently and in closed form using these semantics. Further, based on this approach to query answering, we formulate three novel graph partitioning/compression problems. We develop algorithms for finding a graph summary that least affects the accuracy of query results, and we evaluate our proposed algorithms using both real and synthetic data. 1
AN MDL METHOD FOR FINDING HAPLOTYPE BLOCKS AND FOR ESTIMATING THE STRENGTH OF HAPLOTYPE
"... We describe a new method for finding haplotype blocks based on the use of the minimum description length principle. We give a rigorous definition of the quality of a segmentation of a genomic region into blocks, and describe a dynamic programming algorithm for finding the optimal segmentation with r ..."
Abstract
- Add to MetaCart
We describe a new method for finding haplotype blocks based on the use of the minimum description length principle. We give a rigorous definition of the quality of a segmentation of a genomic region into blocks, and describe a dynamic programming algorithm for finding the optimal segmentation with respect to this measure. We also describe a method for finding the probability of a block boundary for each pair of adjacent markers: this gives a tool for evaluating the significance of each block boundary. We have applied the method to the published data of Daly et al. 1 The results are in relatively good agreement with the published results, but also show clear differences in the predicted block boundaries and their strengths. We also give results on the block structure in population isolates. 1
On the Equivalence Problem for E-Pattern Languages (Extended Abstract)
- In TheoreticalComputer Science
, 1996
"... ) Enno Ohlebusch 1 and Esko Ukkonen 2 1 Technische Fakultat, University of Bielefeld, P.O. Box 100131, 33501 Bielefeld, Germany email: enno@TechFak.Uni-Bielefeld.DE 2 Department of Computer Science, University of Helsinki, P.O. Box 26, FIN-00014 Helsinki, Finland email: Esko.Ukkonen@cs.Helsink ..."
Abstract
- Add to MetaCart
) Enno Ohlebusch 1 and Esko Ukkonen 2 1 Technische Fakultat, University of Bielefeld, P.O. Box 100131, 33501 Bielefeld, Germany email: enno@TechFak.Uni-Bielefeld.DE 2 Department of Computer Science, University of Helsinki, P.O. Box 26, FIN-00014 Helsinki, Finland email: Esko.Ukkonen@cs.Helsinki.FI Abstract. On the one hand, the inclusion problem for nonerasing and erasing pattern languages is undecidable; see [JSSY95]. On the other hand, the language equivalence problem for NE-pattern languages is trivially decidable (see [Ang80a]) but the question of whether the same holds for E-pattern languages is still open. It has been conjectured by Jiang et al. [JSSY95] that the language equivalence problem for E-pattern languages is also decidable. In this paper, we introduce a new normal form for patterns and show, using the normal form, that the language equivalence problem for E-pattern languages is decidable in many special cases. We conjecture that our normal form procedure decides ...

