Results 1 - 10
of
10
XMill: an Efficient Compressor for XML Data
, 1999
"... We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogene ..."
Abstract
-
Cited by 165 (0 self)
- Add to MetaCart
We describe a tool for compressing XML data, with applications in data exchange and archiving, which usually achieves about twice the compression ratio of gzip at roughly the same speed. The compressor, called XMill, incorporates and combines existing compressors in order to apply them to heterogeneous XML data: it uses zlib, the library function for gzip, a collection of datatype specific compressors for simple data types, and, possibly, user defined compressors for application specific data types. 1 Introduction We have implemented a compressor/decompressor for XML data, to be used in data exchange and archiving, that achieves about twice the compression rate of general-purpose compressors (gzip), at about the same speed. The tool can be downloaded from www.research.att.com/sw/tools/xmill/. XML is now being adopted by many organizations and industry groups, like the healthcare, banking, chemical, and telecommunications industries. The attraction in XML is that it is a self-describi...
Processing XML Streams with deterministic automata
, 2003
"... Abstract. We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent ..."
Abstract
-
Cited by 107 (3 self)
- Add to MetaCart
Abstract. We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets. 1
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Horizontal Query Optimization on Ordered Semistructured Data
, 1999
"... The exchange and storage of XML data is becoming increasingly important. In contrast to conventional semistructured data [4, 1], the labels in a document-oriented representation such as XML are ordered. Furthermore, regular expressions (DTDs) describe the horizontal (and vertical) structure of the d ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
The exchange and storage of XML data is becoming increasingly important. In contrast to conventional semistructured data [4, 1], the labels in a document-oriented representation such as XML are ordered. Furthermore, regular expressions (DTDs) describe the horizontal (and vertical) structure of the data. Traditional query languages for semi-structured data ignore the horizontal order and are therefore limited in their expressiveness and optimizability. We describe a query language for querying ordered semistructured data. This query language provides primitives for specifying more powerful queries on ordered semistructured data. Furthermore, we describe how horizontal type information in DTDs is used to optimize queries based on finite automata.
A Query Interface for Heterogenous Biological Data Sources
- University of Pennsylvania
, 1994
"... This report presents our preliminary effort in connecting Kleisli and CPL to these systems. ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
This report presents our preliminary effort in connecting Kleisli and CPL to these systems.
Database Techniques for Biological Materials and Methods
- In 1st Inter'l Conf. on Intelligent Systems for Molecular Biology
, 1993
"... The Biological sciences produce an enormous research literature every year. Research papers are highly structured documents whose content is not captured using the traditional techniques of information retrieval: keywords and flat text. This is especially true of the Materials & Methods section of e ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The Biological sciences produce an enormous research literature every year. Research papers are highly structured documents whose content is not captured using the traditional techniques of information retrieval: keywords and flat text. This is especially true of the Materials & Methods section of experimental papers. A great deal of highly structured information is packed into this section. It involves logical and temporal sequences of operations that combine and operate on materials using various instruments and depending on many parameters. We are designing and implementing databases that will allow this complex knowledge to be represented, stored in object-oriented databases and retrieved. We are developing an application of this technology called the Laboratory Notebook. This application is a software system that will contain personal laboratory information as well as have access to databases of Materials & Methods sections drawn from the literature. 1 Introduction. Biology is a very large and diverse field. The primary output of the enterprise is its published research literature, which consists of about 600,000 papers every
Sequence Comparison Using a Relational Database Approach
- In Proceedings of the International Database Engineering and Applications Symposium (IDEAS
, 1997
"... A variety of heterogenous data sources is available in the field of molecular biology. Our focus lies on the biological sequence data, i. e. data maintained in collections like EMBL or SWISS-PROT. We propose a relational model based on an Entity-Relationship approach for this discourse world. This i ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A variety of heterogenous data sources is available in the field of molecular biology. Our focus lies on the biological sequence data, i. e. data maintained in collections like EMBL or SWISS-PROT. We propose a relational model based on an Entity-Relationship approach for this discourse world. This is the foundation of a flexible architecture useful for a variety of purposes, e. g. for sequence comparison. In this paper we present our first application system within this architecture, the sequence analysis tool NNSAT. The description of this system is accompanied by a short review of the data model and a section on the problem of sequence comparison in which we propose a small modification of the Needleman and Wunsch alignment method. Keywords: Gene and protein sequences, ER model, Relational model, Sequence alignment, Applications on relational systems 1 Introduction The field of molecular biology offers a variety of challenges for computer scientists. Among them are finding suitable...
Pruning Nested Data Values Using Branch Expressions With Wildcards
"... Branch expressions are presented as a means of expressing structural queries over nested data contained in data exchange formats. We demonstrate their utility in pruning large data structures by using them to specify a form of parse optimization; and we show that their evaluation can be done in line ..."
Abstract
- Add to MetaCart
Branch expressions are presented as a means of expressing structural queries over nested data contained in data exchange formats. We demonstrate their utility in pruning large data structures by using them to specify a form of parse optimization; and we show that their evaluation can be done in linear time with a constant amount of memory. Wildcards that range over subtrees of a data structure are introduced and a method for eliminating wildcards is described. We then demonstrate how we have embedded branch expressions into a more general system to express a richer class of queries. Finally, optimizations for migrating operations from the general system into the more efficient branch expression system are described. 1 Introduction In the biomedical community, a vast amount of public data continues to be stored, queried, transmitted, and viewed using data exchange formats (e.g. ASN.1, ACE, EMBL, PIR, and PDB). These formats have varying degrees of implicit or explicit syntactic structu...
Building Neural and Logical Networks
- WIRN’99 - The 11-th Italian Workshop on Neural Nets
, 1999
"... The solution of binary classification problems is obtained by employing a new learning method, called Hamming Clustering (HC). It is able to build in a constructive way a two-layer perceptron with binary weights, which can be easily implemented by means of conventional logical ports. ..."
Abstract
- Add to MetaCart
The solution of binary classification problems is obtained by employing a new learning method, called Hamming Clustering (HC). It is able to build in a constructive way a two-layer perceptron with binary weights, which can be easily implemented by means of conventional logical ports.

