Results 11 - 20
of
36
Suffix Trees and their Applications in String Algorithms
, 1993
"... : The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
: The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching. Work partially supported by the ESPRIT BRA ALCOM II under contract no. 7141 and by the Italian MURST Project "Algoritmi, Modelli di Calcolo e Strutture Informative". y Part of this work was done while the author was visiting AT&T Bell Laboratories. Email: grossi@di.uni...
Cache-Conscious Automata for XML Filtering
- In ICDE
, 2005
"... Hardware cache behavior is an important factor in the performance of memory-resident, data-intensive systems such as XML filtering engines. A key data structure in several recent XML filters is the automaton, which is used to represent the long-running XML queries in the main memory. In this paper, ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Hardware cache behavior is an important factor in the performance of memory-resident, data-intensive systems such as XML filtering engines. A key data structure in several recent XML filters is the automaton, which is used to represent the long-running XML queries in the main memory. In this paper, we study the cache performance of automaton-based XML filtering through analytical modeling and system measurement. Furthermore, we propose a cache-conscious automaton organization technique, called the hot buffer, to improve the locality of automaton state transitions. Our results show that (1) our cache performance model for XML filtering automata is highly accurate and (2) the hot buffer improves the cache performance as well as the overall performance of automaton-based XML filtering. 1
Effect of Node Size on the Performance of Cache-Conscious B+-Trees
- In Proc. of SIGMETRICS
, 2003
"... In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cacheconscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cacheconscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Conventional wisdom suggests that the index’s node size should be equal to the cache line size in order to minimize the number of cache misses and improve performance. As we show in this paper, this design choice ignores additional effects, such as the number of instructions executed and the number of TLB misses, which play a significant role in determining the overall performance. To capture the impact of node size on the performance of a cache-conscious B+-tree (CSB+-tree), we first develop an analytical model based on the fundamental components of the search process. This model is then validated with an actual implementation, demonstrating that the model is accurate. Both the analytical model and experiments confirm that using node sizes much larger than the cache line size can result in better search performance for the CSB+-tree.
Generalized hash teams for join and group-by
- In Proc. of the 25th VLDB Conference
, 1999
"... We propose a new class of algorithms that can be used to speed up the execution of multi-way join queries or of queries that involve one or more joins and a group-by. These new evaluation techniques allow to perform several hash-based operations (join and grouping) in one pass without repartitioning ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We propose a new class of algorithms that can be used to speed up the execution of multi-way join queries or of queries that involve one or more joins and a group-by. These new evaluation techniques allow to perform several hash-based operations (join and grouping) in one pass without repartitioning intermediate results. These techniques work particularly well for joining hierarchical structures, e.g., for evaluating functional join chains along key/foreign-key relationships. The idea is to generalize the concept of hash teams as proposed by Graefe et.al [GBC98] by indirectly partitioning the input data. Indirect partitioning means to partition the input data on an attribute that is not directly needed for the next hash-based operation, and it involves the construction of bitmaps to approximate the partitioning for the attribute that is needed in the next hash-based operation. Our performance experiments show that such generalized hash teams perform significantly better than conventional strategies for many common classes of decision support queries. 1
Overlay Striping and Optimal Parallel I/O for Modern Applications
- Parallel Computing
, 1998
"... Disk array systems are rapidly becoming the secondary-storage media of choice for many emerging applications with large storage and high bandwidth requirements. Striping data across the disks of a disk array introduces significant performance benefits mainly because the effective transfer rate of th ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Disk array systems are rapidly becoming the secondary-storage media of choice for many emerging applications with large storage and high bandwidth requirements. Striping data across the disks of a disk array introduces significant performance benefits mainly because the effective transfer rate of the secondary storage is increased by a factor equal to the stripe width. However, the choice of the optimal stripe width is an open problem: no general formal analysis has been reported and intuition alone fails to provide good guidelines. As a result one may find occasionally contradictory recommendations in the literature. With this work we first contribute an analytical calculation of the optimal stripe width. Second, we recognize that the optimal stripe width is sensitive to the multiprogramming level, which is not known a priori and fluctuates with time. Thus, calculations of the optimal stripe width are, by themselves only, of little practical use. For this reason we propose a novel str...
ConceptBase - A Deductive Object Base
- Journal of Intelligent Information Systems, Special Issue on Advances in Deductive Object-Oriented Databases
, 1993
"... Deductive object bases attempt to combine the advantages of deductive relational databases with those of object-oriented models. We review modeling and optimization issues encountered during the development of ConceptBase, a prototype deductive object base supporting the Telos data model. We also ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Deductive object bases attempt to combine the advantages of deductive relational databases with those of object-oriented models. We review modeling and optimization issues encountered during the development of ConceptBase, a prototype deductive object base supporting the Telos data model. We also report on a number of application experiences in the field of meta data management.
An analytical study of object identifier indexing
- In Proceedings of the 9th International Conference on Database and Expert Systems Applications, DEXA’98
, 1998
"... The object identifier index of an object-oriented database system is typically 20 % of the size of the database itself, and for large databases, only a small part of the index fits in main memory. To avoid index retrievals becoming a bottleneck, efficient buffering strategies are needed to minimize ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
The object identifier index of an object-oriented database system is typically 20 % of the size of the database itself, and for large databases, only a small part of the index fits in main memory. To avoid index retrievals becoming a bottleneck, efficient buffering strategies are needed to minimize the number of disk accesses. In this report, we develop analytical cost models which we use to find optimal sizes of index page buffer and index entry cache, for different memory sizes, index sizes, and access patterns. Because existing buffer hit estimation models are not applicable for index page buffering in the case of tree based indexes, we have also developed an analytical model for index page buffer performance. The cost gain from using the results in this report is typically in the order of 200-300%. Thus, the results should be of valuable use in optimizers and tools for configuration and tuning of object-oriented database systems. 1
METU ObjectOriented DBMS
- In Advances in Object-Oriented Database Systems
, 1994
"... METU Object-Oriented DBMS 1 includes the implementation of a database kernel, an object-oriented SQL-like language and a graphical user interface. Kernel functions are divided between a SQL Interpreter and a C++ compiler. Thus the interpretation of functions are avoided increasing the e ciency of th ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
METU Object-Oriented DBMS 1 includes the implementation of a database kernel, an object-oriented SQL-like language and a graphical user interface. Kernel functions are divided between a SQL Interpreter and a C++ compiler. Thus the interpretation of functions are avoided increasing the e ciency of the system. The compiled by C++ functions are used by the system through the Function Manager. The system is realized on Exodus Storage Manager (ESM), thus exploiting some of the kernel functions readily provided by ESM. The additional functions provided by the MOOD kernel are the optimization and interpretation of SQL statements, dynamic linking of functions, and catalog management. An original query optimization strategy based on the object-oriented features of the language is developed. For this purpose formulas for the selectivity ofa path expression, and for the cost of forward and backward path traversals are derived, and join sizes are estimated. New strategies for ordering the joins and path expressions are also developed. A graphical user interface, namely MoodView is implemented on the MOOD kernel. MoodView provides the database programmer with tools and functionalities for every phase of OODBMS application development. Current version of MoodView allows a database user to design, browse, and modify database schema interactively. MoodView can automatically generate graphical displays for complex and multimedia database objects which can be updated through the object browser. Furthermore, a database administration tool, a full screen text-editor, a SQL based query manager, and a graphical indexing tool for the spatial data, i.e., R Trees are also implemented. 1
Informed prefetching of collective input/output requests
- Proceedings of SC99
, 1999
"... Optimizing collective input/output (I/O) is important for improving throughput of parallel scientific applications. Current research suggests that a specialized collective application programming interface, coupled with system-level optimizations, is necessary to obtain good I/O performance. Unfortu ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Optimizing collective input/output (I/O) is important for improving throughput of parallel scientific applications. Current research suggests that a specialized collective application programming interface, coupled with system-level optimizations, is necessary to obtain good I/O performance. Unfortunately, collective interfaces require an application to disclose its entire access pattern to fully reorder I/O requests, and cannot flexibly utilize additional memory to improve performance. In this paper we propose and analyze a method of optimizing collective access patterns using informed prefetching that is capable of exploiting any amount of available memory to overlap I/O with computation. We compare this approach to diskdirected I/O, an efficient implementation of a collective I/O interface. Moreover, we prove that under certain conditions, a per-processor prefetch depth equal to the number of drives can guarantee sequential disk accesses for any collectively accessed file. In empirical studies, a prefetch horizon of one to two times the number of disks per processor is sufficient to match the performance of disk-directed I/O for sequentially allocated files. Finally, we develop accurate analytical models to predict the throughput of informed prefetching for collective reads as a function of the per-processor prefetch depth. 1
A Selectivity Model for Fragmented Relations in Information Retrieval
, 2001
"... New application domains cause todays database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization, and main-memory computing) meeting these demands. Also, fragmentation might help improving e#cient processing of que ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
New application domains cause todays database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization, and main-memory computing) meeting these demands. Also, fragmentation might help improving e#cient processing of query types such as top N. Database design and query optimization require a good notion of the costs resulting from a certain fragmentation. Our mathematically derived selectivity model facilitates this. Once its two parameters have been computed based on the fragmentation, after each (though usually infrequent) update, our model can forget the data distribution, resulting in fast and quite good selectivity estimation. We show experimental verification for Zipfian distributed IR databases. Keywords: selectivity, fragmentation, Zipf, information retrieval, databases 1 Introduction E#cient and e#ective processing of large amounts of data is of crucial importance in most computer applications,...

