Results 1 - 10
of
32
Multi-resolution bitmap indexes for scientific data
- ACM Transactions on Database Systems
"... The unique characteristics of scientific data and queries cause traditional indexing techniques to perform poorly on scientific workloads, occupy excessive space, or both. Refinements of bitmap indexes have been proposed previously as a solution to this problem. In this paper, we describe the diffic ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The unique characteristics of scientific data and queries cause traditional indexing techniques to perform poorly on scientific workloads, occupy excessive space, or both. Refinements of bitmap indexes have been proposed previously as a solution to this problem. In this paper, we describe the difficulties we encountered in deploying bitmap indexes with scientific data and queries from two real-world domains. In particular, previously proposed methods of binning, encoding, and compressing bitmap vectors either were quite slow for processing the large-range query conditions our scientists used, or required excessive storage space. Nor could the indexes easily be built or used on parallel platforms. In this paper, we show how to solve these problems through the use of multi-resolution, parallelizable bitmap indexes, which support a fine-grained tradeoff between storage requirements and query performance. Our experiments with large data sets from two scientific domains show that multi-resolution, parallelizable bitmap indexes occupy an acceptable amount of storage while improving range query performance by roughly a factor of 10, compared to a single-resolution bitmap index of reasonable size.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
- In SIGMOD Conference
, 2007
"... The Boolean semantics of SQL queries cannot adequately capture the “fuzzy ” preferences and “soft ” criteria required in non-traditional data retrieval applications. One way to solve this problem is to add a flavor of “information retrieval ” into database queries by allowing fuzzy query conditions ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The Boolean semantics of SQL queries cannot adequately capture the “fuzzy ” preferences and “soft ” criteria required in non-traditional data retrieval applications. One way to solve this problem is to add a flavor of “information retrieval ” into database queries by allowing fuzzy query conditions and flexibly supporting grouping and ranking of the query results within the DBMS engine. While ranking is already supported by all major commercial DBMSs natively, support of flexibly grouping is still very limited (i.e., group-by). In this paper, we propose to generalize group-by to enable flexible grouping (clustering specifically) of the query results. Different from clustering in data mining applications, our focus is on supporting efficient clustering of Boolean results generated at query time. Moreover, we propose to integrate ranking and clustering with Boolean conditions, forming a new type of ClusterRank query to allow structured data retrieval. Such an integration is nontrivial in terms of both semantics and query processing. We investigate various semantics of this type of queries. To process such queries, a straightforward approach is to simply glue the techniques developed for ranking-only and clustering-only together. This approach is costly since both ranking and clustering are treated as blocking post-processing tasks upon Boolean query results by existing techniques. We propose a summary-based evaluation method that utilizes bitmap index to seamlessly integrate Boolean conditions, clustering, and ranking. Experimental study shows that our approach significantly outperforms the straightforward one and maintains high clustering quality.
Breaking the curse of cardinality on bitmap indexes
- in SSDBM 2008
"... Bitmap indexes are known to be efficient for ad-hoc range queries that are common in data warehousing and scientific applications. However, they suffer from the curse of cardinality, that is, their efficiency deteriorates as attribute cardinalities increase. A number of strategies have been proposed ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Bitmap indexes are known to be efficient for ad-hoc range queries that are common in data warehousing and scientific applications. However, they suffer from the curse of cardinality, that is, their efficiency deteriorates as attribute cardinalities increase. A number of strategies have been proposed, but none of them addresses the problem adequately. In this paper, we propose a novel binned bitmap index that greatly reduces the cost to answer queries, and therefore breaks the curse of cardinality. The key idea is to augment the binned index with an Order-preserving Bin-based Clustering (OrBiC) structure. This data structure significantly reduces the I/O operations needed to resolve records that can not be resolved with the bitmaps. To further improve the proposed index structure, we also present a strategy to create single-valued bins for frequent values. This strategy reduces index sizes and improves query processing speed. Overall, the binned indexes with OrBiC great improves the query processing speed, and are 3 – 25 times faster than the best available indexes for high-cardinality data. 1
Performance of Multi-Level and Multi-Component Compressed Bitmap Indexes
, 2007
"... Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, especially for low cardinality attributes. Recently, bitmap indexes were also shown to be just as effective for high cardinality attributes when certain compression methods are applied. There are m ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, especially for low cardinality attributes. Recently, bitmap indexes were also shown to be just as effective for high cardinality attributes when certain compression methods are applied. There are many different bitmap indexes in the literature but no definite comparison among them has been made, largely because there is no accurate prediction of their index sizes and search time. This paper presents a systematic evaluation of two large subsets of compressed bitmap indexes that use multi-component and multi-level encodings. We combine extensive analyses with ample experimental results to confirm them, whereas earlier studies of these indexes are either empirical or for uncompressed indexes only. Our analyses provide highly accurate predictions that agree with test measurements. These analyses not only identify the best methods in terms of index size and query processing cost, but also reveal new ways of using multi-level methods that significantly improve their performance. Using the best parameters obtained through analyses, we produce three two-level indexes with the optimal computational complexity. Furthermore, the fastest two-level indexes are predicted and observed to be 5 to 10 times faster than other well-known indexes. 1
Indexing Scientific Data
, 2007
"... The ability to extract information from collected data has always driven science. Today’s large computers and automated sensing technologies collect terabytes of data in a few weeks. Extracting information from such large amounts of data is like trying to find a needle in a haystack. For efficient i ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The ability to extract information from collected data has always driven science. Today’s large computers and automated sensing technologies collect terabytes of data in a few weeks. Extracting information from such large amounts of data is like trying to find a needle in a haystack. For efficient information extraction, we need disk-based indexing schemes that can efficiently handle queries restricting ranges on dozens of attributes. Unfortunately, the unique characteristics of scientific data and queries cause traditional indexing techniques to have poor performance on scientific workloads, occupy excessive space, or both. Bitmap indexes were proposed as a solution to these problems. However, in experiments with scientific data and queries, we found that previously proposed variations of bitmap indexes either were quite slow or required excessive storage for processing the large-range query conditions our scientists used. Scientists also told us that bitmap indexes, though smaller than traditional indexes, were too large for scientific data warehouses. Our scientists also wanted an efficient method to consolidate the data points returned by the indexes into larger, more meaningful regions of interest. To address these three problems, we introduced multi-resolution bitmap indexes, which group
Magellan: A Searchable Metadata Architecture for Large-Scale File Systems
, 2009
"... As file systems continue to grow, metadata search is becoming an increasingly important way to access and manage files. However, existing solutions that build a separate metadata database outside of the file system face consistency and management challenges at large-scales. To address these issues, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
As file systems continue to grow, metadata search is becoming an increasingly important way to access and manage files. However, existing solutions that build a separate metadata database outside of the file system face consistency and management challenges at large-scales. To address these issues, we developed Magellan, a new large-scale file system metadata architecture that enables the file system’s metadata to be efficiently and directly searched. This allows Magellan to avoid the consistency and management challenges of a separate database, while providing performance comparable to that of other large file systems. Magellan enables metadata search by introducing several techniques to metadata server design. First, Magellan uses a new on-disk inode layout that makes metadata retrieval efficient for searches. Second, Magellan indexes inodes in data structures that enable fast, multi-attribute search and allow all metadata lookups, including directory searches, to be handled as queries. Third, a query routing technique helps to keeps the search space small, even at large-scales. Fourth, a new journaling mechanism enables efficient update performance and metadata reliability. An evaluation with realworld metadata from a file system shows that, by combining these techniques, Magellan is capable of searching millions of files in under a second, while providing metadata performance comparable to, and sometimes better than, other large-scale file systems.
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
"... Abstract. The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management sys ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. The multi-core trend in CPUs and general purpose graphics processing units (GPUs) offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multicore architectures: a multi-core CPU and a GPU. The concurrency afforded by
Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language
"... Abstract: The development of an efficient compression scheme to process the Arabic language represents a difficult task. This paper employs the dynamic Huffman coding on data compression with variable length bit coding, on the Arabic language. Experimental tests have been performed on both Arabic an ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: The development of an efficient compression scheme to process the Arabic language represents a difficult task. This paper employs the dynamic Huffman coding on data compression with variable length bit coding, on the Arabic language. Experimental tests have been performed on both Arabic and English text. A comparison is made to measure the efficiency of compressing data results on both Arabic and English text. Also a comparison is made between the compression rate and the size of the file to be compressed. It has been found that as the file size increases, the compression ratio decreases for both Arabic and English text. The experimental results show that the average message length and the efficiency of compression on Arabic text is better than the compression on English text. Also, results show that the main factor which significantly affects compression ratio and average message length is the frequency of the symbols on the text.
Analysis of Basic Data Reordering Techniques
"... Abstract. Data reordering techniques are applied to improve the space and time efficiency of storage and query systems in various scientific and commercial applications. Run-length encoding is a prominent approach of compression in many areas, whose performance is significantly enhanced by achieving ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Data reordering techniques are applied to improve the space and time efficiency of storage and query systems in various scientific and commercial applications. Run-length encoding is a prominent approach of compression in many areas, whose performance is significantly enhanced by achieving longer and fewer “runs ” through data reordering. In this paper we theoretically study two reordering techniques, namely lexicographical order and Gray code order. We analyze these two methods in the context of bitmap indexes, which are known to have high query performances. We take into account the two commonly used bitmap encodings: equality and range. Our analysis indicates that, when we have all the possible data tuples, both ordering methods perform the same with equality encoding. However, Gray code achieves better compression with range encoding. Experimental results are provided to validate the theoretical analysis. 1

