Results 1 
8 of
8
Simple and spaceefficient minimal perfect hash functions
 In Proc. of the 10th Intl. Workshop on Data Structures and Algorithms
, 2007
"... Abstract. A perfect hash function (PHF) h: U → [0, m − 1] for a key set S is a function that maps the keys of S to unique values. The minimum amount of space to represent a PHF for a given set S is known to be approximately 1.44n 2 /m bits, where n = S. In this paper we present new algorithms for ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
Abstract. A perfect hash function (PHF) h: U → [0, m − 1] for a key set S is a function that maps the keys of S to unique values. The minimum amount of space to represent a PHF for a given set S is known to be approximately 1.44n 2 /m bits, where n = S. In this paper we present new algorithms for construction and evaluation of PHFs of a given set (for m = n and m = 1.23n), with the following properties: 1. Evaluation of a PHF requires constant time. 2. The algorithms are simple to describe and implement, and run in linear time. 3. The amount of space needed to represent the PHFs is around a factor 2 from the information theoretical minimum. No previously known algorithm has these properties. To our knowledge, any algorithm in the literature with the third property either: – Requires exponential time for construction and evaluation, or – Uses nearoptimal space only asymptotically, for extremely large n.
External perfect hashing for very large key sets
 In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07
, 2007
"... A perfect hash function (PHF) h: S → [0, m − 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functio ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
A perfect hash function (PHF) h: S → [0, m − 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets. In this paper we present a distributed and parallel version of a simple, highly scalable and nearspace optimal perfect hashing algorithm for very large key sets, recently presented in [4]. The sequential implementation of the algorithm constructs a MPHF for a set of 1.024 billion URLs of average length 64 bytes collected from the Web in approximately 50 minutes using a commodity PC. The parallel implementation proposed here presents the following performance using 14 commodity PCs: (i) it constructs a MPHF for the same set of 1.024 billion URLs in approximately 4 minutes; (ii) it constructs a MPHF for a set of 14.336 billion 16byte random integers in approximately 50 minutes with a performance degradation of 20%; (iii) one version of the parallel algorithm distributes the description of the MPHF among the participating machines and its evaluation is done in a distributed way, faster than the centralized function.
Practical perfect hashing in nearly optimal space
 Information Systems
"... A hash function is a mapping from a key universe U to a range of integers, i.e., h: U↦→{0, 1,...,m−1}, where m is the range’s size. A perfect hash function for some set S ⊆ U is a hash function that is onetoone on S, where m≥S. A minimal perfect hash function for some set S ⊆ U is a perfect hash ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
A hash function is a mapping from a key universe U to a range of integers, i.e., h: U↦→{0, 1,...,m−1}, where m is the range’s size. A perfect hash function for some set S ⊆ U is a hash function that is onetoone on S, where m≥S. A minimal perfect hash function for some set S ⊆ U is a perfect hash function with a range of minimum size, i.e., m=S. This paper presents a construction for (minimal) perfect hash functions that combines theoretical analysis, practical performance, expected linear construction time and nearly optimal space consumption for the data structure. For n keys and m=n the space consumption ranges from 2.62n to 3.3n bits, and for m=1.23n it ranges from 1.95n to 2.7n bits. This is within a small constant factor from the theoretical lower bounds of 1.44n bits for m=n and 0.89n bits for m=1.23n. We combine several theoretical results into a practical solution that has turned perfect hashing into a very compact data structure to solve the membership problem when the key set S is static and known in advance. By taking into account the memory hierarchy we can construct (minimal) perfect hash functions for over a billion keys in 46 minutes using a commodity PC. An open source implementation of the algorithms is available
NearOptimal Space Perfect Hashing Algorithms
"... Abstract. A perfect hash function (PHF) is an injective function that maps keys from a set S to unique values. Since no collisions occur, each key can be retrieved from a hash table with a single probe. A minimal perfect hash function (MPHF) is a PHF with the smallest possible range, that is, the ha ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. A perfect hash function (PHF) is an injective function that maps keys from a set S to unique values. Since no collisions occur, each key can be retrieved from a hash table with a single probe. A minimal perfect hash function (MPHF) is a PHF with the smallest possible range, that is, the hash table size is exactly the number of keys in S. Differently from other hashing schemes, MPHFs completely avoid the problem of wasted space and wasted time to deal with collisions. The study of perfect hash functions started in the early 80s, when it was proved that the theoretic information lower bound to describe a minimal perfect hash function was approximately 1.44 bits per key. Although the proof indicates that it would be possible to build an algorithm capable of generating optimal functions, no one was able to obtain a practical algorithm that could be used in real applications. Thus, there was a gap between theory and practice. The main result of the thesis filled this gap, lowering the space complexity to represent MPHFs that are useful in practice from O(n log n) to O(n) bits. This allows the use of perfect hashing in applications to which it was not considered a good option. This explicit construction of PHFs is something that the data structures and algorithms community has been looking for since the 1980s. 1.
GERINDO: Managing and Retrieving Information in Large Document Collections
 Departamento de Ciência da Computação, UFMG, Belo Horizonte
, 2007
"... We present in this report a summary of the main results produced in the five years of the GERINDO research project. The aim of this project is to address the increasing demand for software tools capable of dealing with information available in large document collections, such as the World Wide Web. ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We present in this report a summary of the main results produced in the five years of the GERINDO research project. The aim of this project is to address the increasing demand for software tools capable of dealing with information available in large document collections, such as the World Wide Web. It involves efforts of researchers from three Brazilian universities to develop core technologies for a number of document management applications demanded by today’s information society. These efforts are concentrated in six main research topics: document categorization, semistructured data management, agents and focused crawlers, information retrieval models and searching techniques, efficiency issues, and data mining. Besides specific contributions in these five research topics, the project has stimulated the interaction among the researchers of the three universities who have worked together to solve challenging problems using a combination of different approaches. Moreover, the project has promoted other collaborations with research groups from
Perfect hashing for data management applications
, 2007
"... Perfect hash functions can potentially be used to compress data in connection with a variety of data management tasks. Though there has been considerable work on how to construct good perfect hash functions, there is a gap between theory and practice among all previous methods on minimal perfect has ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Perfect hash functions can potentially be used to compress data in connection with a variety of data management tasks. Though there has been considerable work on how to construct good perfect hash functions, there is a gap between theory and practice among all previous methods on minimal perfect hashing. On one side, there are good theoretical results without experimentally proven practicality for large key sets. On the other side, there are the theoretically analyzed time and space usage algorithms that assume that truly random hash functions are available for free, which is an unrealistic assumption. In this paper we attempt to bridge this gap between theory and practice, using a number of techniques from the literature to obtain a novel scheme that is theoretically wellunderstood and at the same time achieves an orderofmagnitude increase in performance compared to previous “practical ” methods. This improvement comes from a combination of a novel, theoretically optimal perfect hashing scheme that greatly simplifies previous methods, and the fact that our algorithm is designed to make good use of the memory hierarchy. We demonstrate the scalability of our algorithm by considering a set of over one billion URLs from the World Wide Web of average length 64, for which we construct a minimal perfect hash function on a commodity PC in a little more than 1 hour. Our scheme produces minimal perfect hash functions using slightly more than 3 bits per key. For perfect hash functions in the range {0,..., 2n −1} the space usage drops to just over 2 bits per key (i.e., one bit more than optimal for representing the key). This is significantly below of what has been achieved previously for very large values of n. 1.
Topic: Search GigaHash: Scalable Minimal Perfect Hashing for Billions of URLs
"... A minimal perfect function maps a static set of ..."
Demonstration Wikipedia in the Pocket Indexing Technology for Nearduplicate Detection and High Similarity Search
"... We develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach aims at retrieval tasks such as nearduplicate detection and high similarity search ..."
Abstract
 Add to MetaCart
We develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach aims at retrieval tasks such as nearduplicate detection and high similarity search. To demonstrate the performance of our technology we have compiled the search index “Wikipedia in the Pocket”, which contains about 2 million English and German Wikipedia articles. 1 This index—along with a search interface—fits on a conventional CD (0.7 gigabyte). The ingredients of our indexing technology are similarity hashing and minimal perfect hashing.