## Performance in Practice of String Hashing Functions (1997)

### Cached

### Download Links

- [www.cs.rmit.edu.au]
- [www.cs.rmit.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. Int. Conf. on Database Systems for Advanced Applications |

Citations: | 22 - 7 self |

### BibTeX

@INPROCEEDINGS{Ramakrishna97performancein,

author = {M.V. Ramakrishna and Justin Zobel},

title = {Performance in Practice of String Hashing Functions},

booktitle = {Proc. Int. Conf. on Database Systems for Advanced Applications},

year = {1997},

pages = {215--223}

}

### OpenURL

### Abstract

String hashing is a fundamental operation, used in countless applications where fast access to distinct strings is required. In this paper we describe a class of string hashing functions and explore its performance. In particular, using experiments with both small sets of keys and a large key set from a text database, we show that it is possible to achieve performance close to that theoretically predicted for hashing functions. We also consider criteria for choosing a hashing function and use them to compare our class of functions to other methods for string hashing. These results show that our class of hashing functions is reliable and efficient, and is therefore an appropriate choice for general-purpose hashing.

### Citations

8550 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...ashing functions H is universal if, for a given table size T and any pair of valid keys s 1 and s 2 , the number of hashing functions h 2 H such that h(s 1 ) = h(s 2 ) is less than or equal to jH j=T =-=[2]-=-. That is, for a randomly-chosen hashing function the probability that s 1 and s 2 hash to the same value is less than or equal to 1=T . In practice universality means that, with high probability, a r... |

699 | The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition - Knuth - 1998 |

433 |
Algorithms in c
- Sedgewick
- 1990
(Show Context)
Citation Context ...-behaved, but the operations required in the conversion and hashing make it unlikely that they would be faster than shift-addxor. For example, consider Cormen, Leiserson, and Rivest [2] and Sedgewick =-=[17]-=-, which are two of the better-known recent algorithms texts. Cormen, Leiserson, and Rivest [2] suggest that strings be converted to numbers through radix conversion. For alphanumeric strings, implicit... |

183 |
Overview of the first Text REtrieval Conference (TREC-1
- Harman
- 1992
(Show Context)
Citation Context ...d hand-edited to remove errors and nonsense [18]. 1 Another was trec, a file of 1,073,726 distinct words (that is, contiguous alphabetic strings) extracted from the first 3 gigabytes of the TREC data =-=[6]-=-; this data contains the full-text of newspaper articles, abstracts, and scientific journals. In our experiments we have focused on certain table sizes and load factors, to allow comparison with previ... |

78 |
Expected length of the longest probe sequence in hash code searching
- Gonnet
- 1981
(Show Context)
Citation Context ... expressed fear of this possibility by concluding that \hashing would be inappropriate for certain real-time applications such as air trac control, where people's lives are at stake". However, Go=-=nnet [5]-=- proved that such fears of hashing are baseless, since the probability of the worst case is, in his words, ridiculously small. Gonnet proposed a measure for the worst case of hashing based on the leng... |

45 | Phonetic string matching: lessons from information retrieval
- Zobel, Dart
- 1996
(Show Context)
Citation Context ... key sets. (However, results for all of the key sets were similar.) One was names, asle of 31,918 distinct surnames extracted from Internet news articles and hand-edited to remove errors and nonsense =-=[18]-=-. 1 Another was trec, asle of 1,073,726 distinct words (that is, contiguous alphabetic strings) extracted from thesrst 3 gigabytes of the TREC data [6]; this data contains the full-text of newspaper a... |

42 |
Practical performance of Bloom filters and parallel free-text searching
- Ramakrishna
- 1989
(Show Context)
Citation Context ...is often the basic data structure in applications such as symbol tables in compilers and account names in password files. Hashing is also used in applications such as spell checking and Bloom filters =-=[15]-=-. In databases, hashing is important, not just for indexing, but also for operations such as joins and inverted-file construction. The performance of a hashing scheme depends primarily on two factors:... |

31 |
The Second Text Retrieval Conference
- Harman, editor
- 1994
(Show Context)
Citation Context ...s and hand-edited to remove errors and nonsense [18]. 1 Another was trec, asle of 1,073,726 distinct words (that is, contiguous alphabetic strings) extracted from thesrst 3 gigabytes of the TREC data =-=[6-=-]; this data contains the full-text of newspaper articles, abstracts, and scientic journals. In our experiments we have focused on certain table sizes and load factors, to allow comparison with previo... |

22 |
Fast hashing of variable-length text strings
- Pearson
- 1990
(Show Context)
Citation Context ...n Database Systems for Advanced Applications, Melbourne, Australia, April 1-4, 1997. that has attracted surprisingly little research. Some recent papers have examined specic string hashing functions [=-=12, 13]-=- but how these functions compare to the analytically-predicted performance of hashing is unknown. Moreover, good choice of hashing function is crucial to eciency. It is often assumed that for a given ... |

9 |
Hashing Functions
- Knott
- 1975
(Show Context)
Citation Context ...nction. There has been much research addressing the problems of over ow and collisions. Hashing functions have received less attention, but analytically the behaviour of hashing is now wellunderstood =-=[3, 7, 10, 11, 14]-=-. However, in much of the work on hashing it is assumed that the keys are integers, while in practice keys are often strings of alphanumeric characters|an aspect of hashing Proceedings of the Fifth In... |

9 |
Key-to-address transform techniques: A fundamental performance study on large existing formatted files
- Lum, Yuen, et al.
- 1971
(Show Context)
Citation Context ...nction. There has been much research addressing the problems of over ow and collisions. Hashing functions have received less attention, but analytically the behaviour of hashing is now wellunderstood =-=[3, 7, 10, 11, 14]-=-. However, in much of the work on hashing it is assumed that the keys are integers, while in practice keys are often strings of alphanumeric characters|an aspect of hashing Proceedings of the Fifth In... |

9 |
Hashing practice: analysis of hashing and universal hashing
- Ramakrishna
- 1988
(Show Context)
Citation Context ...nction. There has been much research addressing the problems of over ow and collisions. Hashing functions have received less attention, but analytically the behaviour of hashing is now wellunderstood =-=[3, 7, 10, 11, 14]-=-. However, in much of the work on hashing it is assumed that the keys are integers, while in practice keys are often strings of alphanumeric characters|an aspect of hashing Proceedings of the Fifth In... |

5 |
Practical Perfect Hashing
- Cormack, Horspool, et al.
- 1985
(Show Context)
Citation Context ...her valuable properties are perfection, where the hashing function is collision-free, and orderpreservation, where the sort order of the hash values is the same as the sort-order of the original keys =-=[1, 4, 16-=-]. Both are valuable in specic applications; perfect hashing functions can be used for lookup in static tables, for example, because it may then not be necessary to store the keys. However, such funct... |

5 |
Selecting a hashing algorithm
- McKenzie, Harries, et al.
- 1990
(Show Context)
Citation Context ...n Database Systems for Advanced Applications, Melbourne, Australia, April 1-4, 1997. that has attracted surprisingly little research. Some recent papers have examined specic string hashing functions [=-=12, 13]-=- but how these functions compare to the analytically-predicted performance of hashing is unknown. Moreover, good choice of hashing function is crucial to eciency. It is often assumed that for a given ... |

5 |
File organization using composite perfect hashing
- Ramakrishna, Larson
- 1989
(Show Context)
Citation Context ...her valuable properties are perfection, where the hashing function is collision-free, and orderpreservation, where the sort order of the hash values is the same as the sort-order of the original keys =-=[1, 4, 16-=-]. Both are valuable in specic applications; perfect hashing functions can be used for lookup in static tables, for example, because it may then not be necessary to store the keys. However, such funct... |

3 |
General performance analysis of key-to-address transformations methods using an abstract concept
- Lum
- 1973
(Show Context)
Citation Context |

2 |
Order-preserving minimal hash functions and information retrieval
- Fox, Chen, et al.
- 1991
(Show Context)
Citation Context ...her valuable properties are perfection, where the hashing function is collision-free, and orderpreservation, where the sort order of the hash values is the same as the sort-order of the original keys =-=[1, 4, 16-=-]. Both are valuable in specic applications; perfect hashing functions can be used for lookup in static tables, for example, because it may then not be necessary to store the keys. However, such funct... |

2 |
Practical performance of Bloom and parallel free-text searching
- Ramakrishna
- 1989
(Show Context)
Citation Context ...ble is often the basic data structure in applications such as symbol tables in compilers and account names in passwordsles. Hashing is also used in applications such as spell checking and Bloomslters =-=[15-=-]. In databases, hashing is important, not just for indexing, but also for operations such as joins and inverted-le construction. The performance of a hashing scheme depends primarily on two factors: ... |

2 |
Expected worst-case performance of hash files
- Larson
- 1982
(Show Context)
Citation Context ...ng quite small, that is, not dramatically greater than would be given by dividing the keys evenly amongst the buckets. Larson extended these results for the general case of bucket size greater than 1 =-=[9]-=-. We now use these analytical results, for both average-case and worst-case behaviour of a class of ideal hashing functions, as a yardstick for evaluating the behaviour in practice of classes of strin... |

1 |
Distribution dependent hashing functions and their characteristics
- Deutscher, Sorenson, et al.
- 1975
(Show Context)
Citation Context |

1 |
Expected worst-case performance of hash
- Larson
- 1982
(Show Context)
Citation Context ...ng quite small, that is, not dramatically greater than would be given by dividing the keys evenly amongst the buckets. Larson extended these results for the general case of bucket size greater than 1 =-=[9]-=-. We now use these analytical results, for both average-case and worst-case behaviour of a class of ideal hashing functions, as a yardstick for evaluating the behaviour in practice of classes of strin... |