Good Hash Functions
By W. Clawpaws on Sunday 22 April 2007, 00:35 - Random Tips - Permalink
I happened to want to create a hash table with integer keys and went looking for a suitable function. As usual, Google is your friend. And as usual, once you start researching things on the 'net, hours can go by.
Thomas Wang has a good discussion of various integer hash functions, but that also lead me elsewhere to discussions of good hash functions in general.
In the past, I've found that many of the hash functions that are claimed as being better than Knuth's classic string hash function don't actually prove to be any better by most metrics, and some seem to be much worse.
For example, one popular hash on the street these days seems to be Paul
Hsieh's SuperFastHash. It does run quickly, and on the whole its statistical
properties seem to shake out reasonably well. But when you look at the actual
integers it returns, in my tests using /usr/share/dict/web2 on my
Mac, there seem to be a far more collisions than you'd statistically expect.
Statisitically, you'd expect about six collisions in the 32-bit space. Knuth's
hash function has only five, and they're very dissimilar words, namely:
227010540: autovivisection grovelings 890239928: dialypetalous mumpishness 2851341963: anisostemonous umbellifer 3508170762: ctenodactyl fuliginousness 3909438781: prerogativity puzzleheaded
The SuperFastHash function, on the other hand has 59 collisions, an order of magnitude more. Here are a representative few:
432696082: Cotinga Cotonam 535511585: miscoin misfond 631000912: amidine aminity 668950620: untossed unworked 738886349: hennin penman 749072160: revisible rewirable
Notice that the words that hash the same seem somehow similar. That's just weird.
In addition to his own hash function, Paul Hsieh also has some other useful code on his site, including a hash test program comparing several different hash implementations for speed, and a portable implementation of stdint.h.
The FNV (a.k.a. Fowler/Noll/Vo) hash is another hash function that seems popular these days. It seems broadly similar to Knuth's hash function, but does a better job of distributing hashes for short strings across the full 32-bit space for hashes. For example, Knuth's hash hashes bat and cat to 137867 and 139236 respectively, but FNV hashes them to 950299920 and 1587996537. Like Knuth's hash function, FNV places single-letter words in adjacent spots (although there is an alternative version, FNV-1a, that avoids this problem), Here are the collisions in 32-bit space for FNV.
374764810: diabolically koilanaglyphic 1055878936: deuteropathic vertebrosacral 1290893597: parer vila 1408982841: basiotribe narcotinic 1713658462: averral climatical 3129894270: Scorpididae transposer
Bob Jenkins published an article in Dr. Dobbs journal in 1997, providing a good hash function of his own, and has continued to tweak his code since. His page on hashing has lots of good stuff, including links to his code. His hash function is no slouch, and is the only one I looked at that maps single characters to radically different positions. Below are his 32-bit collisions, again with about the distribution you'd expect:
728135544: chorda fingerbreadth 733592810: stockily virginally 893264706: combaron unlimited 1456871225: gaspingly secularistic 1486736111: unbodied Yankee 2683815022: blackpoll Paharia 2947362466: Borinqueno unskewed 3298503807: distributress granulator
Thanks to Paul Hsieh's test program, here are some performance numbers for these different implementations (as benchmarked on my aging PowerBook G4):
FNVHash : 3.9300s knuthHash : 2.9700s BobJenkins : 2.4600s SuperFastHash : 2.2800s
Running on some other architectures, I find that FNV and Knuth are really about the same (the difference between the two seems to be a G4 artifact). On the whole, although it may look like there's a big difference between the algorithms, in my experience, I've found that the hash function, (or even the whole hash table implementation!) isn't really the bottleneck. In other words, if you make your hash function twice as fast, usually no one will notice.
Paul Hsieh's SuperFastHash may be a tiny bit faster than Bob Jenkins's hash, but I think not enough to really stand out, and its strange collisions worry me. Bob Jenkins's hash function is probably the best and the one to use if you want an industrial-strength hash, but it is massive and complex. FNV may be slower, but it's short and sweet, just two mystery constants to remember. But if I have to write it myself, from memory, I'm still going to go with Knuth. Usually, Knuth's slightly odd pattern really won't matter.
For more, see Wikipedia's coverage of hash tables, which also has pretty good coverage of hash functions.
Comments
You should take a look at the last version of Bob Jenkins's hash, which is called Lookup3. The old one that included in Paul Hsieh's test program is significantly slower. Theoretically, Lookup3 has about the same speed as Paul Hsieh's hash, but it has much better collision properties. It is the only hash that passes the frog.c test (sets of keys that are all the same length, with all bytes zero, except with a few bits set). See more details on Bob Jenkins's page.
As to speed concern, in my admittedly not very scientific tests, Lookup3 performed better than Paul Hsieh's hash with maximum optimization and about as good as Paul Hsieh's hash with the typical optimization.
The test is done by Paul Hsieh's test modified to include Lookup3. I've included the results of two runs for each optimization setting.
$gcc -O3 -march=athlon-xp hash.c -o hash
==========================
$./hash
BobJenkins : 1.8100s
SuperFastHash : 1.5500s
Lookup3 : 1.3800s
==========================
BobJenkins : 1.8100s
SuperFastHash : 1.5600s
Lookup3 : 1.3600s
==========================
$gcc -O2 hash.c -o hash
==========================
$./hash
BobJenkins : 1.8100s
SuperFastHash : 1.5400s
Lookup3 : 1.5700s
==========================
$./hash
BobJenkins : 1.8100s
SuperFastHash : 1.5500s
Lookup3 : 1.5800s