Life with Lunchhooks

To content | To menu | To search

Random Tips

Entries feed

Sunday 22 April 2007

Good Hash Functions

I happened to want to create a hash table with integer keys and went looking for a suitable function. As usual, Google is your friend. And as usual, once you start researching things on the 'net, hours can go by.

Thomas Wang has a good discussion of various integer hash functions, but that also lead me elsewhere to discussions of good hash functions in general.

In the past, I've found that many of the hash functions that are claimed as being better than Knuth's classic string hash function don't actually prove to be any better by most metrics, and some seem to be much worse.

For example, one popular hash on the street these days seems to be Paul Hsieh's SuperFastHash. It does run quickly, and on the whole its statistical properties seem to shake out reasonably well. But when you look at the actual integers it returns, in my tests using /usr/share/dict/web2 on my Mac, there seem to be a far more collisions than you'd statistically expect. Statisitically, you'd expect about six collisions in the 32-bit space. Knuth's hash function has only five, and they're very dissimilar words, namely:

227010540:  autovivisection grovelings
890239928:  dialypetalous mumpishness
2851341963: anisostemonous umbellifer
3508170762: ctenodactyl fuliginousness
3909438781: prerogativity puzzleheaded

The SuperFastHash function, on the other hand has 59 collisions, an order of magnitude more. Here are a representative few:

432696082: Cotinga Cotonam
535511585: miscoin misfond
631000912: amidine aminity
668950620: untossed unworked
738886349: hennin penman
749072160: revisible rewirable

Notice that the words that hash the same seem somehow similar. That's just weird.

In addition to his own hash function, Paul Hsieh also has some other useful code on his site, including a hash test program comparing several different hash implementations for speed, and a portable implementation of stdint.h.

The FNV (a.k.a. Fowler/Noll/Vo) hash is another hash function that seems popular these days. It seems broadly similar to Knuth's hash function, but does a better job of distributing hashes for short strings across the full 32-bit space for hashes. For example, Knuth's hash hashes bat and cat to 137867 and 139236 respectively, but FNV hashes them to 950299920 and 1587996537. Like Knuth's hash function, FNV places single-letter words in adjacent spots (although there is an alternative version, FNV-1a, that avoids this problem), Here are the collisions in 32-bit space for FNV.

374764810:  diabolically koilanaglyphic
1055878936: deuteropathic vertebrosacral
1290893597: parer vila
1408982841: basiotribe narcotinic
1713658462: averral climatical
3129894270: Scorpididae transposer

Bob Jenkins published an article in Dr. Dobbs journal in 1997, providing a good hash function of his own, and has continued to tweak his code since. His page on hashing has lots of good stuff, including links to his code. His hash function is no slouch, and is the only one I looked at that maps single characters to radically different positions. Below are his 32-bit collisions, again with about the distribution you'd expect:

728135544: chorda fingerbreadth
733592810: stockily virginally
893264706: combaron unlimited
1456871225: gaspingly secularistic
1486736111: unbodied Yankee
2683815022: blackpoll Paharia
2947362466: Borinqueno unskewed
3298503807: distributress granulator

Thanks to Paul Hsieh's test program, here are some performance numbers for these different implementations (as benchmarked on my aging PowerBook G4):

FNVHash         :  3.9300s
knuthHash       :  2.9700s
BobJenkins      :  2.4600s
SuperFastHash   :  2.2800s

Running on some other architectures, I find that FNV and Knuth are really about the same (the difference between the two seems to be a G4 artifact). On the whole, although it may look like there's a big difference between the algorithms, in my experience, I've found that the hash function, (or even the whole hash table implementation!) isn't really the bottleneck. In other words, if you make your hash function twice as fast, usually no one will notice.

Paul Hsieh's SuperFastHash may be a tiny bit faster than Bob Jenkins's hash, but I think not enough to really stand out, and its strange collisions worry me. Bob Jenkins's hash function is probably the best and the one to use if you want an industrial-strength hash, but it is massive and complex. FNV may be slower, but it's short and sweet, just two mystery constants to remember. But if I have to write it myself, from memory, I'm still going to go with Knuth. Usually, Knuth's slightly odd pattern really won't matter.

For more, see Wikipedia's coverage of hash tables, which also has pretty good coverage of hash functions.

Sunday 11 February 2007

Universal Binaries without XCode

OS X inherited fat binary technology from NextStep. Back in the NextStep days, the incantation was easy, you'd just add -arch i386 -arch ppc to all your compilation/linking/library commands and you'd be all set.

With OS X, Apple made it "even easier" — just a check box in XCode. And for projects that still use things like Makefiles, they give you detailed instructions for Building an Open Source Universal Binary. Great right? Not so much, because the those instructions essentially tell you how to make XCode manage the whole build, which is, frankly, nuts.

If you try to go old school and pass -arch i386 -arch ppc to gcc, all seems fine until you try to link, and which point it dies horribly. Turns out that the "standard" developer libraries are thin, not fat. So, to link your program, you need to pass -syslibroot /Developer/SDKs/MacOSX10.4u.sdk to the linker to have it find some libraries with the proper amount of universal goodness. For a C++ project, the relevant incantation is

   g++ -Wl,-syslibroot,/Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc

Why didn't they just say that? Maybe they were too embarrassed...

Understanding mdfind

I love Apple's Spotlight in theory, but in practice I hate the GUI implementation. On my 1.33Mhz G4 laptop, having it try to search while I'm still trying to type the first is both painful and usually useless. And, to make matters worse, when using Spotlight from the Finder, it seems to crash the finder about 50% of the time.

So, half the time I end up just using locate or find. But I keep thinking that I should really be using mdfind. But the manual page is less than helpful. It says

The query can be a string or a query expression.

which isn't very helpful because it doesn't tell you what valid query expressions look like.

Some people tell you to just do a search in the Finder, save it as a saved search/smart folder, and then go peek in ~/Library/Saved Searches. There you'll find evil XML (or binary) plist files, which are unreadable by normal people. But here's a useful trick to render them in human readable form (i.e., the old-style ASCII plist format). If the file ends with a .plist extension (e.g., ~/blah/foobar.plist), you can use defaults read ~/blah/foobar (note the lack of the .plist — the defaults command insists on adding it). If it doesn't end with .plist, you can make a temporary symlink that does.

So, you can look at saved searches, but learning from examples only goes so far. And, you can use mdls on existing known files to find potential attributes to use in your search.

But what you really need is to know the query expression syntax, and the metadata attributes. Why they can't just tell you about these references on the mdfind man page I don't know.

This documentation is a good start, except that it is still fairly sparse. From what I can tell, the inRange operator doesn't work, or at least doesn't work on dates. This works to find the files I've changed in the last day:

   mdfind '(kMDItemFSContentChangeDate >= $ && (kMDItemFSContentChangeDate < $ && (kMDItemContentTypeTree = "public.content")'

but this one doesn't

   mdfind '(inRange(kMDItemFSContentChangeDate,$,$ && (kMDItemContentTypeTree = "public.content")'

The kMDItemContentTypeTree = "public.content" part is to weed out updated cache files and the like, although apparently Makefiles don't qualify as content (none of the importers recognize them, I guess), so they get weeded out too. sigh

Still, I'm closer than ever to weaning myself off locate and find. Maybe.