Pop Count

Larry Osterman recently blogged about some various methods for computing the population count (or pop count) of a machine word. For the uninitiated, pop count is the canonical name for the function which counts the number of asserted bits in a binary number. Larry’s blog entry was excellent, as usual, but it was of the comments that blew me away. Jeu George added a link to his blog describing the MIT HAKMEM pop count. Here it is:

unsigned popCount( unsigned x )
   unsigned r = x - ((x >> 1) & 033333333333)
                  - ((x >> 2) & 011111111111);
   return ((r + (r >> 3)) & 030707070707) % 63;

I spent hours figuring out how it worked.

It’s amazing, and quite concise, but I had a nagging worry: it contains a non-power-of-two modulus operation. I expect that most x86 compilers will generate an expensive “div” instruction to accomplish this.

In the spirit of “assume nothing, measure everything”, here’s my pop count test program.

There are four functions:

  1. popCountA: A naive O(N) loop where N is the number of bits up-to-and-including the most-significant asserted bit.
  2. popCountB: Somewhat less naive O(N) loop, where N is the number of asserted bits.
  3. popCountC: Shift-and-add algorithm that runs in constant time for a 32-bit word.
  4. popCountD: The MIT HAKMEM algorithm.

And here are the results, on two (unfortunately different) platforms. The results are in cycles, (min,max,avg):

Linux turing #1 SMP Tue Aug 9 00:32:49 PDT 2005 i686 GNU/Linux
Dual 1.6GHz AMD AthlonMP
gcc version 4.0.2 20050917 (prerelease) (Debian 4.0.1-8)

popCountA (20,15682,96)
popCountB (22,14075,64)
popCountC (21,34733,21)
popCountD (15,13896,20)

Microsoft(R) Windows(R) XP Professional x64 Edition 5.2.3790 Service Pack 1 Build 3790
Dual 2.6 GHz AMD Opteron
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80×86

popCountA (15,58477,66)
popCountB (5,1770,39)
popCountC (5,11025,10)
popCountD (43,29174,49)

Even with -march=athlon gcc absolutely refuses to generate an imul instruction in popCountC. Instead, it does a “strength reduction” that actually hurts performance. Microsoft, however, generates the imul — which explains how it’s popCountC time beats the pants off gcc’s. Both compilers generate the dreaded div instruction in popCountD, as I expected. Thus, they are both slow. I can’t yet explain why gcc outperformed Microsoft here.

So, although the MIT HAKMEM version is neat, it appears that popCountC (when using the imul instruction) is the fastest pop count algorithm on my hardware. If I had a platform that could do blazingly fast mod operations, perhaps this would be different.


%d bloggers like this: