Fast SSE Select Operation

Some SIMD architectures have a select instruction which combines two vector registers based on a third vector mask. I lust for such an instruction, but silly little two-operand SSE cannot currently support it.

Of course, people build it out of more primitive ops. For example, here’s what Apple suggests when porting to SSE from Altivec (which supports select):

(Note: this has been translated into Microsoft syntax)

_mm_sel_ps_apple(const __m128& a, const __m128& b, const __m128& mask)
    // (b & mask) | (a & ~mask)
    return _mm_or_ps( _mm_and_ps( b, mask ), _mm_andnot_ps( mask, a ) );

We mask b, inverse-mask a, and slam the result together with an or. This is the easy and obvious solution — and the one I used for years.

Here’s a better way:

_mm_sel_ps_xor(const __m128& a, const __m128& b, const __m128& mask)
    // (((b ^ a) & mask)^a)
    return _mm_xor_ps( a, _mm_and_ps( mask, _mm_xor_ps( b, a ) ) );

Witness the amazing power of xor! Here we calculate the bitwise difference between a and b. Then we mask and selectively apply this difference to convert some of the a’s into b’s. This is freaking genius. (I wish I could claim to have invented it.)

Here’s the assembly that my compiler generates for the two sequences. First, the naive way:

 movaps      xmm2, XMMWORD PTR [r8] ; mask
 movaps      xmm1, XMMWORD PTR [rdx] ; b
 andps       xmm1, xmm2
 movaps      xmm0, xmm2
 andnps      xmm0, XMMWORD PTR [rcx] ; a
 orps        xmm0, xmm1

Go-go gadget bit-twiddler:

 movaps      xmm1, XMMWORD PTR [rcx] ; a
 movaps      xmm0, XMMWORD PTR [rdx] ; b
 xorps	      xmm0, xmm1
 andps	      xmm0, XMMWORD PTR [r8] ; mask
 xorps	      xmm0, xmm1

It may not be obvious, but the 2nd sequence is much better because all it’s operations are commutative. Once this little select routine is inlined, a good register allocator will arrange to kill whatever operand may happen to be dead-out. By comparison, the first sequence constrains register allocation with that pesky noncommutative andnps instruction — which has to destroy the mask.

(credit to Jim Conyngham and the MD5 Wikipedia page)


4 Responses to “Fast SSE Select Operation”

  1. 1 Kent April 4, 2007 at 5:51 pm

    Hey bud,

    I love this optimization too, but it’s not perfect. It has a tiny flaw if you pass two references to the same memory location. If that happens the xor’s will wipe out all bits, no matter the mask. Ideally, if you want to merge the same memory location, you want to get the source back unmodified. You can only use the xor approach if you can guarantee that &a != &b.

    If you can’t, you need to use the andnot approach.

  2. 2 Mark April 4, 2007 at 10:32 pm

    I think my version is OK with const references, right? The compiler must not modify either of the source memory locations.

  3. 3 Mark April 5, 2007 at 12:49 pm

    Wendy helped me to understand why my code is *not* susceptible to the problem you describe, Kent. The answer does not have much of anything to do with “const”, despite what I said previously.

    The reason my version is safe is quite simply that my _mm_sel_ps_xor function does not ever modify any of its parameters.

    Sure, the const qualification helps prevent a mistake, but as long as I am careful not to modify my reference parameters, I’m golden.

    Insert head-slapping “duh” here.

    For illustrative purposes, here’s a version that is vulnerable to the problem you describe:

    _mm_sel_ps_xor_BAD_BAD_BAD( __m128& a, __m128& b, const __m128& mask)
        b = _mm_xor_ps( b, a );
        return _mm_xor_ps( a, _mm_and_ps( mask, b ) );

  1. 1 Revisiting Fast SSE Select at mark++ Trackback on April 8, 2008 at 3:51 pm
Comments are currently closed.

%d bloggers like this: