Revisiting Fast SSE Select

I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.

(Awesome library, terrible function name. But I digress…)

There are two alternatives at hand. One is the classic, obvious method:

  ; ((a & MASK) | (b & ~MASK))
  ; xmm0 = MASK
  ; xmm1 = a
  ; xmm2 = b
  andps  xmm1, xmm0
  andnps xmm0, xmm2
  orps   xmm0, xmm1

The other method uses fancy bit-twiddling:

  ; (((a ^ b) & MASK) ^ b)
  ; xmm0 = b
  ; xmm1 = a
  ; xmm2 = MASK
  xorps xmm0, xmm1
  andps xmm0, xmm2
  xorps xmm0, xmm1

I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:

Obvious Method
Xor Method

The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.

This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

Advertisements

5 Responses to “Revisiting Fast SSE Select”


  1. 1 Azeem Jiva April 8, 2008 at 7:08 pm

    I think your arrows are backward…

  2. 2 Mark April 9, 2008 at 12:15 pm

    You are probably imagining that they indicate data flow, but I read them as “depends on.” Sorry if this is non-standard.

  3. 3 Azeem Jiva April 10, 2008 at 8:13 am

    Ahh that makes more sense.

  4. 4 Brent Hollingsworth May 22, 2008 at 1:17 pm

    I have updated the function name. It is now logical_bitwise_select().

    What do you think it should be called?

  5. 5 Mark May 23, 2008 at 8:13 pm

    I think that’s a better name, although I’m not sure “logical” adds much. What’s the alternative? An arithmetic select? I might just call it bitwise_select.


Comments are currently closed.




%d bloggers like this: