Archive for the 'Programming' Category

A Pair of VIM Tricks

I often lose track of my vim sessions and try to re-open a file. This results in an annoying error dialog (which I reflexively dismiss) and kicks off a hunt for that pre-existing vim session.

A coworker pointed out how silly this was, and suggested that vim should simply auto-raise the existing window. As I have discovered, vim ships with the “edit existing” macro which does just that. I’ve added this to my vimrc:

source $VIMRUNTIME/macros/editexisting.vim

Also, I’m pretty sick and tired of .swp and ~ files cluttering up my directories. Now I use the following two commands to keep them off to the side:

set backupdir=$TEMP//
set directory=$TEMP//


Revisiting Fast SSE Select

I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.

(Awesome library, terrible function name. But I digress…)

There are two alternatives at hand. One is the classic, obvious method:

  ; ((a & MASK) | (b & ~MASK))
  ; xmm0 = MASK
  ; xmm1 = a
  ; xmm2 = b
  andps  xmm1, xmm0
  andnps xmm0, xmm2
  orps   xmm0, xmm1

The other method uses fancy bit-twiddling:

  ; (((a ^ b) & MASK) ^ b)
  ; xmm0 = b
  ; xmm1 = a
  ; xmm2 = MASK
  xorps xmm0, xmm1
  andps xmm0, xmm2
  xorps xmm0, xmm1

I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:

Obvious Method
Xor Method

The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.

This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

More Stupid x86 Assembly Tricks

A couple of weeks ago I went hunting for a better way to compute x!=0 on x86. Eventually, I came up with a cute carry-flag trick and blogged about it.

(Note: I’m not branching on this comparison — that would be easy. Instead I want the value of the comparison in a general-purpose register. I should have made this explicitly clear in my original post. Alas, I did not. Doh.)

My goal was to avoid using setcc, because partial-register writes are the devil.

Try as I might, I couldn’t imagine a way generalize my solution so that it would also work for x==0. Someone suggested that I try the GNU superoptimizer (PDF, code), so I did.

At first I was a bit disappointed that the superoptimizer didn’t discover my sequence for x!=0. I think, maybe, the cost heuristics are outdated. (It should model xor reg,reg as being really cheap†.)

Turns out that the superoptimizer is still really clever anyway. It was a source of some great ideas. I’m delighted with what “we” came up with for x==0:

Old method; naive and literal:

85 c9           test     ecx, ecx
0f 94 c0        sete     al
0f b6 c0        movzx    eax, al

New method:

31 c0           xor      eax, eax
83 f9 01        cmp      ecx, 1
11 c0           adc      eax, eax

Once again the new method avoids the setcc and thus avoids insert semantics. As a nice bonus, we save a byte of code.

†This doesn’t actually depend on the input register at all. It’s essentially a “load-zero” instruction. Modern processors understand this and schedule accordingly.

Programmer Productivity, Again

Steve McConnell has just recently blogged on the perennial subject of programmer productivity. In his article he supports the “10x” hypothesis, a fact which should surprise no one (His blog is called “10x Software Development” ).

I’ve also blogged about this previously, and although I am personally uncomfortable with “10x”, McConnell has clearly done his homework. The article includes a detailed list of references.

I guess it’s time to get reading.

PS: I find Kaitlin Duck Sherwood’s musings on this subject to be quite compelling.

Maybe on April 1st I’ll rename this blog “1/10th Software Development”

Fast x86 Integer to Boolean

Consider the following C code:

   int int_to_bool( int i )
      return i == 0 ? 0 : 1;

If you run this thru your favorite x86 compiler, there’s a good chance you’ll see either a setcc instruction, or a cmovcc instruction. The latter was added in Pentium Pro, and thus cannot always be generated. The former has the unfortunate requirement that it write a byte register destination, which invokes the dreaded “insert semantics” of x86.

Here’s a sequence I dreamed up that avoids these problems (input in ecx):

   33 DB     xor         eax,eax
   3B D9     cmp         eax,ecx
   13 DB     adc         eax,eax

In this case I’m using cmp in such a way that operand order is important. I need to subtract my input from zero. (The compare instruction is just a subtract which only sets flags.) As it turns out, this sets the carry flag to exactly the answer I want to return. All that’s left is to extract the carry flag, and the quickest way to do that is to perform add-with-carry into a zero.

In summary: 6 bytes of code, 3 simple ubiquitous ALU instructions and — best of all — no merge or partial register stall issues.

Microsoft’s C compiler also does this kind of superoptimzer-inspired bit-twiddly magic for integer absolute value.

The x86 general-purpose registers are divided into sub-registers of varying widths. The ax register, for example, refers to the low 16-bits of the eax register. Similarly al refers to the lower byte. When writing to these sub-registers the upper bits of the register are preserved. This can hurt performance on a modern dynamically-scheduled processor. For more detail:

ooPSLA Impressions

Hey cool! I’ve been quoted on the ooPSLA Impressions page (hint: Ctrl-F Santaniello). Bonus points if you can find the picture of the back of my head.

OOPSLA 2006 was a hoot. If you haven’t already, check out my trip report.

380 Picoseconds

Please excuse me while I toot my own horn. Take a look at this:

   C:\latency >run
   imul    : 57 - 53 = 4
   lea shl : 56 - 53 = 3
   just lea: 55 - 53 = 2
   just shl: 54 - 53 = 1

That’s right, bitches, I am dynamically measuring the latency of a single x86 instruction — accurate down to one cycle! That’s ~380 picoseconds on my hardware.

This is really hard (impossible?) to do without a serializing read time-stamp counter instruction.