Archive for the 'AMD' Category


My employer launched a new branding strategy today:

… It’s the power of Fusion. … It’s where customer needs, dreams and desires bond with our own passion for engineering.

Uhh, sounds like someone needs to take a cold shower. I just don’t get it.

Good thing I’m not in marketing, I guess.

Revisiting Fast SSE Select

I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.

(Awesome library, terrible function name. But I digress…)

There are two alternatives at hand. One is the classic, obvious method:

  ; ((a & MASK) | (b & ~MASK))
  ; xmm0 = MASK
  ; xmm1 = a
  ; xmm2 = b
  andps  xmm1, xmm0
  andnps xmm0, xmm2
  orps   xmm0, xmm1

The other method uses fancy bit-twiddling:

  ; (((a ^ b) & MASK) ^ b)
  ; xmm0 = b
  ; xmm1 = a
  ; xmm2 = MASK
  xorps xmm0, xmm1
  andps xmm0, xmm2
  xorps xmm0, xmm1

I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:

Obvious Method
Xor Method

The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.

This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

No, Donald Knuth, Half the TLB is Not “Wasted”

Donald Knuth is a genius. If he suffered significant head trauma, was sleep-deprived for days, and then pounded 10 shots of Jager in 10 minutes while holding his breath, his pinky toe could still beat me at Scrabble. I mean, the man is so smart that I am legally retarded by comparison. I should be arrested for even speaking his name.

Donald Knuth apparently gave a Valentine’s Day lecture, and at least one fella came away with this in his notebook:

He talked about how most userland programs were still 32-bit and how operating systems and processors had moved on to 64-bits and made the point that in this scenario, half of the bits in the TLB were being wasted.

Hmm, no. That just sounds wrong.

A TLB is this pretty wicked gizmo that performs address translations literally in one nanosecond via the magic of caching and SRAM. I’m not a hardware guy, but I know that this SRAM crap doesn’t come cheap. Would anyone ever waste it? I think not.

AMD64 chips employ a simple canonical address rule which limits the effective virtual address space to 48-bits while allowing for future expansion. It seems that current implementations map these virtual addresses into a 52-bit physical address space.

It’s pretty clear that we didn’t invent this idea. The MMU on the DEC Alpha, for example, left the upper 21 address bits unused.

So, sure, the page tables themselves hold 64-bits, but the hardware need only map subsets — and I’ll bet you lunch that the TLB’s SRAM structures are sized for this requirement.

More Stupid x86 Assembly Tricks

A couple of weeks ago I went hunting for a better way to compute x!=0 on x86. Eventually, I came up with a cute carry-flag trick and blogged about it.

(Note: I’m not branching on this comparison — that would be easy. Instead I want the value of the comparison in a general-purpose register. I should have made this explicitly clear in my original post. Alas, I did not. Doh.)

My goal was to avoid using setcc, because partial-register writes are the devil.

Try as I might, I couldn’t imagine a way generalize my solution so that it would also work for x==0. Someone suggested that I try the GNU superoptimizer (PDF, code), so I did.

At first I was a bit disappointed that the superoptimizer didn’t discover my sequence for x!=0. I think, maybe, the cost heuristics are outdated. (It should model xor reg,reg as being really cheap†.)

Turns out that the superoptimizer is still really clever anyway. It was a source of some great ideas. I’m delighted with what “we” came up with for x==0:

Old method; naive and literal:

85 c9           test     ecx, ecx
0f 94 c0        sete     al
0f b6 c0        movzx    eax, al

New method:

31 c0           xor      eax, eax
83 f9 01        cmp      ecx, 1
11 c0           adc      eax, eax

Once again the new method avoids the setcc and thus avoids insert semantics. As a nice bonus, we save a byte of code.

†This doesn’t actually depend on the input register at all. It’s essentially a “load-zero” instruction. Modern processors understand this and schedule accordingly.

Fast x86 Integer to Boolean

Consider the following C code:

   int int_to_bool( int i )
      return i == 0 ? 0 : 1;

If you run this thru your favorite x86 compiler, there’s a good chance you’ll see either a setcc instruction, or a cmovcc instruction. The latter was added in Pentium Pro, and thus cannot always be generated. The former has the unfortunate requirement that it write a byte register destination, which invokes the dreaded “insert semantics” of x86.

Here’s a sequence I dreamed up that avoids these problems (input in ecx):

   33 DB     xor         eax,eax
   3B D9     cmp         eax,ecx
   13 DB     adc         eax,eax

In this case I’m using cmp in such a way that operand order is important. I need to subtract my input from zero. (The compare instruction is just a subtract which only sets flags.) As it turns out, this sets the carry flag to exactly the answer I want to return. All that’s left is to extract the carry flag, and the quickest way to do that is to perform add-with-carry into a zero.

In summary: 6 bytes of code, 3 simple ubiquitous ALU instructions and — best of all — no merge or partial register stall issues.

Microsoft’s C compiler also does this kind of superoptimzer-inspired bit-twiddly magic for integer absolute value.

The x86 general-purpose registers are divided into sub-registers of varying widths. The ax register, for example, refers to the low 16-bits of the eax register. Similarly al refers to the lower byte. When writing to these sub-registers the upper bits of the register are preserved. This can hurt performance on a modern dynamically-scheduled processor. For more detail:

Linux-Based In-Flight Entertainment

Delta’s in-flight entertainment system runs Linux. Notice Tux below?

delta's in-flight linux boot screen

I was lucky enough to witness a couple of reboots on a flight from Atlanta. The boot spew also indicates that they use an AMD Geode processor.

Native Quad Core

If Intel’s stuff is quad core, then two motorcycles bolted together are a car.

Barcelona and Kentsfield

Power Saving with the Athlon MP

My server uses a pair of older Athlon MP chips which do not support PowerNow. Since this computer runs 24/7, I suspect it’s responsible for a substantial chunk of my monthly power bill. Today I discovered a way to reduce the power consumption.

First, I used my trusty Kill-A-Watt to get a baseline measurement. The server draws roughly 207 W at idle:

Before: 207 Watts

Now for the fun part. I applied this ACPI kernel patch to my kernel and built the amd76x_pm module. This patch enables the ACPI C2 and C3 processor states, which for some reason are otherwise disabled.

A quick reboot and modprobe later, I was greeted with this:

After: 113 Watts

There are a handful of warnings on the net about crashes, etc., when using the amd76x_pm code. So far, so good for me — But I’ll update this post if I experience any instability.

Update, Sunday 4:30 PM: Kernel Panic. Doh! Reboot and cross fingers…

Update, Monday 6:30 AM: Another crash. Although I didn’t mention this previously, I also enabled CONFIG_HIGH_RES_TIMERS and CONFIG_NO_HZ (Dynamic Ticks) in my new kernel. Maybe one of those is the cuplrit. I’m going to do the easiest thing, and pull amd76x_pm.ko for a while. We’ll see how it goes.

380 Picoseconds

Please excuse me while I toot my own horn. Take a look at this:

   C:\latency >run
   imul    : 57 - 53 = 4
   lea shl : 56 - 53 = 3
   just lea: 55 - 53 = 2
   just shl: 54 - 53 = 1

That’s right, bitches, I am dynamically measuring the latency of a single x86 instruction — accurate down to one cycle! That’s ~380 picoseconds on my hardware.

This is really hard (impossible?) to do without a serializing read time-stamp counter instruction.

You’re Welcome

AMD64 Inside