Chris Hecker is Wrong About OoO Execution

Here is a quote from Chris Hecker at GDC 2005:

Modern CPUs use out-of-order execution, which is there to make crappy code run fast. This was really good for the industry when it happened, although it annoyed many assembly language wizards in Sweden.

I first heard this when Chris was quoted by Pete Isensee (from the XBOX 360 team) in his NWCPP talk a year ago. Maybe Chris was kidding. I don’t know. What I do know is:

  1. He is wrong
  2. Smart people are believing him
  3. It’s time to set the record straight

Processors implement dynamic scheduling because sometimes the ideal order for a given sequence of instructions can only be known at runtime. In fact, the ideal order can change each time the instructions are executed.

Imagine your binary contains the following very simple code:


     mov rax, [foo]
     mov rbx, [bar]

Two loads — that’s all. Lets assume that each of the loads misses cache 10% of the time. Often, one will miss but the other will hit. If you have an in-order machine, and the first load misses, you are forced to wait — you cannot proceed to the 2nd load, and you cannot hide any of the miss latency.

No matter how much of an uber assembly coder you are, you are going to be forced to choose an order for these two loads. More likely, your compiler will make this choice for you. Either way, that choice will be wrong at least some of the time.

An OoO processor can do the right thing every time.

Advertisements

8 Responses to “Chris Hecker is Wrong About OoO Execution”


  1. 1 Nathan Zook June 11, 2007 at 6:19 pm

    I think your example may be making his point. Good coding with a good compiler will minimize the number of occurances of two successive loads to different cache segments.

    Having said that, OoO allows a processor to execute almost any real code faster by utilizing parallelism in the code. We’re all very stuck on the Lucas model, but this model is fundamentally inefficent. There is no reason to wait to fetch the next instruction until the current one is retired unless it is some sort of conditional branch, and we’ve known for a long time how to place winning bets on that outcome as well. Indeed, most code allows for a substantial amount of parallelism at the instruction level (more if you’re not stuck in a register-poor architecture).

    Having said that, two key points remains. To my knowlege, major manufacturers are addicted to what I call un-forward design. Processors released in 1999 were designed to run SPEC95 really fast. SPEC95 was based on code written in 1993 or 1994. That code was designed to run fast on processors designed in 1990-1991. And THOSE processors were designed….

    Even within an architecture, processors need to be designed to provide compiler writers with a coding environment that allows them to quickly write optimizers that deliver good results. I’m guessing that processor developement is slowed 25% or more (taking four years to see three years worth of improvement) because of this.

    For a good example of what I’m talking about, look at the almost ten-year period that it has taken to get SIMD up to speed in x86.

    The other issue is that there is a whole lot of really, REALLY bad code out there that looks okay because OoO allows 50x execution speed. (And yes, I’m counting fetches and stores, including instruction translation and caches. That all enables OoO execution.)

  2. 2 Mark July 6, 2007 at 1:40 am

    No, no, I’m sure I am not making his point. I’m not talking about bank or index conflicts. You have to imagine that these two loads are in a huge program — a veritable sea of other cache accesses.

    He was suggesting that ninja assembly coders know their machines so well that they don’t need no stinkin’ OoO execution. But that is tantamount to saying that, for any sequence of instructions, the ideal schedule can be known statically, prior to runtime — this is just not true! Only at runtime do we actually have enough information to make the correct decision.

    Cache misses are, I think, *the* classic example.

    You have to treat the caches statistically. There are probably umpteen threads in your system, all polluting “your” cache. Even if they aren’t, your typical app is a mess of control flow. The proper schedule for loads A and B might change every 17th execution. Or maybe the proper schedule is data-dependent. Perhaps load A misses on successive even digits of pi 🙂

    Oh and before I forget I want to mention the OoO elephant in the room. Unless your new design comes with a new ISA you are going to run legacy binaries. Today’s ideal static schedule is tomorrow’s mess.

    But, let me take a moment to heartily agree with your comment on “un-forward design”. We build these huge backward-looking trace libraries, and then use them to decide how to build the processors of the future. This just seems twisted and wrong to me.

  3. 3 Nathan Zook July 6, 2007 at 11:52 am

    Are you talking about pipelining or OoO? I see cache misses as the classic case when you want to allow pipelining, but OoO does not occur until you have a dependent instruction of a cache miss that you allow other instructions to flow around.

    What OoO allows you to do would be to interleave instruction streams. So if the controlling load for stream A blocks, then stream B, controlled by the non-blocking load, can still make some progress. If this is what you contemplate, then you at least have a meaningful example.

    Now the question is: how much of a speed difference is there between OoO and pipelined execution of this stream, specifically designed to showcase OoO? The answer is completely dependent upon the probability of cache misses.

    And that IS his point.

    Without OoO, compiler writers (and we few remaining assembly writers) would be very serious about ensuring that our caches DID NOT miss. Moreover, the reduced cache misses takes pressure off of the memory bus, which has been a major system bottleneck since the mid nineties.

    Likewise with conditional branches. OoO only wins on branch misses. But again, hard-to-predict branches are (outside of a certain narrow classic of computations) a classic sign of bad code.

    As for changes to the ISA, the purpose of a new processor is to outperform the old. While the improvement can dramatically improve the performance of bad code, it is almost by definition that it can only hurt good code if the new design involves things like increasing cache latencies, reducing cache sizes, or increasing branch misses or penalties. These are contrary to the goal of improving performance.

    Note that I am NOT claiming that OoO does not allow even good code to run faster. It does. But its affect on good code is far, far less significant than its effect on bad. Thus the annoyed Sweedish wizards. And the claim that its purpose is to run bad code fast.

  4. 4 Mark July 6, 2007 at 1:09 pm

    OK Nathan, help me out here.

    Maybe I’m really talking about dynamic scheduling. Here’s a quote from Patterson’s CS252 slides (emphasis mine):

    Dynamic scheduling:
    Hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior. It handles cases when dependences unknown at compile time. It allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve.

    I suppose I have been conflating dynamic scheduling and OoO execution, when in reality they are separate, although related. Can you explain the difference to me? Is it something pedantic about what “execution” means?

    As you suggest, I could have written my example like this, instead:

    
       mul rax, [foo]
       div rbx, [bar]
    

    Now I have some work on the end of those “controlling” loads. Does that help?

  5. 5 Nathan Zook July 9, 2007 at 4:51 pm

    Well, in theory I guess you could have static sheduling with OOE, and even dynamic sheduling without, but that sure looks like “a whole lot of work for nothing” to me.

    Scheduling and execution are different, and the difference becomes important when instructions have various execution latencies. (A good example of this is the lazy Susan result reservation system that the K7 uses in the FPU). But in general, dynamic sheduling is the primary mechanism by which OOE is implemented.

    But yes, your new example is a fine case where OOE might occur–except that division, integer or floating point, is invariably microcoded, and everybody stalls on ucoded instructions! I’m not pointing this to be pedantic. The fact jumped out at me when I saw your code. It is precisely this sort of knowlege that optimizers (machine and human) have to keep in mind when generating code. Since we are assuming no dependencies between the instructions, we code them the other way around. Then the mul can go first–IF the div ucode does not fill the reservation stations–which it probably will, as integer divide is notoriously slow. (I’ve not seen K8 uarch, so I don’t know for sure…)

    So NOW we can talk about that cache miss. But more importantly, we see that even with OOE, a single cache miss can easily stall a system. Really bad code is likely to be helped less than just bad code for this reason.

    But my purpose in point out all of these things is this–a serious optimizer is going to be keeping up with hundreds (or thousands) of like details when selecting the algorithm to implement. The difference between good code and bad is between keeping up with all of this when selecting the algorthim. Note that I’m NOT talking about the instructions themselves–really optimized code starts by considering the etire ISA when planning how to attack a problem.

    Since my asm days may be behind me, and I’m not likely to do any interviews soon, I’m going to blow my favorite interview question by way of example:

    “Suppose you are maintaining a piece of code in gcc. Inside a critical inner loop, you need to find the most significant one in a thirty-two bit value. What do you do?”

    The perfect answer is “It depends”. 🙂

    If the hw folks have made our lives easy, they’ve implemented a cntlz instruction that executes in one cycle We use the gcc asm facility to access it. Unfortunately, this is a rarely-used instruction, so in most implementations (such as the K7), it is done in microcode. (So it is rarely used: see unforward design) If this is the case, we need to understand the data.

    If each bit is independent, random, and balanced, the fastest way may possibly be to test one bit at a time from the top. This algorithm depends primarily on the branch mispredict penalty, as each branch is 50/50.

    Another method would be to take four bits at a time, and use them to index into a register which decodes them. This gives a 93.75% chance of correct prediction on the branch.

    If the data is some sort of balanced one-hot indicator, then a divide-and-conquer method is likely best, probably with a four-bit lookup for the final pass.

    If this is some sort of exception counter, the most probable value might be 0, followed by 1, and so on. In this case, we examine the data from the bottom up!

    In other cases, we might know that 99% of the time, the value is between 4 and 60. A custom algorithm could exploit this fact.

    Now suppose that we decide that the algorthm that we want is precisely what the hw folks implemented in microcode. Let the hw do it, right? Wrong. Ucode involves a fetch penalty. It drains the pipe. It holds off other execution units for no cause. The advantage is code (and possibly data) fetches.

    On PPC processors, for instance, the word copy and register save functions are ignored by compilers for precisely this reason. As a result, I saw a silent fail of one of these instructions actually classified as “low priority” problem. I concur with that rating, by the way.

    Which doesn’t mean that it is necessary wrong to use the ucoded function–it means that you have to know everything to get it right.

    Now, suppose a coder choses the wrong algorithm. That is, he writes bad code. Then all the OOE hw is going to get really busy, and try to pull his bacon out of the fire. His bad code may well end up looking pretty good.

  6. 6 Mark July 9, 2007 at 5:57 pm

    Ok so dynamic scheduling and OoOE are inexorably linked. And neither can be made redundant by Swedish assembly wizards, no matter how wizardly they might be. We just cannot escape the fact that cache misses are a dynamic runtime event.

    I guess I could have shown my two loads as the heads of some dependency chains of (non-microcoded) instructions. Although, how often do we load data and throw it away? I mean, honestly…

    BTW, we don’t get LZCNT until Barcelona [1]. The closest thing we have currently is probably one of the bitscan instructions (BSF, BSR). I’m pretty sure those are vector-path on K8.

    [1] http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

  7. 7 Nathan Zook July 9, 2007 at 8:13 pm

    BSR–it’s in my 486 manual. I used it for AMD on the K7. Ucoded divide-and-conquer. Cntl(w|d)z on PPC. It’s been a while.

    ===

    But you’re missing my point. My point is that OOE helps bad code much more than it helps good code. Moreover, I’m saying that really good pipelined-only code may well outperform bad code with lots of OOE help because good code is going to load the system generally more efficently than bad.

  8. 8 Mark July 9, 2007 at 8:48 pm

    I’ll concede that OoOE helps poorly-scheduled code more than it helps well-scheduled code. All code, however, has cache misses, and dynamic scheduling is a big win there.

    I don’t want folks to go around believing that we design and build crutches for lazy programmers.

    But again, we should acknowledge the elephant in the room. We live in a world of binary compatibility. The ISA is king. “Good” code is only good until you run it on another micro-architecture. If the legends are true, this was the impetus behind Tomasulo’s invention in the first place.


Comments are currently closed.




%d bloggers like this: