Surprising Effects of Volatile Qualifier

The classic semantics of C/C++ volatile are: “Please Mr. Compiler do exactly the reads and writes I specify.” Consider the following program:


   int x = 0;
   void foo( int c )
   {
      while( --c )
         x += c;
   }

In this case x is just a regular integer, so the compiler is free to register allocate and thus generates this:


      sub     ecx, 1
      je      end ; zero-trip loop
      mov     eax, DWORD PTR x
   looptop:
      add     eax, ecx
      sub     ecx, 1
      jne     looptop
      mov     DWORD PTR x, eax
   end:

Notice how x is loaded before the loop begins, allocated to the register eax, and has its final value stored back to memory after the loop. Now let’s see what happens when x is qualified as volatile:


      sub     ecx, 1
      je      end ; zero-trip loop
   looptop:
      add     DWORD PTR x, ecx  ; add-to-memory
      sub     ecx, 1
      jne     looptop
   end:

We are now doing a so-called Read-Modify-Write instruction inside the loop body. The add is being performed to (and from) memory.

In Visual Studio 2005 the semantics of the volatile qualifier were expanded to reflect modern use of the keyword in multi-threaded code (for an example, see the Wikipedia entry for Double-Checked Locking). This can lead to some surprising behavior:


   volatile unsigned x;
   unsigned y;
   void foo( int c )
   {
       while( --c )
          x ^= y;
   }

The previous version of Visual Studio (2003) will generate this:


      mov     eax, DWORD PTR c
      dec     eax
      je      end ; zero-trip loop
      mov     ecx, DWORD PTR y
   looptop:
      mov     edx, DWORD PTR x
      xor     edx, ecx
      dec     eax
      mov     DWORD PTR x, edx
      jne     looptop
   end:

Since x is volatile, it is loaded and stored inside the loop body. The other global, y, is enregistered outside of the loop. This is very sensible. Look what happens when we use VS 2005:


      mov     eax, DWORD PTR c
      sub     eax, 1
      je      end
   looptop:
      mov     ecx, DWORD PTR y
      xor     DWORD PTR x, ecx
      sub     eax, 1
      jne     looptop
   end:

VS 2005 generates the xor-to-memory for x, which is probably an improvement over the old discrete load and store. More important is what has happened to y. It is now repeatedly loaded from memory inside the loop, despite being totally non-volatile!

There is a long, complex, explanation behind this which I will save for another day. For now, just be aware that the volatile qualifier in VS 2005 has extra-standard behavior which can impair the compiler’s ability to optimize your code.

Use it with caution.

Advertisements

2 Responses to “Surprising Effects of Volatile Qualifier”


  1. 1 Samuel A. Falvo II August 3, 2007 at 2:31 pm

    Although I am a huge fan of the RISC concept, registers really are overrated. Since a register is just an explicitly named level-0 data cache, it follows that effective addresses into memory should be just as fast as a register hit (ignoring logic propegation delays, obviously — I’m not THAT dense!). It only takes a 2-read-1-write data cache interface, which can be effectively emulated by a cache driven by a CPU clock twice as fast as the execution core’s pipeline.

    Yes, it is more transistors than a normal register array. But, then, the Intel philosophy is to just throw more transistors at things anyway. For all we know, Intel and AMD processors are already doing precisely this, which is likely why your compilers elected to perform “to-memory” operations instead of “to-register” operations.

  2. 2 Mark August 4, 2007 at 12:17 am

    Gosh, where do I begin?

    A register and a cache are different.

    You have to ask a cache, “Do you currently have the value of memory location X?” This is called a probe and it takes time. The answer can come back “no” (miss) or “yes” (hit). Modern CPUs require something on the order of 3 cycles to access their first level cache.

    There are no register file “hits” or “misses”. The code asks for register #10, you go get it. That’s it. It’s always there. Register files can be read/written multiple times in a single clock cycle. A register-to-register move, for example, takes exactly 1 cycle.

    It’s not an accident that today’s CPUs have synchronous L1 caches. It’s not because hardware designers didn’t ever think, “Hey, what if we made the cache run twice as fast?”

    We make everything run as fast as it possibly can. This is measured in pico-seconds, which is an absolute measurement. Clock cycles are relative. At 3Ghz a clock is 330 picoseconds. So on my hardware, an ADD takes 330 picoseconds. An L1 cache hit takes 990.

    You can slow your execution units down if you want. You can perform the ADD in 2790 picoseconds, and then claim your cache runs “twice as fast.” I won’t be impressed until you can do a cache probe in less than 990 pico seconds.

    –Mark

    PS: The “to-memory” operations above are due to the keyword “volatile” from the C language. http://en.wikipedia.org/wiki/Volatile_variable


Comments are currently closed.




%d bloggers like this: