rdtsc in the Age of Sandybridge

Years ago I wrote a program that would dynamically measure the speed of my CPU. It would record the current number of CPU clock ticks with __rdtsc() (the C/C++ intrinsic for the rdtsc instruction), then wait some period of time (measured with QueryPerformanceCounter()), and then see how many CPU ticks had elapsed during that period. It was fascinating, in a pointless and geeky way, to see how the CPU clock speed changed as the load changed.

rdtsc has changed a lot since then. My Sandybridge CPU is rated at 2.20 GHz and the rdtsc instruction on it returns a rock-solid 2.195 GHz at all times. I could feel ripped off that my 2.20 GHz CPU runs 1.1% slow, but it turns out that isn’t the full story.

By using CPU-Z or Intel’s overnamed “Intel(R) Turbo Boost Technology Monitor 2.0” I can see that when my CPU is idle it drops to very low speeds, and when it is under heavy load it actually spikes up to around 3.0 GHz (2.993 GHz according to CPU-Z, which is probably more trustworthy than Intel’s potentially biased CPU speed monitor).

The actual peak CPU speed depends on the overall load of the CPU and GPU, which means that if you optimize your code really well then you might push the CPU so hard that it can’t overclock itself quite as much. That sentence just about makes my head explode.

On the code that I care about I find that my Sandybridge CPU usually runs at it’s peak speed of about 3.0 GHz. I can’t get it to thermally limit itself, even when pegging all four hyperthreaded cores.

What this means is that if I use __rdtsc() to measure how many clock cycles it takes to execute a function I am actually measuring how many ‘standardized’ clock cycles it takes to execute that function. The actual number of ‘real’ clock cycles is likely to be 3.0 / 2.2 GHz higher, or about 36%. My inner loop in a program I work on occasionally takes about 3.0 cycles to execute as measured by __rdtsc(), but that is actually 4.09 ‘real’ CPU cycles.

Here’s the test code if you want to try this on your computer:

// Use Query Performance Counter to get a nice accurate time-stamp.
__int64 GetQPCTime()
{
    LARGE_INTEGER qpcTime;
    QueryPerformanceCounter(&qpcTime);
    return qpcTime.QuadPart;
}

// Use QueryPerformanceCounter to interpret the results of GetQPCTime()
__int64 GetQPCRate()
{
    LARGE_INTEGER qpcRate;
    QueryPerformanceFrequency(&qpcRate);
    return qpcRate.QuadPart;
}

int _tmain(int argc, _TCHAR* argv[])
{
    const DWORD msDuration = 1000;
    const int iterations = 6;
    const double qpcRate = (double)GetQPCRate();

    // Measure the CPU speed reported by __rdtsc() when the
    // machine is mostly idle — CPU-Z shows the CPU is running slow.
    printf(“Idle CPU speed.\n”);
    for (int i = 0; i < iterations; ++i)
    {
        __int64 rdtscStart = __rdtsc();
        __int64 qpcStart = GetQPCTime();
        Sleep(msDuration);
        __int64 rdtscElapsed = __rdtsc() – rdtscStart;
        __int64 qpcElapsed = GetQPCTime() – qpcStart;
        printf(”    Clock speed = %1.3f GHz\n”, 1e-9 *
                rdtscElapsed / (qpcElapsed / qpcRate));
    }

    // Measure the CPU speed reported by __rdtsc() when the
    // machine is busy — CPU-Z shows the CPU is running fast.
    printf(“Busy CPU speed.\n”);
    for (int i = 0; i < iterations; ++i)
    {
        __int64 rdtscStart = __rdtsc();
        __int64 qpcStart = GetQPCTime();
        DWORD startTick = GetTickCount();
        for (;;)
        {
            DWORD tickDuration = GetTickCount() – startTick;
            if (tickDuration >= msDuration)
                break;
        }
        __int64 rdtscElapsed = __rdtsc() – rdtscStart;
        __int64 qpcElapsed = GetQPCTime() – qpcStart;
        printf(”    Clock speed = %1.3f GHz\n”, 1e-9 *
                rdtscElapsed / (qpcElapsed / qpcRate));
    }

    return 0;
}

The output of this program is so consistent that it is boring. The rdtsc instruction ticks along at the advertised speed, regardless of whether the CPU is running faster or slower than that speed.

Idle CPU speed.
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
Busy CPU speed.
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz
    Clock speed = 2.195 GHz

When running the first loop (the idle loop) the Intel monitoring tool shows:

image

and when running the second loop (the busy loop) it shows:

image

When running the first loop (the idle loop) CPU-Z shows:

image

and when running the second loop (the busy loop) it shows:

image

If you need a consistent timer that works across cores and can be used to measure time then this is good news. If you want to measure actual CPU clock cycles then you are out of luck. If you want consistency across a wide range of CPU families then it sucks to be you.

Update: section 16.11 of the Intel System Programming Guide documents this behavior of the Time-Stamp Counter. Roughly speaking it says that on older processors the clock rate changes, but on newer processors it remains uniform. It finishes by saying, of Constant TSC, “This is the architectural behavior moving forward.”

Wikipedia mentions how to check for Constant TSC on Linux.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x faster. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And juggle.
This entry was posted in Programming and tagged , , , . Bookmark the permalink.

17 Responses to rdtsc in the Age of Sandybridge

  1. Pingback: Coherent Labs » Timestamps for performance measurements

  2. Pingback: Timestamps for performance measurements | Prog stuff

  3. KJB says:

    Divide frequency by stock multi (CPU MSR) –> gets bus speed –> multiply by current multi (also a CPU MSR) –> CPU frequency in realtime.

  4. E.L. Wisty says:

    Reblogged this on Pink Iguana.

  5. Some old games use this exact same technique of yours to calculate the CPU speed to setup settings and to calculate timings.

    Then… they break on Turboost.

    Any idea on how to make those apps get the Turboboosted clock?

    • brucedawson says:

      If a game is so sensitive to the clock speed that it breaks when the CPU runs 10-20% faster than expected then you are probably doomed. You could try turning off TurboBoost, but even that might be insufficient because they are presumably extremely sensitive to CPU *performance* will necessarily varies depending on load, especially in the presence of hyperthreads.

      Sorry, no idea.

  6. Alexander S says:

    Bruce – did you ever come across cases where the cost of QueryPerformanceCounter varies across boxes with no rhyme or reason? I am looking at two systems right now, and test program (that just continually calls QueryPerformanceCounter) indicates that the cast of the call is 6-7 times greater on one box (even though the box is just somewhat worse – like 2.5 vs 3.3 GHz)

    BIOS on the box does not seem to allow direct configuration of the HPET, but I would assume it is the same on both boxes (some Dell PowerEdges)

    I also saw that the cost of calls to QueryPerformanceCounter is quite high when running in virtual machine environment, but there it is at least somewhat understandable, since it requires emulation or at least few indirection. I can’t find good explanation of differences for straigh-to-metal boxes…

    Our code does a lot of self-measurements using QueryPerformanceCounter, and the rate of calls is very significant.

  7. brucedawson says:

    I’ve never investigated. You could try grabbing an ETW profile to see if the CPU sampling reveals anything, but it may not.

    • Alexander S says:

      I did – it just points me to HalpHpetProgramRolloverTimer function in hal.dll, and that’s about it. But i guess I should compare traces of good box and bad box…

      • brucedawson says:

        You can group samples by module/function/address, drill down to HalpHpetProgramRolloverTimer, then sort by address to see a rough heat map of where time is spent. You can then compare to a disassembly of the function from a live kernel debugging view.

        Maybe there’s a memory-mapped read or write which is much slower on some systems. Just a random guess.

        • Alexander S says:

          Interesting tidbit… “What QueryPerformanceCounter actually does is up to the HAL (with some help from ACPI). The performance folks tell me that, in the worst case, you might get it from the rollover interrupt on the programmable interrupt timer. This in turn may require a PCI transaction, which is not exactly the fastest thing in the world. It’s better than GetTickCount, but it’s not going to win any speed contests. In the best case, the HAL may conclude that the RDTSC counter runs at a constant frequency, so it uses that instead”

        • Alexander S says:

          Ha!
          You even commented there!
          http://blog.strafenet.com/2014/08/07/use-the-source-part-2-queryperformancecounter/

          So, depending on TscQpcEnabled, QueryPerformanceCounter either directly calls rdtsc (which is cheap), or makes a system call into kernel (which, I assume, is really expensive to be called too often)…

          Now, big question if how do i find out the value of TscQpcEnabled on particular system and whether or not there is a way to affect it…

          • brucedawson says:

            Hah, I did – although apparently I was unable to learn much from that article. Pity.

            BTW, you say:

            > It’s better than GetTickCount, but it’s not going to win any speed contests.

            GetTickCount is actually very fast – it just has to read a memory location, I believe, that is incremented by the timer interrupt. It’s just not very precise.

          • Alexander S says:

            Nah, those were not my words, I was citing some blog post. They are saying that rollover interrupt is better that gettickcount(better resolution), but rollover interrupt is way slower than gettickcount due to PCI access and other such silliness.

  8. Is there a way to access TscQpcEnabled and reset it?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s