Defective Heat Sinks Causing Garbage Gaming

Posted on August 6, 2013 by brucedawson

Sometimes being hot is not cool.

When Valve’s customers have performance problems in our games we sometimes ask them to send in xperf traces for us to examine. In some cases this lets me find performance bugs that didn’t show up during our testing, and fixing these issues makes our games run faster for everyone.

In some situations, however, what I find is significantly stranger. In at least four cases this year (2013) these traces showed performance problems caused by thermal throttling – CPUs overheating enough that they throttle back their performance in order to stop themselves from melting, and the computers’ owners had no idea.

Update October, 2015: IMHO, if a CPU is advertised as having an n-core X-GHz then an inability to run all n cores at X GHz is false advertising or defective hardware. If you care about performance then you should buy computers that advertise their specs and you should hold the manufacturers to them. Thermal throttling of phones and ultra-thin laptops may be okay, but it should be documented that they cannot always run at full speed.

I want to share this because I think it’s nerdy cool/horrible, because developers should know about this possibility, and because there may be a lot of people out there whose games are running poorly because of overheating.

Thermal throttling is extremely difficult to detect in xperf traces. It is done automatically by the CPU or the motherboard, and the operating system (OS) doesn’t realize that it is happening. I use the ETW toolset for these investigations but its CPU frequency graphs show the CPU running at full throttle and its power management events say that all is well, and yet…

Thermal throttling confirmations

One might reasonably ask how I know that these computers – customer machines that I have never seen – were being thermally throttled. Before I explain what originally made me suspect thermal throttling I’ll explain the confirmations that I received after doing my analysis.

In the first case the customer was running an AMD Phenom processor. When I told the customer of my suspicions they said that they had recently replaced their processor heat sink. When they put the old heat sink back on their performance problems went away. Myth confirmed!

In the second case the thermal throttling theory was never confirmed but is still strongly suspected. Myth plausible?

In the third case the customer was running an Intel Core i3 550 processor and after analysis suggested thermal throttling I asked them to install and run RealTemp, which shows readings from the CPU’s temperature sensors. The screen to the right shows the results, which is that both cores reached temperatures of 105 degrees Celsius – hotter than the boiling point of water! The “LOG” indicators under Thermal Status indicate that the CPUs were thermally throttled. Once again, myth confirmed.

In the fourth case the customer was running an AMD FX-8150 processor. They had already run temperature monitoring software and it showed reasonable temperatures, but the trace clearly showed that the CPU was being throttled down to 30% of normal speed. This frequency throttling was confirmed using CPU-Z, and when they disabled AMD Turbo Core (thus reducing the maximum CPU speed) the problem went away. CPU throttling confirmed, cause uncertain.

In all cases the customer had initially said that the problem must be with our game because only our game was hitting these mysterious slowdowns. I don’t doubt their story, but this shows how difficult it is to find the cause of a slowdown. I rely on xperf traces because they give me enough information to identify slowdowns through science instead of through guessing.

Thermal throttling signatures

Because the CPU doesn’t tell the OS when it is thermal throttling there are no direct indicators in an xperf trace that the CPU has been slowed down. The trace will show the frame rate dropping suddenly with the game CPU bound, but that can have many causes. The first suspicious sign is that every part of the frame loop runs more slowly. Normally a frame rate drop is caused by the load on one or two systems increasing and this sudden across the board slowdown is unusual.

UIforETW has several options to make detection of thermal throttling easier. If Intel Power Gadget is installed then UIforETW will record the CPU temperature, and UIforETW also periodically measures the actual CPU frequency.

In order to interpret the data correctly it is important to understand your game’s architecture. Some systems, like rendering, will happily use all available CPU time. Other systems, perhaps game simulation or audio, run a fixed number of times per second with their workload per second roughly constant. So, when a CPU is running more slowly the tasks with a fixed workload will take longer. This leaves less time for rendering, so the amount of time spent in rendering will actually drop. It takes a careful eye to distinguish increased load (such as more simulation or audio work) from a slower CPU.

If a subsystem is taking longer to run than normal this can have many causes; CPU starvation (other threads stealing CPU time), page faults, disk I/O, or other things that would cause the CPU to not run the subsystem’s code. CPU throttling should only be suspected if a subsystem is taking more CPU time – that is, time spent executing code, not just elapsed time. Luckily, the WPA CPU usage graphs show us how much CPU time is consumed. The CPU Usage (Sampled) data shows a statistical estimate of where CPU time was spent, handy for finding which subsystems are consuming CPU time.

(see Xperf for Excess CPU Consumption: WPA edition and The Lost Xperf Documentation–CPU sampling for more on the xperf sampling profiler)

The CPU Usage (Precise) data gives us an extremely accurate measure of how much CPU time our process is consuming in total. This lets us be certain that we are correctly accounting for all CPU time.

(see Xperf Wait Analysis–Finding Idle Time and The Lost Xperf Documentation–CPU Scheduling for more on xperf’s precise CPU usage)

Another reason why the CPU might run our code more slowly is cache contention – other code evicting our code or data from the cache. This would require that there be other code running that is using the cache heavily, and in all cases that I examined it was clear that there was not enough other code running for this to be a problem.

So, we know that the CPU is taking longer to do what we believe is a constant workload. But in order to confirm that this slowdown is caused by the processor running more slowly, we really need something else that has a constant CPU workload so that we can see if it is also running more slowly. Luckily we have just such a thing.

Audiodg

The xperf traces showed a sudden drop in frame rate and a sudden increase in CPU consumption in our game, but that could still represent a bug in our code. An extra vote was needed – something else that should have a constant CPU load so I could see if it also slowed down. I found that in the Windows Audio Device Graph Isolation process, also known as audiodg. This process does audio processing and, during normal game play, its CPU overhead is extremely stable. Sure, its overhead does change occasionally, but in my experience it is very stable during normal gameplay. So, it was very interesting when I noticed that the CPU consumption of audiodg increased by a factor of 2.4 at exactly the same time that our game started slowing down.

Enough talk – here’s a pretty picture:

This is a Windows Performance Analyzer (WPA – the xperf trace viewer) screenshot from an xperf trace which caught a customer’s CPU in the act of being throttled. The blue diamonds at the top represent frame boundaries. The game started at 100-300 fps (a solid line of diamonds) and then plummeted to about 15 fps. The jagged blue graph represents CPU consumption by the game process – the spikes on the right are variations in the number of active game threads, peaking once per frame.

The red graph along the bottom is the CPU consumption of audiodg. You can see, plain as day, that audiodg’s CPU consumption increases significantly (about 2.4x) at exactly the same time that the frame rate drops.

As an additional vote take a look at the green spikes. Those are from sidebar.exe which is updating the clock gadget once a second. After the CPU slows down it takes more CPU time to update the clock. That is what made me certain that the CPU was going slower, because the load on those two processes is quite constant, so if they are taking longer to run it must be because the CPU is slower.

Of course, one explanation for this would be that audiodg and sidebar were the cause of the problem. Correlation is sometimes causation, so maybe they were starving the game of CPU time. That is certainly something that had to be considered, but it was clearly not the case. Audiodg and sidebar went from using about 1% of CPU time to about 2.4% of CPU time – they couldn’t starve anybody.

This graph is from another customer, zoomed in to show the dramatic increase in audiodg CPU usage, from 0.79% of total CPU power to 2.37% – an increase of three times at precisely the time where the frame rate drops. Meanwhile the CPU Frequency graph says that all CPUs are running steadily at 3.6 GHz – but that just isn’t true.

Measuring CPU frequency

Diagnosing these problems is tricky, and I am a fundamentally lazy person so I decided to write some code to make my job easier. I wrote some test code that measures the frequency of a processor. This code starts by creating one high priority thread for each logical core on the system. Every five seconds these threads wake up. They call a function that does 500,000 dependent integer adds, which on any modern processor should take 500,000 clock cycles. They use QueryPerformanceCounter to time the code and infer the clock frequency. Because this function will sometimes be slowed down by an interrupt I call it seven times and retain the fastest clock frequency, which I emit into the xperf trace. It’s crude, but effective. Here are some actual results from a machine that was hitting performance problems, graphed in Excel:

3.889241
3.897883
3.889241
3.889241
3.889241
3.889241
3.889241
3.889241
3.889241
3.889241
1.050957
1.044076
3.889241

UIforETW contains an updated and improved version of this frequency measurement code, sampling the frequency every three seconds. And, with WPA 10 you can graph the results inside WPA.

I find it remarkable how stable the results are – except when the CPU suddenly and catastrophically dropped to 27% of its previous clock rate, sending the frame rate plummeting. The drop in frequency was perfectly correlated with the game performance drop, increased audiodg CPU usage, and real-time frequency monitoring from CPU-Z.

What to do?

If you suspect your PC is not performing as well as it should then it is worth checking to see if your CPU is overheating or otherwise being throttled. There are a number of tools which can help you do this. Keep in mind that RealTemp (Intel only) is the only one that I have used, and I’m not actually endorsing any of these, but here are a few options:

SpeedFan – http://www.almico.com/speedfan.php
RealTemp – http://www.techpowerup.com/realtemp/
CoreTemp – http://alcpu.com/CoreTemp/
Speccy – http://www.piriform.com/speccy
Intel® Extreme Tuning Utility (with thermal throttling graph!) – http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-boards-software-extreme-tuning-utility.html
Intel Power Gadget – https://software.intel.com/en-us/articles/intel-power-gadget-20

In the most recent case the temperature monitoring tools insisted that all was well, but the CPU frequency was still dropping during gameplay. It is fine for your CPU’s frequency to drop when your machine is under light load, or when running on battery – that saves a lot of power which can extend your battery life, reduce your power bill, and keep your house cool. It’s also fine if your machine doesn’t stay at its Turboboost or Turbo Core frequency (temporarily raised frequencies) for long. However your CPU should be able to maintain its rated frequency under load. If it cannot then your machine is not behaving correctly, either due to bad design or defective parts. Therefore, even if your machine is not overheating you may want to try monitoring its CPU frequency to see if it is dropping when your game performance drops. To be precise, if you are running a game that is CPU bound and your CPU frequency drops when game performance drops then the reduced frequency is probably the problem. Many of the temperature monitoring tools can display CPU frequency, or you can try one of these tools:

Intel ® Turbo Boost Technology Monitor – download from here
CPU-Z – http://www.cpuid.com/softwares/cpu-z.html

If you suspect your CPU is overheating then there are a few steps that you can try:

Open the case and check for dust, especially on the heat sink, fans, and the vents to the outside – your CPU can only be cooled effectively if cool air comes from outside the case and is pulled over the heat sink by the fan – dust can be removed manually or with compressed air
Make sure your computer is not in an enclosed space – a computer in a stereo cabinet may not get enough cool air
If you have replaced the heat sink then be sure that it is rated for your processor, is firmly attached and is using the recommended thermal paste. CPUs need to dissipate a lot of heat and tiny obstacles can slow this process

Unfortunately some computer cases are just designed poorly. If your case is badly designed then your CPU may be trying to cool itself with recirculated hot air. In one test a poorly designed case was fixed by simply adding a plastic tube that directed cool air from the case vent to the CPU fan and this lowered the CPU temperature by 20-25 degrees Celsius! However I don’t recommend trying to fix poorly designed cases yourself – buying a case that is properly designed is a better option.

Please share your experiences with finding (and fixing!) overheating problems.

Extrapolating from anecdotes

I’m sure that there are a lot of unsuspecting people with this problem, but I have no idea how many because it’s tough to extrapolate from my highly biased sample to the computing population at large. It’s tempting to write a test that will proactively look for this problem, but since thermal throttling is workload dependent it is impossible for such a test to say whether the games that you play will trigger thermal throttling.

Data from a range of customers playing games showed several percent were being significantly thermally throttled.

I’m hopeful that this post will raise awareness of the issue and that the suggestions will let users detect whether or not they are hitting this problem. In the end I suspect that most causes of per game performance are, in fact, due to bugs in the game or other less esoteric causes – but sometimes you need to look to your hardware.

Code for measuring CPU frequency

The code for measuring CPU frequency can now be found in UIforETW, right here.

I hope that some day the ETW code in Windows that provides the CPU frequency will be fixed to detect thermal throttling – and a temperature provider would also be nice. ~~Chapter 14 in the Intel Software Developer’s Manual, Volume 3A:, System Programming Guide, Part 1 would be a good starting point…~~ UIforETW now has the ability to measure CPU temperature directly, as long as Intel Power Gadget is installed. Recommended.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in Investigative Reporting, Programming, xperf and tagged thermal throttling. Bookmark the permalink.

53 Responses to Defective Heat Sinks Causing Garbage Gaming

Tom Forsyth says:

August 6, 2013 at 6:14 pm

Some case designs are inherently broken in unobvious ways. Two completely different machines from different vendors both had a fan pulling air in from underneath – behind the front panel. The only thing keeping this air channel open is the small rubber feet on the bottom of the case. Which works fine sitting on an office desk, but guess what happens when you put it on almost any sort of carpet in your own house? Yup, totally blocked. The solution in both cases was to tear off a cardboard flap from the box it came in and put that under the front feet, raising them enough to get good airflow. I bet very few people even realize this is a problem though.

Reply
Andy Simpson (@aiusepsi) says:

August 6, 2013 at 7:09 pm

Would it be practical for you to do anything about this from the Steam end? Having the Steam overlay pop up a warning about high CPU temperatures might be worthwhile.

Reply
- brucedawson says:
  
  August 6, 2013 at 7:42 pm
  
  Possible, but there are some challenges. Temperature reading seems to require administrator privileges, and there are a bunch of tricks (different CPUs have their temperatures read differently, sensors may need calibration, may get stuck, etc.) so it’s certainly not an easy problem. Frequency measurements in Steam may work better, but we also have to try to distinguish between legitimate power saving and thermal throttling. It’s something we’re talking about.
  
  Reply
  - Matias Goldberg says:
    
    August 7, 2013 at 6:52 pm
    
    I agree on the challenges.
    Though, since your work at Valve, you have access to massive survey statistics.
    
    By collecting CPU temp readings from all systems and cataloged by CPUID (don’t know if MB model affects though) record averages, mode, std. dev.; min, & max temp readings
    Then use this data to compare against the user’s reading. Calibration isn’t needed because you’re comparing relative values against records from the same model. You’re not interested whether it’s reading 70°C or 30°C; but rather it’s 20°C off from average at a given load.
    
    Still, there’s the problem of sensors getting stuck and admin privileges. Admin priv. can be solved if the feature is optional (with a nice name like “Protect your home” “Take care of your rig if it overheats”; so people feel inclined to enable it).
    Usually sensor issues arise when the application went to or came from hybernation/sleep/turned off monitor; other than that; real sensor problems don’t happen often.
    
    PS. Or do the same with measured frequency instead of temps.
    
    It’s just an idea, certainly ambitious and may not work, but I just felt like saying it 🙂
    
    Reply
    - brucedawson says:
      
      August 7, 2013 at 10:11 pm
      
      Apparently the safe temperatures vary from CPU to CPU within the same model, just due to factory differences. I think some information about this gets burned into the CPU itself, so it’s accessible, but it’s more finicky stuff to deal with when we really want to be working on games.
      
      Temperature is handy in a tool that customers use because they can see how the CPU temperature responds to load. However I’m not sure it’s the right thing for an automatic monitoring system. For that we really want to know whether the CPU was throttled. Intel sets a bit when this happens, although again it requires administrator privileges to read.
      
      Another route is to measure the frequency. I think my measurements are quite accurate, and their overhead is fairly low, but we still have to distinguish between frequency throttling under load and frequency throttling due to lack-of-load (legitimate power saving). So many little complications.
      
      Reply
Romulo says:

August 6, 2013 at 7:33 pm

Some years ago I had an Athlon 4000+ with a Radeon HD600 graphics, it was THE laptop back then…When TF2 got released I could play the game on max settings but only for a few minutes, the game would be really slow after a while….I discovered after some time it was due to throttling (back then i used cpu-z and rightmark cpu clock). The cheap solution ended up being Undervolting and a good amount of Arctic Silver thermal paste. Took me days to find out and I never though I would hear complains about that after so many years…..

Reply
Nathan Reed says:

August 6, 2013 at 8:41 pm

Did you ever determine why the xperf trace didn’t show the change in CPU frequency? What would cause xperf to miss that, while tools like CPU-Z would show the true frequency?

Reply
- brucedawson says:
  
  August 6, 2013 at 11:55 pm
  
  The xperf trace displays whatever the ETW POWER provider puts in the trace and my understanding is that this provider records the frequency which the operating system has asked the CPU to run at. Somebody needs to write code to emit ETW events with the *actual* CPU frequency. CPU-Z could do that, but it would be better if Windows did.
  
  I’m not sure but I think that CPU-Z would not detect Intel chips slowing down because they don’t actually lower the clock rate, they just skip a bunch of clocks. But I’m not sure. Tricky stuff.
  
  In short, CPU-Z displays the clock rate accurately because that is its primary job.
  
  Reply
jrrr says:

August 6, 2013 at 10:15 pm

It’s good to understand the profiler symptoms of throttling, but this seems needlessly indirect. The computer should vocally complain during an emergency throttling event – “Help, I’m overheating!”

Reply
- brucedawson says:
  
  August 6, 2013 at 11:56 pm
  
  Yes. That. +100. I will be making that suggestion to Microsoft.
  
  Reply
SmallStepForMan says:

August 6, 2013 at 10:52 pm

Modern many-core systems actually don’t allow all cores to be used simultaneously. They will temporarily disable a core to allow it to cool down, which means all processing must be transferred to a freshly awoken core. The entire CPU halts while the transfer of registers/caches happens, and the pipelines/translators/predictors for the new core have to be reinitialised. And all of this happens behind the OS’s back …

Reply
- brucedawson says:
  
  August 6, 2013 at 11:50 pm
  
  Ummm. That is wrong. A properly cooled many-core system can run all cores at their rated speed, or beyond. I have tested this on many computers. I agree that there are some computers that cannot do these. We call them ‘broken’. I recommend purchasing computers that aren’t broken.
  
  The whole transfer of registers/caches behind the operating system’s back is something I have never heard of. You’ll have to provide references.
  
  Reply
BinarySplit says:

August 7, 2013 at 1:45 am

I’d like to point out that often developers are more at fault than users.
I’ve encountered a lot of games that have vsync turned off by default. I’ve also encountered quite a few games that will use 100% CPU (or at least 100% of one or two cores) even when minimized. Reckless decisions/oversights like this result in bad experiences for the users. Even if their computer doesn’t overheat, their fans will become noisy, their power bill will be higher, and they may even burn themselves if they’re using a laptop.
You should actually expect that if you run as fast as the CPU/GPU will let you, that a substantial proportion of your customer base will have issues. Consider this: of the 7 discrete GPUs I’ve owned in my life, 5 of them have had overheating issues in some games. Generally the solution has been to turn on vsync, or even vsync @ half refresh rate. I don’t live in a particularly hot climate, and my computers’ cases have been clean and well ventilated. I’ve just had terrible luck with the GPUs that I’ve bought.

Reply
- Juha Hiekkamäki says:
  
  August 7, 2013 at 6:20 am
  
  Yes, the sad reality is that running as fast as possible will cause problems on many systems. This is not the developers fault, though. It’s a hardware/driver problem as software shouldn’t be able to cause overheating no matter what it does. It’s not possible to know what is “too much” and how much to cap the performance, as this is something that is not (and shouldn’t be) visible outside. System not being able to handle the load created by application is broken.
  
  We have a game that generates furmark type of temperatures on GPU. Sadly there is very little we can do about it. Temperature is not a concept that is visible in the API’s. It wouldn’t matter even if it was, as drivers have the capability to limit performance based on temperature (and other metrics) already. It doesn’t need interaction with the software.
  
  That said, I do regret shipping our game without vsync on by default. It also took a few patches to add in options to limit the frame rate. Giving users ways to control power usage is good and should be encouraged. Having vsync on by default would have solved some (but certainly not all) of the problems – as well as reducing the default power usage. Still, the fact that artificial caps are needed for stability, and that users have to use trial and error to find settings that work on their system is very bad user experience and simply should not be needed.
  
  Reply
- brucedawson says:
  
  August 7, 2013 at 8:55 am
  
  Many customers don’t want v-sync so I don’t think you can ‘blame’ developers for not having it on by default. I think games should have an option to throttle back to 30 fps (for better thermals and battery life on laptops) but I absolutely think that a CPU/GPU combo that cannot sustain its rated speed on all cores is defective. If it can’t maintain X GHz on n-cores then it’s not an n-core X GHz machine!
  
  Business idea: Resell dual-core CPUs as four-core CPUs but warn that if you use more than two cores it will throttle the others. Or, resell 3.2 GHz CPUs as 3.6 GHz because that’s what their TurboBoost peak frequency is. Genius!
  
  Data point: the last trace I looked at was being thermally throttled even though its average CPU usage was 31% (across four cores/eight threads).
  
  Reply
  - BinarySplit says:
    
    August 7, 2013 at 12:46 pm
    
    I absolutely agree that hardware that overheats in normal conditions is defective, but the fact of the matter is that there’s a lot of defective parts out there.
    
    Often people can’t return their defective products because they don’t actually realize they’re defective, don’t have time to build evidence that it’s definitely that specific component that’s defective, can’t live without their computer for the time it takes to replace a part, or are outside of the part’s warranty period.
    
    You should take “average CPU usage” with a grain of salt on machines with hyperthreading. The first 50% counts for 95% of the processing power and power consumption, the second 50% is just the hyperthreading cores that don’t really do much. In reality, the CPU in that data point is running at about 60% of its maximum power output. It’s definitely defective, but not as bad as your statement suggests.
    
    I don’t think defaulting vsync to off is a smart move. If a user cares about vsync, then they know how to change it. But users who don’t care about vsync often don’t even know why they would want it. The “clueless” part of the market is who you should tailor your default settings to, because if they get a bad frame rate and don’t know what settings to change to fix it, they’ll just quit your game and give you a bad review.
    
    Fun fact: With vsync on, on a 60hz monitor, a game that plays at 30 FPS actually looks better than at 45FPS. If a game is running at 45FPS instead of 60, the hardware is probably struggling. There is possibly some benefit for gamedevs to detect average frame rates between 35 and 55 FPS and switch a 30FPS fixed frame rate.
    
    Reply
    - brucedawson says:
      
      August 7, 2013 at 1:09 pm
      
      I was wondering if anybody would call me on my 30% usage number — you are correct that 60% is actually more meaningful. In most cases using the ‘unused’ hyperthreads will not significantly increase total instructions-per-second.
      
      I agree with your other points also.
      
      Reply
Doug Binks (@dougbinks) says:

August 7, 2013 at 7:43 am

Note that if you need to detect GPU throttling of Intel GPUs (which share the power and thermal budget with the CPU cores), take a look at the sample GPU Detect: http://software.intel.com/en-us/vcsource/samples/gpu-detect. I don’t know what the equivalents are for NVIDIA and AMD unfortunately.

Intel’s free Graphics Performance Analyzer also has some useful tools for this type of analysis on both GPU and CPU.

Reply
- brucedawson says:
  
  August 7, 2013 at 8:50 am
  
  Thanks for the pointer — I’ll take a look.
  
  Reply
Aditya Rawat (@eddywebs) says:

August 7, 2013 at 8:22 am

Any CPU heat measure utility for MAC OSx or Linux ?

Reply
- Ralph says:
  
  August 7, 2013 at 10:03 am
  
  For Linux you can use the sensors commandline utility, part of lm-sensors. To my surprise the readings are often inaccurate, due to lack of documentation and bugs in BIOS, etc. Turns out that utilities like SpeedFan paper over these problems.
  
  Reply
Nate says:

August 7, 2013 at 8:59 am

A question about your code: How do you get windows to run all the threads at the same time? In my experience, waking up the thread via semaphore will just make them runable, and Windows will then run them at some (inconstant) time in the future. There is no guarantee that they will run at the same time. Maybe it does not actually matter?

Reply
- brucedawson says:
  
  August 7, 2013 at 9:17 am
  
  I just call ReleaseSemaphore(numCores). This isn’t guaranteed, but from examining the traces I can see that it generally works. When it doesn’t it’s because (I think) some of the threads are parked and take a ms or so to wake up. It seems to work very well. On hyperthreaded machines it is best to have both threads running my code, but having the other thread on a core idle during the test should give the same results, by design.
  
  Reply
Ralph says:

August 7, 2013 at 9:59 am

“In the fourth case the customer was running an AMD FX-8150 processor. They had already run temperature monitoring software and it showed reasonable temperatures…”

This matches my experience. I have an AMD FX-6100 CPU with stock heat sink and an ASRock N68C-GS FX mainboard. The Thermal Throttling setting in the BIOS causes the CPU to massively throttle when the heat sensor reads about 45C, which is easy to hit with only two cores out of six in use!! I couldn’t find a way to change the threshold, so I had to disable it.

Unfortunately, even running a stress test on three cores is enough to trigger CPU shut down (at ~67C, though it is rated for max 70C). This is because the stock heatsink for the FX-6100 is awful. It’s a much cheaper lighter version of the heatsinks for other AMD CPUs.

Reply
- brucedawson says:
  
  August 7, 2013 at 6:10 pm
  
  Thanks Ralph — that’s very helpful. I’ll pass this along to the customer — and to AMD.
  
  I wonder why AMD’s maximum temperature is 70C compared to 105C for Intel. Maybe the sensors are measuring different things?
  
  Reply
  - BinarySplit says:
    
    August 8, 2013 at 1:28 am
    
    Intel have always lead in handling overheating. On one model of the Pentium 4, it was actually possible to completely remove the heatsink and the CPU would continue running, albeit very slowly.
    
    I remember back when I was a teenage OC’er, Intel CPUs had in-core temperature sensors, but AMD still used a temperature sensor mounted underneath the socket. This meant AMD chips would report temperatures 20-30 degrees lower than the actual core temperature. Surely AMD must have fixed this by now though. My first thought is that maybe Ralph’s tool is looking at one of the mobo sensors instead of the CPU temp sensor?
    
    Reply
    - Ralph says:
      
      August 8, 2013 at 12:14 pm
      
      AMD chips do have on-chip temporary sensors, but unfortunately the value they report is offset, and it isn’t documented how much. Also I think that sensor is closer to some cores than others, because it seems quite inconsistent (as tasks bounce from core to core I assume).
      
      Since it’s been a while I tried reenabling thermal throttling and looking at the numbers. It seems I misremembered a bit. Throttling happens when the on-mobo CPU sensor reads ~57C (steady state), and the sensor in the CPU reads ~42 (which translates to roughly 57C too I think). Still a way lower threshold than needed, and I still hit it with only 2 cores in use (and TurboCore on). It throttles from 3.3GHz down to 1.4GHz.
      
      The k10temp Linux kernel module for reading the on-CPU sensor from recent AMD CPUs claims to report the throttling threshold, but the numbers are nonsense.
      
      Reply
Darth Continent says:

August 7, 2013 at 1:18 pm

I once used a peltier device cannibalized from an old Coleman cooler to cool my CPU, however I didn’t anticipate that it would cool so well that condensation would form around the CPU’s pins. After powering up the system I started getting random restarts, and when I tore down the peltier, heatsink, and CPU, there was a fine layer of moisture surrounding the pins!

Reply
Poolpy says:

August 7, 2013 at 2:00 pm

The function CallNtPowerInformation(ProcessorInformation, …) will return the realtime cpu frequency for each core, it is very handy to find out if the cpu is severely throttling or not

Reply
- brucedawson says:
  
  August 7, 2013 at 3:51 pm
  
  Unfortunately this doesn’t seem to work. I wrote some test code that grabs CurrentMhz every two seconds and calculated the minimum and maximum across all of my CPU cores. The results never changed even as CPU-Z showed my CPU frequency bouncing all over as load changed. CurrentMhz (on 64-bit Windows 7, Intel processors) seems to always show the nominal frequency — 3201 MHz for my desktop, 2201 for my laptop. I’m awaiting results from some people who have throttling issues but I don’t have high hopes.
  
  Reply
Tim says:

August 7, 2013 at 10:09 pm

I discovered a similar issue on a large percentage of the corporate laptops at my last workplace. I was eventually able to prove the problem by running CPU-Z, CoreTemp and Prime95. I could watch the CPU temp rise when Prime95 started and then it would peak and CPU-Z would show the clock speed drop and the temperatures start to drop. At the same time the fan would be going crazy.

This was all caused by the heat pipe failing on the CPU heatsink. I never even considered that a heatsink could fail up until that point in time. Touching the different parts of the heatsink on an overheating laptop and comparing this to a working laptop showed the problem. It also showed me how amazingly well heat pipes work when they aren’t broken.

Reply
KristofMattei says:

August 8, 2013 at 5:50 am

I recall seeing events in the Event Viewer that the system firmware has lowered the CPU frequency.

Reply
Mark Santaniello says:

August 10, 2013 at 1:28 pm

Hi Bruce,

Clever stuff.

One problem with your add-loop is that you are assuming a lot about the CPU micro-architecture. It will likely always be a fixed amount of work, but I wouldn’t be surprised if it didn’t always run at 1 cycle per-iteration on some architectures under some conditions (alignment, etc). Just randomly plucking something out of the air here: Bulldozer’s integer clusters have a shared front-end, including fetch. Are you sure you couldn’t be fetch-limited?

I expect you know this, but for the benefit of your readers, there’s a reasonable explanation as to why most of the “normal” methods for reading clock speed don’t show you T-states (thermal throttling) or turbo. In the beginning, we had the TSC (time-stamp counter) and you could just compare that to the real-time clock. A lot of folks used RDTSC not specifically because they cared to know the CPU clock speed, but rather just to have low-latency access to a high-resolution clock. This worked pretty well back when we had single-core machines and no dynamic frequency scaling. Once we got to multi-core and DVFS, people complained that the TSC was unsynchronized across cores (“Time is going backwards?!”) and also that the frequency changed with P-states. So eventually we moved the TSC down to the “uncore” and made it’s frequency invariant. Now people complain that it doesn’t tell you the true clock speed 🙂

Theres alway some internal performance counter that gives the true speed including throttling / turbo, usually with a weird name like “clocks not halted”. On recent Intel you just compare: IA32_FIXED_CTR1 (0x30A) against IA32_FIXED_CTR2 (0x30B). I’m sure AMD has something similar. This is likely what the various Windows GUI tools you mentioned are doing under the hood.

These things typically require a privileged instruction like RDMSR, so you need a driver. I suppose Valve could distribute one with Steam and have folks opt-in.

In the future, Windows should probably just do this and expose it via ETW if they don’t already.

Reply
- brucedawson says:
  
  August 12, 2013 at 12:43 pm
  
  It’s true that I am assuming a lot about the micro-architecture. On in-order cores my loop would have a different cycle count because the loop control would use some cycles. On the Xbox-360/PS3 CPUs the cycle count would be even more changed because they are in-order and they also have a two-cycle latency on integer add.
  
  I’m mostly concerned with out-of-order x86, so the main error in my measurements, I think, is the Bulldozer/hyperthreading shared resources. That may be why I measure slightly less than the expected frequency.
  
  However, on any CPU micro-architecture the results should be consistent (as long as my threads run simultaneously on all cores and as long as at least one of my tests completes without being interrupted), and indeed they are. So, while I should be careful about claiming that I’ve measured the exact CPU frequency, I am pretty confident that the slowdowns I’m seeing are real. That is, when I measure 3.6 GHz it might actually be higher, but when I measure a slowdown from 3.6 GHz to 1.05 GHz I’m certain that that is real.
  
  I hear that Windows 8 has a system for including CPU counters with ETW events, but they ran out of time and didn’t expose a way to use it, so it’s not much use yet. Unhalted core cycles is one possible CPU counter, although you do have to make sure that the CPU that you are measuring is kept busy during the measurement.
  
  Reply
Shane Creamer says:

August 12, 2013 at 4:38 pm

Hey Bruce, great blog and article.

I was thinking there may be a method to determine load for your CPU throttling due to heat vs. idle scenarios.

Since Performance Monitor (Perfmon) data is callable by multiple methods (Perflib API’s, Logman.exe, PowerShell, WMI, etc.) perhaps that could be a Processor vendor neutral method on Windows (I assume Unix/Linux topp has a similar counter?) to determine if there is active load on the OS?

Method #1 – Perfmon’s IO Data Operations/Sec counter to measure load.
In the Windows utility Performance Monitor there in the Process object there is a counter called IO Data Operations/Sec. It measure the IOps generated by that process. A typical idle workstation will generate 1,000 IOps or less under _Total while a busy OS (playing a Steam game) will generate 5,000-50,000 IOps depending on the workload type, number of threads,and processor cores that are servicing the load.

Method #2 – Perfmon’s Process % Processor Usage counter to measure load.
As an another possible option you could even use the Perfmon counter Process\\% Processor Usage. If say Half Life 2 (hl2.exe) expressed is running 4 threads you could expect to see Process\hl2.exe\% Processor Usage be 300-400% on a 4 core system (you have to divide by the # of cores to get your CPU usage expressed as a system wide 0-100% with this counter).

Just hoping one of these methods might be useful since they are portable, lightweight, non-invasive, and can be called by multiple mechanisms.

Reply
- brucedawson says:
  
  August 12, 2013 at 4:49 pm
  
  I don’t think IOPS will be particularly helpful since (for these purposes) I’m totally focused on CPU work, not disk work.
  
  I’m not sure how to use processor usage for this either, at least not directly. When I saw CPU throttling it didn’t actually change the CPU usage significantly. But, high processor usage together with frequency throttling is interesting. I might also use GetThreadTimes for that.
  
  Reply
  - Shane Creamer says:
    
    August 12, 2013 at 5:10 pm
    
    Actually I/O Data Operations/Sec appears useful for any I/O, and not just for disk I/O just for fun I fired up Steam and ran a Perfmon session against Borderlands 2 and then Left 4 Dead 2 while each was playing/pause/and playing.
    
    Findings:
    =========
    The IO Data Operations/Sec for both processes took a dramatic tick upward from typically 200-300 I/O’s sec at Idle with no disk access (no disk led light and no activity on Logical Disk (*)\Disk Transfers/Sec), and in playing the games in a fire fight – up to 2,000/sec for Borderlands 2 and about 1700 I/O’s a sec by Left 4 Dead 2.
    
    The counter for \Process\<borderlands2\% Processor Usage on an 8 core system went from 72% at paused to just about 275% during a fire fight, or 2.75 cores active at 95%-100% confirmed by the Processor and Processor Information object @ about a mix of 80% user mode/ 20% kernel mode activity.
    
    Just wanted to show that Perfmon *may have* what you are looking for if GetThreadTimes doesn't pan out.
    
    Reply
    - brucedawson says:
      
      August 12, 2013 at 5:43 pm
      
      I’m not sure what “any I/O” means in this context. It might include file I/O (a lot of file I/O doesn’t hit the disk because of the system disk cache), it might include reads/writes through pipes, it might include network I/O, and it might include other things. Unfortunately I can’t think of anything that it might include that would be a good proxy for CPU usage. What do *you* think might be included in the I/O counts?
      
      I really want clock cycles per second, and conveniently enough I can measure that pretty easily.
      
      Reply
      - brucedawson says:
        
        August 12, 2013 at 5:46 pm
        
        Aside: how come perfmon, just like xperfview/WPA, doesn’t document the meaning of its counters? These are counters with very precise meanings and even if the rough concept is obvious from the name, the subtleties (such as what counts as I/O) is not obvious and should be made explicit.
        
        Somebody needs to do an equivalent of my Lost Xperf Documentation series for perfmon.
        
        Reply
Randell Jesup says:

August 16, 2013 at 3:49 pm

Timely article, and really good. (Say hi To Chris Green…)

I just today diagnosed a similar problem running loopback WebRTC tests on my Lenovo W520 (Sandy Bridge Core i7 2820QM laptop, 2.4GHz nominal (max ~3GHz)). I’d be running a call, sitting around 25% in the browser (maybe 35% overall), and every 4-5 minutes for 1 minute the CPU would jump to 60% in the browser, 90-95% overall (and we’d start missing audio deadlines).

Mystifying. I used our internal Gecko profiler and Process Explorer to notice that in one of these, a whole load of threads jumped from like 2.5-3% CPU to 9-10%, and the MediaStreamGraph thread went from 50%-60% idle to close to flat out except for waiting for a D3D10 frame release lock (separate bug) – and that was taking the same time it always was (waiting), while every single functions (AEC, Opus encode, etc) was taking 2-3x as many cycles. xperf traces were more confusing since they didn’t show waits, and I hid the other tasks. And as you show, it said it was a constant frequency, killing my idea of throttling for a time.

The next clue was that Process Explorer showed that a background backup task that sits at a few percent was jumping during these episodes to 10-15%. But killing it had no real impact. So I went back to my hypothesis that somehow throttling was involved, though that made no sense as it was on wall power, set to Max performance, and only 30ish% busy.

CPU-Z initially confirmed the clock was dropping from 2.3-2.4GHz to 800Mhz(!) Speccy showed that the CPU temp was slowly rising to ~95C, then throttling and dropping to ~75C before going up to 3Ghz for a minute, then 2.3-2.4 until the next episode. later on I had more trouble provoking it (in a less “crowded” and cooler air environment, but it eventually did recur, though this time at around 90C). The weird thing is at other times it was sitting steadily at 95-97 C and near 100% CPU use much of the time doing a compile, and staying at 2.3-2.4GHz all the time.

It’s almost as if it reduced the frequency since it knew the overall use was low enough it could get away with it. but likely that’s just chance.

So: poor man’s measurement tool: Process Explorer. You can drill into a task or even threads to see if constant-use items bump up all at once.

Reply
- shanecreamer says:
  
  August 16, 2013 at 5:14 pm
  
  Hey Randall, did your Xperf trace capture wait analysis data? If so, when you shifted Xperf into wait analysis mode did you see a difference in dpc’s/sec, interrupt time, or processor frequency or processor usage?
  
  If you are not used to xperf wait analysis captures, I can go dig up the blogs that discuss the providers to hook so that you can try it/experiment with it and see if it helps.
  
  I helped identify an interesting storport >16 processor performance issue on a Win2008 R2 SQL server last year using wait analysis, and am now a big fan of it.
  
  Kind Regards,
  Shane Creamer
  
  Reply
  - rjesup says:
    
    August 16, 2013 at 6:58 pm
    
    No – I never saw how to capture wait data (and it would be useful in tracking down the other bug that caused it to wait there; an interlock between two running browsers on D3D10 buffer releases)
    
    Reply
    - brucedawson says:
      
      August 16, 2013 at 10:38 pm
      
      Wait analysis is awesome. It is a critical part of the xperf puzzle. The sampling profiler lets you identify what your threads are doing when they are running, and wait analysis lets you identify why they aren’t running. You need both.
      
      Because I have a blog post for all seasons, here is my article on doing wait analysis:
      
      Xperf Wait Analysis–Finding Idle Time
      
      It hasn’t been updated for WPA, but together with the other WPA articles it should be understandable.
      
      Good job on finding another thermal throttling instance. The excessive temperature together with CPU throttling while under load sounds like it clinches the diagnosis.
      
      Reply
Reşit Şahin says:

August 20, 2013 at 1:01 pm

Hi there,

As a game player i am experiencing this heating problem. I am playing football manager 2013 regularly. I have a lenovo i7 laptop with 12G RAM and i live in Turkey. The laptop was perfectly woring in the winter but when it is around 30+ C degree in the summer, the fan works non stop. As a result i started experiencing very slow game play.

I doubt that the cpu slows down for some reason or the OS is working on some background stuff. I am able to get back to faster gameplay but not sure what is the reason. I think it can be about heat and also posibly about power saving features of the cpu. They both are releated.

I am also myself a software developer. I would like to say that the game developers pay attention to graphics and sound more then game play experience. If the game is lagging on 10 frames per sec, do i care about great graphics?

Games simply can downgrade the workload in case of low frame rates. This would not be very difficult. As an example it is possible to downgrade simulation accuricy in Football manager 2013 but not for the leage you play! So it would be nice to even further downgrade it.

Another issue is my FM 2013 uses only 1G of ram out of 12G and it cause page faults. It could be optimised to run faster in case of available memory.

I would like to be able to play any game with an i7+12G laptop without performance problems. The only solution seems to be that the game developers care about game play experience more then nice graphics.

Reply
- brucedawson says:
  
  August 20, 2013 at 3:05 pm
  
  If you have 12 GB of RAM and FM 2013 only uses 1 G then that is fine. The OS will use much of the rest of the memory for caching of disk data and that will make your system responsive.
  
  The paging that you are seeing is probably soft-faults — mapping pages in and out of FM 2013 without touching the disk. These are harmless. It is only hard faults that are a performance concern. Unfortunately Task Manager does not distinguish between hard and soft faults. You need to use Resource Monitor or xperf to distinguish between them.
  
  It’s not too surprising that your laptop cannot cool itself properly when the room temperature is 30+ C. Cooling systems in laptops have a challenging job at 20 C and probably aren’t specced to maintain full CPU speed at 30+ C. You could try verifying this by monitoring the temperature and CPU speed as discussed in the article. You may also be able to cap the frame rate in order to reduce the load and prevent overheating, or else change the graphical settings and the resolution to something less demanding.
  
  Reply
  - Resit says:
    
    August 20, 2013 at 11:31 pm
    
    Actually i have installed many apps that measure the system and cpu temperature. I have installed “Intel Turbo Boost” tool as you also described. It shows the frequency of the cpu. There was no bar at all when i first opened it. I have also installed another intell tool which shows power used by the cpu and a tool called hardware monitor.
    
    Yesterday after writing the comment i wanted to have another try to play FM and check system stats. There is one very important and handy tool of Lenovo which you can use to choose power profiles. The tool manages power usage of WiFi card,CPU and i guess also the GPU. When i choose Energy Saving mode, the cpu uses 5-10 Wats of energy. But when i switch to performance mode, then cpu uses 30-40 Wats, cpu and system temperatures goes from ~60 C – to ~80 C and the laptop Fan starts producing Tremendous noise.
    
    I am not sure but the performance of the game seems to stay same. Some times it uses %100 memory so i believe that it uses threading to simulate games but the owerall performance does not seem to change much. Anyway it is working at acceptable levels now.
    
    I would like to advice anybody to try a power management tool when dealing high temperatures and game performance.
    
    In year 2005 we were only working on source code to improve the app performance. But nowadays process syncrenisation,cooling and power management seems to be important topic.
    
    Overall i would like to thank you for this nice article. I have learned many things about heat/power related cpu staff as a developer .
    
    Reply
    - brucedawson says:
      
      August 21, 2013 at 8:14 am
      
      The 5-10 and 30-40 Watts of power that you mention — is that while the system is idle? If their power management is causing the CPU to use 30-40 Watts when idle then that is a terrible waste. But if it’s causing the CPU to use 5-10 Watts when busy then the CPU is probably running well below its peak rate.
      
      Anyway, it is normal and expected that a busy CPU (running a game) should draw a lot more power than an idle CPU. And there is nothing wrong with a game keeping a CPU busy. You should enable v-sync to stop games from running at ridiculously high frame rates, and you should keep your system cool so that the CPU doesn’t overheat.
      
      Reply
Resit says:

August 22, 2013 at 3:04 am

5-10 wats and 30-40 wats is used when the game is running with CPU usage of %10-%20 of 8 cpu threads.

Reply
- brucedawson says:
  
  August 23, 2013 at 9:09 pm
  
  It sounds like your system has insufficient cooling to run at 30-40 Watts, but 5-10 Watts doesn’t allow enough CPU power to run the game. It sounds like better cooling and/or an intermediate power saving option is needed.
  
  Reply
Pingback: Self Inflicted Denial of Service in Visual Studio Search | Random ASCII
Pingback: Graph All the Things (Using WPT 10) | Random ASCII
Pingback: Xperf Basics: Recording a Trace (the ultimate easy way) | Random ASCII
Pingback: ETW Central | Random ASCII