What Outranks Thread Priority?

This investigation started, as so many of mine do, with me minding my own business, not looking for trouble. In this case all I was doing was opening my laptop lid and trying to log on.

The first few times that this resulted in a twenty-second delay I ignored the problem, hoping that it would go away. The next few times I thought about investigating, but performance problems that occur before you have even logged on are trickier to solve, and I was feeling lazy.

When I noticed that I was avoiding closing my laptop because I dreaded the all-too-frequent delays when opening it I realized it was time to get serious.

Luckily I had recently fixed UIforETW’s circular-buffer tracing to make it reliable, so I started it running and waited for the next occurrence. It didn’t take long.

Continue reading

Posted in Investigative Reporting, Performance, uiforetw, xperf | Tagged , , | 9 Comments

Big Project Build Times–Chromium

A twitter discussion on build times and source-file sizes got me interested in doing some analysis of Chromium build times. I had some ideas about what I would find (lots of small source files causing much of the build time) but I inevitably found some other quirks as well, and I’m landing some improvements. I learned how to use d3.js to create pretty pictures and animations, and I have some great new tools.

As always, this blog is mine and I do not speak for Google. These are my opinions. I am grateful to my many brilliant coworkers for creating everything which made this possible.

The Chromium build tools make it fairly easy to do these investigations (much easier than my last build-times post), and since it’s open source anybody can replicate them. My test builds took up to 6.2 hours on my four-core laptop but I only had to do that a few times and could then just analyze the results.

Continue reading

Posted in Chromium, Performance, Programming | Tagged , | 25 Comments

Creating a Public Symbol Server, Easily

I’ve been a big fan of symbol servers for years. They are a part of the Microsoft/Windows ecosystem that is far better than anything I have seen for other operating systems. With Microsoft’s and Chrome’s symbol servers configured I can download a user’s Chrome crash and start analyzing it immediately – with code bytes and symbols – without knowing or caring what OS or Chrome version that user was running. Since Chrome’s symbols are source-indexed I will even get source-code popping up automatically, and all of this is available to anyone who is interested – no special Google privileges required.

Continue reading

Posted in Debugging, Programming, Symbols, uiforetw | Tagged | 16 Comments

Bulk ETW Trace Analysis in C#

WPA CPU usage graphETW traces record a wealth of information about how a Windows system is behaving. When analyzing a new and unknown problem there is no replacement for loading the trace into WPA and following the clues to a solution. The thrill of the hunt and the creative challenge of finding a visualization that will reveal the root cause never gets old (too nerdy? Sorry – I do enjoy this).

But sometimes you want to extract some commonly found piece of information from multiple traces, and doing this manually is tedious and error prone.

Continue reading

Posted in uiforetw, xperf | Tagged , | 11 Comments

O(n^2), again, now in WMI

I recently hit some multi-minute delays on my workstation. After investigating I found that the problem was due to a lock being held for five minutes, and during that time the lock-holder was mostly just spinning in a nine-instruction loop.

Coming up with a good title for my blog posts is critical but I immediately realized that the obvious title of “48 processors blocked by nine instructions” was taken already by a post from less than a month earlier. The number of processors blocked is different, and the loop is slightly longer but, really, it’s just deja vu all over again. So, while I will explain the new issue that I found I first want to discuss the question of why this keeps happening.

Why does this happen?

Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

Continue reading

Posted in Investigative Reporting, Programming, Quadratic, Rants, uiforetw, xperf | Tagged , , , , | 46 Comments

Heap Snapshots–Tracing All Heap Allocations

I’ve recently started using heap snapshots on Windows to track heap allocations. I was able to use heap snapshots to record call stacks for all outstanding allocations in Chrome’s browser process over a full two weeks, letting me account for pretty much every byte of memory consumed.

uxtheme.dll with its own heap inside chrome.exeSince then I have used heap snapshots to find wasteful memory usage in the Windows heap, a memory leak in a security tool injected into Chrome, and many details of Chrome’s memory usage that I was not previously aware of.

I first read about heap snapshots here. This page gives the mechanics of how to record a heap snapshot but it spends very little time explaining what heap snapshots are or how to use them effectively.

Continue reading

Posted in Documentation, Performance, Programming, uiforetw, xperf | Tagged , | 24 Comments

63 Cores Blocked by Seven Instructions

I seem to have a habit of writing about super powerful machines whose many cores are laid low by misuse of locks. So. Yeah. It’s that again.

But this one seems particularly impressive. I mean, how often do you have one thread spinning for several seconds in a seven-instruction loop while holding a lock that stops sixty-three other processors from running. That’s just awesome, in a horrible sort of way.

Update, November 2020: Microsoft has formally acknowledged the bug and said that it is fixed. They said that “root cause is that the search for free regions of the volsnap CoW bitmap were incorrectly unbounded” and I have inferred that this is most likely to happen on new drives where the free regions are uninterrupted by allocated regions that would themselves stop the search.

Contrary to popular belief I don’t actually have a machine with 64 logical processors, and I’ve never seen this particular problem. But a friend hit this problem, nerd-sniped me asked for help, and I decided it was interesting enough to look at. They sent me an ETW trace that contained enough information for me to craft a tweet-for-help which resolved the issue swiftly.

Continue reading

Posted in Investigative Reporting, Performance, Programming, Quadratic, uiforetw, xperf | Tagged , , | 40 Comments

We Need a Carbon Tax, Now, That Goes Up Over Time

If we tax fossil fuels – making them more expensive – then the awesome power and creativity of the free market will create diverse alternatives and efficiencies with minimal additional government intervention.  We will ultimately save money, be healthier, and slow the irreversible transformation of our climate.

Antarctic iceberg

Every year we extract billions of tons of hydrocarbons from the ground and from forests and burn them. Not surprisingly this has added Coral lives matterhundreds of billions of tons of CO2 to the atmosphere and the oceans. CO2 in the atmosphere traps heat, and CO2 in the oceans makes them more acidic. Because of all this the glaciers and icecaps are melting, temperatures and ocean levels are rising, and corals are dying. Exxon’s scientists warned about this in 1982, but like other oil companies has continued funding climate-change denial. When the situation is bad enough to lead Bill Nye to drop the f-bomb then maybe we should pay attention. Continue reading

Posted in Commuting, Environment | Tagged , , | 70 Comments

Taskbar Latency and Kernel Calls

I work quickly on my computer and I get frustrated when I am forced to wait on an operation that should be fast. A persistent nuisance on my over-powered home laptop is that closing windows on the taskbar is slow. I right-click on an entry, wait for the menu to appear, and then select “Close window”. The mouse movement should be the slow part of this but instead I find that the delay before the menu appears is the longest component.

Sources on twitter say that a fix is in fast-ring builds as of October 2019, build 18999, but unfortunately this won’t ship to a stable release until early 2020. And in late October Microsoft said “We’ve done some work to improve the launch speed of the taskbar jump lists”, with a thank you from one of the developers.
September 2019: ~700 ms to focus change
April 2020: 250-300 ms (optimizations? or different data?) to focus change
June 2020: 100 ms to focus change!!! The fix has landed in Windows 10 2004.
Unfortunately a deeper investigation shows that it takes at least another 100 ms from when the menu gets focus to when it is painted. So, the latency is a still-visible 200 ms from mouse-release to pixels on the screen. This is the absolute best case on a high-performance laptop, and apparently that’s as good as it’s going to get. Pity.

This has been bothering me for a long time but I had been showing uncharacteristic self control and had resisted being distracted. Until today, when I finally broke down and grabbed an ETW trace.

This post was written as a test of speed-blogging. Total time from finding the issue and sarcastically tweeting about it to publishing the initial post was about 90 minutes.

The ETW trace records me right-clicking on the task bar to close two Explorer windows. I used UIforETW’s Tracing to file with the default options, giving me a 20.9 MB trace of the issue. You can see a video of the analysis below on youtube. Continue reading

Posted in Investigative Reporting, Performance, uiforetw, xperf | Tagged , , , | 41 Comments

O(n^2) in CreateProcess

So many possible introductions to this one:

  • Windows 7: Sheesh, I sure am slow at creating processes
  • Windows 10: Hold my beer…

Or how about:

  • A) How long does CreateProcess take on Windows?
  • B) How long would you like it to take?
  • A) You mean you can make it as fast as I want?
  • B) No, I can make it as *slow* as you want

O(n^2) algorithms that should be linear are the best.

Note that, despite breathless and click-baity claims to the contrary, the performance of Chrome and Chromium was never affected by this bug. Only Chromium’s tests were affected, and that slowdown has been 100% mitigated.

CFG ended up being a big part of this issue, and eight months earlier I had hit a completely unrelated CFG problem, written up here.

I often find odd performance issues all on my own, but sometimes they are given to me. So it was when I returned from vacation to find that I’d been CCed on an interesting looking bug. Vivaldi had reported “Unit test performance much worse on Win10 than Win7”. unit_tests were taking 618 seconds on Win10, but just 125 seconds on Win7.

Update, April 23, 2019: Microsoft received the initial “anomaly” report on the 15th, the repro steps on the 21st, and announced a fix today. Quick work! The fix is in 1903 (back ported) and in 2004. I don’t know what other Windows 10 versions received the fix.

By the time I looked at the bug it was suspected that CreateProcess running slowly was the problem. My first guess was that the problem was UserCrit lock contention caused by creating and destroying default GDI objects. Windows 10 made these operations far more expensive, I’d already written four blog posts about the issues that this causes, and it fit the symptoms adequately well.

Continue reading

Posted in Investigative Reporting, Performance, Quadratic, xperf | Tagged | 29 Comments