A few years ago I did a lot of thinking and writing about floating-point math. It was good fun, and I learned a lot in the process, but sometimes I go a long time without actually using that hard-earned knowledge. So, I am always inordinately pleased when I end up working on a bug which requires some of that specialized knowledge. Here then is the second of (at least) three tales of floating-point bugs that I have investigated in Chromium (part one is here, part three is here). And this time I actually fixed the bug, both in Chromium, and then in googletest so that future generations will be spared some confusion.
The behavior of the Windows scheduler changed significantly in Windows 10 2004 (aka, the April 2020 version of Windows), in a way that will break a few applications, and there appears to have been no announcement, and the documentation has not been updated. This isn’t the first time this has happened, but this change seems bigger than last time. So far I have found three programs that hit problems because of this silent change.
The short version is that calls to timeBeginPeriod from one process now affect other processes less than they used to. There is still an effect, and thread delays from Sleep and other functions may be less consistent than they used to be (see [updated] section below), but in general processes are no longer affected by other processes calling timeBeginPeriod.
A few years ago I did a lot of thinking and writing about floating-point math. It was good fun, and I learned a lot in the process, but sometimes I go a long time without actually using that hard-earned knowledge. So, I am always inordinately pleased when I end up working on a bug which requires some of that specialized knowledge. Here then is the first of (at least) three tales of floating-point bugs that I have investigated in Chromium (part two is here, part three is here). This is a short one.
I write a lot about investigations into tricky bugs – CPU defects, kernel bugs, transient 4-GB memory allocations – but most bugs are not that esoteric. Sometimes tracking down a bug is as simple as paying attention to server dashboards, spending a few minutes in a profiler, or looking at compiler warnings.
Here then are three significant bugs which I found and fixed which were sitting in the open, just waiting for somebody to notice.
In May 2019 I was asked to look at a potentially serious Chrome bug. I initially misdiagnosed it as unimportant, thus wasting two valuable weeks, and when I rejoined the investigation it was the number one browser-process crash in Chrome’s beta channel. Oops.
On June 6th, the same day I realized I had misinterpreted the crash data, the bug was marked as ReleaseBlock-Stable meaning that we couldn’t ship our new Chrome version to most of our users until we figured out what was going on.
Update, August 2021: a reader of another of my blog posts told me how to trace GDI object leaks, with call stacks! This makes any reproducible GDI object leak truly trivial to investigate. I created a batch file to implement GDI object tracing. For analysis tips just look at my handle-leak blog post.
The crash was happening because we were running out of GDI (Graphics Device Interface) objects, but we didn’t know what type of GDI objects, our crash data gave us no clues as to where the problem was happening, and we couldn’t reproduce the problem.
Several of us worked hard on the bug on June 6th and 7th, testing out theories but not making any clear progress. Then, on June 8th I went to check my email and Chrome immediately crashed. It was the crash.
This investigation started, as so many of mine do, with me minding my own business, not looking for trouble. In this case all I was doing was opening my laptop lid and trying to log on.
The first few times that this resulted in a twenty-second delay I ignored the problem, hoping that it would go away. The next few times I thought about investigating, but performance problems that occur before you have even logged on are trickier to solve, and I was feeling lazy.
When I noticed that I was avoiding closing my laptop because I dreaded the all-too-frequent delays when opening it I realized it was time to get serious.
Luckily I had recently fixed UIforETW’s circular-buffer tracing to make it reliable, so I started it running and waited for the next occurrence. It didn’t take long.
A twitter discussion on build times and source-file sizes got me interested in doing some analysis of Chromium build times. I had some ideas about what I would find (lots of small source files causing much of the build time) but I inevitably found some other quirks as well, and I’m landing some improvements. I learned how to use d3.js to create pretty pictures and animations, and I have some great new tools.
As always, this blog is mine and I do not speak for Google. These are my opinions. I am grateful to my many brilliant coworkers for creating everything which made this possible.
The Chromium build tools make it fairly easy to do these investigations (much easier than my last build-times post), and since it’s open source anybody can replicate them. My test builds took up to 6.2 hours on my four-core laptop but I only had to do that a few times and could then just analyze the results.
I’ve been a big fan of symbol servers for years. They are a part of the Microsoft/Windows ecosystem that is far better than anything I have seen for other operating systems. With Microsoft’s and Chrome’s symbol servers configured I can download a user’s Chrome crash and start analyzing it immediately – with code bytes and symbols – without knowing or caring what OS or Chrome version that user was running. Since Chrome’s symbols are source-indexed I will even get source-code popping up automatically, and all of this is available to anyone who is interested – no special Google privileges required.
ETW traces record a wealth of information about how a Windows system is behaving. When analyzing a new and unknown problem there is no replacement for loading the trace into WPA and following the clues to a solution. The thrill of the hunt and the creative challenge of finding a visualization that will reveal the root cause never gets old (too nerdy? Sorry – I do enjoy this).
But sometimes you want to extract some commonly found piece of information from multiple traces, and doing this manually is tedious and error prone.
I recently hit some multi-minute delays on my workstation. After investigating I found that the problem was due to a lock being held for five minutes, and during that time the lock-holder was mostly just spinning in a nine-instruction loop.
Coming up with a good title for my blog posts is critical but I immediately realized that the obvious title of “48 processors blocked by nine instructions” was taken already by a post from less than a month earlier. The number of processors blocked is different, and the loop is slightly longer but, really, it’s just deja vu all over again. So, while I will explain the new issue that I found I first want to discuss the question of why this keeps happening.
Why does this happen?
Roughly speaking it’s due to an observation which I’m going to call Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.