Part of my job always seems to include crash analysis. A program crashes on a customer’s machine, a minidump is uploaded to the cloud, and it might be my desk that it appears on when Monday morning rolls around. The expectation is that I can make sense of it so that we can ship more reliable software.
The Windows ecosystem makes this as easy as possible because the debuggers will automatically download the binaries and the debug symbols, and if source indexing has been applied then they will also automatically download the correct source files as needed.
Chrome has a public symbol server and uses source indexing so debugging Chrome is accessible for any developer.
TL;DR – be wary of buying a Lenovo laptop or any other laptop that uses a Synaptics touch pad until Synaptics ships a fixed driver. Their driver has a memory leak and they have a battery-life bug that causes Windows to repeat the same system-scan once a second. So far Synaptics has failed to respond so the wait for a fix must be assumed to be infinite.
Update: October 15, 2017: an updated driver for my laptop was released September 11, 2017, six days after I published this post. However I have yet to see the driver on Windows Update, Lenovo’s System Update, or in any communications from Lenovo or Synaptics. So, I don’t know if or how ‘normal’ customers are supposed to get this fix. I only found out through the kindness of a stranger.
Now for more details to back up my claims…
My new laptop has great battery life and I want to keep it that way so if I notice something consuming CPU time for no good reason I investigate. If it’s a web page with battery-wasting ads I close it, and if it’s a process that keeps doing background work I’ll kill it.
So, when I noticed that Task Manager said that WmiPrvSE.exe was consuming ~0.7% of my CPU time (~5.6% of a core) continuously I investigated.
Earlier this month I wrote about how Windows 10 holds a lock during too much of process destruction, which both serializes this task and causes mouse-cursor hitches and UI hangs (because the same lock is used for these UI tasks).
I thought I’d use this as an excuse to dig slightly deeper into what is going on using a clunky-but-effective ETW profiling technique. This technique shows that a 48 instruction loop is consuming a huge chunk of the CPU time while the lock is held – the 80/20 rule is alive and well. And, thanks to some discussion on hacker news I know have an idea of what that function does and why it got so much more expensive in Windows 10 Anniversary Edition.
This story begins, as they so often do, when I noticed that my machine was behaving poorly. My Windows 10 work machine has 24 cores (48 hyper-threads) and they were 50% idle. It has 64 GB of RAM and that was less than half used. It has a fast SSD that was mostly idle. And yet, as I moved the mouse around it kept hitching – sometimes locking up for seconds at a time.
Update Oct 29, 2017: a video showing how to see if the bug is fixed can be found here, and the bug is fixed in the 17025 insider preview builds.
Update Nov 20, 2017: the fix has made it to Creators Update (RS2) which means I can now build Chrome without encountering micro-hangs!
So I did what I always do – I grabbed an ETW trace and analyzed it. The result was the discovery of a serious process-destruction performance bug in Windows 10.
I just got a new laptop (Lenovo P51, four-cores, eight-threads, 32 GB RAM, multiple drive bays). My old machine was more than six years old so it was probably overdue. I wanted to record some of the reasons for the upgrade, and the process, if only for myself, so here we go.
One would think that the main reason to upgrade a six-year-old laptop would be the hardware. Bigger, faster, etc., but it turns out that software was as big a factor.
I’ve written in the past about how to compare floating-point numbers for the common scenario where two results should be similar but may not be identical. In that scenario it is reasonable to use an AlmostEqual function for comparisons. But there are actually cases where floating-point math is guaranteed to give perfect results, or at least perfectly consistent results. When programmers treat floating-point math as a dark art that can return randomly wrong results then they do themselves (and the IEEE-754 committee) a disservice.
A common example given is that in IEEE floating-point math 0.1 + 0.2 does not equal 0.3. This is true. However this “odd” behavior is then extrapolated in some ill-defined way to suggest that all floating-point math is wrong, in unpredictable ways. The linked discussion then used one of my blog posts to justify their incorrect analysis – hence this article.
In fact, IEEE floating-point math gives specific guarantees, and when you can use those guarantees you can sometimes make strong conclusions about your results. Failing to do so leads to a cascade of uncertainty in which any outcome is possible, and analysis is impossible.
I’m lucky enough to live just 2 km (1.25 miles) away from the place where I work. Because of this – and because I dislike driving – I tend to commute in a variety of non-car ways. A few months into my new job I noticed that I tended to use about six different commute methods on a regular basis: walking, running, cycling, unicycling, inline skating, and taking a bus. Having that many commute methods got me thinking: how many commute methods could I come up with? Could I commute to work using a different method every work day for a month?
And so was born the commute challenge. After much procrastination I tried this challenge in April 2017. One month, twenty work days, twenty different commute methods.