Taskbar Latency and Kernel Calls

I work quickly on my computer and I get frustrated when I am forced to wait on an operation that should be fast. A persistent nuisance on my over-powered home laptop is that closing windows on the taskbar is slow. I right-click on an entry, wait for the menu to appear, and then select “Close window”. The mouse movement should be the slow part of this but instead I find that the delay before the menu appears is the longest component.

Sources on twitter say that a fix is in fast-ring builds as of October 2019, build 18999, but unfortunately this won’t ship to a stable release until early 2020. And in late October Microsoft said “We’ve done some work to improve the launch speed of the taskbar jump lists”, with a thank you from one of the developers.
September 2019: ~700 ms to focus change
April 2020: 250-300 ms (optimizations? or different data?) to focus change
June 2020: 100 ms to focus change!!! The fix has landed in Windows 10 2004.
Unfortunately a deeper investigation shows that it takes at least another 100 ms from when the menu gets focus to when it is painted. So, the latency is a still-visible 200 ms from mouse-release to pixels on the screen. This is the absolute best case on a high-performance laptop, and apparently that’s as good as it’s going to get. Pity.

This has been bothering me for a long time but I had been showing uncharacteristic self control and had resisted being distracted. Until today, when I finally broke down and grabbed an ETW trace.

This post was written as a test of speed-blogging. Total time from finding the issue and sarcastically tweeting about it to publishing the initial post was about 90 minutes.

The ETW trace records me right-clicking on the task bar to close two Explorer windows. I used UIforETW’s Tracing to file with the default options, giving me a 20.9 MB trace of the issue. You can see a video of the analysis below on youtube. Continue reading

Posted in Investigative Reporting, Performance, uiforetw, xperf | Tagged , , , | 40 Comments

O(n^2) in CreateProcess

So many possible introductions to this one:

  • Windows 7: Sheesh, I sure am slow at creating processes
  • Windows 10: Hold my beer…

Or how about:

  • A) How long does CreateProcess take on Windows?
  • B) How long would you like it to take?
  • A) You mean you can make it as fast as I want?
  • B) No, I can make it as *slow* as you want

O(n^2) algorithms that should be linear are the best.

Note that, despite breathless and click-baity claims to the contrary, the performance of Chrome and Chromium was never affected by this bug. Only Chromium’s tests were affected, and that slowdown has been 100% mitigated.

CFG ended up being a big part of this issue, and eight months earlier I had hit a completely unrelated CFG problem, written up here.

I often find odd performance issues all on my own, but sometimes they are given to me. So it was when I returned from vacation to find that I’d been CCed on an interesting looking bug. Vivaldi had reported “Unit test performance much worse on Win10 than Win7”. unit_tests were taking 618 seconds on Win10, but just 125 seconds on Win7.

Update, April 23, 2019: Microsoft received the initial “anomaly” report on the 15th, the repro steps on the 21st, and announced a fix today. Quick work! The fix is in 1903 (back ported) and in 2004. I don’t know what other Windows 10 versions received the fix.

By the time I looked at the bug it was suspected that CreateProcess running slowly was the problem. My first guess was that the problem was UserCrit lock contention caused by creating and destroying default GDI objects. Windows 10 made these operations far more expensive, I’d already written four blog posts about the issues that this causes, and it fit the symptoms adequately well.

Continue reading

Posted in Investigative Reporting, Performance, xperf | Tagged | 29 Comments

Exercises in Emulation: Xbox 360’s FMA Instruction

Years ago I worked in the Xbox 360 group at Microsoft. We were thinking about releasing a new console, and we thought it would be nice if that console could run the games of the previous console.

Emulation is always hard, but it is made more challenging when your corporate masters keep changing CPU types. The Xbox one – sorry, the original Xbox – used an x86 CPU. The Xbox two – sorry, the Xbox 360 – used a PowerPC CPU. The Xbox three – sorry, the Xbox One – used an x86/x64 CPU. These ISA flip-flops did not make life easy.

imageI made some contributions to the team that taught the Xbox 360 how to emulate a lot of the original Xbox games – emulating x86 on PowerPC – and was given the job title Emulation Ninja for that work*. Then I was asked to help investigate what it would take to emulate the Xbox 360’s PowerPC CPU with an x64 CPU. To set expectations, I’ll mention up front that I didn’t find a satisfactory solution.

Continue reading

Posted in Floating Point | Tagged , , , , | 23 Comments

When Your Profiler Lies

Last week I wrote about the performance consequences of inadvertently loading gdi32.dll into processes that are created and destroyed at very high rates. This week I want to share some techniques for digging deeper into this behavior, and the odd things that I found when trying to use them.

When I first wrote UIforETW I noticed that an inordinate amount of the size of the traces it recorded was coming from the Microsoft-Windows-Win32k provider. This provider records useful information about UI hangs and which window is active, but some less useful events were filling the trace buffers and crowding out the interesting ones. The most verbose events were the ExclusiveUserCrit, ReleaseUserCrit, and SharedUserCrit events and they were routinely generating 75% of the Microsoft-Windows-Win32k event traffic. So I stopped recording those events and forgot about them until quite recently. And that’s funny because those events record exactly the information that is needed for investigating all of these UI hangs – theoretically.

Continue reading

Posted in Investigative Reporting, Performance, Programming, uiforetw, xperf | Tagged , , | 20 Comments

A Not-Called Function Can Cause a 5X Slowdown

Subtitle: Making Windows Slower Part 3: Process Destruction

In the summer of 2017 I wrestled with a Windows performance problem. Process destruction was slow, serialized, and was blocking the system input queue, leading to repeated short mouse-movement hangs when building Chrome. The root cause was that Windows was wasting a lot of time looking up GDI objects during process destruction, and it did this while holding the system-global user32 critical section. I shared the details in this blog post: 24-core CPU and I can’t move my mouse.

Microsoft fixed the bug, and I moved on with my life, but then it appeared to have come back. I heard complaints that the LLVM test suite was running slowly, with frequent input hangs.

Continue reading

Posted in Investigative Reporting, Performance, Programming | Tagged , , | 14 Comments

Making Windows Slower Part 2: Process Creation

Windows has long had a reputation for slow file operations and slow process creation. Have you ever wanted to make these operations even slower? This weeks’ blog post covers a technique you can use to make process creation on Windows grow slower over time (with no limit), in a way that will be untraceable for most users!

And, of course, this post will also cover how to detect and avoid this problem.

Continue reading

Posted in Code Reliability, Investigative Reporting, Performance, Programming | Tagged , , | 20 Comments

Commute Challenge 2018

Bruce Dawson in wet suit with commute methods, copyright 2018 425 Business MagazineI’ll bet I had more fun commuting during September 2018 than you did.

In April 2017 I gave myself the challenge of commuting to work using a different method every workday for a month – twenty ways in twenty days! The write up and video are here. It was great fun and it also served as a joyous celebration of the many ways to make commuting more fun than sitting alone in traffic.

In September 2018 I did the same thing, with nineteen new methods. It was a lot of work but a huge amount of fun, and I’ve got the video to prove it. I got assistance from friends, coworkers, and strangers in ways that I would not have thought possible, and the second commute challenge actually worked.

Continue reading

Posted in Commuting, Environment, Unicycling | 11 Comments

24-core CPU and I can’t type an email (part two)

In my last post I promised to give more details about some rabbit holes that I went down during the investigation, including page tables, locks, WMI, and a vmmap bug. Those details are here, along with updated code samples. But first, a really quick summary of the original issue:

In the last post I talked about how every time a CFG-enabled process allocates executable memory some Control Flow Guard (CFG) memory is allocated as well. Windows never frees the CFG memory so if you keep allocating and freeing executable memory at different addresses then your process can accumulate an arbitrary amount of CFG memory. Chrome was doing this and that was leading to an essentially unbounded waste of memory, and hangs on some machines.

And, I have to say, hangs are hard to avoid if VirtualAlloc starts running more than a million times slower than normal.

Continue reading

Posted in Investigative Reporting, memory, Performance | Tagged , , , | 26 Comments

24-core CPU and I can’t type an email (part one)

I wasn’t looking for trouble. I wasn’t trying to compile a huge project in the background (24-core CPU and I can’t move my mouse), I was just engaging in that most mundane of 21st century tasks, writing an email at 10:30 am. And suddenly gmail hung. I kept typing but for several seconds but no characters were appearing on screen. Then, suddenly gmail caught up and I resumed my very important email. Then it happened again, only this time gmail went unresponsive for even longer. Well that’s funny

CFG ended up being a big part of this issue, and eight months later I hit a completely unrelated CFG problem, written up here.

I have trouble resisting a good performance mystery but in this case the draw was particularly strong. I work at Google, making Chrome, for Windows, focused on performance. Investigating this hang was actually my job. And after a lot of false starts and some hard work I figured out how Chrome, gmail, Windows, and our IT department were working together to prevent me from typing an email, and in the process I found a way to save a significant amount of memory for some web pages in Chrome.

Continue reading

Posted in Investigative Reporting, memory, Performance | Tagged , , , | 58 Comments

Making Windows Slower Part 1: File Access

Tortoises are slowWindows has long had a reputation for slow file operations and slow process creation. Have you ever wanted to make these operations even slower? This weeks’ blog post covers a technique you can use to make all file operations on Windows run at one tenth their normal speed (or slower), in a way that will be untraceable for most users!

And, of course, this post will also cover how to detect and avoid this problem.

Continue reading

Posted in Investigative Reporting, Performance, Programming, xperf | Tagged , , | 25 Comments