24-core CPU and I can’t type an email (part one)

I wasn’t looking for trouble. I wasn’t trying to compile a huge project in the background (24-core CPU and I can’t move my mouse), I was just engaging in that most mundane of 21st century tasks, writing an email at 10:30 am. And suddenly gmail hung. I kept typing for several seconds but no characters were appearing on screen. Then, suddenly gmail caught up and I resumed my very important email. Then it happened again, only this time gmail went unresponsive for even longer. Well that’s funny

CFG ended up being a big part of this issue, and eight months later I hit a completely unrelated CFG problem, written up here.

I have trouble resisting a good performance mystery but in this case the draw was particularly strong. I work at Google, making Chrome, for Windows, focused on performance. Investigating this hang was actually my job. And after a lot of false starts and some hard work I figured out how Chrome, gmail, Windows, and our IT department were working together to prevent me from typing an email, and in the process I found a way to save a significant amount of memory for some web pages in Chrome.

This investigation had so many rabbit holes that I’m going to save some of the digressions for a follow-on post (which is here) , but this one will entirely explain the hangs.

As usual I had UIforETW running in the background, tracing to circular buffers, so I just had to type Ctrl+Win+R and the buffers, representing the last thirty seconds or so of system activity, were saved to disk. I loaded it into Windows Performance Analyzer (WPA), but I couldn’t definitively find the hang.

When a Windows program stops pumping messages there will be ETW events emitted to indicate exactly where this happened, so those types of hangs are trivial to find. But apparently Chrome kept on pumping messages. I looked for a time when one of Chrome’s crucial threads either went into a busy loop or went completely idle, but I couldn’t find anything conclusive. There were some spots where Chrome went mostly idle, but even then all of the key threads kept chugging along, so I couldn’t be sure where the hang was – Chrome might have just been going idle because there was nothing to do:

image

imageUIforETW has a built-in key logger which is often useful when identifying key points in a trace. However, for obvious security reasons it defaults to anonymizing the keystrokes, treating every number typed as a ‘1’ and every letter as an ‘A’. This was making it hard to find the exact moment of the hang so I turned the input tracing from “Private” to “Full” and waited for the hang. At around 10:30 the next morning the hang happened again. I saved the trace buffers, and put these notes into UIforETW’s trace information field:

Was typing “defer to those with more scuba experience” and gmail hung around the end of “those” and then resumed partway through “experience”. gmail tab was PID 27368.

Clearly just a normal work-related discussion of commute methods, but the important thing is that now I had a way of locating the hang in the ETW trace. I loaded the trace, looked at the keyboard data in the Generic Events field (the events are emitted by UIforETW itself, each one is a purple diamond in the screen shot below) and suddenly I could see exactly where the hang was, and clearly the dip in CPU usage was 100% correlated:

image

Okay, but why is Chrome stopping? Here’s some foreshadowing: not shown in these screenshots is that WmiPrvSE.exe was consuming an entire CPU hyperthread whenever this happened. But that shouldn’t matter. I have a 24-core/48-thread machine so consuming one hyperthread means my machine is still about 98% idle.

I then zoomed in to the clearly-significant period where Chrome was idle and looked in particular at CrRendererMain in chrome.exe (27368), the gmail tab process.

Aside: I just want to say thank you to 2015 me for asking Microsoft to improve their thread naming mechanisms, and thank you to Microsoft for doing everything I suggested – thread names in WPA are great!

The problem suddenly became clearer. During the 2.81 s hang this thread was scheduled and ran 440 separate times. Normally running every 6 ms is enough to make a program responsive, but for some reason it wasn’t making any progress. I noticed that every time it woke up it was on the same stack. Simplified down it was:

chrome_child.dll (stack base)
KernelBase.dll!VirtualAlloc
ntoskrnl.exe!MiCommitVadCfgBits
ntoskrnl.exe!MiPopulateCfgBitMap
ntoskrnl.exe!ExAcquirePushLockExclusiveEx
ntoskrnl.exe!KeWaitForSingleObject (stack leaf)

Chrome is calling VirtualAlloc, which tries to update some “CfgBits” and needs to acquire a lock. At first I assumed that Chrome was calling VirtualAlloc 440 times, which seemed like a weird thing to do, but the reality was perhaps weirder. Chrome called VirtualAlloc once and a lock needed to be acquired. Chrome was signaled when that lock was available but – for 439 times in a row – when Chrome woke up and tried to acquire the lock it was unavailable. The lock had been reacquired by the process that had just released it.

This is because many Windows locks are, by design, not fair and if a thread releases a lock and then tries to reacquire it immediately then it can, in cases like this, reacquire it every single time. Starvation. More on this next time.

The Readying Process – the one that signaled Chrome to go grab the lock – in every case was WmiPrvSE.exe, releasing the lock on this stack:

ntoskrnl.exe!KiSystemServiceCopyEnd (stack base)
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiUnlockAndDereferenceVad
ntoskrnl.exe!ExfTryToWakePushLock (stack leaf)

I got distracted by WMI for a while (more on that next time) but eventually I wrote a program to recreate WMI’s behavior. I had CPU sampling data that showed me where WmiPrvSE.exe was spending its time (some edits and simplifications made):

WmiPerfClass.dll!EnumSelectCounterObjects (stack base)
WmiPerfClass.dll!ConvertCounterPath
pdh.dll!PdhiTranslateCounter
pdh.dll!GetSystemPerfData
KernelBase.dll!blah-blah-blah
advapi32.dll!blah-blah-blah
perfproc.dll!blah-blah-blah
perfproc.dll!GetProcessVaData
ntdll.dll!NtQueryVirtualMemory
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiQueryAddressSpan
ntoskrnl.exe!MiQueryAddressState
ntoskrnl.exe!MiGetNextPageTable (stack leaf)

It turns out that reproducing the slow scanning from the sampling data was quite easy. The interesting part of the call stack above is NtQueryVirtualMemory, which is used to scan process memory, and is called by the suggestively named GetProcessVaData, where Va probably stands for Virtual Address. My VirtualScan program just called NtQueryVirtualMemory in the obvious loop to scan the address space of a specified process, the code worked, it took a really long time to scan the gmail process (10-15 seconds), and it triggered the hang. But why?

The advantage of writing the scanning loop myself was that I could start generating statistics. NtQueryVirtualMemory returns data on each contiguous range of address space that has matching attributes. That might mean all reserved, all committed with particular protection settings, etc. The gmail process had about 26,000 of these blocks but I found another process (WPA, as it turns out) that had 16,000 memory blocks but could be scanned very quickly.

At some point I looked at the gmail process with vmmap and noticed that the gmail process had a fair amount of memory (361,836 KiB) and a lot of separate blocks (49,719) coming from a single reservation in the Shareable category – a reservation with a size of 2,147,483,648 KiB, aka 2 TiB. Wot?

image

I happened to know that the 2 TiB reservation is used for Control Flow Guard (CFG), and I remembered that “CFG” showed up on the call stacks where the gmail Chrome process was waiting – MiCommitVadCfgBits was on the stack. Maybe the large number of blocks in the CFG region were the problem!

Control Flow Guard (CFG) is used to try to stop exploits. Its 2 TiB reservation is a sparse bitmap which indicates which addresses (across 128 TiB of user-mode address space) are valid indirect branch targets. So I modified my virtual memory scanner to count how many blocks were in the CFG region (just look for the 2 TiB reservation) and how many blocks were tagged as executable. Since CFG memory is used to describe executable memory I expected to see one block of CFG memory for each block of executable memory in the process. Instead I saw 98 blocks of executable memory and 24,866 blocks of committed CFG memory. A significant discrepancy:

     Scan time,  Committed, page tables, committed blocks
Total: 41.763s, 1457.7 MiB,    67.7 MiB,  32112, 98 code blocks
CFG:   41.759s,  353.3 MiB,    59.2 MiB,  24866

vmmap shows reservations and commited memory as blocks and my scanning tool only counts committed blocks – that’s why vmmap shows 49,684 blocks where my scanning tool reports 24,866

In hindsight it is obvious, but what if the CFG bitmap never gets cleaned up? What if memory gets allocated in the CFG bitmap when you allocate executable memory, but it doesn’t get freed when you free the executable memory. That would explain this behavior.

End-to-end repro

The next step was to write a VAllocStress program that allocates and then frees thousands of blocks of executable memory at random addresses. This executable needs to have CFG enabled and needs to be 64-bit and I hit some road blocks because I didn’t really know what I was trying to do but it worked. After allocating and freeing this many blocks of executable memory this program should sit in a loop where it occasionally try to allocate/free more executable memory, and note when this is slow. Here is my basic algorithm for VAllocStress:

  1. Spin in a loop a lot of times:
    1. Allocate some executable memory with VirtualAlloc at a random address
    2. Free the memory
  2. Then, spin in an infinite loop:
    1. Sleep for 500 ms (don’t want to be a CPU hog)
    2. Allocate some executable memory with VirtualAlloc at a fixed location
    3. Print a message if the VirtualAlloc call takes more than ~500 ms
    4. Free the memory

That’s it. It’s super simple. And it was so satisfying when it worked. I ran it and then got my VirtualScan program to scan my VAllocStress process. I quickly got confirmation that the CFG block was horribly fragmented, and the scanning took a long time. And my VAllocStress program hung for the duration of the scan!

At this point I had simulated the CFG fragmentation, the long scanning times, and the hang. Huzzah!

Root-cause

It turns out that v8 (Chrome’s JavaScript engine) has CodeRange objects for managing code-gen, and each CodeRange object is constrained to a 128 MiB address range. This should be small enough to avoid the rampant CFG allocation I was seeing.

But what if you have multiple CodeRange objects, and what if those get allocated at random addresses and then freed? I instrumented the CodeRange constructor, left gmail running, and found the smoking gun. Every couple of minutes another CodeRange object was created (and destroyed). Being in the debugger made it easy to find that WorkerThread::Start was allocating these objects, and suddenly everything makes sense:

  1. gmail uses service workers, probably for implementing offline mode
  2. These appear and disappear every few minutes because that’s how service workers behave
  3. Each worker thread gets a temporary CodeRange object which allocates some executable pages for JITted JavaScript code from a random location in the 47-bit process address space
  4. The 2 TB CFG memory reservation gets some entries every time new code pages are allocated
  5. The CFG allocations are never freed
  6. NtQueryVirtualMemory is painfully slow when scanning CFG memory (about 1 ms per block) for reasons that I don’t understand

The slow scanning of CFG memory is actually fixed in Windows 10 RS4 (April 2018 Update), which briefly had me wondering if all of this investigation was pointless. But it’s not.

Memory

The CFG reservation starts out as just an address reservation – no memory is actually committed. As executable pages are allocated however parts of the CFG reservation are turned into committed memory, using real memory pages. And those pages never get freed. If you keep allocating and freeing randomly located blocks of executable memory then the CFG region will grow arbitrarily large! Well, that’s not quite true. The CFG memory block is best thought of a cache with bounded size. However that is little consolation when the bounded size is 2 TiB, per-process!

The worst I’ve seen so far is when my gmail tab had been running for eight days and had accumulated 353.3 MiB of CFG memory, and 59.2 MiB of page-tables to map that memory, for a total waste of about 400 MiB. For some reason most people see either no symptoms or much milder symptoms than I do.

You can also leak CFG memory by loading a DLL and then unloading it. CFG memory is allocated to track its valid branch targets, and that memory is not freed when the DLL is unloaded. This is normally a very minor problem, but if you repeatedly load and free the DLL, forcing it to a new address each time then you can leak arbitrarily large amounts of CFG memory in a slightly different way. Did I do this? Of course I did.

Resolution

The v8 (Chrome’s JavaScript engine) team has updated their code so that it reuses addresses for CodeRange objects, which will squish this bug quite nicely. Microsoft already fixed their CFG memory scanning performance problems. Maybe some day Microsoft will free CFG regions when executable memory is freed, at least in the easy cases where large address ranges are freed. And a vmmap bug was reported.

Most of my coworkers and our customers never noticed this problem. I only noticed it because:

  • I had gmail’s offline mode enabled
  • I was running on an old version of Windows 10
  • Our IT department was running regular WMI scans of our computers
  • I was paying attention
  • I got lucky

It seems highly improbably that, as one of the most qualified people to diagnose this bug, I was the first to notice it. If any of the factors above had not been true then I would not have experienced this hang and the associated memory waste might never have been found.

By the way, the reason that the hangs kept happening at 10:30 am is just because that’s when our IT team ran their inventory scans. If you want to trigger a scan manually, you can go to Control Panel | Configuration Manager | Actions | Select ‘Hardware Inventory Cycle’ and then Run Now.

clip_image002

Source code

If you want to play along at home you can grab the source-code to VAllocStress and VirtualScan from github, here.

Rabbit holes

This investigation included a lot of rabbit holes. Some were important (page tables), some were instructive (locks), and some were purely a waste of time (WMI details). I’ll talk more about these, and a vmmap bug, in a follow-up post. If you want more gory details – some of which ended up being wrong or irrelevant – you can follow the whole investigation at crbug.com/870054.

Hacker news discussion is here.

Reddit discussion is here.

Russian translation is here.

Japanese translation is here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Investigative Reporting, memory, Performance and tagged , , , . Bookmark the permalink.

28 Responses to 24-core CPU and I can’t type an email (part one)

  1. As a layperson swimming in a sea of ignorance, I understood 60% of that, I tried though…
    M̶u̶r̶d̶e̶r̶ Memory mystery solved…

  2. Imran says:

    Loved every bit of it.. bless you be.

  3. xyl says:

    make sure hardware side is ok, clog usually start at circuity degrade, or power related. Good troubleshooting steps at end to end repro, try that myself by cramping system with other heavy tasking and eventually finished, what am l missing there compare to your situation?

  4. Jim Jones says:

    Nicely written.

  5. cxw says:

    Congrats on the find, and thanks for the write-up! Always nice when it happens to the right person. Some years ago I was working on a hardware project and became the first to smoke an expensive component. Same situation – I was getting paid to figure out that sort of thing 🙂 . (Turned out to be pure Murphy – there were two ways to plug it in when there should have only been one.) Anyway, nice job on this one!

  6. Cycologist says:

    Try Linux, fewer mysterious OS hangs and stalling is virtually unheard of.

    • dmz73 says:

      Tried Linux – Ubuntu 18.04. After suspend and resume Bluetooth related process went rogue and chewed up 100% of CPU (dual core). System was slow as molasses and mostly unresponsive. Could not open task manager or even switch to console. After minutes of this the virtual console switch worked but that too was unresponsive. Only thing that worked was forced power off. At least on Windows task manager usually works and allows me to kill bad process.

  7. Jared Still says:

    Thanks for this; I have been having this problem for quite some time, and it is most frustrating. I have most recently experienced it just a couple days ago.
    Thanks again.

    • brucedawson says:

      There’s no guarantee that this is the problem that you are experiencing (there are a *lot* of things that can cause hangs) but if it’s the same issue then the problem will magically go away when the fix ships (M69 or M70 version of Chrome)

  8. Very interesting bug, but have you heard of this one? Full kernel freezes (in Windows) when closing Chromium based applications. I’ve seen Chrome do it, Discord, Google Play Music Desktop Player, and Visual Studio Code. It usually happens after the process has been running for days, and I’ve had it happen on both my laptop and a friend’s (who has the same CPU as mine).

    I’ve also seen (in Discord at least) unexplained input delays when scrolling while. I also profiled Discord while it was happening, but it just seems that all the frames are late, and it doesn’t seem to happen when I use software rendering.

    Perhaps both are potentially related to Intel’s graphics drivers not interacting well with Chromium?

    • brucedawson says:

      What do you mean by a full kernel freeze? The whole machine locks up? Does it ever respond again?

      If it ever responds again then you could record an ETW trace (tinyurl.com/etwcentral has instructions on using UIforETW for this), open a bug at crbug.com, and attach a link to the trace. CC me on it.

      If the system never responds then that is probably a driver bug.

  9. sikander says:

    I don’t have a Control Panel | Configuration Manager ???

  10. jarmenz says:

    Like a modern detective )

  11. Nelson says:

    I am way too ignorant of the inner workings of Windows software. To make it worse I am, ahem, elderly now, and learning is no longer as quick and sure as it was in my “yout”. Nonetheless, I scanned this to try, once again, to experience the thrill of following a rabbit down its hole. Toward the end I happened upon your comment,

    “It seems highly improbably that, as one of the most qualified people to diagnose this bug, I was the first to notice it.”

    I was reminded of a thought from Eric Raymond, a man in the open software world whom I admire. In his book “The Cathedral and the Bazaar” he noted “given enough eyeballs, all bugs are shallow”. Your eyeballs, apparently, were the pair needed to see that bug wriggling at the bottom of the muck we affectionately call code. Nice read. Thanks for posting.

  12. Kyle Sabo says:

    Part of the problem here comes from the legacy performance counter infrastructure (circa 1993). There is no way to enumerate the counters provided by each provider other than to ask each provider to collect every counter available. This is a rather large oversight, but it mostly works because collecting counters should be very fast. Except for that one counter that enumates address spaces…

    The next part of the problem is that WMI promises a list of performance counters currently available on the machine. The way it does that is by enumerating all the counters, which means collecting one of every legacy counter… It’s likely that your IT department doesn’t care about this particular counter, but it gets collected as a side effect of finding counters they do care about.

    As you noted, this was addressed in the 1803 release of Windows 10. If you query WMI for the list of available counters you should see the memory one conspicuously missing from the list.

    • brucedawson says:

      Thanks for the info Kyle. I hadn’t realized that the memory counter was missing from the list, but I did notice that scanning the address spaces is *much* faster in 1803. So that means that the performance issue was fixed in two independent ways.

      Do you know why the querying would happen multiple times? When I left VAllocStress running it would report whenever it noticed that it was being scanned and these reports would always be in clusters of about eight. It appeared that WMI was scanning its address space that many times. Is that likely to be a WMI weirdness or a Google IT weirdness? This is why the hang would happen multiple times at 10:30 am (and it’s not just because one scan takes that long – it is definitely different scans of the same process).

      • Kyle Sabo says:

        After it comes alive, WMI refreshes the list of counters every 2 minutes until the WMI helper process closes due to inactivity.

        • brucedawson says:

          Ah. That makes perfect sense and is terrifying, especially given the possibility that collecting the counters might take more than 2 minutes. I now have something else to watch for…

  13. Michael says:

    24-cores cpu cannot be fast and the problem in 24 cores.
    Better cpu “for gmail” is 8700k 5.0 GHz.

    • brucedawson says:

      Well, I need to use my computer for more than just gmail, so I guess I’m stuck with 24 cores. And, chrome is multi-threaded enough to get some benefit from multi-core. And finally, once this fix lands (it should be in M69, which releases in a few weeks), the gmail hangs will be gone.

      • George Tsiros says:

        you can plainly see that core count is irrelevant.
        hell, most of the hardware is irrelevant.
        Only thing that matters is the software.

  14. kfsone says:

    I’ve been seeing weird slowdowns of the new console host, e.g open a powershell window and do a `gci -r c:\` and then ask visual studio to build a hello world app. The console window will drop to a crawl. I most often see it from inside ipython.

    I guess it’s time to install the sdk 🙂

    • brucedawson says:

      Are you running at 200% magnification? I’ve heard of issues like this in that configuration. But, the only way to know is to grab and analyze a trace – good luck!

  15. Benjamin J. Abt says:

    It’s October 2018 now and this issue still has not been solved by Gmail. Every time I log in the CPU usage jumps from 5%-10% while browsing other websites up to 50-80% with a Gmail tab opened – and that is on a Core 2 Quad Q9650, which is not a slouch at all. As soon as I log off from Gmail, the CPU use drops to 5% again! How can such a simple and basic thing as a webmail tab consume as much CPU power as a 3D shooter at high resolutions? There must be some malicious code be running in that tab …

    • brucedawson says:

      The issue that I reported on has been fixed. You must be hitting a different issue. You could have some malicious or inefficient code injected into that tab – uBlock perhaps?

      I would try with extensions disabled (in an incognito window?) to see if that helps. If you are using Chrome then you can report a performance bug at crbug.com and use chrome://tracing to provide additional details.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.