GDI leaks and the importance of luck

Machu PicchuIn May 2019 I was asked to look at a potentially serious Chrome bug. I initially misdiagnosed it as unimportant, thus wasting two valuable weeks, and when I rejoined the investigation it was the number one browser-process crash in Chrome’s beta channel. Oops.

On June 6th, the same day I realized I had misinterpreted the crash data, the bug was marked as ReleaseBlock-Stable meaning that we couldn’t ship our new Chrome version to most of our users until we figured out what was going on.

The crash was happening because we were running out of GDI (Graphics Device Interface) objects, but we didn’t know what type of GDI objects, our crash data gave us no clues as to where the problem was happening, and we couldn’t reproduce the problem.

Several of us worked hard on the bug on June 6th and 7th, testing out theories but not making any clear progress. Then, on June 8th I went to check my email and Chrome immediately crashed. It was the crash.

The irony is rich. While I was searching through changes and crash reports and brainstorming about what could possibly be causing Chrome’s browser process to leak GDI objects the GDI object count in my browser was creeping inexorably upward, and on the morning of the 8th it tipped over the magic 10,000 number. At that point one of our GDI object allocations failed and we intentionally crashed the browser. This was an improbably lucky break.

Bugs that are reproducible are, inevitably, fixable. All I had to do was figure out how I had triggered the bug and we would be able to fix it.

First, some background

imageIn most places in Chromium’s code when we try to allocate a GDI object we check to see if the allocation succeeded. If the allocation failed then we record some information on the stack and then intentionally crash, as you can see in this source code. We intentionally crash because if we can’t allocate GDI objects then we can’t render to the screen – it’s better to report the issue (if crash reporting is enabled) and restart the process rather than display a blank UI. The default limit is 10,000 GDI objects per process and we usually use just a few hundred so if we hit that limit then something has gone terribly wrong.

When we get one of the crash reports showing a GDI object allocation failure we have a call stack and all sorts of other useful information. Great! The problem is that these crash dumps aren’t necessarily related to the bug. That’s because the code that is leaking GDI objects and the code that reports the failure may not be the same code.

That is, roughly speaking we had these two types of code:

void GoodCode() {
   auto x = AllocateGDIObject();
   if (!x)
     CollectGDIUsageAndDie ();
   UseGDIObject(x);
   FreeGDIObject(x);
}

void BadCode() {
   auto x = AllocateGDIObject();
   UseGDIObject(x);
}

The good code notices when an allocation has failed and reports it, whereas the bad code both ignores failures and leaks the objects, thus setting up the good code to take the fall.

Chromium is a few million lines of code. We didn’t know which function was buggy, and we didn’t even know what type of GDI object was being leaked. One of my coworkers added code to troll through the Process Environment Block before crashing, to get counts by GDI object type, but for all of the enumerated types (DCs, regions, bitmaps, palettes, fonts, brushes, pens, and unknown) the counts totaled less than one hundred. Weird.

It turns out that objects we directly allocate are in this table, but objects created on our behalf by the kernel are not, and live somewhere in the Windows object manager. This means that GDIView is just as blind to the problem as we are (and GDIView is only helpful anyway with a local repro). Because we were leaking cursors, and cursors are USER32 objects with associated GDI objects, and those GDI objects are allocated by the kernel we were blind to what was happening.

The misinterpretation

Our CollectGDIUsageAndDie function has a catchy name I think you’ll agree. Very evocative.

The problem is that it did too much. CollectGDIUsageAndDie checked for about a dozen different types of GDI object allocation failures and, due to inlining, they all ended up getting the same crash signature – they all crashed in the main function and got bucketed together. So one of my coworkers wisely landed a change to break out the different checks to different (non-inlined) functions. This meant that we could now tell at a glance which check was failing.

Unfortunately this meant that when we started getting crash reports from CrashIfExcessiveHandles I confidently said “this isn’t a crash spike – it’s just a signature change.”

That was a mistake. It was a crash spike and a signature change. It was a dessert topping and a floor wax. Oops. Sloppy analysis there Dawson. No cookies for you.

Back to our story

At this point I knew that something I had done on June 7th had used up almost 10,000 GDI objects in a day. If I could figure out what it was then I could crack the case.

imageWindows’ Task Manager has an optional GDI objects column which can be used to look for leaks. On June 7th I was working from home, connecting to my machine at work, and I had that column enabled on my work machine as I ran tests and tried to come up with a repro scenario. But meanwhile it was the web browser on my home machine which was leaking GDI objects.

The main thing I had been using the browser on my home machine for was to connect to my work machine, using the Chrome Remote Desktop (CRD) app. So, I enabled the GDI objects column on my home machine and started experimenting. It didn’t take long to get results.

In fact, the bug timeline shows that it took just 35 minutes to go from “I’ve hit the crash” (2:00 pm) to “it’s something to do with CRD” to “it’s cursors.” Did I mention how much easier bug investigations are when you have a local repro?

It turned out that every time the CRD app (or any Chrome app?) changed cursors it would leak six GDI objects. Wiggling the mouse over the right part of the screen when using Chrome Remote Desktop could leak hundreds of GDI objects per minute, and thousands per hour. Once I started paying attention to the GDI objects column in task manager this quickly became apparent – I could see the GDI objects count leap up whenever the cursor changed.

After a month of no progress this problem had gone from intractable to a simple fix. I created a hack fix and then one of my coworkers (apparently I didn’t do any actual work on this bug) created the real fix. It was uploaded at 11:16 am on June 10th, and landed at 1:00 pm. A few merges later and the bug was vanquished.

That’s it?

Fixing bugs is nice, but what we really want is a way to make sure the bugs never come back. Using C++ objects to manage resources (RAII FTW!) is clearly the right thing to do, but in this case our WebCursor class had a bug.

When it comes to memory leaks there is a robust set of systems available. Microsoft has heap snapshots, Chromium has heap profiling in the wild and leak sanitizer on test machines. But GDI object leak detection seems to have been neglected. The Process Information Block has incomplete information, some GDI objects can only be enumerated in kernel mode, and there isn’t a single chokepoint for allocating or freeing objects to allow easy tracing. This wasn’t the first GDI object leak I’ve dealt with and it won’t be the last, because there isn’t a robust way of tracking them. Some suggestions for Windows-Next:

  • Make it trivial to get counts for all GDI object types, without requiring arcane PEB reading (and without skipping cursors)
  • Have a supported way to intercept and trace all GDI object creations and destructions, for robust tracking – including those created indirectly
  • Make sure this is documented

That’s it. This sort of tracking wouldn’t even be that hard, since GDI objects are necessarily constrained in a way that memory isn’t. It would be great to make these quirky but unavoidable GDI objects safer to use. Please?

Reddit discussion is here. Twitter thread starts here.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Bugs, Chromium, Debugging, Programming and tagged . Bookmark the permalink.

24 Responses to GDI leaks and the importance of luck

  1. My strategy with this type of thing is often to build an inbetween ‘manager’. So you get gdiManager->AllocateGDIObject(). Where you can then add an optional registration to see what is being eaten (and a nice DebugPrint() function or even SaveState(), CompareState() to check changes). Second strategy is to name each and every resource object you use. It’s so nice to be able to print a list of objects and say: ‘GDI object 464 “TabCursor” ‘ instead of having only numbers. It helps immensely when you then see loads of ‘TabCursor’ entries. 😉
    Ofcourse, this means you need the discipline of no-one allocating their own GDI objects. If everyone goes through classes like ScopedDC then it’s easy. It depends.

    • brucedawson says:

      Sure, that’s a good strategy, but how do you implement it? It’s difficult to find a complete list of GDI *types*, and a complete list of functions that allocate and free GDI objects is even more difficult to obtain.

      So, yes, I agree with that strategy and I have done it on smaller projects. But if you end up with a leak anyway then that means that you have missed instrumenting a GDI object allocation – how do you find it? A truly robust system would – like for memory – offer a *guarantee* that all allocations could be tracked, instead of a best effort. A GDI-object-allocation/deallocation call back would be perfect.

      • I admit you must have control over all the source code. Perhaps there are hooks in Win32? Sometimes I’ve used something like ‘#define AllocateGDIObject xxx’ so that source that still uses direct calls will fail to compile. (and in your manager ‘#undef AllocateGDIObject’. True, you’ll have to refactor all the code that’s using GDI’s but sometimes its worth it, given enough frustration. 😉 I did something similar with Windows handles, which are also ungraspable it seems. They come & go asynchronously. Never really got it nailed entirely, and these days where there is so much code and external party dll’s, it’s hard to get a grasp.

        • brucedawson says:

          Yep, definitely possible. But it’s hard to work up the enthusiasm for making this sort of change throughout all of Chromium (together with the dozens/hundreds of open source projects it includes) when I’m not even aware of a complete list of all functions that allocate GDI objects. In particular, the object we were directly leaking was a cursor, which isn’t even a GDI object.

          We could, and maybe we should, but it’s been much more cost effective to use wrapper classes and investigate as needed – and hope we don’t have another leak scare. We will see.

  2. There are programs that display the separate types of allocated GDI objects for each process. Like GDIView (https://www.nirsoft.net/utils/gdi_handles.html).

    • brucedawson says:

      Do you know if they display counts of cursor objects? That is one that we could not get.

      But what we really need is code to do this, so we can get the numbers from our customers who are hitting the bug. Local numbers are less useful.

  3. Back in the days of Win16 I worked on a graphically intense product, and in those days the GDI heap was shared across all apps, so running out of GDI was a serious crash risk even if we weren’t the one using all the objects. The niche nature of the product meant a lot of our users also used certain other products… and if one of those products crashed from a lack of GDI objects somehow we’d get the blame.

    So we wrapped the GDI objects with our own C++ GDI objects and would poll the GDI heap regularly, and when things were tight, our C++ wrappers could release their GDI object, and then try to allocate it again when it was next required. And many of the GDI wrappers (eg pens, fonts) would, if the allocation failed, instead get and use a stock pen or stock font as they’re always available and so we always had a valid handle to use.
    https://docs.microsoft.com/en-us/windows/win32/api/wingdi/nf-wingdi-getstockobject

    So when GDI resources were tight, our program would slow down in its redraws as cached objects were freed and reallocated, and in extreme cases the UI would turn monochrome and chunky (stock pens are black and white etc, stock font was the system font), but we wouldn’t crash. And if some other program crashed, we could at least point out that we’d actually visibly released almost all of our GDI handles.

    • brucedawson says:

      Interesting. One of the GDI objects we would hit failures on is a device context in a paint message. I don’t know how a program is supposed to continue from there.

      Being resilient in the days of shared GDI object limitations makes sense. In today’s context it just means papering over a bug – better to restart (and restore tabs) rather than limp along.

      • Very much agreed, and sure, not all GDI objects have stock implementations. It was more to say that as a technique it gave the user (or at least the developer when using your own software) a strong visual clue when handles were in short supply and we could effectively audit how many handles we had active at any one time and how many were shared behind the C++ wrappers etc.
        These days we regularly find Outlook or other “enterprise grade” apps choking on thousands of handles (not just GDI) where they’re presumably thought to be effectively unlimited, until they cause such issues.

  4. rdonch says:

    It looks like the fix might have a slight bug in it too – if a WebCursor object is assigned to itself, its platform data will be freed, leaving it unusable.

  5. Pavel Medvedev says:

    Hi Bruce,

    thanks for the story!

    It seems this bug was caused by the “Rule of 3” violation (in C++03). And in C++11 I would prefer to have such resources non-copyable, only moveable.

    I wonder if it were possible to prevent this kind of errors with a static code analyzer.

  6. Adrian says:

    I was under the impression that all the painting in Chrome used DirectX APIs rather than GDI, and that modern versions of Windows had distinctly separate stacks rendering with them. Sure, I could imagine a handful of GDI objects for compatibility (like the aforementioned cursors, and perhaps some window icons for the desktop manager), but those might number in the dozens rather than hundreds.

  7. Thanks, great writeup! One thing I didn’t get is – how did you arrive at the conclusion “it’s cursors”?

    • brucedawson says:

      I guess I could have been clearer about that (I just added a sentence) but I was just watching the GDI objects column in task manager to see what made it go up, and I quickly found that everytime the cursor changed in the CRD app the GDI object count would go up by six.

  8. akraus1 says:

    In the past I have used memory dumps to check the allocated GDI objects: https://aloiskraus.wordpress.com/2016/06/25/show-gdi-handles-by-type-in-windbg/ but you are right that cursors are not listed there. It did work reasonably well and I still find it hard to believe that we have Handle Leak tracing with ETW but not GDI/User Object tracing which is easy to implement but no one wants to invest in “legacy” stuff at MS. Someone notified me that this approach may be broken because that kernel memory which is mapped into your address space is now no longer accessible due to some past security fixes… The fun part is that MS has been lingering around what UI technology we should use instead. Now more and more Electron/Chromium based UIs are created which are truly cross platform. I guess I will still debug GDI leaks inside the browser in the future ;-). I have done also a specific ETW enabled hooking library to detect similar leaks (http://geekswithblogs.net/akraus1/archive/2016/01/30/172079.aspx) when I did know the exact allocating methods. That was fun to build and I did learn a good amount of how x64 exception handling works which is really hard stuff.

    • brucedawson says:

      ETW has handle leak tracing? That sounds handy. I’ll have to look into that. I’ve heard that GDI tracing is available in the next Windows 10 SDK – I’m unsure about details.

      • akraus1 says:

        WPRUI has Handle Usage which includes the usual handle types. I think you know this? I have written about it here: http://geekswithblogs.net/akraus1/archive/2016/03/14/173308.aspx. This is also exposed via xperf.

      • Chris Davis says:

        You can use the new HUD (Heads Up Display) tool from MS. By far the best tool for diagnosing GDI leaks I have come across.

        HUD (“Head Up Display”) is a performance profiler that monitors the following data in real time or via Event Trace Log (*.etl) analysis, showing associated call stacks for create/destroy/read/write operations as appropriate. HUD displays the following information in real-time: module load/unload history, thread create/destroy history, registry activity, file I/O activity, hangs (soft and hard), GDI/User handle activity, GDI leak detection, heap allocation activity, insights that call out perf issues automatically detected by HUD. No per-app instrumentation or setup is required. Just attach to a running process or have HUD launch and attach to a new process and you’re up and running. NOTE: HUD will resolve call stack symbols based on the symbol path specified in the _NT_SYMBOL_PATH environment variable, or the MS public symbol server if _NT_SYMBOL_PATH is not set.

        https://www.microsoft.com/en-us/download/details.aspx?id=100813

        • akraus1 says:

          HUD is great! Is the hudconfig file format described somewhere? I would like to configure sever symbol paths into HUD or I will resort back to _NT_SYMBOL_PATH as you have mentioned. What I miss is the support for JIT compiled managed code. The call stacks there show only raw numbers which are not helpful.

          For Win32K Handle Tracing it seems to use
          “Microsoft-Windows-Win32k”:0x8000030000240000:0x4
          These enable for the Win32K Provider
          0x40000 UIPI
          0x200000 ThreadInfo
          0x20000000000 UserHandleOperation
          0x10000000000 GdiHandleOperation
          0x8000000000000000 Microsoft-Windows-Win32k/Tracing

          where I suspect that Use/GdiHandleOperation are the new flags which are needed to enable GDI/User leak tracing since Windows 10 1903. Along with call stacks a very welcome addition. WPA can decode these events but it misses the correlation feature.

          The kernel session enables AntiStarvation which is a flag which is not known to me. Is that somewhere described? Besides these custom kernel stack flags are enabled

          What are these for? Any hints? I have created a wpr HUD profile which lets you tailor what you want to record. Later you can snip out the process in question of a huge ETL file which is the main use case for this tool I think.

  9. Pingback: GDI leaks and the importance of luck | Random ASCII – tech blog of Bruce Dawson – stopthefud – ProjectQ

  10. akraus1 says:

    The stack tracing flags have not survived. Here are they: Stack Tracing: ImageLoad 0x0443 0x0444 0x0449 ThreadCreate CSwitch ReadyThread CreateKey OpenKey DeleteKey QueryKey SetValue DeleteValue QueryValue 0x0911 0x0912 0x0913 0x0914 0x0915 Profile ImageUnload.

    Most interesting are of course the hex numbers. What type of stacks are enabled in the kernel there?

  11. What do you think about lowering the maximum number of GDI objects on debug builds to e.g. 1000? At Apple we did something similar where we lowered the memory ceiling of various programs on debug builds so if they went above where we would expect the app would get killed and we would get a crash log and some memory info. There are certainly problems with this, especially if you rely on the build at all — we did this with SpringBoard, and we had set the memory limit too low, so it was crashing fairly often (~once a day) and every time it crashed it brought every other app on the system down with it. Eventually people complained at us too much for it to be worth doing and we raised the memory limit.

    • brucedawson says:

      I think it’s a valid strategy. Doing that for canary builds would be particularly good because those typically update daily so you a slow leak could easily fail to reach 10,000 before a restart. It’s all a delicate balancing act to flush out the problems without causing “false positives”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.