Determinism Bugs, Part Two, Kernel32.dll

It was literally the day after I cracked the __FILE__ determinism bug that I hit a completely different build determinism issue. I was asked to investigate why the Chrome build number reported for Chrome crashes on Windows 11 was lagging behind what was reported by winver. For example, Chrome crashes on 10.0.22000.376 were being reported as happening on 10.0.22000.318. After some code spelunking I found that crashpad retrieves the Windows version number from kernel32.dll, so I focused on that.

Aside: crashpad grabs the Windows version number from kernel32.dll instead of using GetVersionExW (which is deprecated, BTW) because the GetVersion* functions will frequently lie about the Windows version for compatibility reasons. For crash reporting we really want the actual-no-lies-we-can-handle-the-truth version number, and kernel32.dll used to be the best way to get this.

That’s when things got weird.

I used chrome://crash/ to trigger a Chrome crash and then loaded the crash into windbg, and looked at the version information for kernel32.dll with the command “lm v m kernel32”:

Windbg version mismatch annotated_thumb[2]

Can you see the problem? kernel32.dll appears to be reporting that it is version .318 and .376. No wonder our crash reporting system is confused!

Then things got weirder. For some reason I looked at the crash dump on a different machine and the results were different. Now kernel32.dll was being reported as version .318 and .347. How can the same crash dump be reporting different version imageinformation? I was starting to feel a bit unhinged, and was starting to think I should resurrect my original plan of going to circus school.

But before pulling out my tight wire, juggling pins, and unicycles I decided to investigate a bit more closely. I attached windbg to a Chrome process on my Windows 11 machine and ran “lm v m kernel32” again. Now it said that it’s version number was consistently .318. Somehow it felt better to know that the first version number was always .318, but the second one depended on the phase of the moon.

At this point it’s important to understand how minidumps and symbol servers work.

A minidump records the minimum information needed in order to diagnose a crash. This includes the contents of the stacks from all threads, a few hundred bytes of memory from wherever registers are pointing, information about all loaded and unloaded modules, and a few other snippets. In all cases the idea is to record as little information as possible while still being able to accurately reproduce as much process state as possible. Some memory (most heap memory and global variables) are not recorded, but it is okay to have some information omitted. It is not, however, okay to have some information which is incorrect.

The minidump only records minimal information about the loaded modules, but a crash-dump analyst wants to be able to load symbols, disassemble all functions, etc., and that is where symbol servers come in. The minidump records enough information about loaded modules (a few hundred bytes) to contain the crucial identifiers which allow the debugger to download the full DLL or EXE and the PDB from the symbol servers where Microsoft and Chrome publish their DLLs, EXEs, and PDBs.

So, windbg loads a minidump, looks at the timestamp, image size, and image name information, and uses that to download the full DLL or EXE files.

So…

Apparently the memory saved in the minidump contains the first version number displayed by windbg for kernel32.dll, so it is consistent. But the second version number comes from the copy of kernel32.dll downloaded from the symbol server, and that was inconsistent.

I then used sysinternal’s sigcheck to look at the kernel32.dll DLLs in the local symbol imageserver cache on my two development machines. It confirmed that they had versions .347 and .376. It was weird that the symbol server was returning a mismatched copy of kernel32.dll, but even weirder that it had returned two mismatched copies. A quick check of the file dates explained that. The .347 version had been retrieved on December 7th, and the .376 version had been retrieved on December 22nd. And suddenly it all made sense.

Microsoft built Windows 11 version .318. It shipped a new kernel32.dll and pushed it to its symbol servers. Then Microsoft built Windows 11 version . 347. It didn’t ship the new kernel32.dll but it pushed it to its symbol servers, overwriting the previous version. Then Microsoft built Windows 11 version .376. Once again it didn’t ship the new kernel32.dll, but it pushed it to its symbol servers.

All three versions of kernel32.dll had the same timestamp, image size, and image name, so they all occupied the same slot in the symbol server, and overwrote each other. The version was in the local symbol server cache depended on when you first retrieved that “version” (timestamp, image size, image name triplet) from the symbol server.

At this point it all made sense except for – why? Why is Microsoft building different versions of kernel32.dll that have the same symbol server identifier?

From a technical point of view it is fairly obvious what is happening. Microsoft has deterministic builds. In order for builds to be deterministic the timestamp can no longer be the actual build time. Instead the timestamp is based on a hash of the code segment and probably some other data, but crucially the timestamp hash does not include the version number. So, if only the version number changes the timestamp stays the same and the version number in the file you retrieve from the symbol server may not match what was on the user’s machine.

That’s pretty annoying, actually. It’s especially annoying if you’re investigating a version-number bug like I was, but even when you’re not doing that it is confusing, and seems to violate the fundamental guarantees of symbol servers.

So, the technical explanation is simple enough, but the design question is perplexing. It seems to me that a fundamental tenet of symbol servers is that the symbol server identifier uniquely identifies a particular file. Once you break that assumption all bets are off. I am saddened that Microsoft has decided to do this, and I hope that they fix this bug. The whole concepts of minidumps and symbol servers is on shaky ground until this is addressed.

These symbol-server overwrites have been going on since at least July 2020 when Michael Maltsev first blogged about them. Surprisingly enough, at least some of the experts in this area at Microsoft appear to not have been aware, and apparently I wasn’t paying close enough attention to notice until now.

You can find the three overlapping versions of kernel32.dll that I have found in this Google Drive folder. The folder also contains the minidump that started this investigation. I filed an issue for this problem on github.

What about Chromium?

I mentioned in the previous blog post that Chrome also has deterministic builds, so we have also had to deal with this problem. The solution that we settled on is to use the timestamp from the last commit as the build timestamp. This isn’t perfect – it means that some binaries that would otherwise be identical are instead slightly different – but I strongly believe that it’s better than the alternative of claiming that files are the same when they aren’t. The commit timestamp solution also means that the timestamp still has meaning as a date, and it means that hash collisions are a complete non-issue.

The initial crashpad bug, wherein the wrong OS version was being reported in crashes, was fixed after some discussion with Microsoft (probably on twitter) to read the current OS version from the registry. At some point the same issue should be fixed in Chromium itself.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Bugs, Chromium, Computers and Internet, Investigative Reporting, Programming, Symbols and tagged , , . Bookmark the permalink.

19 Responses to Determinism Bugs, Part Two, Kernel32.dll

  1. Paul says:

    “filed an issue” link is wrong

  2. Richard Critten says:

    “I filed an issue for this problem on github.” – the issue link is the same as the previous Google Drive folder link

  3. Leo Davidson says:

    All the timestamps in PDB files are complete garbage (because Microsoft decided it was clever to re-use the timestamp fields for completely unrelated metadata and force us to crossreference cryptic release numbers to work out how up to date someone’s version of Windows is), so why not the version numbers as well, I guess?

    They sure do like to make our lives interesting, in between making us wait 15 minutes, even on gigabit fiber internet, for symbols to download in series over the world’s slowest protocol with the worst latency handling imaginable, while the Visual Studio UI and/or debuggee is/are completely hung, usually with zero indication anything is even happening, other than the fact the UI won’t respond, or how long is left. 😦

    I’ve given up hope of it ever improving as it’s been this way for about 15 years and MS don’t seem to care (as if they care about anything affecting other developers using their decaying platforms these days).

    • brucedawson says:

      Well, given the timestamp field’s use in symbol servers Microsoft had no choice but to adjust their use if they wanted deterministic builds (which are, to be clear, a good thing). However I agree that it is frustrating that they decided to use some sort of binary hash rather than a modified date. Setting the timestamp to the last commit time, at least on release builds, would have preserved a valuable piece of information. I hope that they fix this issue by doing exactly that.

  4. Leonardo Santagada says:

    Some questions are left unanswered:

    1) Why is crashpad using the kernel32.dll version as the windows version instead of using official win32api for that? That seems to be the main problem here.

    2) Is those kernel32.dll exactly the same except for metadata? Then maybe the metadata should be left alone and those be made exactly the same.

    I don’t see using the hash of the data as what is used to say two binaries are the same… as in effect they are, if after all the changes done to a repo an artifact is still the same its much better for everyone if they are threated as the same (much easier to reason that a bug dependent only on the behavior of some dll that hasn’t change is still there on a new windows release). Feels like that is preferable than every commit to a monorepo invalidating all artifacts produced from it like you suggested.

    (well, for everything there is exceptions of course, on the metadata might be some UAC or some other value that completely change how the dll is loaded).

    • brucedawson says:

      Good questions:
      1) Using the kernel32.dll version is pretty common because it avoids the version lies that you get when asking the OS for the version number. It used to work well when kernel32.dll was updated with every update, but now it doesn’t work as well. I’m talking to Microsoft about that.
      2) kernel32.dll is exactly the same except for the metadata. Simply not publishing the new versions to the symbol server (not overwriting) would be an acceptable solution.

      I agree that changing an artifact merely because some commit to an irrelevant file (a .md file perhaps) is not ideal. That said, the current behavior is unacceptable. I think that the current behavior is fine for files that are not pushed to a symbol server. As soon as the artifacts are pushed to a symbol server the rules change and the files have to be _identical_ or else have a different timestamp. This seems manageable – just set the timestamps differently for builds whose artifacts get pushed to symbol servers, because retrieving a file that is different – even just in the version number – is basically lying to developers.

  5. Z.T. says:

    A (timestamp, image size, image name) triplet doesn’t seem like a good idea. A hash truncated to timestamp (8 bytes?) is not enough. They should use a full hash, something like sha256, as part of the key.

    • brucedawson says:

      You might be right, but you’ll have to hop in a time machine and go back twenty years to make that suggestion. Changing how symbol servers behave at this point brings on a host of backwards compatibility problems.

      Even back then, when symbol servers were first created, they had to work within the constraints of the PE file format. A full hash of the binary was not stored in the first few hundred bytes of the image file, so it could not easily be used as a key. Any alternative “better” implementation needs to respect the constraints.

  6. Oliver says:

    Just wondering, is there any reason you can’t tap into either the native API RtlGetVersion (effectively taking OSVERSIONEXW) or KUSER_SHARED_DATA? To the best of my knowledge none of these lie. The reference to KUSER_SHARED_DATA was originally gleaned from the Windows Internals book by Yosifovich, the RtlGetVersion is something I’ve been using for a long time myself. Also enlightening the studies by Geoff Chappell (“site:”-search his page for “RtlGetVersion GetVersionEx”). Interestingly he notes on his page about RtlGetNtVersionNumbers (_not_ to be confused with RtlGetVersion): “The RtlGetNtVersionNumbers function gets Windows version numbers directly from NTDLL.” … much like you do with kernel32.dll.

    • brucedawson says:

      From a contact at Microsoft I got the advice to get the version from the registry, now that getting it from kernel32.dll can lag. Specifically they recommended going to Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion and looking at CurrentBuild and UBR. winver displays the version number as CurrentBuild.UBR. I could look at RtlGetVersion but I’ll probably follow this unofficial Microsoft advice instead.

  7. syeberman says:

    Commit timestamps can be faked (`git commit –date`). Does the Chrome build protect against two different builds having the same date?

    • brucedawson says:

      That’s an interesting question, but I think it is not a concern for Chromium builds because committers don’t have permission to commit directly and therefore I don’t think they can fake the dates. Chromium developers such as myself commit using https://chromium-review.googlesource.com (Gerrit). It enforces permissions and approvals and presumably also stops date-fakery from happening. Presumably.

  8. Chris Guzak says:

    if you specify maxversiontested values for the current OS in your manifest, GetVersionExW won’t lie.

  9. Alex says:

    In my debugging adventures, when things look _weird_ and I have a full-memory minidump at hand, I would invoke WinDBG command
    !for_each_module !chkimg @#ModuleName
    to ensure that binaries are not patched with hooks.

    However, somewhere around Win10, reproducible builds came into effect, and WinDBG would not complain like “Error for ntdll: Could not find image file for the module. Make sure binaries are included in the symbol path”. I debugged WinDBG with WinDBG and found that the problem lies in `ext!FindSymCallBack` function, which fails because image loaded from symbol server has different timestamp (this is contrary to your findings).

    Eventually I resorted to patching WinDBG code to skip timestamp test, and voila, WinDBG command now works. It however now finds version resource as “patched”, because, as you already noticed, while executable code is the same, version could differ in sibling reproducible builds. So I now use
    !for_each_module !chkimg -ss .text @#ModuleName
    to only check executable code and ignore resources.

    • brucedawson says:

      Huh. This seems very weird. I have also used that command and I haven’t seen that problem, and I don’t understand why that would happen. Were you using the latest windbg? Some old software doesn’t like timestamps with (seemingly) nonsensical dates.

      Thanks for the command example for just comparing the .text section – that seems useful.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.