No Start Menu for You

I tend to launch most programs on my Windows 10 laptop by typing the <Win> key, then a few letters of the program name, and then hitting enter. On my powerful laptop (SSD and 32 GB of RAM) this process usually takes as long as it takes me to type these characters, just a fraction of a second.

Usually.

Sometimes, however, it takes longer. A lot longer. As in, tens of seconds. The slowdowns are unpredictable but recently I was able to record an Event Tracing for Windows (ETW) trace of one of these delays. With a bit of help from people on twitter I was able to analyze the trace and understand why it took about a minute to launch notepad.

Before I get in to the analysis I have two warnings/disclaimers: 1) I have a good understanding of the problem, but I do not have a solution and 2) if you are seeing identical symptoms that doesn’t mean that your root cause is the same as mine, but I will give some hints on how to see if it is.

My analysis of the trace (trace is here, installer for the analysis tools is here, analysis tutorials are here, feel free to follow along) started with my looking at the input events and Window in Focus graph in Windows Performance Analyzer (WPA), both shown below (zoomed in a bit for maximum detail):

Input events and Window in Focus graphs in WPA

The first diamond in the Multi-Input row shows when I pressed the Windows key, with subsequent key presses (including pressing enter) clumped together shortly afterwards. The units on the x-axis are seconds so we can see that all of the typing took about 3/4 of a second.

The input events are injected into the trace by my UIforETW trace-recording tool. These input events are one of the reasons I prefer UIforETW over Microsoft’s trace-recording tools. I’ve hidden the table-view but it lists which keys were pressed, while anonymizing letters and numbers to prevent UIforETW from being a key logger. Input events can be a critical tool in helping know where/when to look in traces to understand what is happening.

The input events help establish context, but the Window in Focus events really tell the tale. We can see that the SearchApp (start menu?) gains focus as soon as I press the <Win> key but then nothing else happens for more than thirteen seconds. That’s the problem, visualized.

But why?

The next step is to see what is causing the delay. A quick glance at the CPU Usage (Precise) and Disk Usage graphs showed that the CPU and disk were almost 100% idle, so the start menu must be waiting on something else:

Window in Focus, CPU Usage (Precise) and Disk Usage graphs in WPA

When a process is idle for a while when you wish that it was doing work then the challenge is to figure out what it was waiting on. I looked at the context-switch events in CPU Usage (Precise). Some of the SearchApp threads were named (yay!) but not all of them were and I couldn’t find the main thread to see what it was waiting on, so I had to poke around and hope something became obvious. I zoomed in on the burst of CPU activity just before notepad launched and I noticed that WerFault.exe and wermgr.exe both started getting busy. Correlation is not causation, but it sure is suspicious.

Note that WER stands for Windows Error Reporting – the system that sends crash dumps back to Microsoft for analysis so that software reliability can be improved

Looking at the Processes table showed me that the command line for WerFault.exe was “C:\WINDOWS\system32\WerFault.exe -u -p 17804 -s 2124”. That suggests that Windows Error Reporting was being asked to record information for crashed process 17804, and when I looked in the Processes table for that Process ID (PID) I found “RuntimeBroker.exe <Microsoft.Windows.Search> (17804)”. Well now. Doesn’t that name look relevant?

A look at all of the “Transient” processes (those that started or ended during the trace timeline) was quite revealing:

Window in Focus and Processes (Lifetime By Process) graphs in WPA

WerFault.exe and RuntimeBroker.exe (17804) (the top of the two RuntimeBroker.exe processes) were both running when I started recording the trace and both ended at about the same time, and WerFault.exe was handling a crash in RuntimeBroker.exe. Notice also that a new copy of RuntimeBroker.exe starts running when the old copy goes away. Now we’re starting to have an explanation:

  1. RuntimeBroker.exe crashes
  2. WerFault.exe deals with the crash, keeping the RuntimeBroker.exe process open
  3. Then a new RuntimeBroker.exe launches and provides whatever it is that SearchApp.exe needed

Now we have a new question: why is WerFault.exe sitting idle for so long?

I looked at the CPU Usage (Precise) data and saw that WerFault.exe has at least thirteen threads, none of them named (come on Microsoft – thread names are really helpful!) but the main thread was easily identifiable as the one using the most CPU time. I then sorted by Time Since Last and noticed that at one point the main thread had been waiting to run for 15.572 s. In fact it was probably waiting even longer, but the start of its wait was before the start of the trace and therefore unknowable. You can find more details on how to do idle-analysis here.

The stack where the main WerFault.exe thread was waiting for 15.572 s is shown below:

WerFault.exe wait call stack

The summary would be that it was waiting in UploadReport.

RuntimeBroker.exe crash call stackSo now we understand the problem. RuntimeBroker.exe crashed (due to heap corruption, according to the call stack in the RuntimeBroker.exe crash dump, shown to the right) and it took more than 15 seconds to upload the crash dump, presumably due to my flaky hotel WiFi. During this time my start menu was inoperable.

This deserves reiterating. My start menu was hung due to the combination of heap corruption and WerFault.exe deciding that it needed to upload the crash dump before releasing the old process so that a new one could be started.

It took two bugs (the heap corruption and the upload-before-restarting) to make this hang happen, but happen it did.

Secondary WerFault.exe hung threadWe can even go deeper. The UploadReport function was blocked for 15.567 s and the Readying Process/Readying Thread Id shows us who ultimately unblocked the function. That turned out to be another WerFault.exe thread which was blocked in some CHttpRequest functions, as show above. That doesn’t add significantly to the understanding of the problem, but does demonstrate nicely how you can trace a hang backwards through multiple processes and threads.

Watching for this problem

In general if you want to understand why your computer is performing badly you need to record and analyze a trace. However if you want to see if you are hitting this particular problem then there are easier steps that you can follow.

The first step is to configure the local recording of crash dumps. This is a good idea in general because it lets you monitor the stability of your computer over your time.

Then, with crash dumps being recorded if you see a Start Menu hang you can just look in %localappdata%\crashdumps and see if there is a recent RuntimeBroker.exe crash. If so then you are presumably seeing this bug.

Waiting on upload

Raymond Chen gives several reasons for why Windows Error Reporting doesn’t restart crashed processes before uploading the report (circa 2012) but I don’t find those reasons entirely compelling, especially in the start-menu case. As long as you kill the old process before starting the new one – and answer the DLL-version questions from a crash dump – most of the problems he points out are avoidable. Exponential backoff on process restarts can address the rest. And, the consequences of waiting can, as we have seen, be arbitrarily long start-menu hangs, with no indication that a crash is the problem. There is also some confusion about the behavior – maybe the design has changed in the last ten years.

Fixing the crash

Application verifier settings - pageheap enabled for RuntimeBroker.exeHeap corruption bugs can be extremely difficult to find and fix, but this one seems like it might be easy. I turned on pageheap on RuntimeBroker.exe, killed the relevant version to get it to restart and apply the pageheap settings and it started crashing every time I opened the start menu. I configured WER to save full crash dumps and soon had a half-dozen crash dumps with full details of what was happening.

The crashes normally happen on this call stack:

Crash with this on the stack: windows_cortana_Desktop!Windows::Cortana::EdgeProfileInfoLifetimeTraits::Destroy<Windows::Cortana::EdgeProfileInfo>

With pageheap enabled the crash happens on a very similar call stack, but slightly earlier. The crash happens earlier (and more reliably) because with page heap when you free memory it is unmapped, so dereferencing it reliably causes a crash, instead of reading from the freed memory:

Crash with this on the stack: windows_cortana_Desktop!Windows::Cortana::EdgeProfileInfoLifetimeTraits::Destroy<Windows::Cortana::EdgeProfileInfo>

The crash happens when dereferencing [rcx] so I ran its value through the the !heap command (see my pageheap blogpost for details) and got this call stack:

Complicated call stack showing when memory was freed

The only complication is that this doesn’t happen on all Windows 10 machines. There seems to be some required state that makes it happen or not and I don’t know what is. I will say that I’m happy to share the crash dumps with anyone at Microsoft who wants to investigate.

I don’t know the code and don’t understand what is happening but I’ve dealt with enough use-after-free bugs to say that this is probably straightforward to investigate and fix using the crash dumps. Although, I got a couple of different crash call stacks so there might be multiple bugs.

Conclusions

I ended my initial twitter thread by saying that these hangs were making me cranky and had me wondering if Windows 10 was abandonware. Since then I’ve been told that people seem to be seeing start-menu hangs on Windows 11, but that is not necessarily the same problem.

To be clear, Microsoft has the technology to record traces on start menu hangs on customer machines. These traces would show roughly the same thing as my trace. They also receive crash dumps from customer machines. They might even have a way of correlating them (if not then they should hook that up). And they created pageheap which makes use-after-free crashes easy to investigate.

So, why hasn’t this been addressed? On my laptop I see that RuntimeBroker.exe has crashed, on average, every second day this year. That is too many start-menu hangs for my tastes. I don’t know how long it has been happening so maybe a fix is on the way – if so that would be great to hear. If not then I will continue to be cranky and I hope that this serves as a good reminder of the importance of using all that fancy telemetry to address issues like this.

Or, maybe I’m just unlucky and I’m one of the few people (not the only person) who is hitting this crash.

In short, I am really pleased with the tools that Microsoft has created and released to let me analyze performance issues such as these. However I wish that I didn’t have to use them so often on Windows itself.

Online discussion

Hacker news

Initial twitter discussion

Twitter announcement of blogpost

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Code Reliability, Debugging, Investigative Reporting, Performance, Programming, Rants, uiforetw, xperf and tagged , , . Bookmark the permalink.

19 Responses to No Start Menu for You

  1. djmips says:

    Funny – after reading I went to check my CrashDumps folder and I had 10 RuntimeBroker.exe dmp files with roughly the same call stack and all starting on 1/14 2023 with the most recent one from today at 11:06 PM (1/17 2023)

  2. anreiz says:

    Crashdumps shows RuntimBroker.exe.dmp from 2 days ago & 8 days ago. Always helps the crankyness to better understand the issue so kudos!

  3. ignacy130 says:

    Microsoft has a huge problem with anything related to search. Windows Search is unusable, but I’ve counted on the ability to quickly launch apps via win+r… for nothing. It’s hard to believe that a quick text search is so hard and one need to use Everything or some 3rd party launchers.

  4. kasper93 says:

    I have just-in-time debugger enabled on my main machine. I don’t like when something is crashing in the background without my knowledge. I would rather fix or disable offending party.

    That said I’ve seen this RuntimeBroker.exe crash, but only few times, definitely nothing reproducible enough to care about. I’ve seen also WindowsSearch.exe crash some time ago, but this doesn’t happened anymore.

    Nice writeup, always getting to the bottom of things 🙂

  5. Stuart Axon says:

    Thank you for looking at this!

    I use Win-Key+typing in GNOME on Linux to launch everything, but I’ve always found this really painful on windows for some reason – I can’t pin it down but it feels very janky.

  6. randommarius says:

    I had similar issues. Windows search also does too much phoning home in my opinion.

    My solutions was to use their Power Toys version of Run. It’s much faster and more configurable.

  7. ALegg says:

    I’ve observed similar crashes on Win11, where, after some VPN/ISP issues, dns requests were timeouting (& whole internet was not working on that machine for ~60sec) – that also caused start menu to be unusable, because apparently searching the web first is more important for their business, than just showing “cmd” as App to the user, and as result it showed nothing at all.

    No idea was it related to the issue You found, as I didn’t check the dumps (it was 3-4 weeks back), and since that was my first contact with Win11, it certainly did increase my inner hate toward “MS/new UI fancy things/breaking basic stuff with pointless features I will not use”.

    Thanks Bruce for being vocal about it, You’re the hero we don’t deserve!

  8. Klaus Kjærgaard says:

    I had some of the same issues – just with start menu being too slow. I disabled web search for start menu, and it works, is fast, only finds apps (and other local stuff; documents, folder, settings – but only when you navigate to that specified search target).
    Bottomline – turning off web search in start menu = speed

    regedit/use at own discretion:
    HKCU\Software\Policies\Microsoft\Windows\Explorer
    “DisableSearchBoxSuggestions”=dword:00000001

  9. Alan Auerbach says:

    Hey Bruce!

    I just became aware of this post. I’ll check the trace to see if there’s anything unexpected with regard to the WER behavior, but in general, as Raymond noted, we hold the crashing process during our initial exchange with Watson to see if there are any additional (secondary) requests for data that need the original process still alive to satisfy. That initial exchange has a timeout value of 15 seconds for HTTP connect and two minutes for HTTP response. For certain processes (e.g., the shell proper, DWM) we have a much smaller timeout, as we know those processes are crucial to user responsiveness. This case is interesting because it’s a RunTimeBroker.exe processes servicing the shell. We’ll look into this to see if we can detect and improve that case.

    Ordinarily, once that initial exchange with Watson is complete, we release the crashing process and dump from the cloned process. One exception to that is when the user has configured local dumps. 🙂

    • brucedawson says:

      Wait – so configuring local dumps might be making the hang worse? I don’t understand the logic behind that.

      For small-ish processes it would be more efficient to do a full memory dump and then satisfy _all_ questions from that, rather than making the user wait. At least in this case. I know, it’s complicated.

      What about the crash, however? I haven’t heard anything back on whether this crash was known, is fixed, will be fixed, etc. I’m hopeful that nobody has asked for my crash dumps because the bug is already fixed, but nobody has actually said that.

      And, I think I saw at least two different crash call stacks when I enabled pageheap, so I suspect that there are actually two bugs.

      Thanks for the information. Let me know if you need anything more information.

  10. Aaron Avery says:

    Thanks for researching that, Bruce. It’s been quite a while since my start menu was that bad (or even non-functional). I landed on Keypirinha as my configurable third-party launcher of choice and it provides a much better experience for most things. Once in a while I’ll want some control panel (Settings) page that it refuses to find and I go back to the start menu for that. But having it index things like RDP machine names and Putty sessions saves time.

  11. akraus1 says:

    I had also long start menu popup times because I had once connected with an business account. Since then every start menu popup would check if my machine is part of an Active Directory. It kept trying until I deleted the cached credentials.

    Could I get your ETW Trace to check if ETWAnalyzer could help to make similar analysis easier?

    ETWAnalyzer -extract All -fd xxx.etl -symserver MS
    cd Extract
    ETWAnalyzer -dump process -crash -timefmt s
    ETWAnalyzer -dump CPU -processName WerFault;RuntimeBroker -methods * -MinMaxWaitMs 1000 -FirstLastDuration s s

    That should give you good insights. It also supports to resolve symbols from Json files which are much smaller and quicker to transfer (https://github.com/Siemens-Healthineers/ETWAnalyzer/releases/tag/2.5.4.0)

  12. akraus1 says:

    EtwAnalyzer -extract all -fd “D:\Media\Blog\BruceDawsonStartMenuHang\2023-01-12_07-58-22 slow start menu.etl” -allcpu -SymbolServer ms
    Since not much CPU is used I extract all CPU with -allCPU which extracts data also for methods with etwanalyzer %f% -dump process -crash
    12.01.2023 15:58:01 2023-01-12_07-58-22 slow start menu
    PID: 17024 Start: Stop: 2023-01-12 07:58:17.585 Duration: RCode: 0 Parent: 17804 WerFault.exe -u -p 17804 -s 2124
    PID: 17804 Start: Stop: 2023-01-12 07:58:17.596 Duration: RCode: HEAP_CORRUPTION Parent: 1152 RuntimeBroker.exe -Embedding

    which shows Heap corruption right there.

    The HTTP upload is more tricky since it is async. The last seen time would be relevant here

    D:\>etwanalyzer %f% -dump cpu -pn wermgr;runtimebroker -methods *upload* -fld s s
    CPU ms Wait ms Ready ms Last-First First(s) Last(s) Method
    12.01.2023 15:58:01 2023-01-12_07-58-22 slow start menu
    wermgr.exe(23136) + -upload
    1 ms 0 ms 0 ms 0,000 s 16.803 16.803 CHttpRequest::UploadData
    9 ms 934 ms 0 ms 0,945 s 15.858 16.803 TpUpload
    9 ms 934 ms 0 ms 0,945 s 15.858 16.803 CHttpRequest::UploadAndFetchResource
    10 ms 46 ms 0 ms 0,057 s 15.784 15.841 CReportManager::UploadReport
    24 ms 934 ms 1 ms 0,962 s 15.841 16.803 CWatsonTpTransport::DoUpload
    60 ms 56 ms 2 ms 0,126 s 15.715 15.841 DoCoreUpload

    When you know that RuntimeBrokder with PID 17804 did crash you can query for the methods just before the wer method was called
    This uses -zt (ZeroTime) feature to dump the last seen methods 1s before WerpWaitForCrashReporting was called of the crashed process. That can be handy to see what did happen:

    D:\>EtwAnalyzer %f% -dump CPU -pn 17804 -methods * -zt Last WerpWaitForCrashReporting -fld s s -MinMaxLast -1 0 -SortBy Last -TopNMethods 40
    CPU ms Wait ms Ready ms Last-First First(s*) Last(s*) Method
    12.01.2023 15:58:01 2023-01-12_07-58-22 slow start menu
    RuntimeBroker.exe(17804) – -Embedding
    5 ms 187 ms 0 ms 0,013 s -0.013 -0.000 ThreadInvokeWorker
    5 ms 187 ms 0 ms 0,013 s -0.013 -0.000 RPC_INTERFACE::DispatchToStubWithObject
    5 ms 187 ms 0 ms 0,013 s -0.013 -0.000 DispatchToStubInCNoAvrf
    5 ms 187 ms 0 ms 0,013 s -0.013 -0.000 RPC_INTERFACE::DispatchToStubWorker
    1 ms 185 ms 0 ms 0,013 s -0.013 -0.000 RtlFreeHeap
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 Windows::Foundation::Collections::Internal::SimpleVectorView<Windows::Cortana::EdgeProfileInfo,Windows::Foundation::Collections::Internal::Vector<Windows::Cortana::EdgeProfileInfo,Windows::Cortana::EdgeProfileInfoEqualPredicate,Windows::Cortana::EdgeProfileInfoLifetimeTraits,Windows::Foundation::Collections::Internal::VectorOptions >,Windows::Cortana::EdgeProfileInfoLifetimeTraits,XWinRT::IntVersionTag,1>::`scalar deleting destructor’
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 Microsoft::WRL::Details::RuntimeClassImpl<Microsoft::WRL::RuntimeClassFlags,1,1,0,Windows::Foundation::Collections::IVectorView,Windows::Foundation::Collections::IIterable,Microsoft::WRL::FtmBase>::Release
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 ::operator
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 ObjectMethodExceptionHandlingAction< >
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 CStdIdentity::ReleaseCtrlUnk
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 CStdMarshal::DisconnectWorker_ReleasesLock
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 CStdMarshal::Disconnect
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 CRemoteUnknown::RemReleaseWorker
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 Microsoft::WRL::Details::RuntimeClassImpl<Microsoft::WRL::RuntimeClassFlags,1,1,0,Windows::Foundation::Collections::IVector,Windows::Foundation::Collections::IIterable,Microsoft::WRL::FtmBase>::Release
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 Windows::Foundation::Collections::Internal::Vector<Windows::Cortana::EdgeProfileInfo,Windows::Cortana::EdgeProfileInfoEqualPredicate,Windows::Cortana::EdgeProfileInfoLifetimeTraits,Windows::Foundation::Collections::Internal::VectorOptions >::`scalar deleting destructor’
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 Windows::Cortana::EdgeProfileInfoLifetimeTraits::Destroy
    0 ms 188 ms 0 ms 0,013 s -0.013 -0.000 ZwWaitForMultipleObjects
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 WerpWaitForCrashReporting
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 RtlReportExceptionHelper
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 RtlReportException
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 RtlReportFatalFailure$filt$0
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 __C_specific_handler
    0 ms 185 ms 0 ms 0,013 s -0.013 -0.000 RtlpExecuteHandlerForException

    • brucedawson says:

      That’s interesting that the process return code is in the trace – I didn’t know that.

      Getting RuntimeBroker CPU activity before it crashed will work poorly because it crashed before the trace recording started. That said, you did seem to catch something interesting. Is that from context switches or sampling? It’s a bit hard to read – it feels like some stack entries need some template goop elided for better readability. Plus, I’m not sure what the time-stamp/function-name lines actually mean. It looks like a call stack, but why is the (mostly unchanging) time listed on each line?

      Interesting stuff

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.