When I run into a problematically slow program I immediately reach for a profiler so that I can understand the problem and either fix it or work around it.
This guidance applies even when the slow program is a profiler.
And so it is that I ended up using Windows Performance Toolkit to profile Windows Performance Toolkit. Again. And not for the last time. The good news is that once again I was able to learn enough about the problem to come up with a very effective workaround (now available in UIforETW) in the Traces list context menu, as Strip chrome symbols, easily extendable for other problematic PDBs.
I’ve previously documented slowdowns and unreliabilities with WPT symbol loading in Xperf Symbol Loading Pitfalls. Today’s blog post documents a new problem with slow WPT symbol loading.
And, it appears that I’m not the only person hitting new WPT/WPA symbol loading slowdowns (details here):
What’s my motivation?
I recently moved to Google where I’m working on Chrome for Windows. It wasn’t long before I fired up xperf (aka Windows Performance Toolkit or WPT) to see what Chrome’s performance looked like. I profiled the public build and used the public symbol server to automatically download full symbols.
(you can find the Chrome symbol server at https://chromium-browser-symsrv.commondatastorage.googleapis.com, changed from http for greater security)
I soon noticed that Windows Performance Analyzer (WPA) was taking a very long time to load the symbols for chrome.dll and chrome_child.dll. The symbols for these two DLLs were taking a total of about 25 minutes to be loaded which was making using the profiler quite cumbersome.
WPT translates symbols into .symcache files in c:\symcache so that subsequent uses of the same symbols are much faster.
25 minutes is plenty enough time to investigate so I recorded another profile during this time – profiling the profile viewer. This meta-profile showed that the vast majority of the wpa.exe main-thread samples – 97% of them in the screenshot below – were in the main thread inside of msdia120.dll – which is different from the previous problem:
The first time I analyzed this I found that almost all of the msdia120.dll samples were in Mod1::TranslateFileId, called through SymCachePdb::LoadInlineeLines.
Tracking source file numbers for inlined functions is something that only happens if you enable enhanced debugging information (previously enabled with /d2Zi+, now with /Zo, discussed in more detail here), so the initial evidence suggested that it was Chrome’s use of /d2Zi+ that was causing its symbols to take so long to be processed.
Aside: if /Zo is the problem then that makes this whole performance problem particularly annoying. My tests show that WPA uses a mapping between addresses and functions, and nothing else. It doesn’t make use of line-number information. But see the comments for where to view the inlining information in WPA.
As I continued to monitor this issue, talking to some people at Microsoft about it, I found that the results were actually variable. On some Chrome builds the symbols loaded quite quickly, and on another build the time was spent almost entirely in these three msdia120.dll functions:
(normally I use the stack view but Module/Function is handy to see what module and function the CPU is actually spending its time in)
The lack of consistency is peculiar, but not critical in this case. Without the source code to WPA or msdia120.dll there’s not much point in theorizing unless it leads to a workaround.
Chrome’s PDB files are big – up to 775 million bytes.
Whether the slowdown is because of /d2Zi+ or just because Chrome’s PDBs are big doesn’t really matter. Either way, I thought that maybe making the PDBs smaller would reduce how long it took WPA to process them. The Windows debugger toolkit comes with pdbcopy.exe which can be used to strip private information (types, file names, line numbers, etc.) from PDBS so I decided to use that. And it worked.
I now have a Python script that does the following steps immediately after recording a trace:
- xperf -i trace.etl -a symcache -dbgid
- Scan the output from step 1 for references to chrome.dll and chrome_child.dll
- Use RetrieveSymbols.exe (source code here based on ideas discussed here) to download the symbols based on the guid, age, and pdbname recorded in step 2
- Use pdbcopy -p to copy the symbols to a temporary directory, stripping private information and making the PDB files ~85% smaller
- Set _NT_SYMBOL_PATH to point just to the temporary directory with the stripped symbols
- xperf -i trace.etl -symbols -a symcache -build
The invocation of the symcache action in step 1 gets xperf to dump information about all of the symbols referenced in the trace. That information gives us everything we need to retrieve the symbols from the Chrome symbol server.
(have I mentioned that I love symbol servers? Read Symbols the Microsoft Way for details)
The invocation of the symcache action in step 6 builds the .symcache files from the PDB files. During this step xperf can only see the stripped symbols and it processes them extremely quickly.
~25 minutes to process the Chrome symbols from within WPA
29 seconds to pdbcopy and process the Chrome symbols prior to running WPA
50 times faster. Works for me. This fix, configured to run automatically for Chrome’s symbols if Chrome developer is checked in the Settings dialog, can be found in https://github.com/google/UIforETW, which is an open-source tool for recording and managing traces.
Command-line processing of ETW traces
Xperf has a number of options for command-line processing of ETW trace files. If you run “xperf -help processing” you can get a list of them. These xperf actions are documented here but the symcache action is missing. The documentation is pretty light so the lack of symcache documentation doesn’t make much difference.
The xperf processing commands have mostly been made obsolete by wpaexporter (example usage shown here) but they are still occasionally useful. I think the symcache and heap actions are the only ones I ever use now.
If the profilers or msdia120.dll were open source then this would have made my investigation easier and would have let me create a true solution.
However, profiling these closed source tools still let me come up with a workaround and the time savings (especially across other Windows developers at Google) will quickly justify the effort.
I think that this slow symbol loading represents both a bug in WPA (requesting data that it appears to not use) and in msdia120.dll (taking too long to load data). I’ve contacted the WPT team and filed a Visual Studio bug. The bug has some sample traces attached to make seeing the problem even easier. If you are hitting this issue or other MSDIA slowdowns then consider commenting on or voting on the bug.
Update: some useful comments to this post explain how WPA uses this extra information so it’s not actually a WPA bug. But WPA should have an option to not use this information since it is usually useless and very expensive to obtain.
WPT ships with a fairly old copy of msdia120.dll (built June 16, 2013). I tried using the VS 2013 Update 3 version (built June 25, 2014) but it didn’t seem to make any difference.
Working around this bug would be easier if there was a tool to translate a PDB file to a .symcache file. In fact there is such a tool, called symcachegen, documented here but it was never actually released 😦
The .symcache files made from stripped Chrome PDBs are about half the size as those made from non-stripped PDBs. It has been suggested that there might be extra value in those larger .symcache files. However I haven’t seen any extra value so until I see it I’ll go with 50x faster and 50% smaller as a nice consolation prize.