After upgrading to Visual Studio 2013 I noticed that find-in-files had a problem when searching directories. The VS IDE would repeatedly hang, rendering it completely useless for the duration of the search. I filed a bug, complete with ETW traces and a detailed analysis of what was going on.
This is a success story, of sorts. Microsoft has now released Visual Studio 2013 Update 2 which ameliorates the problem. They seem to have changed the behavior so that the worker threads don’t start searching until the directory scan has finished, I think. So VS should now avoid a self-inflicted denial of service. However the VS main thread can still hang if anybody else (another copy of Visual Studio perhaps?) is hitting the disk. So the hangs still happen, at best slightly less frequently.
It may seem like I pick on Visual Studio a lot, as I’ve written quite a few blog posts pointing out its code-gen and performance problems. But it’s tough love really. I use Visual Studio for hours every day so I notice every little blip. And, I meticulously analyze and report many of these blips, in hopes of getting them fixed.
There have been hot debates about what to do when closed-source software has problems. The choices range from vague and impotent rants, to moving to open source tool chains, to precisely targeted bug reports in hopes of getting a fix.
I’m not a big fan of vagueness or impotence. While I’m using Linux more than ever before I’m still dependent on Visual Studio for much of my work. Therefore, I’m glad that I know how to analyze these performance problems. This lets me file useful bugs that are more likely to be addressed, and it sometimes lets me find workarounds. My time is too valuable to waste it on guessing at performance problems, in anybody’s software, and when these problems preferentially happen on my machine, then I want to rant potently. Plus, I enjoy these investigations, and I always learn something while doing them.
What’s going on
As I mentioned in my last post, I have ETW tracing running 24×7 so when my machine hangs I can easily record a trace.
(see Xperf Basics: Recording a Trace (the ultimate easy way) for how to use UIforETW for 24×7 tracing)
When analyzing a trace it’s important to be able to find the problem area. In the case of a UI hang you need to find exactly when the UI hung so that your investigation is focused on the actual problem area. Without knowing the precise (to the microsecond) location of the hang it can be difficult to distinguish between a thread that is idle because it has nothing to do and a thread that is failing to respond to the user. Again, see the previous post for details on the UI Delays graph. In this case WPA’s UI Delays graph showed exactly when Visual Studio was hanging. The well-formed question that I then investigated was:
Why is thread 16,020 of process devenv.exe (7104) failing to pump messages for 4.68 s starting at 120.36364531 s into the trace?
This was not the only hang. During the 28.7 s of data that I examined, Visual Studio was hung (on and off) for 18.425 s, but the 4.68 s hang was the longest.
Having chosen this particular hang I zoomed in on the specified time range.
With the thread Id from the UI Delays data I looked at the CPU Usage (Precise) data for that thread. Visual Studio has many threads, so even with a thread number it can be time consuming to find the thread of interest in the WPA tables. To simplify this task I started by sorting by NewProcess to find the right process, and then expanded that to show the NewThreadId column. I then sorted by that column to find the right thread and expanded that to show the NewThreadStack:
I could see that thread 16,020 was about 99% idle (49.7 ms CPU Usage over a 4.68 s time period) so it was time to find out what the thread was waiting on. So, I went back to sorting by one of the TimeSinceLast columns so that as I drilled down into the NewThreadStack I would be taken down the call stacks that represented the longest waits.
TimeSinceLast measures how long since a thread last ran. Not running for a while is not inherently a bad thing, but when a thread is supposed to be pumping messages then it is bad.
The two TimeSinceLast columns have different summarizing functions. One represents the sum of all TimeSinceLast values for all context switches summarize in that row whereas the other column represents the maximum TimeSinceLast value. Sort by whichever one you are interested in.
When you drill down into a call stack you may find that it is many levels deep. Knowing which parts of the stack to ignore and which parts to pay attention to is an art form. It’s particularly artsy when looking at call stacks that start out in somebody else’s code and go in-and-out of Windows code. There are, I think, two main tricks to extracting understanding from the call stacks shown by WPA.
The first trick is to sort by the column that you care about (typically one of the TimeSinceLast columns when looking at CPU Usage (Precise) data), drilling in to the top stack, and ignoring other call stacks. Many fascinating function names will pop up on secondary call stacks and, unless they show sufficiently large values on your data column of choice, you should ignore them. No matter how interesting those eight context switches with SCM::FDoIdleLoop on their call stack may look you have to ignore them, because 177.816 us is just not interesting when you’re on the trail of 4,643,814.815 us:
A busy Windows system can have tens of thousands of context switches per second and if you don’t restrict yourself to the ones that matter then you’ll go mad. Blinders are essential.
The second trick is to ignore most of the entries on the call stack. The function names on a call stack tell a story, but most of the names are superfluous adjectives and if you can learn to filter out the irrelevant ones then you’ll be much better off.
To the right we have one of the important wait-stacks from this Visual Studio hang:
This call stack – where the Visual Studio UI thread waited for more than a second as part of the 4.68 second hang – is 92 entries in total. However with a bit of opinionated culling I can reduce this call stack to the following nine entries.
- _RtlUserThreadStart – thread start, good sign, we have a full call stack
- devenv.exe!WinMain – looks like the main thread
- user32.dll!DispatchMessageW – handling some message I guess
- user32.dll!InternalCallWinProc – calling back to the Visual Studio window procedure
Now our flowery and verbose story is much simpler. Visual Studio’s main thread was handling a message and during handling of this message it enumerated one or more directories, and one of the calls to get the next entry went into the kernel where it waited (for over a second).
Even this can be simplified: Visual Studio’s main thread is hanging because it is doing directory enumeration on the main thread.
Since we know that this hang is correlated with find-in-files it seems reasonable to assume that the main thread is scanning the file system to decide what files to search. Indeed, a superficial analysis might end with “the problem is that Visual Studio is doing disk I/O on the main thread”.
But we can do a better analysis than that.
This initial analysis is superficial. While my source code is not on an SSD (I work on too many projects for the code to all fit) it is on a fairly fast disk. It is unreasonable for a single disk operation to take over a second to return, and these long delays are happening repeatedly, so something else is going on. It could be my disk failing, but that wouldn’t explain why my coworkers were having the same problem. Curious…
If you expand the Storage area in Graph Explorer then you can open up the Disk Usage graph. In this trace I immediately noticed that disk usage was pegged at 100%:
That initially sounds bad, but a moments thought suggests that this is normal and expected for a find-in-files operation – it should be disk I/O limited in most cases.
Disk Usage measures what portion of time the disk is busy. It is not calculated from some theoretical estimate of how many seeks can be done or how much data can be transferred per second. Instead it is equal to the percentage of the time when the work queue for the disk is not empty. At any given instant a disk is either 100% busy (servicing requests) or 0% busy (queue is empty). This is similar to CPU usage – at any give instant an individual core is either running code or not.
If we arrange the disk usage table to group by Process then we can see what processes are doing disk I/O. In this trace I saw that devenv, SavService, and System were doing the majority of disk I/O. devenv is expected. System is also expected because the system process does a lot of disk I/O on behalf of other processes. SavService is the anti-virus process on my machine, so it is possible that it is the problem. Grouping by IO Type showed that SavService was doing reads, and grouping by Path Tree shows that it was reading from the same directory as devenv and System so it is probably also doing disk I/O on behalf of devenv.
After a bit more arranging of columns I ended up with this view:
Disk 0 is my boot disk and it is essentially idle. Disk 1 is my data disk and it is being slammed by 931 disk operations in the 4.68 s region selected. A rather mediocre 9.9 MB of data is read during this time, which suggests that the poor mechanical read head is spending all of its time bouncing around.
The Disk Service Time – the sum of the time the disk spent servicing the individual requests – is 4.61 s, which matches up with the claim that the disk usage is around 100%. The IO Time, however, tells us something different. How can we have an IO Time of 37.5 s in a 4.68 s region?
IO Time represents how long IO takes from the point of view of a thread. If a single thread is accessing the disk then IO Time should be similar to Disk Service Time, perhaps with a bit of overhead added. However if N threads are all hammering the disk as hard as possible then typically we would expect each thread to have to wait for its disk I/O and for disk I/O from each of the other threads (assuming a perfectly fair round-robin), so IO Time would typically be N times the Disk Service Time.
Aside: if lots of threads are hammering the disk then it is possible that the disk will have to spend a higher percentage of its time seeking, so Disk Service Time per request may increase.
With that in mind we can now estimate that there are eight different threads reading from the disk. From the Path Tree column we can see that all of these threads are reading from the same directories, so this is not conflict from unrelated tasks, this appears to be Visual Studio parallelizing find-in-files.
To paraphrase the immortal Fred Brooks, “adding threads to a disk bound job makes it slower.” Or, at least, having lots of threads is not going to make it faster.
All questions in ETW trace analysis can be answered by adding or rearranging columns. If we want to confirm how many Visual Studio threads are hitting the disk we just need to add the Thread ID column and place it right after the Process column:
This shows us thirteen Visual Studio threads reading from the data disk. The main thread in this trace is 16,020 and, based purely on the count of disk operations, it appears to be doing one type of work, while farming out other types of work to num-procs worker threads (my machine has a twelve-thread CPU). The main thread appears to be spending virtually all of its time waiting on disk I/O, whereas the worker threads are typically spending about half of their time waiting on disk I/O, so presumably they are spending the rest of their time doing something else.
If a thread is doing asynchronous I/O then IO Time does not equate to waiting time, but I saw no signs of asynchronous I/O.
I briefly looked at the CPU Usage (Sampled) data to see what the other threads are doing but it showed few samples from the other I/O threads. So I looked at the CPU Usage (Precise) data to see what thread 9,964 (the thread with the least IO Time) was waiting on. It’s edited-down call stack looked something like this:
Ah hah. The presence of savonaccess.sys (anti-virus!) probably explains the discrepancy between my initial estimate of eight I/O threads and the reality of thirteen. The twelve worker threads are spending some of their time waiting on the disk, and some of their time waiting on the anti-virus driver. Could anti-virus be the culprit?
The anti-virus software might be slowing down the find-in-files but… there is actually still no evidence that it is. Despite the best efforts of savonaccess.sys to slow down disk I/O, the disk is still pegged. The ultimate limiting factor of the performance of find-in-files on my system is still the speed of the disk, as shown by the 100% disk Usage to read the files and directories to be searched.
Just because something shows up on the call stack doesn’t mean it is to blame. If my disk was faster or if I had fewer hardware threads then perhaps anti-virus would be the limiting factor, but in this case it seems clear that it is not.
When looking at a trace like this it is easy to get deep into the guts and start obsessing about the wrong things. Anti-virus! Must be bad! But even if anti-virus was slowing down find-in-files (it isn’t) it’s important to remember that find-in-files throughput is not even the purpose of this investigation!
The purpose of this investigation is to find out why Visual Studio hangs. And we’re almost there. I just need to add one more column…
The cryptic initalism IACB stands for Initiated After Completed Before. It is a measure of how many disk operations jumped the queue in front of the disk operation(s) on the current row. Disk firmware and disk drivers can reorder I/O requests in order to improve throughput, by minimizing head movement, or for whatever secret reasons they might have. This is a feature of modern disks that is also known as Native Command Queuing. Whatever the reason, the IACB column is a reminder that the disks on your system are not shared fairly, and sometimes you lose.
When we add the IACB column, group by it, and sort by it then we can see the problem, plain as day:
The highlighted row shows a read whose Disk Service Time was just 4.36 ms, but whose IO Time (the time from the request to the results) was a whopping 1.077 s. The IACB explains why – an impressive 288 other I/O requests jumped the queue in front of this hapless read operation! Presumably this improved total drive throughput, but at a high cost in latency – a classic tradeoff.
Since all of the disk I/O in this trace is coming from one source – find-in-files – we can finally identify the root cause of the problem:
This is the moment you’ve all been waiting for
When Visual Studio does a find-in-files it needs to build a collection of files to search, and it needs to search within those files. The searching within the files is done by worker threads, however the building of the collection of files to search is done on the main UI thread. This arrangement works fine when the search is scoped to Entire Solution or Current Project, but when it is scoped to a directory, with sub-folders included, then things can go awry.
If you have enough hardware threads, and if your disk decides to aggressively reorder disk operations, then the main thread can end up waiting arbitrarily long for some disk operations to return. These long delays happen because the worker threads may repeatedly jump the queue ahead of the main thread. Since the main thread I/O is synchronous this means that the main thread will hang for arbitrarily long times.
In short, the twelve worker threads were performing an aggressive denial-of-service attack on the main thread, and that is the ultimate cause of the UI hangs!
The initial analysis of this problem didn’t proceed in the straight line shown here, but even with all the false steps it was quite a quick process. Analyzing the trace took under an hour – a lot less time than writing this blog post.
- Don’t do disk I/O on the main thread – get one of the worker threads to do the directory scanning. Presumably this is the best option.
- Don’t let the worker threads start searching until the list of files to search has been built. This reduces the frequency of the hangs, but it doesn’t prevent them, especially if two copies of Visual Studio are searching simultaneously.
- Do asynchronous I/O on the main thread. This should also work, but you have to be careful to use asynchronous operations for every single I/O operation on the main thread or else the problem will still occur.
- Do all of the I/O on the main thread. This solution seems heretical, and I don’t recommend it, but it would be an improvement in most cases. If only one thread is doing disk I/O then the I/O operations should typically complete quickly so the main thread should normally not be hung for “too long”.
- Use fewer worker threads. This is more of a mitigation than a solution. If there were fewer threads hammering the disk then the main thread would probably be blocked less frequently. Even though this isn’t a solution to the hanging problem it should still be done because having twelve I/O threads makes no sense.
Point number five is important. It makes no sense to have the number of I/O threads be proportional to the number of hardware threads. If twelve I/O threads are ideal then there should be twelve on all computers, not just those with twelve hardware threads. It’s a bit tricky in this case because the I/O threads are also doing the actual searching of the file data – I saw “msenv.dll!CVsFinder::BoyerMooreScan” on the worker thread sampled data. But still. Twelve I/O threads?
It’s worth acknowledging that when the disk cache is fully populated (on subsequent searches) the performance characteristics will be very different. Then the search should be CPU limited and the twelve-way parallelism should give huge speedups. Optimizing for all cases is tricky.
There was yet another thread that appeared to be the owner of all of the I/O threads – it had msenv.dll!FindInFileEnumParallel on its call stack. I only mention this because this thread also had stackclr.dll!ThreadNative::SpinWait on its stack – about 66 ms of spinning over a hundred ms time period. Grrr. I hate busy waits.
Your mileage may vary
Many of my coworkers – perhaps all of them – have complained about this bug. But other machines – including my home machine with its clunky/slow laptop drive – never hit this problem, and apparently Microsoft developers had not hit it. Is it because my home machine only has eight hardware threads instead of twelve? Is it a difference in how the hard disk behaves? A difference in anti-virus software? I don’t know. The design of find-in-files is clearly vulnerable to this problem, and on some machines this vulnerability becomes manifest.
Update 2 to Visual Studio 2013 is supposed to fix this bug:
And indeed it hangs less frequently. With the fix I can now say that the Visual Studio 2013 find-in-files is better than it was for previous versions. Find-in-solution no longer returns duplicate results for files that are listed multiple times, and that is a huge convenience.
I enjoy finding performance problems in the software that I use, especially when I can get the problems fixed or find a workaround. If you enjoyed this article then you might enjoy these other articles about discoveries made using ETW/xperf:
- Transient 4 GB allocation by Western Digital driver – now fixed
- Visual Studio hang from Perforce paths in the output window – fixed in VS 2013
- mshtml in the VC++ compiler process – bug filed
- Overheating CPUs causing performance problems
- Visual Studio Single Step Slowdowns – disable Show Threads in Source
- Visual Studio Breakpoint Hangs – avoid excessive breakpoints
- PowerPoint slowdowns from too-large background images – use smaller images
- Windows Live Photo Gallery problems – and workarounds found
These articles give more background information to understand how to do this type of analysis.
- ETW Central – your one-stop shopping page for ETW information
- You Got Your Web Browser in my Compiler – brief discussion of the UI Delays graph
- Xperf Basics: Recording a Trace (the ultimate easy way)
- The Lost Xperf Documentation–CPU Usage (Precise)
- The Lost Xperf Documentation–Disk Usage
- In Praise of Idleness – an anti busy-wait manifest0
Reddit discussion is here.
The connect bug can be found here.