Profiling the profiler: working around a six minute xperf hang

Anytime my computer is a little bit slow I’m likely to record a trace and take a quick look. If I’m lucky I’ll find an easy workaround, in which case I’ve got a potential blog post and a smoother running computer. And, recording a trace to my uber-fast SSD only takes a few seconds, so the cost is minimal.

But, a few months ago I noticed that it was taking six minutes to record a simple trace, whether I used wprui or my batch files. This sort of delay really takes all the fun and spontaneity out of random performance investigations.

Spoiler alert: I profiled xperf and found a workaround

Continue reading

Posted in Investigative Reporting, Performance, Programming, xperf | Tagged , | 8 Comments

ETW Trace Compression (and xperf syntax refresher)

Despite extolling the virtues of wprui for recording ETW traces (here, and here) I’ve actually returned to using xperf.exe in batch files to do most of my trace recording. It gives me more precise control over what is recorded, and where, and with Windows 8+ it has another advantage: trace compression!

As usual the trace compression feature is lightly documented so I’m going to explain it here, and while I’m at it I’ll explain a bit more about recording traces with xperf.

xperf syntax

Continue reading

Posted in Performance, Programming, xperf | Tagged , | 13 Comments

Knowing Where to Type ‘Zero’

Some code optimizations requires complex data structures and thousands lines of code. But, in a surprising number of cases, significant improvements can be made by simple changes – sometimes as simple as typing a single zero. It’s like the old story of the boilermaker who knows the right place to tap with his hammer – he sends an itemized bill for $0.50 for tapping the valve, and $999.50 for knowing where to tap.

Continue reading

Posted in Performance | Tagged , , , | 17 Comments

Home Network Printer Setup That Works

Whenever I add a network printer to one of my Windows computers at home I end up with a reference to a hard-coded IP address. That means that the next time my home router reboots and assigns a different IP address, I lose the ability to print. Having the printer configured to a hard-coded IP address is like browsing to instead of

In order to ensure reliable printing for my family I have had to do some printer configuration jujitsu and I want to share my steps here, if only so that I’ll remember them next time.

Continue reading

Posted in Computers and Internet, Rants | Tagged | 11 Comments

Hidden Costs of Memory Allocation

IMG_8410 croppedIt’s important to understand the cost of memory allocations, but this cost can be surprisingly tricky to measure. It seems reasonable to measure this cost by wrapping calls to new[] and delete[] with timers. However, for large buffers these timers may miss over 99% of the true cost of these operations, and these hidden costs are larger than I had expected.

Further complicating these measurements, it turns out that some of the cost may be charged to another process and will therefore not show up in any timings that you might plausibly make.

Continue reading

Posted in Investigative Reporting, Performance, xperf | Tagged , | 24 Comments

Slow Symbol Loading in Microsoft’s Profiler, Take Two

When I run into a problematically slow program I immediately reach for a profiler so that I can understand the problem and either fix it or work around it.

This guidance applies even when the slow program is a profiler.

And so it is that I ended up using Windows Performance Toolkit to profile Windows Performance Toolkit. Again. The good news is that once again I was able to learn enough about the problem to come up with a very effective workaround.

Continue reading

Posted in Investigative Reporting, xperf | Tagged , , , , | 16 Comments

Intel Underestimates Error Bounds by 1.3 quintillion

imageIntel’s manuals for their x86/x64 processor clearly state that the fsin instruction (calculating the trigonometric sine) has a maximum error, in round-to-nearest mode, of one unit in the last place. This is not true. It’s not even close.

The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct. For huge inputs it can be much worse, but I’m going to ignore that.

I was shocked when I discovered this. Both the fsin instruction and Intel’s documentation are hugely inaccurate, and the inaccurate documentation has led to poor decisions being made.

The great news is that when I shared an early version of this blog post with Intel they reacted quickly and the documentation is going to get fixed!

Continue reading

Posted in Floating Point, Investigative Reporting, Programming | Tagged , , | 101 Comments