Despite extolling the virtues of wprui for recording ETW traces (here, and here) I’ve actually returned to using xperf.exe in batch files to do most of my trace recording. It gives me more precise control over what is recorded, and where, and with Windows 8+ it has another advantage: trace compression!
As usual the trace compression feature is lightly documented so I’m going to explain it here, and while I’m at it I’ll explain a bit more about recording traces with xperf.
Some code optimizations requires complex data structures and thousands lines of code. But, in a surprising number of cases, significant improvements can be made by simple changes – sometimes as simple as typing a single zero. It’s like the old story of the boilermaker who knows the right place to tap with his hammer – he sends an itemized bill for $0.50 for tapping the valve, and $999.50 for knowing where to tap.
Whenever I add a network printer to one of my Windows computers at home I end up with a reference to a hard-coded IP address. That means that the next time my home router reboots and assigns a different IP address, I lose the ability to print. Having the printer configured to a hard-coded IP address is like browsing to 126.96.36.199 instead of www.google.com.
In order to ensure reliable printing for my family I have had to do some printer configuration jujitsu and I want to share my steps here, if only so that I’ll remember them next time.
It’s important to understand the cost of memory allocations, but this cost can be surprisingly tricky to measure. It seems reasonable to measure this cost by wrapping calls to new and delete with timers. However, for large buffers these timers may miss over 99% of the true cost of these operations, and these hidden costs are larger than I had expected.
Further complicating these measurements, it turns out that some of the cost may be charged to another process and will therefore not show up in any timings that you might plausibly make.
When I run into a problematically slow program I immediately reach for a profiler so that I can understand the problem and either fix it or work around it.
This guidance applies even when the slow program is a profiler.
And so it is that I ended up using Windows Performance Toolkit to profile Windows Performance Toolkit. Again. The good news is that once again I was able to learn enough about the problem to come up with a very effective workaround.
Intel’s manuals for their x86/x64 processor clearly state that the fsin instruction (calculating the trigonometric sine) has a maximum error, in round-to-nearest mode, of one unit in the last place. This is not true. It’s not even close.
The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct. For huge inputs it can be much worse, but I’m going to ignore that.
I was shocked when I discovered this. Both the fsin instruction and Intel’s documentation are hugely inaccurate, and the inaccurate documentation has led to poor decisions being made.
The great news is that when I shared an early version of this blog post with Intel they reacted quickly and the documentation is going to get fixed!