About

PXL_20240206_191720608.PORTRAIT (1) This is the blog of Bruce Dawson. I’m a programmer working at Google (great company!) on Chrome for Windows (awesome browser) and hacking around at home. I started this blog while I was working for Valve and continued it at Google. Prior to that I worked for Microsoft where I received excellent training on performance, debugging, security, and reliability. Prior to that… various other companies. This blog tends to include a random assortment of programming tidbits that I find interesting, information about unicycling, rants about Windows Live Photo Gallery, and occasional drink recipes.

The opinions stated here are my own, not necessarily those of my employer.

You can reach me at brucedawson at gmail.com.

93 Responses to About

Aaron Roberts says:

March 2, 2012 at 11:18 am

Bruce (I believe that’s your name), I’d wanted to get ahold of you and see if you have any insight into effective software standards. While there are tons of books and articles on things people should do, I haven’t see case studies or post-mortems where a team’s coding standards were examined to determine how useful they were as part of the development process. For example, a formal coding standard of 137 pages, detailing naming conventions, bracketing, etc may be theorectically great, but if developers can’t digest the whole thing, its probably going to end up unused. In contrast, a one page synopsis and 10 page set of examples, may be too thin for teams. I’d love to contact you directly and hear your thoughts.

Reply
- brucedawson says:
  
  December 12, 2012 at 11:32 pm
  
  Huh — I thought I’d replied to this, but I guess not.
  
  It’s good to have some basic standardization for how code should be laid out — variable naming conventions, spacing, parentheses, etc. 137 pages is too much, but 5-10 is well worth it.
  
  Beyond that I suspect that code reviews are the best way to ensure both quality and consistency.
  
  Reply
Malini Kothapalli says:

April 4, 2013 at 11:43 am

Or you could let your development environment help you format your code. If you use an IDE for your day to day coding, you may find that it can do a lot of that stuff for you. In my case, I have setup Eclipse to do most of that stuff for me.

Reply
- Malini Kothapalli says:
  
  April 4, 2013 at 11:49 am
  
  I couldn’t edit my earlier post, so I am replying to my own post. I wanted to make it clear that IDE can not only be used to auto format the code, but it can also help you follow a naming convention for your constants, variables, class names, class files, class header files, etc.
  
  Reply
Zeke Odins-Lucas says:

August 12, 2013 at 3:58 pm

hey, bruce! nice blog. a coworker pointed it to me, and as I was reading it, I thought, he sounds familiar… – Zeke

Reply
Sarkie says:

October 21, 2013 at 3:55 am

Random Question: Why do you have the O2 Area on your blog image ?

Reply
- brucedawson says:
  
  October 21, 2013 at 8:30 am
  
  Random Answer: I lived in London for a year, took the picture on a flight in to London, I liked the picture, and I was able to edit it to the necessary aspect ratio. Also, I’m too artistically lazy to bother reconsidering this choice.
  
  Reply
  - Sarkie says:
    
    October 21, 2013 at 9:23 am
    
    Random Reply: I didn’t really notice it till it didn’t load, assumed it was a default picture, so thought I’d ask. Brilliant picture taking into the account you were on a flight.
    
    Reply
GTHK says:

June 1, 2014 at 7:28 pm

RSS in thunderbird plus this blog results in the same posts reappearing multiple times, I have five copies of everything now. 😦

Reply
- brucedawson says:
  
  June 1, 2014 at 10:50 pm
  
  Five copies? Wow. I’ve got two copies of my blog in my RSS feed in Outlook, and two copies of a couple of other blogs. I don’t know what causes it. WordPress bug?
  
  Reply
John says:

December 2, 2014 at 10:05 am

Hi Bruce, I’ve enjoyed your ETW blog entries and training videos and shared them with my co-workers — thanks!

Question: do you know of a way to aggregate WPR collected function weights from different stacks, e.g. to identify critical low level functions that are called by different call paths (like malloc and free)? For example KCacheGrind allows you to sort by function weight and call count to easily identify the aggregated weight of low level functions called from different call paths and also shows a nice call graph which can highlight this (e.g. see http://kcachegrind.sourceforge.net/html/Screenshots.html). Is there any way to export the WPR collected data into Excel or some other format that could maybe then be translated so that KCacheGrind or gprof2dot/graphviz could highlight aggregated hot spots. Supposedly this used to be possible (see http://stackoverflow.com/questions/4394606/beyond-stack-sampling-c-profilers/4453999#4453999) but I can’t figure out how to do the necessary CSV export from WPR. Thanks for your time.

Reply
- brucedawson says:
  
  December 2, 2014 at 11:48 pm
  
  I covered exporting of CPU sampled data to text format in this blog post:
  
  Summarizing Xperf CPU Usage with Flame Graphs
  
  It’s definitely tricky, and it was only after I wrote the post that somebody gave me the hints needed to perfect it.
  
  I usually find that the table view (grouping by stack, or by module, function and address) is sufficient (together with grouping by process, thread ID, or whatever else seems appropriate). The butterfly view (show all stacks leading to a particular function) is also helpful. Therefore I rarely export the data. I find dynamically exploring it in WPA suits most of my needs.
  
  Reply
Matthew says:

February 19, 2015 at 10:32 pm

Hi Bruce! Thanks for the blog and also for your videos on wintellect. I watched them all and lean’t a great deal about WPR and WPA. I can now use it to track memory leaks hotspots and long waits, as well as measure slow frames.

From your blog I get the impression you enjoy investigating strange performance problems, and are knowledgeable about general system performance, so you may be able to work this one out.

http://stackoverflow.com/questions/28579750/files-loading-slower-on-second-run-of-application-with-repro-code

Testing if files exist gets slower after the first run of a program, and remains slow until the folder the file is in is renamed. I’ve tried using my ETW skills on this one but am drawing a blank. I suspect its something to do with NTFS however can’t be sure. Enjoy!

Reply
- brucedawson says:
  
  February 20, 2015 at 9:14 am
  
  I do like puzzles. I just posted a comment on the question which I will reproduce here:
  
  Consider uploading an ETW trace so that people can investigate without having to run the repro code. This also serves as an archive of how it was behaving, and will include many pertinent details such as amount of memory, type of disk, OS version, etc
  
  Reply
steve says:

September 9, 2015 at 1:14 pm

how do you say the word “ghoti?

Reply
- brucedawson says:
  
  September 9, 2015 at 6:17 pm
  
  You pronounced it “ghoti”. Using the phonetic alphabet that would be /fiSH/.
  
  Reply
Saleel Kudchadker says:

March 16, 2016 at 6:48 pm

Your ETW series is really amazing! Thanks very much, The tool is very methodical and can be used to debug almost anything. I guess it a matter of time to get more familiar with it. I only use GPUView so far but its probably inspired by ETW\Xperf.

Reply
- brucedawson says:
  
  March 16, 2016 at 11:28 pm
  
  I’m glad you like it. I have never mastered GPUView – you should write about how to use it, since I know it does a few things that WPA can’t.
  
  Reply
  - ejsalil says:
    
    March 17, 2016 at 10:57 am
    
    Yes. I’ll get on one soon. Its probably good to organize thoughts and write something helpful for everyone!
    
    Reply
Jim P says:

April 15, 2016 at 9:00 am

What are your thoughts on Rust?

Reply
- brucedawson says:
  
  April 15, 2016 at 9:05 am
  
  I know virtually nothing about it, so my thoughts are {}
  
  Reply
Milian Wolff says:

May 19, 2016 at 11:07 am

Hey Bruce,

I could not find your mail address, so I hope putting this down as a comment here is OK.

First up, thanks a lot for your blog posts on xperf and WPA – much appreciated. I have some questions on the latter, which you may help me with:

The tools I’ve used so far, most notably perf and VTune, give you different “visualizations” for call stack data associated to e.g. CPU samples. WPA, as I see it, only offers me the top-down view. Can I somehow view the data in a bottom-up manner? Is there maybe also a caller/callee view, i.e. some way to get a flat list of symbols in a process with the self and inclusive cost?

Alternatively, is there some trick to handle deep call stacks? In Visual Studio e.g. I can aggregate call stacks if they don’t introduce branches and do not differ in their sample cost from the parent symbol. Right now, I’m always getting mad at WPA for forcing me to click dozens of times to expand a call stack until I find the actual interesting point in my application…

Also, do you have contacts to people actually working on WPA? I think it would be a good idea for them to add a flame graph visualization as well. It is my current favorite way to visualize the output of perf e.g.

Then, I wonder whether there are some tricks for application developers. I see the value in analyzing the full picture of the system, as many times it shows the odd interactions between processes that one never would have seen otherwise. That said, sometimes I only want to look at my application and nothing else. Is there an easy way to filter all the visualizations in WPA on a certain executable e.g.? I have found ways to filter individual views, but it’s cumbersome to repeat that step for every view.

Finally, I wonder about the custom xperf events. Is this the recommended way on Windows to add static trace points, and should frameworks (like Qt e.g.) ship with xperf events? If you have knowledge about Linux or Solaris/Mac systems, do these events compare to Systemtap or DTrace static trace points? Is there maybe some good documentation on custom xperf events that also tells me more about the overhead of these events, and whether they should be shipped in e.g. release builds or only compiled in on demand?

Thanks a lot again, hope to learn some more tricks from you!

Reply
- brucedawson says:
  
  May 19, 2016 at 7:10 pm
  
  I hear that the next version of WPA may include flame graphs. I have not seen them but I am hopeful.
  
  Yes, it is possible to view caller/callee data on any call stack. Right click on a stack entry and select View Callers or View Callees from the context menu. This is covered in my WPA training videos. You can also change from viewing the call stack to viewing samples by process/module/function – the Randomascii Exclusive (module and function) view preset gives you that. Different views expose different information. You should also get used to fearlessly rearranging (and adding/removing) WPA columns. All questions can be answered by rearranging columns and changing the sort key.
  
  You also don’t have to click to expand stacks. Just choose the appropriate sort key and keep pressing the right arrow key – much faster.
  
  Some people like to filter graphs to a particular process. I usually don’t bother. The noise doesn’t really bother me – I just look at the areas of interest.
  
  I recommend shipping with custom ETW events built in. The ETWProviders*.dll exist for this purpose and the events are very low overhead. A few thousand per second is totally reasonable in release code. Microsoft ships Windows/IE/Edge with *tons* of these events.
  
  Reply
  - Milian Wolff says:
    
    August 31, 2016 at 8:27 am
    
    Thanks a lot for your replies!
    
    I just tested the latest update on W10, and it now sports a basic flame graph view – awesome! Much easier that way.
    
    Regarding top-down/bottom-up call stacks: Doing it via the context menu means I first have to drill down to select a function, then select to see its callers. What I’m missing is a configuration on the Stack column to set the direction. I.e. right now it’s top-down. I want it to be bottom-up. VTune makes the difference (and value!) of both versions quite apparent. This does not seem to be possible with WPA, or am I simply misusing the context menu?
    
    Also, is there a way to get file + line numbers for symbols in the stack view?
    
    Thanks again.
    
    Reply
    - brucedawson says:
      
      August 31, 2016 at 9:06 am
      
      I had forgotten that a flame graph view had been added – CPU Usage (Sampled) flame by Process, Stack, or configure as appropriate. Thanks for pointing that out!
      
      The way that stacks are collapsed in WPA makes reversed stacks impractical (illogical even) without first selecting a point to reverse them from (as with the filtering by callers to a particular function). I’m not sure how VTune does it so I can’t really compare.
      
      I have asked for file+line numbers (and source server support) but no dice. You’ll notice there are no file or line number columns available, and the .symcache files omit that information. Maybe the Windows 11 version, but I doubt it.
      
      Reply
Marc Chiarini says:

August 29, 2016 at 1:12 am

Hi Bruce, wonderful insights all over this blog! I had a quick question I hope you can answer. Is it possible to use WPA to analyze L1/L2/L3 CPU cache statistics? If not, have you needed to do it with some other tool? I have a suspicion that my multi-threaded program might be suffering from false sharing…an increase in threads (and consequently, number of cores used) results in non-linear performance degradation. I’m moderately certain that it’s not caused by locks. Anyway, any insight you could offer would be much appreciated.

Reply
- brucedawson says:
  
  August 29, 2016 at 8:54 am
  
  I keep hearing rumors that CPU performance counters will be made available in ETW, but so far, no dice.
  
  What I usually do is profile my code on Linux and use perf to monitor CPU performance counters. In most cases the results should be applicable on Windows.
  
  You should at least be able to use ETW to see if locks (spinning in them or waiting on them) are the problem.
  
  Reply
Tomer Ben Arye says:

September 13, 2016 at 1:37 pm

Hi Bruce.
I tried to tweet you with small question.
We are having the call for DX9 present() delayed randomly ( 1 sec video stutter)
I used your UIforETW but a software is blocking user keystrokes.
Even if we wrote a program that sends those keys , the computer blocks it.
What is our alternative to this key combo?

( going to study your second video course – thanks for that! )

Reply
- brucedawson says:
  
  September 13, 2016 at 11:32 pm
  
  Because UIforETW is elevated you probably can’t send keystrokes or other messages to it except from other elevated processes. You should probably hack the source to add the functionality you want.
  
  I implemented the type of code you are talking about – programmatically detecting a slowdown and recording the trace – at a previous job, but I don’t have the code anymore. Let me know if you come up with a reusable solution and maybe it can be rolled into the main release.
  
  Reply
Peter N Gregory says:

October 8, 2016 at 8:13 am

Hi Bruce Dawson,

I love the stuff about unicycles. Tried to get hold of Greg Harper about his small sun & planet geared hub. However, Greg appears to have retired now and they don’t seem to have a forwarding address for him at Washington University.
Do you know if Greg’s 1:1 or 1.5:1 ratio hubs made it into production, please? I’m emailing from Olde England and we’re not all that clued up about such things yet.
Thanking you in anticipation of your reply, Peter N Gregory

Reply
- brucedawson says:
  
  October 8, 2016 at 10:36 am
  
  Greg never commercialized his hub, but Schlumpf did, and the Schlumpf version can be shifted on the fly – pretty magical. See this blog post for details:
  
  Riding Faster on One Wheel: Geared Unicycles
  
  Or do some searches on Schlumpf unicycles to find things like this:
  
  http://krisholm.com/en/gear/component/kh-schlumpf
  
  Reply
  - Peter Gregory says:
    
    November 8, 2016 at 10:40 am
    
    Sorry if this message is popping up in the wrong place, Bruce Dawson. I’m not that au fait with your usual means of communication.
    
    I have an iterative algorithm to estimate the length of side of a square equivalent in area to a circle of given size. I wrote it by hand onto an A4 size piece of paper.
    
    Would this be of any interest to you?
    
    Kind regards, Peter Gregory
    
    Reply
    - brucedawson says:
      
      November 8, 2016 at 11:31 am
      
      Squaring the circle? That sounds in keeping with my article from many years ago here:
      
      http://www.cygnus-software.com/misc/pidigits.htm
      
      Unfortunately that site doesn’t allow commenting, so feel free to reply here with a link to a scan.
      
      Reply
Peter N Gregory says:

October 9, 2016 at 9:34 am

Thanks for your message, Bruce. In the event, my message reached Greg via his retirement email address at Washington University.

For a single speed, fixed-wheel version, Greg used gears from QTC (Quality Transmission Components) at
http://qtcgears.com/

I will try browsing their catalogue during the week.

Kind regards, Peter

Reply
Andrew says:

November 17, 2016 at 1:25 pm

Enjoyed and learned a lot from your posts for ETW. Thanks a lot!
Have a question, I want to write a program to run xperf to capture certain OS & driver events through a Fast Boot cycle, a.k.a. Fast Startup in WPR/ADK (not full cold boot), including both shutdown and resume phases. However in Fast Boot, all the user processes are terminated so how would I be able to do it? Clearly WPR and ADK have that capability but I can’t directly use them for other reasons (plus I want to know how!). Much appreciated if you can give me any suggestion 🙂

Reply
- brucedawson says:
  
  November 17, 2016 at 4:44 pm
  
  Look at bin\etwrecord.bat. You’ll have to split it into two parts, one to run before boot and one to run after. That might work, but probably not.
  
  Or, you might need to use xbootmgr. Unfortunately I have no recent experience with it, but there are many examples on the web.
  
  Why can’t you use wpr? It’s not my favorite tool but I do use it sometimes.
  
  Reply
Giovanni Ciavarelli says:

May 29, 2017 at 10:26 am

Hi Bruce! Are you the same Bruce Dawson of “The Duel” ? 😀 I started a very small “Dos Memories” blog and of course …. If It’s you, would you please reply a couple question I’d like to post? 😀 Thanks! 😀
P.s.
Oh yeah! I’ve been shameless! 😀

Reply
- brucedawson says:
  
  May 29, 2017 at 8:44 pm
  
  Test Drive II: The Duel? Yeah, I worked on the Amiga version of that, a long time ago.
  
  Reply
  - Reini says:
    
    August 3, 2022 at 12:39 pm
    
    Test Drive absolutely deserves a remake with the same advanced engine and with lots of more cars and tracks. Even on the Amiga improvents are easely possible with all those CF SD harddrives and Ram extensions (ACA500…) I enjoy it still today. https://photos.app.goo.gl/4VtbdnkGtLEfdweS9
    
    Reply
    - Reini says:
      
      August 3, 2022 at 1:11 pm
      
      improvements… meant with all the better hardware today or with emulation… of course I can imagine the programming on old engine and in the emulation would be big challenge.
      
      Reply
ChenA says:

July 18, 2017 at 8:20 pm

Hi, Bruce. i have some question to consult you, i failed to start and open a user mode realtime heap trace, EnableTraceEx2 failed when run on win7 64bit, return 1168, success on win10.
detail code is on https://github.com/chena1982/MemCheck/blob/master/ETWTraceSession.cpp line 112.
i search on the internet, don’t find any document about this, do you have any suggestion?
Thanks.

Reply
- brucedawson says:
  
  July 18, 2017 at 8:22 pm
  
  Sorry, no idea. I haven’t used the ETW APIs very much – I generally just shell out to xperf and let it deal with them.
  
  Reply
Selman Genc says:

July 24, 2017 at 7:43 am

Hi Bruce, I’m working with ETW and I want to display logged events in a different graph than the standard Generic Events graph. I guess I need to create a custom wpaProfile but I haven’t found any documentation about it, about the entries in the xml file. Do you know any documentation or something that can help me with this? Thanks.

Reply
- brucedawson says:
  
  July 24, 2017 at 9:56 pm
  
  I keep meaning to blog about this…
  
  If you use UIforETW and you use the supplied startup profile (go to Settings and click Copy Startup Profile) then you will get multiple custom views for the generic events graph. These use different filters (only showing different providers) and different graphing types. Open up the View Editor and explore, and then use Manager Presets to save your custom settings as a new preset.
  
  Reply
  - Selman Genc says:
    
    August 14, 2017 at 12:42 am
    
    Yes, I have read the blog posts, they were really helpful. I have one question, when I log values increasing randomly like 54,128,251, one value per second, and view them in WPA, I set the value column’s aggregation mode to Sum but there is a zigzag in the graph, like this: http://imgur.com/GyXPj3H. Is that normal, if not how can I get rid of it? I think it’s because I’m logging one event per second but the timeline’s sensitivity is in microseconds so there is a gap where no value is logged and that causes the zigzag. But I wanted to make sure 🙂
    
    Reply
    - brucedawson says:
      
      August 14, 2017 at 9:01 am
      
      It looks like aliasing between the display resolution and the data resolution. Try zooming in and out. Or, try changing the graph type – the selector is to the right of “Provider, Task, Opcode”
      
      Reply
      - Selman Genc says:
        
        November 24, 2017 at 12:42 am
        
        Hi again, I have another question 🙂 This doesn’t look like a resolution problem, even if I zoom in the zigzag is still there. What I want to achieve is: I’m logging increasing values like 100, 130, 150 and when I look at the graph I want to see what value I logged in a particular time. And the graph should look like a rising line instead of zigzag. I’m logging same values multiple times per second, in that case I want to see only one value for that particular second, not the sum or the average of them. I have tried different settings in WPA but couldn’t figure this out. This is what the logged values look like in PerfView:
        
        Reply
        
        brucedawson says:
        
        November 24, 2017 at 9:25 am
        
        I don’t know, but you don’t say what you are actually seeing now. You should share (on Google drive perhaps) a short trace, a .wpaProfile file showing what you tried, and a screenshot showing what it looked like – then maybe somebody will have some ideas.
        
        Reply
      - Selman Genc says:
        
        November 28, 2017 at 1:45 am
        
        You are right, sorry, I have tried different options in WPA, when I set the value field’s aggregation mod false and select this option in Graph Configuration:
        
        I get this:
        
        If I try to change the option to “Cumulative over time”, I get a graph exactly how I want, but the value that’s shown is the sum of all values, instead of a particular value I logged in that particular time:
        
        Do I need to change the way how I log the values? I guess I need to log a value just once per second if I want to see only that value there.
        
        Reply
        
        brucedawson says:
        
        November 28, 2017 at 8:23 am
        
        Without a trace, and a .wpaProfile file containing some of your attempts, it is very difficult for anyone to help.
        
        Reply
      - Selman Genc says:
        
        December 11, 2017 at 10:35 pm
        
        I have uploaded the trace file and the profile to google drive if you want to take a look:
        
        https://drive.google.com/open?id=1HQvkTRtsJ8Xbb0jkjoUgO7mg-q2Re4wO
        
        Reply
        
        brucedawson says:
        
        December 11, 2017 at 10:39 pm
        
        It says I don’t have permission to view the files. You should make them readable by all if you want help.
        
        Reply
        
        brucedawson says:
        
        December 15, 2017 at 10:05 pm
        
        I’m not sure what you’re trying to do or what you are expecting. You appear to be logging values from dozens of different task names under the NetsparkerEventSource. The graphing column (the one to the right of the blue bar is set to ‘sum’ which means that in the table you are getting the sums of the values for each type of event – I’m not sure what that is for.
        
        I agree that WPA seems to be interpreting the sum oddly when it graphs it, but I think you’re asking it to do something meaningless. The sum of all of the different event types? I guess?
        
        I suspect you want to filter down to a particular Task Name (select it, right-click, filter to selection) and you might want to View Editor, Advanced, Graph Configuration and look at the options there. You can also create presets that embed filtering – that’s what the Generic Events views in my default profile do – see the Filter settings.
        
        Reply
      - Selman Genc says:
        
        December 13, 2017 at 5:35 am
        
        Oh, sorry, my bad. Permission granted.
        
        https://drive.google.com/file/d/1HQvkTRtsJ8Xbb0jkjoUgO7mg-q2Re4wO/view?usp=sharing
        
        Reply
      - Selman Genc says:
        
        December 13, 2017 at 5:37 am
        
        Btw I think the notify system is not working for comments, I don’t get any notification via email when you reply even though I select the box, fyi.
        
        Reply
Marek says:

July 27, 2017 at 2:50 pm

Hello, i liked your posts about ETW .. Just now i’ve started to have a short (second or two) hiccups of chrome after upgrading to AMD driver ver 17.7.2 and immediately thought about using ETW to look for culprit but i wasn’t successful .. Don’t know how to reach to AMD guys soo… Here is ETW Trace https://mega.nz/#!RtpjSSZT!S4vXTI5b-f7Ss9v8zcOCEO1Ti33IKooElTMyQN-L-30 .. If you’ll have time and interest to look at that i’ll be happy 🙂 for now i’ll just rollback to older drivers. Thanks

Reply
- Marek says:
  
  July 27, 2017 at 2:57 pm
  
  Oh and BTW: you can see the hiccup in UI Delays
  
  Reply
  - brucedawson says:
    
    July 27, 2017 at 9:18 pm
    
    Since it’s a Chrome hang I took a look. The analysis was fairly straightforward although it is not clear *why* closesocket took five seconds to return. I filed crbug.com/749946. Please make additional comments there. In particular:
    1) Wired network?
    2) Did rolling back the drivers help?
    3) Anything else we should know, given that it is a networking hang caused by a function that should always return quickly failing to?
    
    Reply
Pavel Celba says:

August 4, 2017 at 9:27 am

This may interest you, it’s computation error problem: https://www.quora.com/unanswered/C++-double-End-integer-N-N-1-Now-compute-step-number-to-increment-double-S-End-N-1-What%E2%80%99s-the-double-number-C-so-that-S-+-C-*-N-1-End-and-S-+-C-*-N-1-is-as-close-to-End-number-as-possible

Reply
Jeff Stokes says:

February 15, 2018 at 7:20 am

Bruce, have you heard of time travel inversion being a thing in Server 2012R2 or 2016? Where ETW won’t open even with a /tti switch on WPA?

Reply
- brucedawson says:
  
  February 15, 2018 at 8:20 pm
  
  I haven’t. It’s more likely a hardware problem than an OS issue. I’ve always had good luck with recording traces, on multiple OSes.
  
  Reply
Raymond says:

March 16, 2018 at 12:57 pm

You wouldn’t be interested in written a decent tutorial of basic ETW & xperf? I’ve really enjoyed your blogs in the past and curious on your tips&tricks.

Reply
- brucedawson says:
  
  March 16, 2018 at 3:53 pm
  
  You can find a link to my ETW & xperf tutorials (and documentation) here:
  
  ETW Central
  
  (or use this more memorable short link: https://tinyurl.com/etwcentral)
  
  Reply
altiano says:

October 16, 2018 at 3:14 am

Hi Bruce, what book (or online articles) would you recommend me to understand windows internals for beginner?

Reply
- brucedawson says:
  
  October 16, 2018 at 9:13 am
  
  I’m not sure that “Windows internals” and “beginners” belong in the same sentence, but “Windows 7 Internals: Part 1” is an excellent reference. Once you have enough of a base of understanding you can learn a lot through careful examination of ETW traces – they show you much of what the OS is *really* doing.
  
  Reply
  - altiano says:
    
    October 16, 2018 at 7:19 pm
    
    After searching on Internet, I got : Windows Internals 7th Edition Part 1.
    Seems like an easy to read for me that there is a introduction part on the first chapters explaining the concept.
    
    Reply
    - Jeff Stokes says:
      
      October 16, 2018 at 7:24 pm
      
      Sorry, if I may interject, this may be what you are looking for.
      
      Apologies for intruding.
      
      Reply
Richard Harvey says:

January 10, 2019 at 9:01 pm

Hello Bruce,
I’m a researcher in climate science/climate change where we build and run huge 3D numerical models of the atmosphere, ocean and land surface in order to understand and project what the (warmer) climate will plausibly be like in the coming century. These codes are almost exclusively written in Fortran 77/90/95, include hundreds of thousands of lines, and obviously use floating-point math to solve the model’s differential equations. Up until recently all models have been using 64-bit precision as a standard, but for the past 10 years or so more and more models are starting to use 32-bit precision to take advantage of its speed and storage gains. In a recent paper of mine (Harvey & Verseghy, 2015) I explored some of the negative consequences of using single precision (your C floats) when dealing with the very slow conduction of heat in deep soil in a so-called “land surface model”. On the other hand, in Dawson et al. (2017) my paper was carried forward and suggested a simple way to use double precision for soil temperature while keeping the rest of the model at single precision, thus having the best of both worlds. The question I have is : did you delve into scientific computing at all in your forays of floating point math? Do you have any insight/advice to share that are targeted specifically to this field? Could you write something about this in one of your future posts?

Best regards,
Richard Harvey

Refs:

Harvey R, Verseghy DL (2016) The reliability of single precision computations in the simulation of deep soil heat diffusion in a land surface model. Clim Dyn 46:3865–3882. https://doi.org/10.1007/
s00382-015-2809-5

Dawson, A., P. D. Düben, D. A. MacLeod, & T. N. Palmer, 2017: Reliable low precision simulations in land surface models, Climate Dynamics (2018) 51:2657–2666. https://doi.org/10.1007/s00382-017-4034-x

Reply
- brucedawson says:
  
  January 10, 2019 at 10:16 pm
  
  I hope that the stability of your algorithms is well understood, since 32-bit precision doesn’t leave a lot of room for error. Unfortunately I haven’t dealt with scientific computing. I’ve seen what can go wrong in floating-point math in games, and readers have shared their conundrums, but I feel like I’m mostly a purveyor of anecdotes.
  Dawson et al. – huh.
  
  Reply
  - Richard Harvey says:
    
    January 11, 2019 at 10:23 am
    
    Thanks for you reply Bruce. It is unfortunate indeed, because floating point errors are not a priority in general in scientific computing, and I feel they should be more now thanks to the recent move to 32-bit models, and someone like you but also familiar with scientific code could be very useful!
    
    Reply
  - Richard Harvey says:
    
    January 11, 2019 at 10:28 am
    
    …however do note that our codes contain many, many floating point comparisons, e.g., if some temperature is below/above freezing (0.0) and nearly all of those are done like IF (TEMP .LT. 0.0) THEN… instead of some epsilon. I would be very interested in knowing whether some of our results are plainly wrong, or whether this is just an academic question.
    
    Reply
    - brucedawson says:
      
      January 12, 2019 at 11:33 am
      
      Some compilers warn about floating-point equality comparisons such as “if (f1 == f2)” but don’t warn about comparisons such as “if (f1 < 0.0f)", even though the "<" comparisons can be just as dangerous. Due to imprecision in measurements and calculations a "<" comparison really has three results – less than, greater than, or ambiguous, where ambiguous should be returned whenever the numbers are "close enough". But what counts as close enough? And what do you do then? Dealing with "close enough", if it is even possible, often just propagates the uncertainty and leads to worse problems later on.
      
      Messy stuff. Sorry, no silver bullet.
      
      Reply
      - Richard Harvey says:
        
        January 12, 2019 at 1:38 pm
        
        In fact, perhaps the question should be: “are my results wrong?” instead of “are there floating point errors in my code?”… these are subtly different questions.
        
        Reply
Stefan Winterstein says:

July 18, 2020 at 4:28 am

Stumbled over your name when reading about the lastest Chrome update and had an immediate flashback: THE Bruce Dawson, from the Amiga days?! 30 years ago, CygnusEd has been the program I’ve been using most on the Amiga, and Chrome has been the same for the last 10 (?) years… it’s software like this that makes working with computers a blast, so thanks for that!

Reply
- brucedawson says:
  
  July 20, 2020 at 10:24 pm
  
  Yep, same person. I can take most of the credit for CygnusEd (Steve LaRoque worked on the early versions) but Chrome was awesome long before I got there. I just hope I’m helping make it a bit better.
  
  Reply
Trung0246 says:

August 22, 2020 at 3:22 am

Hi, after I read about your blog for usage of the tool called UIForETW, I tried to use it to debug some weird issues as I tried to capture a specific lag spike. I’m not sure if its approriate to ask here but can you help me take a look at my trace? For some reason, in every couple minute, the process “Service Host: DNS Client” have a very high usage (but I’m not sure if the thing is the root cause). I tried to search for solution related to it but no vain. The only hope I have is to ask you to look at my trace and see the spike (since I don’t have sufficient knowledge to read the trace).

Reply
- brucedawson says:
  
  August 22, 2020 at 11:08 am
  
  Very high CPU usage from another process might be the cause, but you would be better of being more rigorous in your analysis. If you look at the posts linked from https://randomascii.wordpress.com/2015/09/24/etw-central/ you will find tutorials on how to investigate problems. Unfortunately I can’t do this for you. Good luck!
  
  Reply
Phil Miller says:

April 13, 2021 at 5:51 am

Just a quick comment regarding your Chrome Blog article about eliminating copies and allocations in webcam usage:
There’s an additional hidden benefit of your optimization beyond the direct and indirect costs you measured. Every page fault will update an entry in the TLB and in caches of the page table. By avoiding them, you’re leaving those caches undisturbed, reducing overhead for the rest of the process. If you have a high-level metric for performance in that scenario, I bet it would improve by more than the percentage that you measured for the code your bug directly touched on.

Reply
- brucedawson says:
  
  April 13, 2021 at 11:19 am
  
  Avoiding the copy will also avoid polluting the L1 and L2 and L3 caches which is probably even more significant. So, you are probably correct that the actual savings are even greater.
  
  Measuring those improvements requires lower noise levels than we have in our tests so I guess I’ll just take those as an added reason to be happy without ever knowing the actual amount.
  
  Context:
  https://blog.chromium.org/2021/04/dont-copy-that-surface.html
  
  Reply
Andrew Porter says:

July 4, 2021 at 11:25 am

Hello, Bruce! Just curious: you wouldn’t happen to know the average, worst case, and best case latencies for FSINCOS/FSIN on Intel, AMD, and esoteric platforms implementing x87, would you?

Also, I’ve just managed to randomly stumble upon your blog, so excuse any ignorance on my part, but have you ranted yet about how cringe the “optimize later” and “premature optimization” quote taken sorely out of context from Donald Knuth are and how everyone seems to be indifferent to optimization and why we should instead see new hardware as the opportunity for better latency and more features rather than incentive to be lazy and indifferent to performance?

Thanks!

In Pax Christ,
Andrew

Reply
- brucedawson says:
  
  July 5, 2021 at 5:09 pm
  
  Sorry, I don’t have any latency or throughput numbers for fscincos/fsin. Agner Fog might.
  
  I haven’t ranted about that – it seems like to vague a complaint as a general thing for my tastes – there are too many specific issues. As in, I am far more interested in specific instances of unacceptably poorly performing code.
  
  Reply
  - Andrew Porter says:
    
    July 6, 2021 at 7:10 am
    
    Thanks, I’ll ask him
    
    Ah, that’s fair. Well since you say specific, I have a small list of specifics. Indifference in programming related to: branching; memory, both in transient or persistent memory–everything takes so much memory when I know it could use less; garbage collection (whoever got the bright idea that program heap should function like cache memory should be slapped across the face and jailed for a heinous crime); benchmarking vs. cycle counting; and probably other things I could add, but those are my primary concerns, I’d say, when I think of things I’ve come across where that quote has practically been used as an excuse to be inefficient.
    
    Reply
    - brucedawson says:
      
      July 10, 2021 at 2:14 pm
      
      I mean _really_ specific, as in “task-bar context menus call ReadFile hundreds of thousands of times, causing noticeable latency” or “now that the ReadFile bug is fixed the task-bar context menus are faster, but still too slow”. Vague ranting isn’t very practical and I prefer to rant about specific issues that if fixed would deliver noticeable and useful improvements.
      
      Programmer time is expensive and finite so it is wise to apply the engineering mindset and optimize the most important things.
      
      Reply
      - Andrew Porter says:
        
        July 11, 2021 at 7:29 am
        
        That’s quite fair. However, the project I’m working on will have a standard library, so I have the privilege of innovating fast subroutines so that the less important things in the best case will already be optimal, and in the worst case, faster than otherwise. When I implement something, I always look for the best implementation, and if it doesn’t exist, I look for it, write it once, and never worry about it again because high-level optimization is pretty straightforward (imo); it’s the low-level stuff that’s hard to optimize I have found so far.
        
        For example, a _really_ specific problem that I’d love to rant about is how slow hardware divide is, even though you could make an ALU that can be easily reconfigured for divide or multiply under the principle that x*y = z and y = z/x for sums of powers of two reduces to 2^0/x + 2^1/x + … + 2^n/x for division, and 2^0 * x + 2^1 *x + … + 2^n * x for n bits, so you could just have as many bit shifting units as you can cram into the ALU and then set for mul or div mode and then compute and they should execute with similar times and incredibly faster than 3c for mul and 41c for div, but maybe even as fast as 1c with throughput scaled by the number of ALUs like most other simple ALU ops like bit shifts and sums, but unfortunately, it looks like I’ll have to make my own hardware before that ever happens… so for now, I’ll use fixed-point log_2 software divide as 2^(log_2(z) – log_2(x)) to be on par with hardware multiply.
        
        Pleasure chatting, Bruce!
        
        Reply
Reinhold says:

September 14, 2021 at 4:50 am

Hello Bruce, I would like to ask if you can tell us a little bit about your time on working on Test Drive 1 and 2 for the Amiga? How have you done the track layouts and the cars? I am dreaming about a modable version for Test Drive 2 (only on harddisk, Compact Flash etc.) where users can add new cars, tracks, objects similar to the Cannonball Project for Out Run. https://www.youtube.com/watch?v=t-93kDC8Vac | https://github.com/djyt/cannonball/wiki PS: I am after all these years a fan of Test Drive on the Amiga. Best regards, Reini

Reply
- brucedawson says:
  
  September 14, 2021 at 11:38 am
  
  The cars were just hand-drawn bitmaps, in a half-dozen (probably fewer, actually) sizes. The track layout was just an array of angles or some-such (left/right and up/down), and maybe some flags to say whether there was a rock wall present. It was extremely crude and simple, necessarily, and was definitely just fake-3D.
  
  The Amiga’s 32-colour palette let the game look really good – so much better than the PC version – but the code was nothing special and not really Amiga specific.
  
  When I worked on Grand Prix Cycles the next year I had time to implement fancy transitions using the Copper chip to get 60 fps full-screen animation when going between menus. That was fun and it felt good to make use of the Amiga hardware. That was back in the days where there was no PM/approval-process/process so I just did that because I thought it was cool and it shipped. Crazy times.
  
  Reply
  - Dostalgia says:
    
    August 2, 2022 at 12:13 am
    
    Was it too early for VGA graphic?
    
    Reply
    - brucedawson says:
      
      August 3, 2022 at 8:43 am
      
      Test Drive was created in 1987, which is also when VGA graphics were introduced. I can’t remember if the PC version of Test Drive supported this amazing new standard, but it certainly had to support earlier (ugly) graphics standards because few machines had VGA at the time.
      
      Reply
Andrei Vieru says:

October 20, 2021 at 8:43 am

Hi Bruce, came across your work on IdleWakeups in Chromium repository tools area: https://github.com/chromium/chromium/tree/master/tools/win/IdleWakeups

Hoping for a little bit of your insight. Specifically in the readme it’s mentioned that “By default, CPU usage is normalized to one CPU core, with 100% meaning one CPU core is fully utilized.” But when reading through source I’m not seeing where “number of cores” is being taken into account when calculating cpu usage.

Similarly, in your awesome UIforETW repo, when parsing ETL files:
https://github.com/google/UIforETW/blob/186d450b8ff62ddda44571eb38f028f52ffa85b2/TraceProcessors/CPUSummary/CPUSummary.cs#L177

“Per core” is again mentioned but not seeing use of processor count as part of calculations.

Perhaps I’m confused by the concept of “per core” cpu usage, could you share any insight on this?

Thank you

Reply
- brucedawson says:
  
  October 20, 2021 at 6:52 pm
  
  In both cases I think that per-core is implicit in the data source. In many cases the data source is CPU seconds used or some equivalent, and if you divide this by elapsed time then you get the average CPU usage in cores, and then you just multiply by 100%.
  
  That is, if a process uses 13 seconds of CPU time over a ten second period of time then on average it us using 1.3 CPU cores or 130% CPU utilization.
  
  Normalizing to a percentage of total available CPU power requires dividing by the number of CPUs.
  
  Reply
innerpeace says:

August 1, 2022 at 6:24 pm

Hi Bruce,
I’m working on exporting ETL data via WPAExporter and then post process it later. I’m working with WPA to comes up with custom profiles which can then be fed to WPAExporter.
I have had success with most of the raw data in WPA, but there is one thing puzzling me about the “CPU Usage Precise Utilization by Process and thread”, Table data vs graph data.
In the table, I’m able to see various things about “Time since last”, Waits, Switch in Time, CPU usage %, CPU usage in ms etc, for a given process and a thread.

However the graph of this raw data, that WPA puts out is called “% CPU usage using resource time as [Switch-In Time, Switch-in Time + New Switch-in Time] Aggregation: Sum”

I see that the X-axis of this graph is “Switch-in Time” working as a timeline. But I cannot figure out what the Y-Axis corresponds to. None of the columns from the table data correspond to Y-Axis.
sum of “Switch-in Time” and “New Switch-in Time” does not correspond to Y-Axis either.

Could you shed some light on where does WPA come up with the Y-Axis numbers of CPU usage in this Precise graph?

Reply
- brucedawson says:
  
  August 3, 2022 at 8:41 am
  
  FWIW, I don’t use WPAExporter anymore. I prefer trace processing (https://randomascii.wordpress.com/2020/01/05/bulk-etw-trace-analysis-in-c/).
  
  For the graph the Y-axis is percentage utilization. There is some smoothing going on because typically there is more detail than can be displayed but if you zoom in enough then the Y-axis becomes quantized to multiplies of 1/num-logical-processors. That is, at any point in time each logical processor is either in-use or not, and the Y-axis when zoomed in reflects that. When you zoom out it averages it in some manner.
  
  Reply
  - innerpeace says:
    
    August 3, 2022 at 11:14 am
    
    > becomes quantized to multiplies of 1/num-logical-processors.
    Ohh Thanks! that makes a lot of sense,
    
    > I prefer trace processing
    Yeah trace processing is great on its own, way more flexible and can do more things with it. However, I’m currently limited by a tech stack built in JS that needs to produce this analysis.
    One option I considered was to analyze the trace files in C# and convert them to WASM (using https://github.com/unoplatform/Uno.Wasm.Bootstrap) and use WASMed Trace processing in my JS tech stack. However it seems like Trace processing APIs are just Foreign function interfaces, that in turn use native Windows APIS from kernel32.dll which implement OpenTrace and ProcessTrace APIs (https://docs.microsoft.com/en-us/windows/win32/api/evntrace/nf-evntrace-processtrace). So they live outside of the sandbox created by WASM.
    
    I have opened an issue on Microsoft’s issue-only repo for trace processing, but haven’t received any traction on it yet. https://github.com/microsoft/eventtracing-processing/issues/7
    
    Reply