C++ 11 std::async for Fast Float Format Finding

Posted on March 11, 2012 by brucedawson

After a recent post on float precision there was some debate about round-tripping of floats. My claim was that if you print a float with printf(“%1.8e”, f); and then scan it back in then you are guaranteed to get back exactly the same float, but there was some skepticism.

Bold claims need careful proof, and this post shows how C++ 11 can be leveraged to verify this statement faster and more easily than with old-style C++.

The best way to prove my claim is to write some test code. The code to see whether you can round-trip floats to decimal and back without ‘drift’ is pretty simple. You just have to loop through all two billion positive floats (you can use this code as a starting point) and for each float, sprintf it to ASCII and then sscanf it back to a float. If you always get back exactly the same number then the hypothesis is proven.

However this is time-consuming. On my laptop it took about fifty minutes to run. Normally multi-threading it would be too much work for a throwaway task like this, but with a fresh install of VS 11 beta on my laptop I decided to give std::async a try. It worked really well and gave me a 3.6x speedup.

In the code below you can see calls to futures.push_back where I queue up asynchronous calls to FindDriftingFloats(start, end). A bit later I loop through those futures, retrieving the results. This code will automatically parallelize across up to 255 cores, and increasing the level of parallelism is just a matter of doing smaller batches by incrementing start.i by a smaller amount.

The cool thing is that, for the first time, it is possible to write portable multithreaded C++ code.

Here’s the code – focus on the futures vector.

#include <future>
#include <vector>
#include <stdio.h>

/* See
Tricks With the Floating-Point Format

for the potential portability problems with the union and bit-fields below.
*/
union Float_t
{
    Float_t(float num = 0.0f) : f(num) {}
    // Portable extraction of components.
    bool Negative() const { return (i >> 31) != 0; }
    int32_t RawMantissa() const { return i & ((1 << 23) - 1); }
    int32_t RawExponent() const { return (i >> 23) & 0xFF; }

    int32_t i;
    float f;
#ifdef _DEBUG
    struct
    {   // Bitfields for exploration. Do not use in production code.
        uint32_t mantissa : 23;
        uint32_t exponent : 8;
        uint32_t sign : 1;
    } parts;
#endif
};

// Print each float in the range and verify that it round-trips.
// This function is designed to be called on multiple threads.
// Therefore no global variables should be modified.
int FindDriftingFloats(Float_t start, Float_t end)
{
    int failures = 0;
    // These print statements may occur out of order due to this
    // code being run on multiple threads.
    printf("Scan %1.8ef to %1.8ef (%x to %x)\n",
                start.f, end.f, start.i, end.i);

    while (start.f < end.f)
    {
        char buffer[30];
        sprintf_s(buffer, "%1.8e", start.f);
        float fRead;
        int count = sscanf_s(buffer, "%f", &fRead);
        if (count != 1 || fRead != start.f)
        {
            printf("Conversion failure on %1.8e (%08x)!\n",
                        start.f, start.i);
            ++failures;
        }

        start.i += 1;
    }

    return failures;
}

int main(int argc, char* argv[])
{
    auto startTime = clock();
    std::vector<std::future<int>> futures;

    // This represents infinity. We stop here, without processing
    // this number.
    const uint32_t maxRep = 255 << 23;
    Float_t start(0);
    while (start.i < maxRep)
    {
        Float_t end;
        // Increment by any amount. 1<<23 means that each exponent
        // is processed as one batch, but any positive integer will do.
        end.i = start.i + (1 << 23);
        if (end.i >= maxRep)
            end.i = maxRep;

        // Call FindDriftingFloats(start, end) on another thread.
        futures.push_back(std::async(std::launch::async,
                FindDriftingFloats, start, end));
        // Advance to the next range.
        start = end;
    }

    // Record the results. The results may not complete in order, but we
    // wait for them in order.
    int totalFailures = 0;
    for (size_t i = 0; i < futures.size(); ++i)
    {
        int failures = futures[i].get();
        totalFailures += failures;
        // Cast size_t to a type that can be portably printed. unsigned
        // long long would be better if the values could actually be large.
        printf("%d failures found in range %u.\n", failures, (unsigned)i);
    }

    printf("%d floats failed to round-trip to decimal.\n", totalFailures);
    auto elapsedTime = clock() - startTime;
    printf("Took %1.3fs\n", elapsedTime / double(CLOCKS_PER_SEC));

    return 0;
}

Typical output is shown here:

Scan 0.00000000e+000f to 1.17549435e-038f (0 to 800000)
Scan 4.70197740e-038f to 9.40395481e-038f (1800000 to 2000000)
Scan 1.88079096e-037f to 3.76158192e-037f (2800000 to 3000000)
Scan 3.76158192e-037f to 7.52316385e-037f (3000000 to 3800000)
Scan 1.17549435e-038f to 2.35098870e-038f (800000 to 1000000)
Scan 2.35098870e-038f to 4.70197740e-038f (1000000 to 1800000)
Scan 9.40395481e-038f to 1.88079096e-037f (2000000 to 2800000)
Scan 7.52316385e-037f to 1.50463277e-036f (3800000 to 4000000)
Scan 1.50463277e-036f to 3.00926554e-036f (4000000 to 4800000)
Scan 3.00926554e-036f to 6.01853108e-036f (4800000 to 5000000)
Scan 6.01853108e-036f to 1.20370622e-035f (5000000 to 5800000)
Scan 1.20370622e-035f to 2.40741243e-035f (5800000 to 6000000)
Scan 2.40741243e-035f to 4.81482486e-035f (6000000 to 6800000)
Scan 4.81482486e-035f to 9.62964972e-035f (6800000 to 7000000)
Scan 9.62964972e-035f to 1.92592994e-034f (7000000 to 7800000)
0 failures found in range 0.
Scan 1.92592994e-034f to 3.85185989e-034f (7800000 to 8000000)
0 failures found in range 1.
0 failures found in range 2.
0 failures found in range 3.

…

Scan 8.50705917e+037f to 1.70141183e+038f (7e800000 to 7f000000)
0 failures found in range 246.
Scan 1.70141183e+038f to 1.#INF0000e+000f (7f000000 to 7f800000)
0 failures found in range 247.
0 failures found in range 248.
0 failures found in range 249.
0 failures found in range 250.
0 failures found in range 251.
0 failures found in range 252.
0 failures found in range 253.
0 failures found in range 254.
0 floats failed to round-trip to decimal.
Took 830.424s

Note that the “Scan …” printouts don’t always appear in order. That is because they are being done on separate threads and their precise timing is at the whim of the Windows scheduler.

Note also that fifteen jobs were started before the first result was retrieved. That may indicate that the first job took a particularly long time (results are retrieved in order), or it may mean that the main thread was starved for CPU time and was briefly unable to retrieve results.

My laptop has four hyperthreaded cores. The fact that I only got a speedup of 3.6x is interesting, and probably has to do with the single-threaded CPU frequency being higher due to TurboBoost, together with lackluster improvements from hyperthreading. It happens. I did verify that all eight hardware threads were pegged the entire time.

That’s it. Multi-threading done easily enough that even on run-once jobs like this I can multithread my tests in less time than the 36 minutes that multi-threading will save me. Win-win.

The total time to run to run the multithreaded test was under 14 minutes. On a desktop machine with more and faster cores it could easily be done in half that time or less – not bad for running a test on two billion floats.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048

View all posts by brucedawson →

This entry was posted in Floating Point, Programming, Visual Studio and tagged float, precision, print, scanf. Bookmark the permalink.

24 Responses to C++ 11 std::async for Fast Float Format Finding

Jeff Preshing says:

March 11, 2012 at 9:19 pm

Really enjoyed this one Bruce. Interesting, fun and a practical intro to a C++11 module.

Reply
zeuxcg says:

March 12, 2012 at 2:52 pm

You can use clock() and avoid Windows dependency entirely. You don’t get high-frequency timing, but you don’t get it with GetTickCount either – maybe std::chrono fixes that.

Reply
- brucedawson says:
  
  March 12, 2012 at 8:48 pm
  
  Thanks for the clock() suggestion. I’ve updated the post to use that. It looks like clock() just uses GetTickCount() under the covers so the results should be identical, but more portable. I suspect the resolution is identical, which means 10-20 ms on most platforms, which is plenty sufficient for my needs.
  
  Plus, clock() gave me an excuse to use ‘auto’.
  
  Reply
  - Roman says:
    
    June 28, 2013 at 2:07 pm
    
    clock(), however, returns processor time across all threads (that is, on systems that actually implement it correctly), so while it’s portable, it doesn’t give equivalent results across platforms.
    
    Reply
    - brucedawson says:
      
      June 29, 2013 at 8:38 am
      
      Well dang. I wanted wall-clock time, not processor time. I guess I used the wrong function. I’ll have to fix that.
      
      Reply
Pingback: Intermediate Floating-Point Precision | Random ASCII
Guillaume Couture-Le (@GuillaumeCL) says:

March 22, 2012 at 7:20 am

You can probalby get a bit of a better speedup if you use less threads. The std::launch::async policy says that every task will run on a new thread. If instead you just create 4 async tasks you should be able to get a better speedup. See this blog post for details:
http://blog.corensic.com/2011/10/10/async-tasks-in-c11-not-quite-there-yet/

Reply
- brucedawson says:
  
  March 22, 2012 at 9:32 pm
  
  Thanks for the pointer. It was good to read a detailed analysis of std::async by somebody who seems to know a lot more about it than I do. However I seem to be avoiding the problems which the author describes. When my code runs I can see that my process is using 11 threads on my 8-core system (15 threads on my 12-core system). That seems just about perfect as it soaks up all available processor time, but doesn’t over commit.
  
  To be clear, I create 255 tasks with std::async, but the CRT seems to be running them on a number of threads that is appropriate for my computer.
  
  Reply
  - Michal Mocny says:
    
    March 30, 2012 at 7:25 pm
    
    Thanks for this reply, I was going to make the same suggestion about over committing, so I am glad to know that the CRT implementation is doing the right thing here.
    
    Still, there is a way to query for number of native threads supported on a system (std::thread::hardware_concurrency) and then spawn only that many async tasks, to achieve this behavior with higher portability.
    
    (also, its trivial and picky, but is there a reason to use if (end.i >= maxRep) instead of std::min()?)
    
    Reply
brucedawson says:

March 30, 2012 at 8:14 pm

I assume you don’t want to spawn only #cores different tasks because then you could end up going briefly idle. I think it’s going to take a while to figure out the best practices. Also, the main purpose of this post was just to show what could now be thrown together very quickly. I’m glad it’s efficient on VC++. Beyond that, we shall see.

No particular reason for the ‘if’ instead of std::min(). Personal preference for this particular case? Habit?

Reply
Matthew Miner says:

April 4, 2012 at 8:06 am

Fascinating (thanks for this and related posts re: floating point).

You made me curious, so I read up a bit on HyperThreadingTechnology. If I may presume to add to your excellent work:

Given a core with HT, what you really have is one execution unit, and two “everything else” – instruction fetch, decode, etc. When one thread “stalls” while addressing a cache miss, for example, the other thread steps in to keep the execution unit busy.

In your sample app, which is small enough to easily fit in cache, and not particularly memory intensive, once a thread starts running, I imagine there is little reason for it to stall and give the “other thread” access to the execution unit. So, the second thread doesn’t _really_ get access to the execution unit until the first thread ends, making the core’s behavior essentially single core (in spite of the HT).

So, for this sort of workload (all threads are small function, CPU/register intensive), HT is not particularly useful.

That gives you (effectively) four cores (so, four times faster), minus overhead (Windows, process explorer, etc.), or about 3.6x speed up. I suspect you will find comparable performance if you disable HT.

I also realize now that it’s important to group my threads appropriately. If I have a choice, I don’t want to put two small, register intensive threads on one core (one will starve), while putting two memory intensive big funcs on my other core (they’ll be fine, but I won’t get the benefits I want for my first two threads). What I’d prefer to do is spread my two big funcs across the two cores at relatively high prio, and put my two small, register intensive, tasks on the same two separate cores at lower prio.

Reply
- brucedawson says:
  
  April 4, 2012 at 1:18 pm
  
  With HT the most you can generically say is that some resources are shared, and some are per-thread. The registers are always per-thread, and typically most of the execution resources (caches and execution units for example) are shared. I’m not sure about instruction decode — I thought that was shared in most Intel HT processors.
  
  Cache misses are one opportunity for HT to improve efficiency, since they leave large gaps where one thread is idle and the other can run full speed. However they are not the only case. Any time that one thread is not using 100% of the execution resources then the other thread can use them. This can happen if you have one thread doing integer heavy work and another thread doing FP heavy work — they may both run at top speed. It can also happen if, due to data dependencies, one thread cannot fully use the execution resources. For instance, imagine this code:
  
  add eax, eax
  add eax, ebx
  add eax, ecx
  add eax, edx
  add eax, eax
  …
  
  Because each instruction is dependent on the result of the previous one, and because integer add has one cycle latency (on most CPUs) this code will execute one instruction per cycle. Most modern Intel CPUs can do three integer adds per cycle, so this code leaves the main integer ALUs two thirds idle. Therefore, if this code is running on both hyperthreads of a single CPU then it can run at full speed on both threads, thus doubling throughput.
  
  In other words, long dependency chains are an opportunity for hyperthreading to give a speedup. I modified the test code that I used for this post:
  
  Fractal and Crypto Performance
  
  to test the endless chain of dependent adds listed below, and I confirmed that hyperthreads then give almost perfect scaling (I got 6.9 times better performance on eight threads compared to one on my four-core CPU).
  
  So, apparently the printf/sscanf code that I was exercising is fully using some type of execution resource
  
  Out-Of-Order cores are massively complicated, and imperfectly documented, so predicting exactly what will happen is very difficult.
  
  Reply
Pingback: Floating-point complexities | Random ASCII
Pingback: Exceptional Floating Point | Random ASCII
Pingback: That’s Not Normal–the Performance of Odd Floats | Random ASCII
Pingback: Game Developer Magazine Floating Point | Random ASCII
Pingback: Top Eight Entertaining Blog Facts for 2012 | Random ASCII
Pingback: Float Precision Revisited: Nine Digit Float Portability | Random ASCII
Pingback: Comparing Floating Point Numbers, 2012 Edition | Random ASCII
Pingback: There’s Only Four Billion Floats–So Test Them All! | Random ASCII
Caleb says:

February 1, 2014 at 7:04 pm

I’ve been really enjoying your blog! I got here when I was having issues with floating point comparisons, and have just finished writing a simple multithreaded fractal generator with some of these examples. Thanks for this.

Reply
Alexander Riccio says:

June 23, 2014 at 12:01 am
Code analysis found a tiny bug:
```
 printf("%d failures found in range %d.\n", failures, i);
```
Should really be:
```
printf( "%d failures found in range %lu.\n", failures, i );
```
As size_t is an unsigned type.
Reply
- brucedawson says:
  
  June 23, 2014 at 8:05 am
  
  Yep, the code is wrong. But so is your fix. Switching to ‘u’ is correct, but the more important thing is to get the size right, and adding ‘l’ does not do that. ‘l’ specifies long, and there is no guarantee that size_t is a long int. On different platforms long is either the same size, smaller, or larger than size_t.
  
  %zu is the C99 way to specify this, but is not yet supported by VC++. So the correct fix is to cast the result to some other type (unsigned, unsigned long, unsigned long long) and print that.
  
  Code analysis is often misleading when it comes to printing size_t. I have never seen it recommend a portable fix. I look forward to VC++ 14 which will support %zu.
  
  I fixed the code. Thanks.
  
  Reply
  - Alexander Riccio says:
    
    June 23, 2014 at 10:35 am
    Gah! I hate format specifiers.
    Sidenote:
    using
    
    std::launch::async|std::launch::deferred
    
    instead of
    
    std::launch::async
    
    alone seems to result in a slight, but measurable performance boost.
    
    Concurrency Visualizer shows a ton of preemption & synchronization with std::launch::async alone
    
    (loved the article!)
    Reply