## Exercises in Emulation: Xbox 360’s FMA Instruction

Years ago I worked in the Xbox 360 group at Microsoft. We were thinking about releasing a new console, and we thought it would be nice if that console could run the games of the previous console.

Emulation is always hard, but it is made more challenging when your corporate masters keep changing CPU types. The Xbox one – sorry, the original Xbox – used an x86 CPU. The Xbox two – sorry, the Xbox 360 – used a PowerPC CPU. The Xbox three – sorry, the Xbox One – used an x86/x64 CPU. These ISA flip-flops did not make life easy.

I made some contributions to the team that taught the Xbox 360 how to emulate a lot of the original Xbox games – emulating x86 on PowerPC – and was given the job title Emulation Ninja for that work*. Then I was asked to help investigate what it would take to emulate the Xbox 360’s PowerPC CPU with an x64 CPU. To set expectations, I’ll mention up front that I didn’t find a satisfactory solution.

## FMA != MMA

One of the things that worried me was fused multiply add, or FMA instructions. These instructions take three input parameters and they multiply the first two, and then add the third. The ‘fused’ part of fused multiply add means that no rounding is done until the end of the operation. That is, the multiply is done to full precision, and then the add is done, and only then is the result rounded to the final answer.

To demonstrate this in a concrete fashion lets imagine that we are using decimal floating-point numbers with two digits of precision. Let’s imagine this calculation, shown as a function:

FMA(8.1e1, 2.9e1, 4.1e1),  or 8.1e1 * 2.9e1 + 4.1e1,  or 81 * 29 + 41

81*29 is equal to 2349 and if we add 41 we get 2390. Rounded to two digits we get 2400 or 2.4e3.

But if we don’t have FMA then we have to do the multiply, get 2349, which would be rounded to two digits of precision and get 2300 (2.3e3). Then we would add 41 to get 2341, which would be rounded again to get a final answer of 2300 (2.3e3), which is less accurate than the FMA answer of 2400.

Aside 1: FMA(a,b, -a*b) calculates the error in a*b, which is kind of cool

Aside 2: One side effect of aside 1 is that x = a * b – a * b may not return zero if your compiler automatically generates FMA instructions

So, clearly FMA gives more accurate results than separate multiply and add instructions. Well, let’s not go that far, but let’s agree that if you need to multiply two numbers and then add a third then FMA is going to be more accurate than the alternatives. And, FMA instructions often have lower latency than a multiply followed by an add instruction. On the Xbox 360 CPU the latency and throughput of FMA was the same as for fmul or fadd so using an FMA instead of an fmul followed by a dependent fadd would halve the latency.

## Emulating FMA

The Xbox 360 compiler generated FMA instructions all the time, both vector and scalar. We weren’t sure if the x64 CPUs we chose would support these instructions, so emulating them quickly and accurately was going to be crucial. Our emulation of these would have to be perfect because I knew from my previous experience with emulating floating-point math that “pretty close” tended to mean that characters would fall through the floor, cars would bounce out of the world, etc.

So, what does it take to emulate FMA instructions perfectly on an x64 CPU that doesn’t support them?

Luckily the vast majority of floating-point math in games is done to float (32-bit) precision, and I was quite happy to use double (64-bit precision) instructions in the emulation of FMA.

Emulating float-precision FMA with double-precision floating-point math seems like it should be easy (narrator’s voice: it isn’t. floating-point is never easy). A float has 24 bits of precision and a double has 53 bits of precision. This means that if you convert your input floats to double precision (a lossless conversion) you can then do the multiply with no error. That is, it only takes 48 bits of precision to store fully accurate results, and you have more than that, so it’s all good.

Next, you need to do the add. You just need to take the float addend, convert to double, and add to the result of the multiply. Because there is no rounding during the multiply the only rounding happens after the add, which is exactly what is needed to emulate FMA. Our logic is perfect. Declare victory and go home.

## So close…

But it doesn’t work. Or, at least, it fails for some inputs. Take a minute to guess why.

Hold music goes here…

It fails because the definition of FMA is that the multiply and the add happen at full precision and then the result is rounded to float precision. We have mostly managed that.

The multiply happens without rounding, and then after the add happens there is rounding. This sounds like what we were trying to do. But the rounding after the add is to double precision. After that we need to store the result to float precision, at which point rounding happens again.

Uggh. Double rounding.

It can be a bit hard to visualize this, so let’s return to our decimal floating-point formats where single-precision is two decimal digits, and double-precision is four. And let’s imagine that we’re calculating FMA(8.1e1, 2.9e1, 9.9e-1), or 81 * 29 + .99.

The fully accurate answer to this is 2,349.99 or 2.34999e3. Rounded to single precision (two digits) that is 2.3e3. Let’s see where things go wrong when we try to emulate this.

When we do a double-precision multiply of 81 and 29 we get 2,349. So far so good.

When we add .99 we get 2,349.99. So far so good.

That result is rounded to double precision and we get 2,350 (2.350e3). Uh oh.

We round that to single precision and, by the rules of IEEE round-to-nearest-even we get 2400 (2.4e3). That’s the wrong answer. It has a slightly greater error than the correctly rounded result given by the FMA instruction.

Now, you might try to claim that the problem is with IEEE’s round-to-nearest-even. However, for any rounding rule that you might come up with there is a case where the double rounding will give you a different answer from a true FMA.

## So what happened?

I didn’t find an entirely satisfactory answer to this problem.

I left the Xbox team long before the Xbox One shipped and I haven’t paid any attention to it since then, so I don’t know what they decided to do. Modern x64 CPUs have FMA instructions which can emulate this perfectly. It may also be possible to use the x87 FPU to emulate an FMA, somehow – I can’t remember what I concluded about that. Or maybe they decided that it was close enough.

## Side note

If you apply for a mortgage when your job title is emulation ninja then you are in a quandary. If you write that on the mortgage application then you look like a lunatic. If you write “software engineer” then you get in trouble when the mortgage broker calls your employer. You know. Hypothetically.

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in Floating Point and tagged , , , , . Bookmark the permalink.

### 16 Responses to Exercises in Emulation: Xbox 360’s FMA Instruction

1. Typo says:

Typo: “81*82 is equal to 2349”

2. sandinak1 says:

The side note made me ROTFL for a few hours. A nice statement on all the “well what do you want to be called” titles.

• John Simon says:

Yeah, HR is not a place to get cute. I forget the details, but apparently there’s someone who can only take on work in Canada as a ‘Shader Wizard’.

• brucedawson says:

It definitely falls into the “you can’t make this up” category. Getting a panicked call from my wife saying that closing on our house was in jeopardy due to a VP giving me a fun “present” was a surreal experience. It all worked out, although I don’t remember how it was resolved

• kaleberg says:

I lucked out. The first start up I worked for was so small that I was the only one in the office when my bank called to verify my employment and salary. We didn’t have an HR department or a receptionist or any of that fancy stuff.

3. Marc C says:

Bruce, thanks for this. Out of curiosity, what are the main game functions in which FMA is used? Matrix multiplication and convolution?

• brucedawson says:

Matrix multiplication is one obvious case but multiply-and-accumulate is extremely common, and the compiler would generate an FMA for “x = a * b + c;” (by default) which is the sort of expression that developers write all over.
The FMA instruction had the same latency as fmul and fadd individually, so using fma would halve the latency of these calculations, so using them as important.

4. Steven says:

Hi Bruce, did you work with Michael Brundage? He seems to be a fellow Emulation Ninja who worked on Xbox 360. Don’t know him personally, just a fan of Xbox lore. Thanks for the great work on my favorite game console!

• brucedawson says:

I have worked with Michael (fellow Emulation Ninja) and I work nearby him now. In fact, he appears in one of my commute challenge videos:

5. ariwn says:

There’s something wrong with your first example. You are talking about a 2 digit mantissa, but you are using 141 which has 3 digits

• brucedawson says:

Dang, that’s what I get for updating my examples at the last minute. Yeah. 41 was the original number. I changed to 141 because I liked the flow better and totally failed to notice that it made no sense. Fixed. Thanks for pointing that out.

6. Sandeep says:

Hello,

In aside1, what does it mean by phrase “calculates the error in a*b”?

• brucedawson says:

If a and b are floating-point numbers and there result is stored in another floating-point number of the same size then the result will, necessarily, be rounded. There will be error introduced. The expression FMA(a,b, -a*b) exactly calculates the error in a*b.

7. Rale says:

There was algorithm for exact emulation of FMA using only working precision, I guess for fp32 some time can be saved if it was mixed with fp64 (initial splitting), but without benchmark it is impossible to be exact. It was published in 2009. Was there any specific reason against using that algorithm?

• brucedawson says:

I was not aware of that algorithm, and my sloppy investigation was probably done around the same time – I think I left the Xbox team in 2009 so I might even have been experimenting with this in 2008. I’m not sure.

This site uses Akismet to reduce spam. Learn how your comment data is processed.