Years ago I worked in the Xbox 360 group at Microsoft. We were thinking about releasing a new console, and we thought it would be nice if that console could run the games of the previous console.
Emulation is always hard, but it is made more challenging when your corporate masters keep changing CPU types. The Xbox one – sorry, the original Xbox – used an x86 CPU. The Xbox two – sorry, the Xbox 360 – used a PowerPC CPU. The Xbox three – sorry, the Xbox One – used an x86/x64 CPU. These ISA flip-flops did not make life easy.
I made some contributions to the team that taught the Xbox 360 how to emulate a lot of the original Xbox games – emulating x86 on PowerPC – and was given the job title Emulation Ninja for that work*. Then I was asked to help investigate what it would take to emulate the Xbox 360’s PowerPC CPU with an x64 CPU. To set expectations, I’ll mention up front that I didn’t find a satisfactory solution.
FMA != MMA
One of the things that worried me was fused multiply add, or FMA instructions. These instructions take three input parameters and they multiply the first two, and then add the third. The ‘fused’ part of fused multiply add means that no rounding is done until the end of the operation. That is, the multiply is done to full precision, and then the add is done, and only then is the result rounded to the final answer.
To demonstrate this in a concrete fashion lets imagine that we are using decimal floating-point numbers with two digits of precision. Let’s imagine this calculation, shown as a function:
FMA(8.1e1, 2.9e1, 4.1e1), or 8.1e1 * 2.9e1 + 4.1e1, or 81 * 29 + 41
81*29 is equal to 2349 and if we add 41 we get 2390. Rounded to two digits we get 2400 or 2.4e3.
But if we don’t have FMA then we have to do the multiply, get 2349, which would be rounded to two digits of precision and get 2300 (2.3e3). Then we would add 41 to get 2341, which would be rounded again to get a final answer of 2300 (2.3e3), which is less accurate than the FMA answer of 2400.
Aside 1: FMA(a,b, -a*b) calculates the error in a*b, which is kind of cool
Aside 2: One side effect of aside 1 is that x = a * b – a * b may not return zero if your compiler automatically generates FMA instructions
So, clearly FMA gives more accurate results than separate multiply and add instructions. Well, let’s not go that far, but let’s agree that if you need to multiply two numbers and then add a third then FMA is going to be more accurate than the alternatives. And, FMA instructions often have lower latency than a multiply followed by an add instruction. On the Xbox 360 CPU the latency and throughput of FMA was the same as for fmul or fadd so using an FMA instead of an fmul followed by a dependent fadd would halve the latency.
The Xbox 360 compiler generated FMA instructions all the time, both vector and scalar. We weren’t sure if the x64 CPUs we chose would support these instructions, so emulating them quickly and accurately was going to be crucial. Our emulation of these would have to be perfect because I knew from my previous experience with emulating floating-point math that “pretty close” tended to mean that characters would fall through the floor, cars would bounce out of the world, etc.
So, what does it take to emulate FMA instructions perfectly on an x64 CPU that doesn’t support them?
Luckily the vast majority of floating-point math in games is done to float (32-bit) precision, and I was quite happy to use double (64-bit precision) instructions in the emulation of FMA.
Emulating float-precision FMA with double-precision floating-point math seems like it should be easy (narrator’s voice: it isn’t. floating-point is never easy). A float has 24 bits of precision and a double has 53 bits of precision. This means that if you convert your input floats to double precision (a lossless conversion) you can then do the multiply with no error. That is, it only takes 48 bits of precision to store fully accurate results, and you have more than that, so it’s all good.
Next, you need to do the add. You just need to take the float addend, convert to double, and add to the result of the multiply. Because there is no rounding during the multiply the only rounding happens after the add, which is exactly what is needed to emulate FMA. Our logic is perfect. Declare victory and go home.
But it doesn’t work. Or, at least, it fails for some inputs. Take a minute to guess why.
Hold music goes here…
It fails because the definition of FMA is that the multiply and the add happen at full precision and then the result is rounded to float precision. We have mostly managed that.
The multiply happens without rounding, and then after the add happens there is rounding. This sounds like what we were trying to do. But the rounding after the add is to double precision. After that we need to store the result to float precision, at which point rounding happens again.
Uggh. Double rounding.
It can be a bit hard to visualize this, so let’s return to our decimal floating-point formats where single-precision is two decimal digits, and double-precision is four. And let’s imagine that we’re calculating FMA(8.1e1, 2.9e1, 9.9e-1), or 81 * 29 + .99.
The fully accurate answer to this is 2,349.99 or 2.34999e3. Rounded to single precision (two digits) that is 2.3e3. Let’s see where things go wrong when we try to emulate this.
When we do a double-precision multiply of 81 and 29 we get 2,349. So far so good.
When we add .99 we get 2,349.99. So far so good.
That result is rounded to double precision and we get 2,350 (2.350e3). Uh oh.
We round that to single precision and, by the rules of IEEE round-to-nearest-even we get 2400 (2.4e3). That’s the wrong answer. It has a slightly greater error than the correctly rounded result given by the FMA instruction.
Now, you might try to claim that the problem is with IEEE’s round-to-nearest-even. However, for any rounding rule that you might come up with there is a case where the double rounding will give you a different answer from a true FMA.
So what happened?
I didn’t find an entirely satisfactory answer to this problem.
I left the Xbox team long before the Xbox One shipped and I haven’t paid any attention to it since then, so I don’t know what they decided to do. Modern x64 CPUs have FMA instructions which can emulate this perfectly. It may also be possible to use the x87 FPU to emulate an FMA, somehow – I can’t remember what I concluded about that. Or maybe they decided that it was close enough.
- Good discussion on twitter.
- This article discusses a technique for solving this problem.
- Discussion on Hacker news
- Discussion on reddit/r/programming
- Discussion on reddit/r/emulation
If you apply for a mortgage when your job title is emulation ninja then you are in a quandary. If you write that on the mortgage application then you look like a lunatic. If you write “software engineer” then you get in trouble when the mortgage broker calls your employer. You know. Hypothetically.