The recent reveal of Meltdown and Spectre reminded me of the time I found a related design bug in the Xbox 360 CPU – a newly added instruction whose mere existence was dangerous.
Back in 2005 I was the Xbox 360 CPU guy. I lived and breathed that chip. I still have a 30-cm CPU wafer on my wall, and a four-foot poster of the CPU’s layout. I spent so much time understanding how that CPU’s pipelines worked that when I was asked to investigate some impossible crashes I was able to intuit how a design bug must be their cause. But first, some background…
The Xbox 360 CPU is a three-core PowerPC chip made by IBM. The three cores sit in three separate quadrants with the fourth quadrant containing a 1-MB L2 cache – you can see the different components, in the picture at right and on my CPU wafer. Each core has a 32-KB instruction cache and a 32-KB data cache.
Trivia: Core 0 was closer to the L2 cache and had measurably lower L2 latencies.
The Xbox 360 CPU had high latencies for everything, with memory latencies being particularly bad. And, the 1-MB L2 cache (all that could fit) was pretty small for a three-core CPU. So, conserving space in the L2 cache in order to minimize cache misses was important.
CPU caches improve performance due to spatial and temporal locality. Spatial locality means that if you’ve used one byte of data then you’ll probably use other nearby bytes of data soon. Temporal locality means that if you’ve used some memory then you will probably use it again in the near future.
But sometimes temporal locality doesn’t actually happen. If you are processing a large array of data once-per-frame then it may be trivially provable that it will all be gone from the L2 cache by the time you need it again. You still want that data in the L1 cache so that you can benefit from spatial locality, but having it consuming valuable space in the L2 cache just means it will evict other data, perhaps slowing down the other two cores.
Normally this is unavoidable. The memory coherency mechanism of our PowerPC CPU required that all data in the L1 caches also be in the L2 cache. The MESI protocol used for memory coherency requires that when one core writes to a cache line that any other cores with a copy of the same cache line need to discard it – and the L2 cache was responsible for keeping track of which L1 caches were caching which addresses.
But, the CPU was for a video game console and performance trumped all so a new instruction was added – xdcbt. The normal PowerPC dcbt instruction was a typical prefetch instruction. The xdcbt instruction was an extended prefetch instruction that fetched straight from memory to the L1 d-cache, skipping L2. This meant that memory coherency was no longer guaranteed, but hey, we’re video game programmers, we know what we’re doing, it will be fine.
I wrote a widely-used Xbox 360 memory copy routine that optionally used xdcbt. Prefetching the source data was crucial for performance and normally it would use dcbt but pass in the PREFETCH_EX flag and it would prefetch with xdcbt. This was not well-thought-out. The prefetching was basically:
if (flags & PREFETCH_EX)
A game developer who was using this function reported weird crashes – heap corruption crashes, but the heap structures in the memory dumps looked normal. After staring at the crash dumps for awhile I realized what a mistake I had made.
Memory that is prefetched with xdcbt is toxic. If it is written by another core before being flushed from L1 then two cores have different views of memory and there is no guarantee their views will ever converge. The Xbox 360 cache lines were 128 bytes and my copy routine’s prefetching went right to the end of the source memory, meaning that xdcbt was applied to some cache lines whose latter portions were part of adjacent data structures. Typically this was heap metadata – at least that’s where we saw the crashes. The incoherent core saw stale data (despite careful use of locks), and crashed, but the crash dump wrote out the actual contents of RAM so that we couldn’t see what happened.
So, the only safe way to use xdcbt was to be very careful not to prefetch even a single byte beyond the end of the buffer. I fixed my memory copy routine to avoid prefetching too far, but while waiting for the fix the game developer stopped passing the PREFETCH_EX flag and the crashes went away.
The real bug
So far so normal, right? Cocky game developers play with fire, fly too close to the sun, marry their mothers, and a game console almost misses Christmas.
But, we caught it in time, we got away with it, and we were all set to ship the games and the console and go home happy.
And then the same game started crashing again.
The symptoms were identical. Except that the game was no longer using the xdcbt instruction. I could step through the code and see that. We had a serious problem.
I used the ancient debugging technique of staring at my screen with a blank mind, let the CPU pipelines fill my subconscious, and I suddenly realized the problem. A quick email to IBM confirmed my suspicion about a subtle internal CPU detail that I had never thought about before. And it’s the same culprit behind Meltdown and Spectre.
The Xbox 360 CPU is an in-order CPU. It’s pretty simple really, relying on its high frequency (not as high as hoped despite 10 FO4) for performance. But it does have a branch predictor – its very long pipelines make that necessary. Here’s a publicly shared CPU pipeline diagram I made (my cycle-accurate version is NDA only, but looky here) that shows all of the pipelines:
You can see the branch predictor, and you can see that the pipelines are very long (wide on the diagram) – plenty long enough for mispredicted instructions to get up to speed, even with in-order processing.
So, the branch predictor makes a prediction and the predicted instructions are fetched, decoded, and executed – but not retired until the prediction is known to be correct. Sound familiar? The realization I had – it was new to me at the time – was what it meant to speculatively execute a prefetch. The latencies were long, so it was important to get the prefetch transaction on the bus as soon as possible, and once a prefetch had been initiated there was no way to cancel it. So a speculatively-executed xdcbt was identical to a real xdcbt! (a speculatively-executed load instruction was just a prefetch, FWIW).
And that was the problem – the branch predictor would sometimes cause xdcbt instructions to be speculatively executed and that was just as bad as really executing them. One of my coworkers (thanks Tracy!) suggested a clever test to verify this – replace every xdcbt in the game with a breakpoint. This achieved two things:
- The breakpoints were not hit, thus proving that the game was not executing xdcbt instructions.
- The crashes went away.
I knew that would be the result and yet it was still amazing. All these years later, and even after reading about Meltdown, it’s still nerdy cool to see solid proof that instructions that were not executed were causing crashes.
The branch predictor realization made it clear that this instruction was too dangerous to have anywhere in the code segment of any game – controlling when an instruction might be speculatively executed is too difficult. The branch predictor for indirect branches could, theoretically, predict any address, so there was no “safe place” to put an xdcbt instruction. And, if speculatively executed it would happily do an extended prefetch of whatever memory the specified registers happened to randomly contain. It was possible to reduce the risk, but not eliminate it, and it just wasn’t worth it. While Xbox 360 architecture discussions continue to mention the instruction I doubt that any games ever shipped with it.
I mentioned this once during a job interview – “describe the toughest bug you’ve had to investigate” – and the interviewer’s reaction was “yeah, we hit something similar on the Alpha processor”. The more things change…
Thanks to Michael for some editing.
How can a branch that is never taken be predicted to be taken? Easy. Branch predictors don’t maintain perfect history for every branch in the executable – that would be impractical. Instead simple branch predictors typically squish together a bunch of address bits, maybe some branch history bits as well, and index into an array of two-bit entries. Thus, the branch predict result is affected by other, unrelated branches, leading to sometimes spurious predictions. But it’s okay, because it’s “just a prediction” and it doesn’t need to be right.
Discussions of this post can be found on hacker news (hacker news in 2021), r/programming, r/emulation, and twitter.
A somewhat related (Xbox, caches) bug was discussed a few years ago here.
How did the interview go, by the way? :p
Thanks. And the interview went well enough to get me an offer
let me guess was it at google 🙂
It actually wasn’t, and I turned down the offer, but good guess
Seems to me the prefetch instruction should be safe as long as memcpy always breaks before the second-to-last loop? Assuming the cpu never predicts more than one loop round… and prefetchs that aren’t behind a branch should be safe. Certainly would be playing with fire though.
I wonder if processors could use a “prediction barrier.”
Making guesses about branch predictors is a very dangerous game. The main risk, however, is that xdcbt might be predicted to be executed on a block where it is not safe at all. Thus, we could end up doing an extended prefetch on the first cache line of an unintended block.
And, if we intentionally xdcbt a block of memory we have to be *extremely* careful about what we do with that memory (no freeing, no other-core use) until we are certain it is flushed from L1, which takes, how long to be sure?
This is a good example of how the logical principle, quoting Wikipedia, “ex falso (sequitur) quodlibet (EFQ), ‘from falsehood, anything (follows)’ ” intersects with computer science. The illusion that caches sync transparently, and that failed branch predictions “didn’t happen”, are delicate fictions maintained by all of the rules of the system playing along. By creating the xdcbt instruction, you introduced a logical contradiction in those rules. At that point you lost control of the consequences; a system subject to EFQ diverges. (Note that a result of “can’t be so bad because other things appear to still work as designed” is also a subset of “anything” in the “anything follows!”)
This is why I love computer science – the chance to see supposedly theoretical ideas manifest in real effects. I like to call computer science “experimental philosophy” in analogy to the experimental/theoretical divide among physicists. Philosophers are doing “theoretical philosophy”; those of us who create systems are doing “experimental philosophy.”
I like it! The intersection of philosophy and computer science.
I can’t be sure, but with the information about the incoherent memory views when using the prefetch, couldn’t that just be a side effect of the very relaxed memory model in the PowerPC. Without the right sync instructions it should be possible to get some data structures in a “non-sense” state if read by a core while another core is writing them (PowerPC memory model actually supports multiple coalescing store buffers that can in practice result in reordered writes from an external core).
BTW, I had a very “debugging” experience with some lockless memory queue in that processor as well due to having 2 threads in a single core entering the critical section of a lwarx/stwcx in lockstep resulting in neither of them being able to finish (until one of the system tasks, which liked to hop between physical threads, broke the lock-step). Fun times.
I consider this separate from the relaxed memory model, even ignoring the branch predict issue, but I guess that’s a philosophical question. In theory (without the branch predict issue) xdcbt was manageable, although technically you had to flush the cache lines when done with dcbf to be really certain.
Super enlightening article – thanks!
But the part I didn’t get is why a talented guy like you – who even dips his feet into philosophy – works for Google.
Is this a case of money trumps everything?
I greatly enjoy working for Google. I get to work with many talented individuals and solve fascinating problems – I work best as part of a large team.
But was it really fixed even if you didn’t use any xdcbt instructions anywhere in your code? Isn’t it possible that some random data could be reinterpreted to contain xdcbt?
The checking of the Execute bit is done early enough to prevent data from being speculatively executed – good thing!
What if it jumps to a misaligned address in the instruction stream?
Misaligned addresses are not supported in the PowerPC instruction stream. All instructions are four bytes in size and four-byte aligned. Therefore you can trivially control the exact set of instructions in your program, unlike x86 where every address can be interpreted as an instruction.
Would branch hints have helped the bug? or would it still cause the bug when the branch hint was wrong?
I can’t remember how PowerPC branch hints work but if the branch hint overrode the branch predictor then it could have avoided the bug.
On the other hand, this way of avoiding the bug is like juggling flaming chainsaws while unicycling on a tight-wire over Niagara falls with sharks circling below and pterodactyls circling above. At some point you have to wonder whether the risk is actually justified.
We still have similar issues with modern hardware, such as the Xbox One, when you share memory between the CPU and the GPU. Two separate PUs with separate caches that can easily have different views of the same memory. They come with plenty of rope to hang yourself with.
We had a similar bug on Shadow of War. Memory shared between CPU and GPU, allocated by a heap allocator. Two data structures ended up straddling a single cache line. First data structure pulled the cache line into the GPU cache, the second didn’t refresh the cache because it was already there. The second data structure was unrelated to the first but it contained a GPU write address. The GPU had the wrong address and infrequently stomped over random game memory. There was, of course, no repro for the bug. I was lucky enough to have figured this out in two days. It was a good day when I did. 😛
The fix was to allocate all shared memory at cache line boundaries and pad the allocation out to the next cache line. This assured that we didn’t get random data structures coming along for the ride into the various caches. Alternatively we could have flushed the caches but that would have had a considerable performance impact.
I may be oversimplifying the issue but it seems that if the memory prefetched with xdcbt were allocated on a cache line boundary and padded to the next cache line then you could use xdcbt all day without issue.
It sounds like it would still happen. The author mentions that he fixed the prefetch to not go over cache lines not related to the memory it was prefetching.
Also, given the description of your problem. Compute shader output to some structured buffer or where you doing something tricky with render targets? I share your pain fellow GCN programmer!
We would kindly ask the CP to update GPU addresses in some situations. It became a habit to never touch descriptors on the CPU after they’ve started being used by the GPU. We’d either create a new discriptor with updated values (like GPU address) and start submitting that on the CPU side, or use the GPU to update the old one in place.
I don’t remember the scenario exactly at this point. I know the first half of the 64 bytes were being used for a tiny cbuffer, probably a $Globals for one of the draw calls. It used maybe 32 bytes of the cache line. The rest of the bytes got allocated for something like a descriptor. We were making sure that the descriptors were not being modified more than once per frame and once sent to the GPU they were not touched until the GPU was done with that frame. We had not considered the case where the cache line was already in the GPU cache and not refreshed. (because the cbuffer pulled the whole 64 bytes in)
In my case there were two issues. The first was sloppy use of xdcbt – prefetching beyond the ends of the buffer, or using xdcbt on memory where coherency would matter. The second was speculative xdcbt execution.
The first issue was avoidable if you were very careful, but the second issue? Good luck.
Hmm, on a bit unrelated topic but based on your photo I’m now curious – can you explain why do they place masks and “print” on apparently and obviously unusable part of the silicone wafers (closer to the edges, so even physical edges of the chip-to-be don’t fit onto Si)?
I was wondering that myself. It seems like they could optimize wafer usage by shifting some dice left or right, and they could optimize for printing time by skipping the ones that are partially off the edge, but ??? I am not a hardware person and I no longer work with hardware people closely enough to be able to ask. There must be reasons why the patterns continue and stay aligned.
They need to be aligned in at least one direction, because they’re cut into pieces using diamond circular saws, which cut all the way through. I suspect the alignment in the other direction is technically more flexible, but it may be easier to produce the masks in a pure grid.
Leaving off the ones that are cut off by the edges is done sometimes, I’ve definitely seen wafers with that. I’m not sure how large the masks normally are, but I think they are typically exposing the whole wafer to a single large mask for each step. In that case it make sure no difference to processing time whether these things are there or not.
If instead they’re using a smaller but not single-die mask — say, 16 dies at once — then that in itself causes the edge dies to also be exposed.
Thanks, Jasper and Bruce – these explanations make sense! Tbh I was surprised since most of the wafers I saw looked like this – http://thumbs1.ebaystatic.com/d/l225/m/m7NfmZ0uCMZMSXacLyz2EMA.jpg but quick google search reveals other possibilities – http://www.millice.com.sg/wp-content/uploads/wafer.png . My first thought would be that such masks were initially produced for bigger wafers and used on smaller ones during test runs, but maybe this is over-complication.
Awesome story, Bruce, thanks for sharing!
Thanks Sergiy. Long time no see.
Its nice to hear the otherside of the story, since I was the game developer…. Thank you Bruce.
Simon! That’s cool. So far I haven’t shared the name of the game. Should I, you?
It was a 1st party launch title on Xbox 360, Kameo: Elements of Power.
If I remember correctly, other games had completed certification, but unfortunately had to go back through, to make sure they didn’t have this. Our meantime to failure shot up so much, test couldn’t believe it. We were at 1.5 hours before the change, over 26 hours after.
I can’t express my gratitude enough, for what you and others did to find this, after all the nonbelievers. I’m forever in debt to you.
We also had interesting thread locking issues, but that’s another story, and one that Jeff at Rad Tools diagnosed and help fix.
Jeff is awesome.
Random synchronicity: I was out ice climbing with my youngest (adult) child on Saturday and that reminded them of Kameo – pure coincidence, the day before I posted this Kameo related post.
Is not having the instruction in your code enough? I don’t know the details about PPC, but can there be constant pools in the code section (ala ARM)? Is speculatively executing from data sections prevented, too? This really seems like it could blow up anytime.
That’s an interesting idea. I can’t remember the compiler details anymore but some PPC constants are most efficiently loaded from constant data, and that could have been in the code segment (for IP-relative addressing) and one of those constants could be an xdcbt pattern.
The odds of the branch predictor predicting that address would be low, but not zero. Nice.
interesting article, understood ~50% of it, but if I understand correctly, the wrong prediction of branch predictor caused problem and branch predictor was too random and not keeping score of history. couldn’t AI be taught to learn what to predict and what has never been predicted? Or is anything in the world too slow to interfere in internal CPU workings?
Branch predictors have to be insanely fast because they are on the critical path of the CPU. And ultimately they are *predicting* the future and they will sometimes be wrong. Whether you use AI or more sophisticated or less sophisticated methods the predictor will sometimes be wrong, and this was an example of where a single incorrect prediction could cause failure. Unacceptable.
Of course, hindsight being 20/20 it’s very easy to say this after the fact and clearly not something which came to my mind when I first learnt of branch speculation, but executing speculatively an instruction which cannot be reversed/cancelled still sounds a bit crazy. Your coworker’s example hits the nail right on the head : “int” was obvious enough to not be executed speculatively, it was sorcerer’s apprentice-level stuff to let cache be affected.
Reading this article complete broke down the barrier in my mind between the machine code and processor internals. When we were younger, we did not take care too much how processors work, it was magical and somewhat sacred. Well, we had a table for 6502 instruction timings, and we knew that crossing a 256-byte page may have performance impact. We knew that MC68000’s moveq is faster than clr.l, and we should align words and longwords on even addresses even for 68020, despite it can load words from odd addresses, because later takes double memory access. But these were “properties” and “rules” of the instructions, they were known things, and we trusted in processors, and nobody wanted to dig deeper and unveil internal mechanisms.
But, as we know, layers leak. No exception.
Hey there, nice article. I have 1 question though.
How did putting breakpoints fix the problem? Can you please explain?
The initial problem was executing xdcbt instructions. The second problem was speculatively executing xdcbt instructions.
When you set a breakpoint you are actually replacing the instruction at that location with a breakpoint instruction. So now the CPU is speculatively executing a breakpoint instruction, which is harmless. There were no longer any xdcbt instructions to speculatively execute.
Oh yeah, I get it now. Thanks Bruce!
Very nice article, Bruce!
The PowerPC pipeline was brutal, for sure. I avoided the xdcbt like plague, cache coherency was too scary, and IIRC the savings in term of cycles were not huge. Fortunately, dcbt and dcbz to the rescue!
Funny how HW can have issues like that. I remember in the early PS3 SPU days, we got a rare random crash with DMAs. By design, due to pipelining, an SPU would always execute the instruction after a branch (2 instructions in row executed). Except in some cases, if branch mis-prediction was incorrect, and the instruction just after the branch happened to store a DMA command, you had a potential for failures much later during the DMA completion. It took us days to figure out a reliable repro so we could narrow the random SPU crashes. Sony ended up changing the compiler so it would only inject a nop after the branch instruction.
So wait, the SPUs would always execute the instruction after a branch, except sometimes they wouldn’t? That’s pretty nasty. Funny how hardware bugs end up neutering cool features.
One correction for any readers – it’s not the “PowerPC” pipeline that was so horrible, it was the particular implementation of the PowerPC core used in the Xbox 360 and PS3. I read “The Race for a New Game Machine: Creating the Chips Inside the XBox 360 and the Playstation 3” and I was disappointed that they never acknowledged that the CPU that they labored over was basically a clunker – horrible instructions-per-cycle, lower frequency than hoped, and power hungry. They bet on high frequencies and lost (but Sony and Microsoft shipped the consoles anyway)
“I used the ancient debugging technique of staring at my screen with a blank mind, let the CPU pipelines fill my subconscious” – also known as the Feynman Algorithm.
> Trivia: Core 0 was closer to the L2 cache and had measurably lower L2 latencies.
I remember this for some reason! Do you remember approximately *how much* the latencies varied?
I think it was 2-4 cycles extra latency or something like that. I love that you can *see* the extra path length (and therefore latencies) on the wafer I still have hanging on my wall
I stumbled on this post while looking for a used xbox360 console. Fascinating article, it’s above my head on the computer science side of things but very interesting nonetheless.
Pingback: The Slow Burn of Meltdown and Spectre: Exploits, Lawsuits, and Perspective – RBS