Forgetfulness in Computers

We count on computer hardware to be 100% reliable, and we tend to blame software for crashes. This is usually the correct thing to do, but not always.

imageI just read a Wired article on errors in memory chips, and the 2012 paper they reference that discusses hard versus soft errors. The summary is that memory chips sometimes forget things, and a lot of these errors are ‘hard’ errors that are caused by flaws in the chips. One strategy that they recommend is ‘page retirement’ – not using those memory pages that have a history of having failures.

This reminds me of an experience I had a long time ago with a bad memory chip.

Twenty five years ago I was working on the Commodore Amiga. Since I had lots (2.5 MiB!!!) of memory I set up a RAM disk (that would survive reboots) to make my compiles faster. This all worked fine imageuntil my RAM disk started having read errors. That’s not supposed to happen. When these errors continued after a reboot I investigated. I wrote a simple test program to scan all memory – writing and reading various values to ensure that they would stick. I quickly found that one particular bit at one particular address was faulty.

The Amiga didn’t have virtual memory, so the faulty physical address was always seen by the processor at the same address. How convenient. So, I wrote a tiny program that allocated that address when my computer started up, to prevent the RAM disk or other software from using it, and I got back to work, now with just 2.499999 MiB of memory. Simple. And effective.

The laptop I use today has 8 GiB of RAM, which is more than 64 billion bits, with a transistor and capacitor per bit. That’s a lot of really tiny components, and I hope they are all working more reliably than the far smaller memory from so long ago.

I wonder when consumer machines will have to come with ECC memory to detect and protect against these failures?

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more:
This entry was posted in memory, Programming and tagged . Bookmark the permalink.

7 Responses to Forgetfulness in Computers

  1. Paul says:

    I’m in a position where I work with a program which downloads, processes, and writes large amounts of data. I’ve seen many reports of data-corruption, but very nearly every one of these cases I’ve ever personally examined showed a single-bit error written out to disk — consistent with RAM errors (confirmed in many cases by a memory-diagnostic).

    Are there any good numbers out there on rates of RAM/HDD errors? I do plenty of consistency-checking on the data I’m working with, though I feel I’ll never be free of the paranoia that there’s some insidious bug in my code. x_x

  2. Martin D says:

    It’s certainly worth keeping in mind that memory isn’t perfect. I read a pretty interesting article about this recently, this guy did some fairly detailed debugging on what was possibly a cosmic ray flipping one of his bits.

  3. Was it just the IBM PCs that had parity RAM back in the olden times?
    I remember being astonished when reading that the new IBM 5150 PC would waste 11% of it’s RAM doing a parity check. That was 2K of RAM wasted in the bottom of the range 16K PC. That was twice the memory of a ZX81.
    Then kilobytes of memory with parity, now gigabytes of memory with no parity.
    Tell me there’s some sense to this?

    • brucedawson says:

      The lack of some sort of error checking and correction these days does seem very odd. I’m not sure that parity is worthwhile — crashing your process or system because of a parity error is not necessarily an improvement (sometimes it avoids data corruption, sometimes it makes no difference), but ECC memory would be great. My unsubstantiated guess is that ECC memory would be quite cheap if it was put in a significant percentage of consumer machines.

      • Billco says:

        ECC memory _is_ cheap, at least the unbuffered kind. The problem is the performance hit of having to read and write the extra parity bit for each byte. In theory, this results in 12.5% less net memory bandwidth. In practice, the hit is a little worse as it also incurs some latency to perform the parity calculation.

        One feature I am quite fond of is memory mirroring, something you might find on high-end server boards. It is essentially RAID-1 at the module level, so if you have 48gb installed, you would see 24gb usable and the other half is set aside as a mirrored copy – all handled at the chipset level. In some cases, you can even double the redundancy. With RAM being so darned cheap, I could see this becoming a viable option for desktop systems. These days, a 4gb stick can be had for about $40 or less. Few users will need more than 8gb, at least for the next 2-3 years, so why not fill up those slots and enjoy some peace of mind ? It’s the sort of thing Intel and AMD could bake into all their CPUs at minimal cost.

        • brucedawson says:

          > Few users will need more than 8gb

          ‘Need’ is rather ambiguous. What is often underappreciated is that almost all users will actually benefit from more memory. If you leave your computer running for long periods of time then the memory will be used for disk caching, which can significantly reduce delays. There are diminishing returns, but I’m not sure we’re at the level where using half of your memory for reliability is a good move.

          Also, memory mirroring is a very expensive way to get minimal benefits. If the two banks disagree then you don’t know which one is correct. It’s little better than parity memory.

          ECC allows *correcting* of errors, and ECC shouldn’t have a significant performance cost. Instead of having, say, 64 memory lines, you instead have 72. The extra lines add cost, but shouldn’t harm performance, and I’m sure that the ECC calculation itself is relatively cheap.

          Wikipedia estimates 2-3 percent performance cost.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.