We count on computer hardware to be 100% reliable, and we tend to blame software for crashes. This is usually the correct thing to do, but not always.
I just read a Wired article on errors in memory chips, and the 2012 paper they reference that discusses hard versus soft errors. The summary is that memory chips sometimes forget things, and a lot of these errors are ‘hard’ errors that are caused by flaws in the chips. One strategy that they recommend is ‘page retirement’ – not using those memory pages that have a history of having failures.
This reminds me of an experience I had a long time ago with a bad memory chip.
Twenty five years ago I was working on the Commodore Amiga. Since I had lots (2.5 MiB!!!) of memory I set up a RAM disk (that would survive reboots) to make my compiles faster. This all worked fine until my RAM disk started having read errors. That’s not supposed to happen. When these errors continued after a reboot I investigated. I wrote a simple test program to scan all memory – writing and reading various values to ensure that they would stick. I quickly found that one particular bit at one particular address was faulty.
The Amiga didn’t have virtual memory, so the faulty physical address was always seen by the processor at the same address. How convenient. So, I wrote a tiny program that allocated that address when my computer started up, to prevent the RAM disk or other software from using it, and I got back to work, now with just 2.499999 MiB of memory. Simple. And effective.
The laptop I use today has 8 GiB of RAM, which is more than 64 billion bits, with a transistor and capacitor per bit. That’s a lot of really tiny components, and I hope they are all working more reliably than the far smaller memory from so long ago.
I wonder when consumer machines will have to come with ECC memory to detect and protect against these failures?