It’s 2016 and Windows still displays drive and file sizes using base-2 size prefixes. My 1 TB SSD is shown as 916 GB, and a 449 million byte video file is shown as 428 MB. That is, Windows still insists that “MB” means 2^20 and “GB” means 2^30, even when dealing with non-technical customers.
- This makes no sense.
- Just because some parts of computers are base 2 doesn’t mean all parts are base 2.
- And, actually, most of the visible parts of computers are base-10.
As a concrete example of why this matters, imagine that I have one GB free on my disk and I want to know how many 20 MB files will fit. With base-10 the answer is trivial: 50. With base-2 we find that we can fit 51 20 MiB files into one GiB. Even worse is that if you have a bunch of 20,000,000 byte files then you can fit 53 of them into a GiB. That sort of nonsense should not be exposed to consumers. It should be kept for the computer nerds who need to know, and even then it should be saved for those places where base-2 makes sense.
As another example, let’s say that I’m releasing a new version of UIforETW and I want to see whether the new package is bigger, smaller, or the same size as the old one. With base-10 prefixes this is easy, even if I have to compare KB to MB. But with base-2 prefixes I have to apply a scaling factor of 1.024 or 1.048576 before I can do the comparisons. Is 38,616 KB bigger, smaller, or the same size as 37.7 MB? Can you tell without a calculator?
So just stop it. Base 2 prefixes should only be used when there is a compelling advantage for the typical user, and for file and drive sizes in Windows explorer there are no such advantages. If you think I’m wrong (and I know that lots of people do) then be sure to explain exactly why base-2 size prefixes make sense in the context of file and drive sizes.
My specific use case is that I often end up seeing file sizes in exact bytes – from the “dir” command or from other sources. Eight digit numbers are inconvenient so I want to convert these numbers to MB before sharing them. Dividing by 1,048,576 is much more difficult than dividing by 1,000,000 and I see zero advantage to doing the more complicated division. But, if I do the simple/obvious division then I get different answers from Windows Explorer. Hence this rant.
In this article I am going to use words like thousands, millions, billions, and trillions when talking about base-10, KiB, MiB, GiB, and TiB when talking about base 2, and kB, MB, GB, and GB when quoting other people’s usage. I hope that this will always make it clear what I’m trying to say.
Things that are naturally base 2
Let me say up front that memory sizes, address-space sizes, virtual memory page sizes, cache sizes, register sizes, sector sizes, cluster sizes, and probably a few other things that I’ve forgotten about are naturally base 2. Cool. So, when talking about these things you should use base-2 based size prefixes.
However, the only one of these that is ever exposed to a consumer is memory size. A computer might have 8 GiB of RAM and describing that as 8.59 billion bytes is just cumbersome. So go for it and use base-2 prefixes for memory. And, if you want to tell consumers about page sizes and sector sizes then feel free to use base-2 prefixes – but really, why would a consumer care?
Amusingly enough, some Dell brochures have a blanket disclaimer that “GB refers to one billion bytes” and they carefully footnote this even on their memory sizes. This means that when Dell sells you an 8 GB computer they are technically only promising you 7.45 GiB. That’s just weird. It means that they are lying about how much memory their computers contain, but in the wrong direction!
Base 2 prefixes make sense for memory capacity because memory chips have a power-of-two capacity. Base 2 prefixes makes sense for address space because n bits can identify 2^n different addresses. Page sizes are base 2 because it allows for easy bit masking to select the page number and the address within the page. Bit masking is, in fact, one of the main advantages of base 2. So yeah, base 2 has its place.
But its place is not everywhere.
Things that come in base-10 sizes
The list of things that are best represented by base 10 includes CPU frequencies, Ethernet speeds, hard drive sizes, and flash drive sizes. One GHz is actually one billion Hz, Gigabit Ethernet runs at one billion bits per second, one TB drives are actually one trillion bytes, and a 32 GB flash drive is actually 32 billion bytes.
Some of these may seem surprising, but the question to ask yourself is “why should (technology x) use base 2?” If there is no compelling reason to use base 2 then using base 10 is the appropriate choice because it then matches the number system that human beings use. Base 10 should be the default, and base 2 should only be used when there is a compelling reason, such as for memory related technologies. Because base 10 is the default, the designers of oscillating crystals, Ethernet, hard drives, and flash drives have sensibly used base 10.
There are some interesting implications from frequencies being base 10, and memories being base 2. If you have 4 GiB of RAM and a bus that can read 256 billion bytes of memory per second then you might thing that you could read all of memory 64 times per second, right? But you can’t, because the frequency is base 10 and the memory size is base 2, which adds a a 7.4% mismatch. Because 4 GiB is actually 4.29 billion bytes this bus can only read all memory about 60 times per second.
Yes, there is also usually overhead for memory refresh cycles and what-not which mean that the actual read-all-memory passes per second is even lower. My point is that in addition to allowing for that overhead you also need to adjust for GB versus GiB.
In fact, one of the things that sparked this article was a press-release talking about memory chips that had 256 GB/s of bandwidth. The article then breathlessly pointed out that four of these chips would have 1 TB/s of bandwidth. This is almost certainly wrong. The chips probably have 256 billion bytes per second of bandwidth, so four of them would have 1.024 trillion B/s of bandwidth – neither 1.0 trillion B/s nor 1.0 TiB/s. A minor error, but it amused me.
Wait a minute, flash memory is base 10?
A lot of geeks are surprised when they find out that the capacity of flash memory drives is measured with base-10 units. Given that thumb drives are always 8, 16, 32, or 64 GB it seems reasonable to assume that the “GB” refers to GiB. But it doesn’t. Grab a few flash drives and take a look at their capacity. I just looked at the “32 GB” SD card for my camera and its capacity is 31.91 billion bytes. If flash drives were using base 2 prefixes then that should be 34.36 billion bytes – it’s not even close.
But those should be base 2!!!
Really? Why ‘should’ some of these technologies be based on base-2? There is clearly no reason for frequencies to be base 2, so they aren’t.
Hard drive capacity is the product of sector size (base 2) times sectors/track times tracks/platter times number of platters. Constraining those last three numbers to be powers of two would be ridiculous. One small power of two doesn’t make the whole package a power of two. And, since the capacities aren’t powers of two, there is no good reason to clumsily represent the capacities with base-2 prefixes. Describing 320 billion bytes as 298 GiB doesn’t help anything.
One could argue that hard drive manufacturers use base 10 because it makes their drives look bigger, and I’m sure they don’t mind that aspect of it. But, base 10 being financially convenient isn’t enough to justify the claims of a vast hard drive conspiracy. The hard drive manufacturers are simply using the most convenient and standard units because there is no compelling reason to do otherwise.
Flash drives are more surprising because the underlying chips have power of two raw capacity. But flash drive manufacturers necessarily over provision in order to leave space for wear leveling, spare sectors, etc. Flash memory already does complex remapping of ‘addresses’ so constraining themselves to power of two capacities would have no benefits. The reason why flash drives normally have sizes like 8, 16, 32, or 64 GB is probably because the 7.4% to 10.% overhead that this provides is conveniently close to what they need. If the amount of spare capacity changes then flash drives could end up being sold with 120 GB or 130 GB capacities.
Does it matter?
You could reasonably say that it doesn’t matter if we display base 2 units to the user because they don’t care. That’s a terrible argument because if they don’t care we shouldn’t display any numbers at all. If we are going to display numbers to the user then they should be base 10 unless there is a compelling argument for base 2. For file sizes and disk sizes there is no compelling argument – and this is something that OSX does right.
Do you really want to tell your parents that a 1 TB drive is more than twice as big as a 500 GB drive? Or that a 1,010 GB drive is smaller than a 1 TB drive? This is the sort of madness that base-2 causes, for no good reason. The mixing of base-2 and base-10 is even worse, because you can’t even come close to fitting 320 files that Windows says are 1 GB onto a drive that you purchased as 320 GB – you won’t even fit 300.
Do you really enjoy explaining to your friends and relatives why Windows is telling them that their brand new hard drive is smaller than the size listed on the box?
But what about computer nerds, surely they should use base 2 for everything, shouldn’t they? No – only when it makes sense. Using base 2 for anything except memory and bit masks leads to ambiguity and to errors. If you use the wrong unit then you will add 2.4%, 4.9%, 7.4% or 10% error (for kiB, MiB, GB, and TB). There are probably many calculations of disk or memory bandwidth that have been off because of the MiB/million discrepancy, and the errors only get worse as disks and frequencies get larger.
As an example of pointless base-2 consider this Event Viewer screenshot (from this tweet):
The same size is describe in bytes, KiB, and MiB. This leads to having leading digits of 64, 65, and 67. The mismatched units and the base-2 units serve no purpose other than to hide the fact that the various sizes are actually not the same (the size in bytes is 4 KiB larger than the other two). Converting between different units shouldn’t require tricky math, unless justified by critical underlying physical realities (such as cache sizes).
I used to work at Microsoft so I know something about how they think and I’m sure that the main reason they still use base-2 units in Windows Explorer is simply because that is what they have always done. Fear of breaking something, somewhere, will probably keep them on base-2 prefixes forever. But I want to do my part to convince developers to not repeat Microsoft’s mistake.
If you’re going to show sizes using base 2 then I recommend that you acknowledge the nerdiness of this in the most honest way possible – use hexadecimal. Or cut off your thumbs and we’ll switch the whole world to octal.