It’s 2016 and Windows still displays drive and file sizes using base-2 size prefixes. My 1 TB SSD is shown as 916 GB, and a 449 million byte video file is shown as 428 MB. That is, Windows still insists that “MB” means 2^20 and “GB” means 2^30, even when dealing with non-technical customers.
- This makes no sense.
- Just because some parts of computers are base 2 doesn’t mean all parts are base 2.
- And, actually, most of the visible parts of computers are base-10.
As a concrete example of why this matters, imagine that I have one GB free on my disk and I want to know how many 20 MB files will fit. With base-10 the answer is trivial: 50. With base-2 we find that we can fit 51 20 MiB files into one GiB. Even worse is that if you have a bunch of 20,000,000 byte files then you can fit 53 of them into a GiB. That sort of nonsense should not be exposed to consumers. It should be kept for the computer nerds who need to know, and even then it should be saved for those places where base-2 makes sense.
As another example, let’s say that I’m releasing a new version of UIforETW and I want to see whether the new package is bigger, smaller, or the same size as the old one. With base-10 prefixes this is easy, even if I have to compare KB to MB. But with base-2 prefixes I have to apply a scaling factor of 1.024 or 1.048576 before I can do the comparisons. Is 38,616 KB bigger, smaller, or the same size as 37.7 MB? Can you tell without a calculator?
Here’s an example from Windows Explorer. This file is described by the main Explorer window as being 38,617 KB. But the properties window describes it as being 37.7 MB and as being 39,542, 912 bytes. So, 37.7, 38.6, or 39.5 MB? Which is it? This inconsistency appears as a bug to anybody who isn’t intimately familiar with the 2.4% discrepancy which accumulates with each new prefix.
So just stop it. Base 2 prefixes should only be used when there is a compelling advantage for the typical user, and for file and drive sizes in Windows explorer there are no such advantages. If you think I’m wrong (and I know that lots of people do) then be sure to explain exactly why base-2 size prefixes make sense in the context of file and drive sizes.
My specific use case is that I often end up seeing file sizes in exact bytes – from the “dir” command or from other sources. Eight digit numbers are inconvenient so I want to convert these numbers to MB before sharing them. Dividing by 1,048,576 is much more difficult than dividing by 1,000,000 and I see zero advantage to doing the more complicated division. But, if I do the simple/obvious division then I get different answers from Windows Explorer. Hence this rant.
In this article I am going to use words like thousands, millions, billions, and trillions when talking about base-10, KiB, MiB, GiB, and TiB when talking about base 2, and kB, MB, GB, and TB when quoting other people’s usage. I hope that this will always make it clear what I’m trying to say.
Things that are naturally base 2
Let me say up front that memory sizes, address-space sizes, virtual memory page sizes, cache sizes, register ranges, sector sizes, cluster sizes, and probably a few other things that I’ve forgotten about are naturally base 2. Cool. So, when talking about these things you should use base-2 based size prefixes.
However, the only one of these that is ever exposed to a consumer is memory size. A computer might have 8 GiB of RAM and describing that as 8.59 billion bytes is just cumbersome. So go for it and use base-2 prefixes for memory. And, if you want to tell consumers about page sizes and sector sizes and cache sizes then feel free to use base-2 prefixes – but really, why would a consumer care?
Amusingly enough, some Dell brochures have a blanket disclaimer that “GB refers to one billion bytes” and they carefully footnote this even on their memory sizes. This means that when Dell sells you an 8 GB computer they are technically only promising you 7.45 GiB. That’s just weird. It means that they are lying about how much memory their computers contain, but in the wrong direction!
Base 2 prefixes make sense for memory capacity because memory chips have a power-of-two capacity. Base 2 prefixes makes sense for address space because n bits can identify 2^n different addresses. Page sizes are base 2 because it allows for easy bit masking to select the page number and the address within the page. Bit masking is, in fact, one of the main advantages of base 2. So yeah, base 2 has its place.
But its place is not everywhere.
Things that come in base-10 sizes
The list of things that are best represented by base 10 includes CPU frequencies, Ethernet speeds, hard drive sizes, and flash drive sizes. One GHz is actually one billion Hz, Gigabit Ethernet runs at one billion bits per second, one TB drives are actually one trillion bytes, and a 32 GB flash drive is actually 32 billion bytes.
Some of these may seem surprising, but the question to ask yourself is “why should (technology x) use base 2?” If there is no compelling reason to use base 2 then using base 10 is the appropriate choice because it then matches the number system that human beings use. Base 10 should be the default, and base 2 should only be used when there is a compelling reason, such as for memory related technologies. Because base 10 is the default, the designers of oscillating crystals, Ethernet, hard drives, and flash drives have sensibly used base 10.
There are some interesting implications from frequencies being base 10, and memories being base 2. If you have 4 GiB of RAM and a bus that can read 256 billion bytes of memory per second then you might thing that you could read all of memory 64 times per second, right? But you can’t, because the frequency is base 10 and the memory size is base 2, which adds a a 7.4% mismatch. Because 4 GiB is actually 4.29 billion bytes this bus can only read all memory about 60 times per second.
Yes, there is also usually overhead for memory refresh cycles and what-not which means that the actual read-all-memory passes per second is even lower. My point is that in addition to allowing for that overhead you also need to adjust for GB versus GiB.
In fact, one of the things that sparked this article was a press-release talking about memory chips that had 256 GB/s of bandwidth. The article then breathlessly pointed out that four of these chips would have 1 TB/s of bandwidth. This is almost certainly wrong. The chips probably have 256 billion bytes per second of bandwidth, so four of them would have 1.024 trillion B/s of bandwidth – neither 1.0 trillion B/s nor 1.0 TiB/s. A minor error, but it amused me.
Wait a minute, flash memory is base 10?
A lot of geeks are surprised when they find out that the capacity of flash memory drives is measured with base-10 units. Given that thumb drives are always 8, 16, 32, or 64 GB it seems reasonable to assume that the “GB” refers to GiB. But it doesn’t. Grab a few flash drives and take a look at their capacity. I just looked at the “32 GB” SD card for my camera and its capacity is 31.91 billion bytes. If flash drives were using base 2 prefixes then that should be 34.36 billion bytes – it’s not even close.
But those should be base 2!!!
Really? Why ‘should’ some of these technologies be based on base-2? There is clearly no reason for frequencies to be base 2, so they aren’t.
Hard drive capacity is the product of sector size (base 2) times sectors/track times tracks/platter times number of platters. Constraining those last three numbers to be powers of two would be ridiculous. One small power of two doesn’t make the whole package a power of two. And, since the capacities aren’t powers of two, there is no good reason to clumsily represent the capacities with base-2 prefixes. Describing 320 billion bytes as 298 GiB doesn’t help anything.
One could argue that hard drive manufacturers use base 10 because it makes their drives look bigger, and I’m sure they don’t mind that aspect of it. But, base 10 being financially convenient isn’t enough to justify the claims of a vast hard drive conspiracy. The hard drive manufacturers are simply using the most convenient and standard units because there is no compelling reason to do otherwise.
Flash drives are more surprising because the underlying chips have power of two raw capacity. But flash drive manufacturers necessarily over provision in order to leave space for wear leveling, spare sectors, etc. Flash memory already does complex remapping of ‘addresses’ so constraining themselves to power of two capacities would have no benefits. The reason why flash drives normally have sizes like 8, 16, 32, or 64 GB is probably because the 7.4% to 10.% overhead that this provides is conveniently close to what they need. If the amount of spare capacity changes then flash drives could end up being sold with 120 GB or 130 GB capacities.
Does it matter?
You could reasonably say that it doesn’t matter if we display base 2 units to the user because they don’t care. That’s a terrible argument because if they don’t care we shouldn’t display any numbers at all. If we are going to display numbers to the user then they should be base 10 unless there is a compelling argument for base 2. For file sizes and disk sizes there is no compelling argument – and this is something that OSX does right.
Do you really want to tell your parents that a 1 TB drive is more than twice as big as a 500 GB drive? Or that a 1,010 GB drive is smaller than a 1 TB drive? This is the sort of madness that base-2 causes, for no good reason. The mixing of base-2 and base-10 is even worse, because you can’t even come close to fitting 320 files that Windows says are 1 GB onto a drive that you purchased as 320 GB – you won’t even fit 300.
Do you really enjoy explaining to your friends and relatives why Windows is telling them that their brand new hard drive is smaller than the size listed on the box?
But what about computer nerds, surely they should use base 2 for everything, shouldn’t they? No – only when it makes sense. Using base 2 for anything except memory and bit masks leads to ambiguity and to errors. If you use the wrong unit then you will add 2.4%, 4.9%, 7.4% or 10% error (for kiB, MiB, GB, and TB). There are probably many calculations of disk or memory bandwidth that have been off because of the MiB/million discrepancy, and the errors only get worse as disks and frequencies get larger.
As an example of pointless base-2 consider this Event Viewer screenshot (from this tweet):
The same size is describe in bytes, KiB, and MiB. This leads to having leading digits of 64, 65, and 67. The mismatched units and the base-2 units serve no purpose other than to hide the fact that the various sizes are actually not the same (the size in bytes is 4 KiB larger than the other two). Converting between different units shouldn’t require tricky math, unless justified by critical underlying physical realities (such as cache sizes).
I used to work at Microsoft so I know something about how they think and I’m sure that the main reason they still use base-2 units in Windows Explorer is simply because that is what they have always done. Fear of breaking something, somewhere, will probably keep them on base-2 prefixes forever. But I want to do my part to convince developers to not repeat Microsoft’s mistake.
If you’re going to show sizes using base 2 then I recommend that you acknowledge the nerdiness of this in the most honest way possible – use hexadecimal. Or cut off your thumbs and we’ll switch the whole world to octal.
“base 10 is the appropriate choice because it then matches the number system that human beings use”
Hey, great link. I’m surprised it doesn’t mention base 60 (sexagesimal) used by the Sumerians – the reason that minutes, hours, and degrees work the way they do. And then there’s the odd mixed-base of non-metric systems (12 inches to a foot, three feet to a yard, three teaspoons to a table spoon, etc.).
I could change it to say “the number system that human beings use the most”, but I doubt I will.
IIRC from college – “Base Burroughs” from the 1960’s – IIRC, base 12 for one digit, base 10 for the others -> 100 = 120(base 10). Burroughs line printers had a 120 width I think.
Burroughs B2500’s and later (to B49xx – RPI-ACM had an old B2700 in ~1983) had BCD ALUs and a base-10 memory address setup.
Base 2 makes math easier when it matters though, and that would probably be my argument against at least on the technical side. For example, if you want to talk about disks in terms like blocks (e.g. 512b/4096b). An even better example might be IPv4 vs IPv6.
Using imperial in software is probably still arguably worse.
Except it never matters. Computers are quite good in performing calculations, humans not so. And these calculations only happen when displaying data for the user.
I wholly agree with you on the people side. On the computer side, division and/or conversion rules are arguably computationally more expensive though. For example, IPv4 needs large expensive routing tables to route packets. IPv6 can just use bitmasks.
On the computer side actual numbers are inevitably *stored* in binary. BCD instructions still exist in x86 but nobody uses them.
I remember seeing those reading through the Intel manuals.
I can see you’ve never programmed in COBOL. Do you have a credit card? Somewhere there is a server where a single thread of COBOL is screaming along doing BCD arithmetic seeing if they can charge you interest or a late payment fee. Most of the money of the planet is processed in BCD. That’s why Intel will never ditch the BCD instructions, there’s too much money in it.
Does anybody (other than the COBOL use mentioned in this thread) actually use those instructions? Do compilers even generate code for them?
The BCD instructions aren’t even available in x64 mode. They were removed so that those one-byte opcodes could be used for more valuable purposes. I doubt they were even used in x86 mode because they really don’t offer a lot of value, and they certainly aren’t required for dealing with BCD numbers.
> Base 2 makes math easier
Well… base 2 makes *some types* of math easier *for computers*.
So, if you need to do some bit masking, in the computer, the use base 10. And disks do have sectors that are a small power of two, so bit masking is used to calculate the address within a sector. But none of this has any relationship to what should be shown to consumers.
Although I mostly agree with your article there is one big advantage of base 2 (when used with correct prefixes) which is its unambiguousness:
If it says 1 GiB, it IS 1 GiB (base 2).
If it says 1 GB then it SHOULD be base 10 (like storage manufacturers always have used) ooor it could be base 2 (e.g. Windows and possible other applications/OS).
That’s why I would prefer my OS and applications to use base 2 for storage with correct prefixes (yes, I’m looking at YOU, Microsoft!). Also RAM manufactures should use correct prefixes.
I’m fine with using base 10 everywhere else, even storage manufactures though I think it wouldn’t be much a problem to always print both numbers on the packages for reference.
Yep, the ambiguity is the worst part of MB/GB/TB. When it matters I’ll often say “20 million bytes” to avoid the ambiguity. Sometimes I’ll use MiB/GiB/TiB, but I rarely find that valuable for file sizes.
But I don’t find the ambiguity a compelling enough reason to prefer MiB/GiB/TiB for file and drive sizes. And if the ambiguity is a problem we should invent some equally ugly prefixes that unambiguously mean base-10 – perhaps MeB/GeB/TeB – since the bad computer nerds stole the originals.
I really don’t like the ambiguity and I think you made a good proposal.
But instead of ‘e’ (MeB…) we could use ‘o’ because it looks more like the zero in “base 10” (OK OK, it’s actually that this results in more funny words: KoB, MoB, GoB, PoT, EoT) 🙂
Maybe a question could be when to use it. I think it should *always* be used for everything computer storage related (i.e. bytes and bits) (of course except for those rarely cases when actually base 2 is preferred, e.g. RAM):
GB -> GoB; GB/s -> GoB/s; Mbit/s -> Mobit/s
(don’t forget hectobit per second: hbit/s -> Hobit/s, hehe)
This whole thing reminds me of the frustrating semantics of ‘date’ in programming. Is it a time, is it just a calendar day, does it include a timezone, is it interoperable with that date stored there? The lack of consistency leads to misunderstanding, mistakes, time waste and annoyance.
I think Raymond Chen provided some explanation: https://blogs.msdn.microsoft.com/oldnewthing/20090611-00/?p=17933/
Good link. But I think he correctly answered the wrong question. The real question is “why does Explorer use base-2 MB instead of base-10 MB?” – and for that the evidence is much murkier. Many parts of the industry (OSX, hard-drive vendors, computer OEMs) do follow this path, and it is a simpler path (IMHO).
Say it with me: Using base-2 for file sizes makes no sense. Using base-2 for drive sizes makes no sense. Using base-2 for file sizes makes no sense. Using base-2 for drive sizes makes no sense.
Actually it makes perfect sense and I for one don’t go along with or like this renaming of established computer conventions just to come inline with weights and measures. A MB is 1024KB which is 1024x1024Bytes REGARDLESS if it is computer memory or computer storage. The best solution would have been TO LEAVE THE WHOLE THING ALONE !
I agree that the established computer convention of base-10 MB/GB/TB for hard-drive sizes should be retained and we should leave the whole thing alone.
Doing this allows for much easier conversions between the units (1.732 GB equals 1,732 MB, nice and easy) and between raw byte counts (1,732,000,000 bytes equals 1.732 GB – trying doing that in your head with binary GB). It’s nice that using base-10 for file and drive sizes has no real disadvantages because drive and file sizes have no significant relationship to powers of two (unlike memory and cache sizes).
Leave it alone at base 10.
The real problem IMO is not the choice of base 2 vs. 10; it’s notation. Suffixes like Kb and words like Kilobyte have meant base-2 multipliers for a few decades (eternity in CS) before this debate really started. Why not adopting the new suffixes for base-1-0 instead? For example KdB / “kidebyte” (Kilo-decimal-bytes), then nobody would be fighting over this.
Of course “Kilo” means 1000, and K/Kilo are SI standard suffixes for base-10. But consider also this: in a decimal Kb, the multiplier may be a nice base-10 number but the byte is not a fundamental unit; a byte is 2^8 bits, and a bit is 2^1 possible values. So the decimal variant is irremediably broken… it’s a Frankenstein like Kilo-inches or Mega-ounces. That’s ultimately what drags me into the position of sticking to base-2 multipliers AND insisting on their right to the nicer suffixes. Not just because tradition and convenience for engineers, but because it’s the only coherent unit system.
Too bad that we have some base-10 HW standards like “Gigabit Ethernet”; but those are arbitrary, there’s no reason why that spec couldn’t have defined as a binary-Giga – the numbers are close enough that, by the time technology was good enough to transmit 10^9 bits, it was certainly good enough for 2^30. (And again, Gigabit is not anymore a nice round decimal number when you convert it to 125 million bytes per second.) There are zero cases I know where the choice for base-10 is “natural”; there are only two cases, those where electronics and digital logic dictate base-2, and those that are completely arbitrary (almost always related to timing, because we can manufacture oscillators of any frequency we want and time is orthogonal to the “space” factors from base-2-dominated logic).
I agree with your point (and I’d personally go for powers of 2 everywhere). If you’re going to use base-10 multipliers then capacity should sensibly be measured in bits, not the rather arbitrary groups of 8-bits we call bytes. ‘Gigabit Ethernet’ is therefore internally consistent.
Of course everyone would get mightily annoyed when they got the ‘bit’ version of things they were expecting in ‘byte’s but I’m sure they’d get over it.
Capacity is measured in bytes because in most cases that is the indivisible atom of computing. Most processors can read or write no less than a byte. Memory allocations are requested in terms of byte counts. Bit granularity is, in most cases, not possible.
Byte granularity is also appropriate because a 32-bit computer can address 2^32 bytes of memory – the address of any byte can bit in a 32-bit register. Not so with bit granularity. Byte is the appropriate base unit.
Networking uses bits – I don’t know why. Tradition?
It’s bits/s because they are normally sent one after another -> It’s the raw speed of the data communication independent of how bytes are assembled later, i.e. of how many bits they consist, if there is a start bit, stop bits, parity etc (e.g. RS232).
You can convert the speed to bytes/s (e.g. assuming octets), but it won’t represent some actual possible data throughput because of the overhead of all the involved protocols (starting with Ethernet frames (header, addresses, checksums) to higher level protocols e.g. used for file transfers).
I think it’s good to keep bits/s because it cannot be easily converted to some meaningful bytes/s value.
If we’re going to apply the argument that the user doesn’t need to know implementation details and should have the thing that’s easiest to deal with in maths I don’t see why the user should need to know about bytes.
Or we could make up a new ten-bit byte because it’s more metric.
Incidentally, Nintendo cartridge storage capacities are measured in megabits, but as far as I can figure out that’s a base-2 mega.
Not sure if serious…
I’m not sure how moving away from the 8-bit byte gains us anything. It adds a new type and a new confusion, and fails to simplify anything. If bytes were ten bit then it would make conversions between bits and bytes slightly simpler, but it’s not, and pretending doesn’t make it so.
kB and MB and GB have meant *both* base-2 and base-10 for decades. Check your history.
The number of bits in a byte is irrelevant. There are eight ounces in a cup and 256 tablespoons in a gallon, but that doesn’t mean we should fill our cars with %1100 gallons of gas. Please explain why bits-in-a-byte matters in the slightest.
Base-2 should be used only where it maps much more cleanly to the underlying technology – which for consumer visible numbers means memory capacity and *nothing else*. Therefore, base-10 should be used for everything else (and *is* used for most other things) because we decided centuries ago that base-10 was best for human math.
The benefits of using base-10 for ease of human calculation (metric, currency, *everywhere*) are huge and base-2 needs a preponderance of evidence and advantages to overcome that.
Easy. 8 bits IS a byte. There is only ever 8bits in a byte as that is what a byte is !
Sure. Nobody is disputing that. It’s just not relevant to how many bytes are in a GB.
12 is divisible in more ways than 10, and therefore base 12 is obviously better than base 10 – why do you think most layout grids are base 12? If you’re going to push for arbitrarily changing a well-established standard, why encourage what is essentially a lateral move designed to dumb things down for the poorly educated? Let’s move to a system that’s actually better. Base 12 for everything!
Come now, base 12? The Sumerians had it right and we should be using base 60. Learning our timestables would be more cumbersome, but most numbers would then have far fewer digits, and the divisors are many and varied.
Hard drives are just another level in the storage subsystem, so using base 10 with those while measuring RAM and cache in base 2 creates more confusion that it clarifies, because you end up with a change in units at an _arbitrary_ level in the stack. That’s why sectors are sized to a power of 2, because they have to interoperate with the lower levels of the stack.
Reporting of disk drive capacities has evolved. I remember when some hard drive companies used a MB unit that was neither 1000 * 1000 bytes nor 1024 * 1024 bytes but 1000 * 1024 bytes. And nowadays, many manufacturers report formatted capacity rather than raw capacity (though they rarely talk about which level of formatting they refer to).
How about display resolutions? There are two version of 4K, neither of which are 4000 pixels wide. There’s 3840 x 2160 and 4096 x 2160. Oh, look, another use of K for base 2.
“You could reasonably say that it doesn’t matter if we display base 2 units to the user because they don’t care. That’s a terrible argument because if they don’t care we shouldn’t display any numbers at all.”
That’s a huge non-sequitur. When we’re up to gigabytes and terabytes, most people probably no longer care whether it’s base 10 or base 2, because the difference is relatively insignificant, but the first couple digits do still matter–the order of magnitude matters. Even the base-10 reporting disc drive manufacturers are only reporting one or two sig figs.
> Hard drives are just another level in the storage subsystem
Hmmm. No. They are the first level in the storage subsystem. They are the last level in the memory hierarchy. So, one could justify discussing page-file sizes in terms of binary GB, but that’s it.
> When we’re up to gigabytes and terabytes, most people probably no longer care
> whether it’s base 10 or base 2, because the difference is relatively insignificant
Well that’s a funny thing to say because the discrepancy in the TB range is larger than at any of the smaller prefixes. At 9.95% – which rounds to 10% – it is plenty large enough to affect the first couple of digits. 1.0 TiB versus 1.1 trillion bytes seems important to me.
“…one megabyte is one million bytes of information. This definition has been incorporated into the International System of Quantities.”
Type “man resize2fs” in linux for some LOL:
BEGIN QUOTE (translate underline to UPPERCASE):
Note: when kilobytes is used above, I mean REAL, power-of-2 kilobytes, (i.e., 1024 bytes), which some politically correct folks insist should be the stupid-sounding “kibibytes”. The same holds true for megabytes, also sometimes known as “mebibytes”, or gigabytes, as the amazingly silly “gibibytes”. Makes you want to gibber, doesn’t it?
… well at least it’s free (as in free beer) software.
I’m a little late on the comment here (I was avoiding all technology during vacation …), but …
Whenever I’m presented with a problem like this, where the same nomenclature means two different things, I tend towards a solution where you create a new name with a specific definition. In this case, we already have one created for us. I would just use e-notation.
I know what the main objection here is, which is that many or most consumers don’t currently understand e-notation. But we’re talking about the same customers who have already had to learn what Kb, Mb, Gb, Tb, etc. mean. Surely they can figure out e-notation if everybody started using it for memory sizes (and RAM sizes, and chip speeds, etc.).
/just my 2e1 bits …
I’m not sure what you mean by memory sizes as distinct from RAM sizes, and I’m not sure why you’re suggesting the e-notation (KiB/MiB, etc.) for chip speeds. It is needed for RAM sizes and nothing else.
Sorry, brain fart on the RAM vs. *hard drive* sizes, but in any case I was mistaken thinking the your Dell example was regarding hard drive rather than RAM. So total confusion.
My suggestion to use e-notation for everything is so the laypeople don’t need to learn the prefixes after giga- and tera-. Although point taken, it’s probably going to be a while before we even need terahertz, much less petahertz.
You seem to be mainly arguing in defense of the average user. To which I’d say to the average user the measurement of data is abstract and imprecise. A user is really only concerned about the relative size of things, not the exact number of bytes per order of magnitude. The way data is addressed (and often stored) makes powers of two intuitive, at least for the people concerned with that type of thing. Why do it both ways, at the expense of ambiguity and confusion, all on account of the average user that doesn’t know the difference?
I am also arguing in favor of myself. Just a few days ago I was copying 150 GB of data over a 100 M-bit Ethernet. I estimated the time as 150 * 10 * 8 seconds. Oops – don’t forget to multiply by 1.074 for the GiB to GB conversion factor because Windows describes file/directory sizes using base-2 GB, but Ethernet is base 10. Even if Ethernet frequencies were base 2 there would still be a 1.024 adjustment for the GB/MB conversion factor. Aaarrrgghh! It’s ridiculous.
Base-2 KB/MB/GB/TB are rarely useful. They make sense (outside of programming) for memory capacity only. We should use base 10. Or, we should standardize on base-2 and start referring to 95.4 M-bit Ethernet.
In other words, standardizing on base 2 is never going to happen. Standardizing on base 10 for everything-except-memory-capacity would be easy, and would confuse no one.
Don’t get me wrong, I’m for base 10 too, just not with the old prefixes.
But if you do calculations like a dumb user would do, you will always get half-assed results.
For instance you remembered to incorporate the base 2 prefixes but then completely ignored any network overhead.
(In numbers: Assuming you are using TCP/IP over Ethernet with no jumbo frames, no package lost and no addtional overhead of higher level protocols you have about 5% of overhead. This is comparable to MiB vs MB (~5%) and also GiB vs GB (~7%))
So maybe a user is 10% off… it probably won’t matter. And if exact numbers matter someone will be using integers without prefixes anyway (just bytes).
And by the “old prefixes” you mean the ones that were used for base-10 first.
Yes, you should allow for network overhead. It is unfortunate that the errors are cumulative, so that the total error (if you ignore GiB versus GB and network overhead) is over 12%.
And if a 10% error doesn’t matter then lets use base 10. Base 10 is the math that we are taught from kindergarten. The base-2 believers have to justify why it should be used, and I’m hearing nothing compelling.
IMHO it was mixed since the very beginning. I’m solely talking about prefixes used with Bytes and Bits. Using proper SI prefixes with SI units should always stay the same as it ever was. I.e. km, GHz etc.
With TB, GB, MB, kB or even KB you just never know what you get. Even if all decide let’s use base 10 there will be legacy stuff out there for a very long time.
IMHO new base 10 prefixes for Bytes and Bits which everybody should use henceforth unless base 2 is really beneficial (RAM) is the best solution.
I’m not sure that a *third* type of prefix (on top of MB/MiB) is going to solve anything. When I need to be unambiguous I say millions or MiB.
Maybe we need to sue memory makers who claim to have 4 GB of RAM with 256 GB/s of read bandwidth for their base 2/base 10 inconsistency.
Maybe if it were the other way around, e.g. they promise 4 GiB and then it’s only 4 GB. But like it actually is you could sue and they won’t really care.
Regarding another prefix:
Someone like Microsoft is using its prefixes since decades. They won’t change it for instance that in Windows 10 some disk has 1.23 TB (base 2) and in Windows 11 it’s suddenly 1.35 TB (base 10), customers would be confused (understandably).
IF they change something it’s more likely the prefix from TB -> TiB, but that’s not what you actually want.
They promise 256 GB/s of bandwidth and only deliver 256 billion bytes/s. If GB really does “mean” binary GB then they are under delivering.
Giga means billion and you know that. What they say wrongly is 4 GB but then they actually overdeliver. Nobody can sue them because of that.
Sadly, drive manufacturers were sued multiple times for saying that giga means billions, and had to settle. Explorer’s usage of giga as 2^30 probably hurt their cause. I agree that Giga *should* mean billion, but wishing has not yet made it so.
Things like hard drives are often subdivided into power-of-two-sized storage chunks of 512 to 4096 bytes. While one could say that 1GB represents 1,953.125 sectors of 512 bytes, or that 1TB represents 244,140,625 clusters of 4096 bytes, such treatment won’t work MB when using either size of block, nor for GB when using 4096-byte blocks.
Problems were avoided for the 2^10 prefix by making the letter uppercase “K”, as distinct from the lowercase “k”; the former prefix may be pronounced “kay” and the latter “kilo”. A “32 kay-hertz crystal” would be 32768Hz; a “32 kilohhertz crystal” would be 32000Hz (both frequencies are used, though the former is far more common). It’s too bad other prefixes never worked out so nicely in writing, since the same pronunciation-based distinctions could otherwise work just fine for them.
Hard drives are divided into 512 or 4,096 byte sectors. So what? It’s a boring implementation detail. Since the total sizes of these drives have *zero* correlation to powers of two I don’t think you’ve made a compelling case. There are 256 tablespoons in a US gallon, but that doesn’t mean we should use base-2 prefixes for tanker trucks. If the number of sectors was usually a power of two then I’d buy your argument, but in fact that is essentially *never* the case.
You say that 32,768 Hz is more common than 32,000 Hz – got any reference for that? In my experience frequencies are almost all base-10.
I think programmers assume that base-2 prefixes make sense because they think that *all* computer units are base two. It just ain’t so.
Hard drives have a total size which is an integer number of allocation units. It’s possible for a file to take exactly 1.000MiB. It is not possible on most systems for a file to take exactly 1.000 million bytes. I’d also suggest that “million”, “billion”, and “trillion”, are the same length as “mega”, “giga”, and “tera”. Only “quadrillion” is longer than the corresponding prefix “peta”. I suppose “Mebi”, “Gibi” etc. aren’t totally horrible, but a good system should allow hybrid sizes (e.g. multiples of 1,024,000).
As for 32768Hz vs 32000Hz, I don’t have sales figures available for the two kinds of crystals, but if you examine chip manufacturer’s datasheets, you’ll find a lot of chips which accept a 32,768Hz crystal and report the number of whole seconds elapsed. For some reason, the chips rarely allow read-out of the raw number of counts, and an annoying number insist upon formatting the data as year/month/day/hour/minute/second, often using BCD(!). I have yet to see one that uses a 32,000Hz crystal; even the one I’ve used that allowed a 1/100 second readout produced that by sometimes requiring 327 pulses per count and sometimes requiring 328, rather than always requiring 320.
If 32,768 Hz is more common that is an interesting fact. However higher frequencies (GPU, CPU, networking, and memory clocks) are all base 10 so it doesn’t change my basic claim which is that base ten is far more prevalent in computing than most people realize.
I’m afraid I don’t find your file/disk size arguments compelling. Most math is done in decimal. You need a compelling advantage to justify presenting users with base-2 prefixes. I, for one, don’t wan to explain to users that twenty 100-MB files are smaller than two 1-GB files. That’s just dumb.
Hybrid sizes are the most confusing idea possible. Please, just don’t. In fact, that’s probably at the root of my annoyance with base-2 prefixes. When you say 640 GiB you are saying 6.40 * 10^2 * 2^30. Either use hexadecimal for your file/drive sizes with base-2 prefixes, or use base-10 for everything.
Here’s the kicker…SI does NOT cover usage for bytes. One could argue that the use of SI prefixes is incorrect in the first place, base whatever it may be.
That was very compelling. It really is annoying in terabytes, and probably will be even more ridiculous when home computers start having picobytes. Frankly if you’re not in tech, it makes no sense at all and has wasted tens of thousands of pages of explanations in help files, docs, hardware box printouts and probably many more.
When transferring files between let say hard drives and the speed is 50MB/s is it in base 10 or base 2? What about in IDM download box or torrent applications is it in base 2 or base 10?
Unfortunately there is no standardization. You would have to ask the developers, but they might give you the wrong answer if they aren’t aware of all of the places where base-10 is used in computing.
I’m so confused by this article.
What do you think a bit is? How many unique measurable states can a single bit represent?
Count with me now, is that 2, or 10? It is 2, isn’t it.
So until x86 isn’t “binary-compatible” (which, last I checked, still is), I’m honestly confused why you think the onus is on CS to use base-10 because you and others like it more in your world, when in fact they’re measuring something that is at its very core base 2.
Maybe once you’ve got a quantum computer or a quantum storage device you can credibly make an argument in favor of developers not using base-2, but I suspect it won’t be any easier to argue in favor of base-10 then, either, and for the same reason — the fundamental unit the system is based on will be however many unique states we can measure and use computationally the way we use on and off today.
It makes sense to use base-2 KiB/MiB/GiB when talking about things whose sizes are powers-of-two. It makes no sense to use base-2 for things whose sizes are powers -of-ten.
In the powers-of-ten category is disk sizes (hard drives, SSDs, memory sticks), network speeds, clock speeds. So, until you get an Ethernet port that runs at 104.8576 Mhz or a CPU that runs at some multiple of 1.073741824 I’m afraid that base ten makes a lot of sense.
Why does it matter? With base-10 units I can easily tell how many 20 MB files will fit if I have 1 GB free (50). With base-2 units it is much less obvious (especially to consumers) that you can fit 51 20 MiB files into 1 GiB – and yet Windows persists in showing base-2 units to consumers.
I realize this is a long-standing bikeshed/religious issue for people… but computer memory isn’t base10-sized. (Although I suspect by memory sticks you mean USB or SD memory.) And disk drives were “historically” reported in base-2 sizes, like memory (and mixing units is even more confusing to people), even though they basically never were an “even” base-2 size. The base storage of disk drives (and USB/memory sticks) is base-2 units (sectors and allocation blocks for drives, flash memory blocks for USB/SD). The total storage isn’t a nice single-1-bit-base-2 value, but it’s still a base-2 number – a disk with 96 (0x60) 512-byte sectors has 49152 bytes (i.e. 48KB) — any drive based on power-of-two sectors is fundamentally power-of-2-sized device, and if you check the actual capacity it’s almost never going to be an even MiB/GiB size.
Some drive manufacturers started quoting sizes in base-10 because it let them give a larger number (sometimes over a threshold number), and competition forced all the others to follow suit – and left us with this mess.
And yes, some languages like REXX purposely decided their target audience would be less confused if all language-based limits were power-of-10. But they’re rare, and modern languages rarely put such limits on things. The general public doesn’t really know base-2 well if at all, so you’re right there – though I would also suggest they’re not likely to be bitten by the difference, except maybe when buying an HD – and then only because of the confusion; rarely would they say “ah, a 1GB SD cad, I can store exactly 20 of my 50MB files!” — especially as the OS and filesystem uses space too, so a 1GB drive won’t hold 1GB of data files anyways – you’ll never get 20 if it’s all base-10. With base two, you won’t get 21 either – but you might get 20.
We’ll never get drive makers to switch back. If you want to be fully consistent, computer vendors would need to change memory sizes to base-10. Not 2GB of memory, but 2.147GB! (made from two 1.0737GB memory sticks!)
I have never claimed that you should use base-10 numbers to describe memory sizes (or cache sizes, or cache-line sizes, or memory pages – the list goes on). Knocking down that straw-man does not advance your argument.
Drive manufacturers almost certainly do prefer base-10 because it makes their numbers bigger. But it is also more sensible, and easier to understand for consumers.
> any drive based on power-of-two sectors is fundamentally power-of-2-sized device
Uh, no. Sectors are power of two, and that’s it. Are you going to claim that miles are fundamentally a power-of-2 sized measurement because they are an even multiple of 32? No. Similarly, the claim that drives are fundamentally power-of-2-sized devices because they have billions of sectors that are 512 bytes is illogical. There is no *consumer* relevance to the sector size, there are *consumer* benefits to describing them as powers-of-ten, therefore that’s what we should do.
True – though other people do advance that strawman. And my argument that drive sizes are powers-of-2 due to being based on power-of-2 sectors may be correct technically, but is (as you point out) irrelevant to users. Part of it is just my annoyance at people insisting on XiB notation (and worse, the verbal version of that) for things that traditionally were well-understood (like memory). File sizes live in the gray area – traditionally we used to report them in power-of-2 units, which kinda made sense when machines had 1MB (MiB 😉 ) of memory, no VM, and a file size could matter in terms of how much memory it consumed. And also most of the users at that time were used to measuring things in 1024 units. But that’s long ago.
I’m probably still overreacting to people wanting to rewrite all the wikipedia articles about things like the C64 with “64 KiB”. 😉 Which is technically correct, but jarring to someone around during that era. (and I lived and breathed hard drive sizes during the era when some vendors started to switch, and confusion reigned.) I’ll go back to kicking kids off my lawn.
Ah, this is maybe where the confusion is stemming from.
That’s twice now you’ve claimed HDD/SSD are not fundamentally base 2.
Yes they are. Especially hard drives. And it has nothing whatsoever to do with sector size, which, by the by, is not the smallest unit of measure even by their own reckoning, and thus not your base. What is a sector made up of? 512 bytes. What’s a byte? 8 bits. What’s a bit? ON/OFF.
Especially old-school magnetic disks from which all this is based very much started off just trying to realistically record and play back a sequence of BINARY data — 0’s and 1’s — THERE and NOT THERE. The fundamental unit of storage is not a sector, it is a *bit*.
A bit is binary. Thus, it absolutely makes ONLY sense to measure all file-related things in base-2, because fundamentally what you are storing is a sequence of binary data. That you might on top of that construct all sorts of logical patterns that you can measure in your own different way is fine, at the end of the day something is still going to poke a hole in a punch card or not poke a hole in a punch card, and that’s all it can do, is it not?
I understand that these days hard disks and other things might be doing more clever tricks, may be bulking together data transfers and such so that they may not be fundamentally reading/writing individual bits anymore, but some larger grouping thereof, but it is still a GROUPING OF BITS. Which is why file sizes should and properly are base-2 calculated. Because they’re STORED in base 2.
Or am I still missing something?
The fact that we store bits has no earthly relevance. First of all, storage capacities and file sizes don’t deal with bits, they deal with bytes. Now bytes are fundamentally base-2, without a doubt, but that still has no relevance. Because storage capacities and file sizes deal with how-many bytes there are. That’s why sector size is *almost* relevant, but what bytes are made of is irrelevant.
Base-10 should be used for consumer visible measurements for the same reason that metric is better than imperial measurements – because it’s more familiar and intuitive and, unless you’re writing code that deals with sectors, it is a perfectly usable abstraction.
By the way, if you want to use base-2 then you should be consistent. Instead of saying that a sector has 512 bytes you should say that it has %1000000000 bytes or 0x200 bytes – because *mixing* base-2 and base-10 is abhorrent.
I think we’ll have to agree to disagree here, and on multiple fronts. It has every relevance. You’re suggesting in a binary world we just start measuring in base 10 because you want us to, and because of confusion caused by the actions of greedy marketing people who did it first.
What’s abhorrent about 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, and so on? What makes these numbers worse than 5, 10, 15, 20, 25? I don’t find it any more difficult to go backwards and forwards in either system, and again, the fundamental thing being measured and the whole ecosystem it lives in is in fact binary, so..
Or, expounding on that, you’re suggesting that because some marketing guys decided to alter a product’s marketing material to base-10, in fundamental disagreement with the underlying physics of what the device is doing and the base-2 world in which it operates, has always operated, and for a very long time was pretty unremarkable “industry standard” and un-debated an issue that it was being measured in this manner, *thereby allowing them to in essence sell you less of the product whilst keeping the number on the box the same, increasing profits at in effect no cost to them because it’s a burden on the consumer*, we should alter everything else to follow suit because the results of that decision have caused debate and confusion, as they knew very well it would and likely did not care even one tiny little bit?
Maybe we should boycott hard drive manufacturers who have arbitrarily decided to try to undercut their customer base by creating this issue in the first place, and refusing to simply report their product in the binary base-2 system that all the operating systems of the day used (indeed, many have now swapped, likely based on logic like you’re using — ultimately serving the interests of a couple of greedy corporate advertising executives somewhere, good job). This is entirely their own grave they’ve dug, and you’re suggesting we throw them a rope and get away with it? 🙂
I mean at one point they were literally selling drives in this manner when every single reasonable use of the product by the consumer was going to report in binary base-2. You’re suggesting this is the fault of everyone else, not the seller of the product in that space in that world? They had it right, all these other guys, totally wrong?
But even on the merits — I’m still not buying this. How about a different track. Why are you so willing to admit defeat on this issue when discussing memory/RAM, as if there’s a difference? What’s the difference you see that makes it OK to continue to report base-2 for RAM usage that is falling apart for you on file sizes? I don’t grok that, either, and perhaps it is related!
Base-10 for file sizes and drive sizes makes sense because they are visible to consumers and they make the math easier for those consumers. And, the sizes of those files typically have no base-2 relevance (except for being a multiple of a sector sizes if you look deep enough, but that is irrelevant for large files). In short, base-2 is confusing for consumers (why is 0.99 TiB equal to 1013 GiB?) and offers no advantages.
Memory sizes, on the other hand, is most easily expressed for consumers in base-2. As in, 16 GiB of RAM rather than 17.2 GB. There is a real base-2 significance across the full range of sizes. And, for software developers there is often *great* convenience in using base-two. From a post I am working on right now, a single level-4 page table page can address 2 MiB, a level-3 page table page can address 1 GiB, and so on. MMU pages are 4 KiB, cache sizes are 32 KiB. All of these are much easier expressed using base-2, which makes all of the relevant bit-masking and shifting work, as in the VirtualScan tool, here: https://github.com/randomascii/blogstuff/tree/master/cfg
Consumers should only be exposed to base-2 if it clearly simplifies things. This is justified for memory, which almost always comes in an exact power of two, but is not justified for network speeds, CPU clock cycles, or hard-drive sizes.
That is all.
But this has cleared up nothing? You’re acting like it’s totally logical to talk about and address RAM in base-2 because:
a) nobody does it, anyway, at least not most consumers
b) everything devolves all the way down easily in base 2
Here’s my problem – A is demonstrably false (gamers talk RAM all day long, hosting companies, anyone doing virtualization [rapidly becoming everyone]/cloud, RAM is a pretty commonplace thing to be monitoring actively these days), and B applies exactly the same to hard disks as it does to RAM. RAM stores bits. Hard disks store bits. That we’ve arbitrarily chosen to clump together bits into bytes and bytes into sectors on hard disks is almost irrelevant — the underlying most thing on both devices is the same, it is storing sequences of bits, a binary base-2 thing. If it’s good for the goose, it is good for the gander. Heck, I’m still not even clear why you keep saying a hard disk devolves down to sectors as if that’s somehow different. Sectors are base-2, because they’re a grouping of bytes which are a grouping of bits.
Hard disks all come in base-2 sizes from the manufacturer. Still. To this day. As you say, they address in sectors. Sectors are base-2. They’re either 512 bytes or 4096 bytes, generally, but I mean, you could almost arbitrarily pick the size, except that *it would always be a base-2 number*, why? Because it’s a clump of BYTES. Bytes as in 8 bits. That they choose to sell them in a box marketed as base-10 for capacity is their choice and should never have become our problem.
We should end this chain; it’s unlikely this is adding anything, or that anyone will be convinced – but I’ll back up Bruce here by noting that while RAM is typically sized in “nice” power-of-two sizes (2GB, 4GB, or even 3GB); disks (while made of power-of-two sectors) don’t have total sizes that are nice powers of two, and haven’t pretty much forever – and files *really* don’t have powers-of-two sizes, so when you look at a list of file sizes (in powers-of-10) there’s an advantage if the disk is sized and discussed in powers-of-10 – even if it isn’t “exact”.
I agree that it would be better for all end-user stuff to use the same units at hardware and software level, but on the other hand I think it will create more chaos and confusion than solve.
Even being a programmer myself I always thought that of all the bit- and byte-related stuff only hard drives use base 10 units (just didn’t bother to dig into this problem before).
The first reason why base 2 is still used in almost all software, other than historical, is consistency.
As you wrote, RAM was always produced and measured with base 2 units, while other stuff was always mostly base 10. If you will use different units for memory and other stuff, it will be harder for users to compare memory size with other sizes and explain certain size limitations caused by register sizes and memory addressing.
For example, it will be confusing to have 8 GiB of RAM and 8.59 GB page file which is used to store RAM (with both sizes being actually equal). It will be confusing to state that the maximum file size is limited to 4 GiB while your other file sizes will be measured in GB instead or to state it in GB and explain why it’s limited to these “random” 4.3 GB.
Why Windows still uses base 2 units instead of base 10? I think Raymond Chen’s article posted here earlier indirectly answers this question as well: because almost nobody in software world have adopted this either.
Being most popular OS in a world, changing them now would be a major undertaking for the whole industry, probably comparable to Y2K problem, and will cause a total mayhem for the next few years (which will be worsen even further by the fact that the prefixes used will still be the same). If Windows will change its units, all software will need to do the same to talk the same language to its users (and for abandonware this inconsistency will stay forever). Not only software itself, but its requirements will need to be updated as well (even for software that never displayed any bytes info to its users). Also documentation, computer books and what not…
So, OS X has changed its units already, but what about other Mac software? Is it easier to use now? If so, what was their experience?
Suggesting other software developers alone to use base 10 is no go either. They’ll be like a black sheep among other software and answer countless “Why?” from their users.
I think you capture the problems with changing to base-10 pretty well. However I think you underestimate the problems of the current usage of base-2. The use of base-2 causes internal inconsistency _within_ explorer. At this moment I’m looking at a file that explorer’s window describes as being 38,617 KB. If you bring up the properties it describes it as 37.7 MB (39,542,912 bytes).
So, 37.7, 38.6, or 39.5? Which is it? If you aren’t intimately familiar with the 2.4% variation between subsequent power-of-two prefixes then this looks like a bug. I think that this sort of confusion is far more common than the rare times when regular users need to deal with binary prefixes.
Windows is wrong, but Windows hates to change. I understand that.
I am very opinionated on this. Because hard drive manufacturers got sued, and LOST, over something the programmers are actually doing wrong (using the wrong prefixes). If programmers of OSes don’t know the proper prefixes, we are in trouble as a world.
Eehm, do you sue you’re marketers for selling you tomatoes in the vegetable section, when clearly they are a fruits?
We people adapt quite easily when it is convenient for us, yet we make such a fuss, like when people like Bruce think ‘o no’, we shouldn’t expose lay people to the binary system.
There would be absolutely no problem if all stayed in base 2. Base 2 has a little overprovisioning inherent, so you could save a little more than one thousand 1 MB files in a 1 GB storage, yet nobody does it anyway. You don’t wanna see your system crawl to a halt, because there is less than 10% of free disk space (you know, all those pesky page-files, disk fragmentation and log files you don’t have any control over it anyway, let alone lusers).
I really don’t care about clock frequency or network speed etc., as overhead, in-hardware-compression, conversion loss etc. make direct calculation useless anyway. Personally i have no problem if a download takes 1.5 times longer than “advertised”, as long as it is not ten or a hundred times longer (because of maybe ‘best effort’ … all neighbors downloading the latest great season of got ( <- put in your favorite show or so) at the same time, or any other ISP snafu?)
If all stayed in base 2 we would be fine. But that means no more saying “1.5 times longer” – that should be 1.1 times longer. And “10% free disk space”? What is that nonsense? 10/100 free disk space? Do you mean half free? Do you mean 0.000110011… free? I’m unclear on your meaning.
The problem is that “428 MB”, as used by Windows, is a mixture of base-10 (428) and base-2 (MB) and this causes confusion. It means that 1020 MB is smaller than 1.0 GB. Can we not all agree that that is surprising to most people? 0x3FC MB being less than 1.0 GB (which is 0x400 MB) – now that makes sense.
Your main point seems to be “meh, it’s all approximate anyway”, which is fair enough, but that just says that you shouldn’t care whether MB is base-2 or base-10, and yet you cared enough to post a comment!
If this confusion had benefits then sure, go for it. However in all of the comments on this blog post I have yet to see any convincing description of the benefits of mixed bases.
Well seems we have IBM to thank for this mess as common convention was inherited from their onwards…. https://en.m.wikipedia.org/wiki/Timeline_of_binary_prefixes 🙂
When I was a kid the only OS around had 1024 bit bytes and 1024 byte kilobytes. Was like it for decades. Changing anything like that in an OS will break something. It’s great when one doesn’t have to worry about backcompat.