OK, so here’s the thing for me: I kind of suck at maths. Not so much in a “can’t solve a differential equation to solve his life” way (though I couldn’t), but also in a “still counts on fingers” sort of way. Mathematics of all sorts has always been an anathema to me, and as I get older I increasingly take the approach that it means I have to be doubly careful about getting anything I’m doing with numbers right.
When we’re working with data protection systems, there’s a few significant numbers that we need to be aware of, and more importantly, we need to understand what the difference between those numbers means for our solution.
Take megabits per second vs megabytes per second. There’s quite a big difference in this, but it’s one of the first things I’ll seek to qualify when I’m talking to a customer about links between sites. Part of it is what might be best described as a lackadaisical approach to capitalisation, but it’s critical in understanding what you can and can’t do. Most of the time, general IT people are actually very good at ensuring they specify the numbering system being used properly when describing network speeds. But here’s the difference, of course: 100 Mbs (megabits per second) = 12.5 MB/s (megabytes per second). So if you want to work out whether you can replicate your backup data at a remote site into the central DC, it’s really important to understand whether your link will support replication at 400 Mbps or 400 MB/s.
OK, the next one is gibibytes/tebibytes vs gigabytes/terabytes. At some point there was a memo passed around to everyone in IT making it clear when we use gibibytes and tebibytes vs gigabytes and terabytes, but I obviously sick that day and I missed the memo for years. So the “ibis”, for want of a better term, are base 2, while the original nomenclature is base 10. Where that’s important of course is understanding whether you’re asking for 100 Gibibytes or 100 Gigabytes of storage. After all, 100 Gibibytes = 107.374 Gigabytes. That may not sound like a lot, but as capacities ramp up and multiple workloads are worked on, it can make a significant difference over time.
Then there’s deduplication ratios. Deduplication can be expressed in two different ways. I personally like using the X:1 deduplication ratio, because it’s quite straight forward. If you get a deduplication ratio of 25:1 against 5TB of data, then it means you store 200 GB. The other way you can talk about deduplication is the percentage reduction. Do you get 80% reduction, 90% reduction, 98% reduction, 99% reduction? Does a reduction difference between 98% and 99% even matter?
Original Data Size (GB) | % Reduction | Data Stored (GB) |
---|---|---|
5000 | 80% | 1000 |
5000 | 82% | 900 |
5000 | 84% | 800 |
5000 | 86% | 700 |
5000 | 88% | 600 |
5000 | 90% | 500 |
5000 | 92% | 400 |
5000 | 94% | 300 |
5000 | 96% | 200 |
5000 | 98% | 100 |
5000 | 99% | 50 |
It does, of course. A 99% reduction results in half as much storage again as a 98% reduction: it turns out there’s actually a substantial difference between a 98% reduction ratio and a 99% reduction ratio. (So a 25:1 deduplication ratio is effectively equivalent to a 96% reduction ratio.)
Bandwidth speed, storage number base, and deduplication vs reduction ratios: they’re the fundamental numbers you need to get right when understanding what your environment is capable of. Everything else hinges on them; if you get bandwidth wrong, you may not be able to protect your protection. If you get your storage number base (base 2 vs base 10) wrong, you may end up with less storage than you anticipated, and if you don’t get the difference in deduplication/reduction ratios, you may not understand how much logical data you’ll get to store in your backup system. Everything else in your environment will be built on those considerations.