I assume you’ve heard of the wheat and the chessboard problem. It often gets presented as part of the history of chess, or as some fable to teach the importance of understanding what you’re agreeing to. The way I heard it when I was a child was along the following lines:
A king was out for a ride one day when he passed an old lady beside a bridge, begging for alms. Ignoring her, he pressed on, but as he was riding over the bridge, it collapsed, and he fell into the river and would have drowned had the old lady not jumped to his aid. When she got him to the banks of the river, he said, “You’ve saved my life. Whatever you want, you will have.” To which she replied, “Put one grain of rice on the first square of a chess board. Then double that for the second square, double the second square for the third, and so on, until all the squares are filled.”
Wheat and Chessboard Problem
Of course, the wheat and chessboard problem fundamentally teaches about exponential growth, and can equally be used to help understand compound growth, too.
In backup and recovery systems in particular, it’s rather critical to understand your change rates and growth rates, particularly when you’re planning a refresh of your environment. Obviously, since I work in pre-sales, this is something I deal with on a regular basis, and my guidance on this is as follows:
- Garbage in, garbage out: The more accurate an understanding can be built of your change and growth rates, the more correct a solution sizing/design will be.
- Not all workloads in your environment will have the same daily change rate.
- Not all workloads in your environment will have the same annual growth rate.
- Your daily change rates are probably lower than you think.
- Your annual growth rates are also probably lower than you think.
Number 1, above, is to me an immutable truism. If the data regarding your workloads – type, size, change and growth – are not correct, the only way a solution is going to be correct is sheer dumb luck.
Items 2 and 3, above, I’m willing to posit, are almost invariably truisms as well. I’d normally expect to see different workload change and growth rates based not only on the business function, but also the workload type. So even if a customer facing service utilises both a database and a fileserver, those two different systems, even though they support the same function, might have radically different change and growth rates.
Items 4 and 5 are usually correct, though they’re the ones where there’s greater flexibility. If you have servers that just accumulate data on a daily basis (e.g., video feeds, data warehousing, etc.), there’s a greater chance you’ll have bigger change rates and bigger growth rates than we would normally see. However, generally speaking, it’s not unusual to see relatively low change rates, and growth rates, across large numbers of datasets within an environment.
Those change rates and growth rates will clearly have a significant impact on the overall solution requirements. To see what I’m talking about, let’s consider some varying data set sizes and a variety of daily change rates.
Data Size (GB) | 1% Change | 2% Change | 4% Change | 5% Change | 10% Change | 20% Change |
---|---|---|---|---|---|---|
50 | 0.5 | 1 | 2 | 2.5 | 5 | 10 |
250 | 2.5 | 5 | 10 | 12.5 | 25 | 50 |
500 | 5 | 10 | 20 | 25 | 50 | 100 |
1000 | 10 | 20 | 40 | 50 | 100 | 200 |
1500 | 15 | 30 | 60 | 75 | 150 | 300 |
2000 | 20 | 40 | 80 | 100 | 200 | 400 |
5000 | 50 | 100 | 200 | 250 | 500 | 1000 |
10000 | 100 | 200 | 400 | 500 | 1000 | 2000 |
As you can see, even relatively small daily change rates generate a reasonable amount of new data each day. I often get told that there’s a 20% daily change rate, but when we extrapolate out what those numbers mean, it’s more often than not that it was a guess rather than hard numbers.
It’s equally the case that annual growth rates are misunderstood for a lot of data sets. Let’s look at those same dataset sizes with a variety of annual growth rates, compounded over 3 years:
Data Size (GB) | 1% YoY | 2% YoY | 5% YoY | 10% YoY | 20% YoY | 50% YoY |
---|---|---|---|---|---|---|
50 | 51.5 | 53.1 | 57.9 | 66.6 | 86.4 | 168.8 |
250 | 257.6 | 265.3 | 289.4 | 332.8 | 432.0 | 843.8 |
500 | 515.2 | 530.6 | 578.8 | 665.5 | 864.0 | 1687.5 |
1000 | 1030.3 | 1061.2 | 1157.6 | 1331.0 | 1728.0 | 3375.0 |
1500 | 1545.5 | 1591.8 | 1736.4 | 1996.5 | 2592.0 | 5062.5 |
2000 | 2060.6 | 2122.4 | 2315.3 | 2662.0 | 3456.0 | 6750.0 |
5000 | 5151.5 | 5306.0 | 5788.1 | 6655.0 | 8640.0 | 16875.0 |
10000 | 10303.0 | 10612.1 | 11576.3 | 13310.0 | 17280.0 | 33750.0 |
I find predicting growth over 3 years is about as accurate as you’ll get within a solution view – beyond three years, and unless you’ve got an extremely measurable and linear data change, the likelihood that extrapolated growths for say, 5 years (a common request) are accurate is actually pretty minimal. (Which returns us to truism #1: garbage in, garbage out.) To get what I mean, consider the above growth rates, but now extrapolated out to 5 years:
Data Size (GB) | 1% YoY | 2% YoY | 5% YoY | 10% YoY | 20% YoY | 50% YoY |
---|---|---|---|---|---|---|
50 | 52.6 | 55.2 | 63.8 | 80.5 | 124.4 | 379.7 |
250 | 262.8 | 276.0 | 319.1 | 402.6 | 622.1 | 1898.4 |
500 | 525.5 | 552.0 | 638.1 | 805.3 | 1244.2 | 3796.9 |
1000 | 1051.0 | 1104.1 | 1276.3 | 1610.5 | 2488.3 | 7593.8 |
1500 | 1576.5 | 1656.1 | 1914.4 | 2415.8 | 3732.5 | 11390.6 |
2000 | 2102.0 | 2208.2 | 2552.6 | 3221.0 | 4976.6 | 15187.5 |
5000 | 5255.1 | 5520.4 | 6381.4 | 8052.6 | 12441.6 | 37968.8 |
10000 | 10510.1 | 11040.8 | 12762.8 | 16105.1 | 24883.2 | 75937.5 |
As you can see, when we stretch to a 5 year annual growth rate, those workload sizes get quite large, even for the smallest starting points. So say if the workload size is actually 50GB, not 250GB, using the wrong data set size at the start compounds to quite a serious difference in sizing requirements by the end of the 5 years. (While it’s common to see RFPs for instance issued with an assumption around 5 years growth, I honestly think for the most part that a more sensible approach is to size for 3 years growth with a requirement to accommodate extra growth for years 4 and 5.)
There’s an old saying, measure twice, cut once, referring to anything involving clothes, upholstery, etc. It’s that saying that brings me to the point of this post: if you’re looking at refreshing your environment, or changing your environment, it’s worth spending the time and effort gathering as much data as is practicable.
Sometimes, you might have this data available. You might have a strong capacity monitoring and management process within your environment that can chart, on a system by system basis, or a dataset by dataset basis, what your daily change, and annual growth rates are. If you’ve got those details, that’s the perfect information you need to get a solution sized with the greatest accuracy. In my experience, most environments don’t have this information to hand to that degree – and so the best option, the most likely option, is to run a comprehensive assessment of your environment. And that’s where tools such as LiveOptics come in. LiveOptics can review not only the current state of your environment to gather significant amounts of dataset information, it can also run continuously over a defined period to help gather information such as change rates.
In a lot of senses, developing a view of the required size of a solution is a fairly straight-forward mathematical process (I say this as someone who still counts on my fingers). It’s not the solution sizing that’s rocket science (though it can be a long process, depending on the number of datasets to be evaluated), it’s the gathering of the input data to get the sizing right which takes the most careful consideration. Once you get that right, you can evaluate a refresh or change to your environment with relative ease.
If you found this interesting, be sure to check out Data Protection: Ensuring Data Availability.