This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)
So far in data lifecycle management, I’ve discussed:
- The basic data lifecycle;
- The importance of being archived (and deleted);
- The need to delete as well as archive
Now we need to get to our final part – the need to archive rather than just blindly deleting.
You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.
My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.
Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:
- Introduces unnecessary recovery risks
- Hides the real storage requirements
In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.
The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.
That is, over time, we can describe storage usage and deletion occurring as follows:
This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.
In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.
And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:
Let me know when you’re all done with it and I’ll archive 🙂