Why Data Classification in Backup is Pointless

You know that saying, “Data is the new oil”? If you work in IT and haven’t heard of it, I’d be surprised. In fact, I’ve cited it a few times myself, including in the second edition of Data Protection: Ensuring Data Availability. The phrase caught on as much as anything because it strikes a chord — for so long, oil was the driving force of the economy, and it seemed reasonable to assume data had supplanted it.

But, here’s the rub: data isn’t really the new oil.

You see, data is irrelevant.

Yes, I’m going to say that again: data is irrelevant.

That is, data without context is irrelevant.

Talking about data as being critical to the business is akin to talking about a DNS server being a business outcome. Unless your business sells DNS services, a DNS server is not a business outcome. Even if your business does sell DNS services, a DNS server is only one aspect of a business outcome, because there’s always more wrapped up in a ‘product’ than a single server, or even a clustered server.

Going all the way back to Enterprise Systems Backup and Recovery, I’ve made it clear that system dependency mappings (essential for disaster recovery and business continuity planning) have to map the dependencies all the way through to the business function.

It may be hard for many of us in IT to accept, but the business doesn’t give a damn about whether the Oracle database is clustered. The business doesn’t give a damn about whether the NAS platform uses RAID-5 or RAID-6. And it certainly doesn’t give a damn about whether systems are patched or not.

If you work in IT, particularly infrastructure or systems administration, and tried to explain to someone (not in IT) at a party what you do, you know this in your soul. The blank look of despair on their face as they try to extricate themselves from the “What do you do for work?” question may well be seared into your memory.

People who don’t work in IT, particularly leaders in business care about outcomes. Outcomes for the business. Can revenue for the business be increased? Can environmental footprint for the business be decreased? Likewise, they’re not really concerned with whether the DNS service or database cluster is recoverable. They want to know whether the customer billing business function is recoverable. They want to know whether the inventory tracking function is available whenever it’s required. They want to know whether the customer ordering platform is always operational.

What powers those functions? Most of them don’t care. To think otherwise is to think that every car owner is intrinsically concerned with the thermal and acoustic characteristics of their muffler.

And because of that, data classification at the time of backup that’s contextually unaware is pointless. The context of course being, “what’s the business function of this data?”

If you think a backup environment can intelligently ascertain the business function of your data based on its file extension, metadata or even actual content in ‘real-time’ as data is being read, I have a bridge I’d like to sell you.

I’ve long maintained that data classification is a primary-storage process — but it’s more than that. It’s a human-centric process still, and will remain so for some time to come. Sure, there’s software that’ll tell you whether you have some PCI data laying around. And sure, there’s software that can monitor your network and tell you which systems are talking to what servers and storage platforms, but all of this information requires human intervention, and human understanding.

Until you have AI-level software that can monitor your environment and tell you what your business functions are, and then map all of those functions with full dependency analysis back to individual systems, any supposed in-backup classification beyond simple client tagging is a placebo. Anyone who tries to tell you otherwise is trying to sell you a bridge.

There are some things you currently can’t automate your way out of — at least not with the current level of technology. This is one of these things. You might even say that this is a good example of the difference between things that can be automated and autonomous systems. Automatic data classification requires both the former and the latter, and we don’t have the latter yet.

That’s why it’s essential data classification is performed where the data is stored, not when the data is copied, and with humans involved. To be truly useful in guiding next-step functionality, that classification needs to be by business function, not the file extension. By classifying data when it’s originally handled or stored, appropriate backup policies can be developed.

2 thoughts on “Why Data Classification in Backup is Pointless”

  1. Hi Preston,

    Thanks for this blog post. Do you have any software recommendations for dependency mapping? I’ve been looking for that for years.

    1. Hi Urgo,

      I don’t delve into data classification software often – it’s very much a speciality on its own.

      Dell EMC’s consulting division leverages some custom/owned software as part of consulting services to help with the dependency mapping for systems and software. As it’s just part of the overall process, I don’t recall the name of the software offhand, I’m sorry.

      Cheers,
      Preston.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.