Feb 122017
 

On January 31, GitLab suffered a significant issue resulting in a data loss situation. In their own words, the replica of their production database was deleted, the production database was then accidentally deleted, then it turned out their backups hadn’t run. They got systems back with snapshots, but not without permanently losing some data. This in itself is an excellent example of the need for multiple data protection strategies; your data protection should not represent a single point of failure within the business, so having layered approaches to achieve a variety of retention times, RPOs, RTOs and the potential for cascading failures is always critical.

To their credit, they’ve published a comprehensive postmortem of the issue and Root Cause Analysis (RCA) of the entire issue (here), and must be applauded for being so open with everything that went wrong – as well as the steps they’re taking to avoid it happening again.

Server on Fire

But I do think some of the statements in the postmortem and RCA require a little more analysis, as they’re indicative of some of the challenges that take place in data protection.

I’m not going to speak to the scenario that led to the production, rather than replica database, being deleted. This falls into the category of “ooh crap” system administration mistakes that sadly, many of us will make in our careers. As the saying goes: accidents happen. (I have literally been in the situation of accidentally deleting a production database rather than its replica, and I can well and truly sympathise with any system or application administrator making that mistake.)

Within GitLab’s RCA under “Problem 2: restoring GitLab.com took over 18 hours”, several statements were made that irk me as a long-term data protection specialist:

Why could we not use the standard backup procedure? – The standard backup procedure uses pg_dump to perform a logical backup of the database. This procedure failed silently because it was using PostgreSQL 9.2, while GitLab.com runs on PostgreSQL 9.6.

As evidenced by a later statement (see the next RCA statement below), the procedure did not fail silently; instead, GitLab chose to filter the output of the backup process in a way that they did not monitor. There is, quite simply, a significant difference between fail silently and silently ignored results. The latter is a far more accurate statement than the former. A command that fails silently is one that exits with no error condition or alert. Instead:

Why did the backup procedure fail silently? – Notifications were sent upon failure, but because of the Emails being rejected there was no indication of failure. The sender was an automated process with no other means to report any errors.

The pg_dump command didn’t fail silently, as previously asserted. It generated output which was silently ignored due to a system configuration error. Yes, a system failed to accept the emails, and a system therefore failed to send the emails, but at the end of the day, a human failed to see or otherwise check as to why the backup reports were not being received. This is actually a critical reason why we need zero error policies – in data protection, no error should be allowed to continue without investigation and rectification, and a change in or lack of reporting or monitoring data for data protection activities must be treated as an error for investigation.

Why were Azure disk snapshots not enabled? – We assumed our other backup procedures were sufficient. Furthermore, restoring these snapshots can take days.

Simple lesson: If you’re going to assume something in data protection, assume it’s not working, not that it is.

Why was the backup procedure not tested on a regular basis? – Because there was no ownership, as a result nobody was responsible for testing the procedure.

There are two sections of the answer that should serve as a dire warning: “there was no ownership”, “nobody was responsible”. This is a mistake many businesses make, but I don’t for a second believe there was no ownership. Instead, there was a failure to understand ownership. Looking at the “Team | GitLab” page, I see:

  • Dmitriy Zaporozhets, “Co-founder, Chief Technical Officer (CTO)”
    • From a technical perspective the buck stops with the CTO. The CTO does own the data protection status for the business from an IT perspective.
  • Sid Sijbrandij, “Co-founder, Chief Executive Officer (CEO)”
    • From a business perspective, the buck stops with the CEO. The CEO does own the data protection status for the business from an operational perspective, and from having the CTO reporting directly up.
  • Bruce Armstrong and Villi Iltchev, “Board of Directors”
    • The Board of Directors is responsible for ensuring the business is running legally, safely and financially securely. They indirectly own all procedures and processes within the business.
  • Stan Hu, “VP of Engineering”
    • Vice-President of Engineering, reporting to the CEO. If the CTO sets the technical direction of the company, an engineering or infrastructure leader is responsible for making sure the company’s IT works correctly. That includes data protection functions.
  • Pablo Carranza, “Production Lead”
    • Reporting to the Infrastructure Director (a position currently open). Data protection is a production function.
  • Infrastructure Director:
    • Currently assigned to Sid (see above), as an open position, the infrastructure director is another link in the chain of responsibility and ownership for data protection functions.

I’m not calling these people out to shame them, or rub salt into their wounds – mistakes happen. But I am suggesting GitLab has abnegated its collective responsibility by simply suggesting “there was no ownership”, when in fact, as evidenced by their “Team” page, there was. In fact, there was plenty of ownership, but it was clearly not appropriately understood along the technical lines of the business, and indeed right up into the senior operational lines of the business.

You don’t get to say that no-one owned the data protection functions. Only that no-one understood they owned the data protection functions. One day we might stop having these discussions. But clearly not today.

 

Ransomware is a fact of life

 Data loss, Security  Comments Off on Ransomware is a fact of life
Feb 012017
 

The NetWorker usage survey for 2016 has just finished. One of the questions I asked in this most recent survey was as follows:

Has your business been struck by ransomware or other data destructive attacks in the past year?

(_) Yes

(_) No

(_) Don’t know

(_) Prefer not to say

With the survey closed, I wanted to take a sneak peek at the answer to this question.

Ransomware, as many of you would know, is the term coined for viruses and other attacks that lead data erased or encrypted, with prompts to pay a ‘ransom’ in order to get the money back. Some businesses may choose to pay the ransom, others choose not to. If you’ve got a good data protection scheme you can save yourself from a lot of ransomware situations, but the looming threat – which is something that has already occurred in some instances – is ransomware combined with systems penetration, resulting in backup servers being deliberately compromised and data-destructive attacks happening on primary data. I gave an example of EMC’s solution to that sort of devastating 1-2 punch attack last November.

Ransomware is not going away. We recently saw massive numbers of MongoDB databases being attacked, and law enforcement agencies are considering it a growing threat and a billion dollar a year or more industry for the attackers.

So what’s the story then with NetWorker users and ransomware? There were 159 respondents to the 2016 NetWorker usage survey, and the answer breakdown was as follows:

  • No – 48.43%
  • Don’t know – 11.32%
  • Prefer not to say – 9.43%
  • Yes – 30.82%

An August 2016 article in the Guardian suggested that up to 40% of businesses had been hit by ransomware, and by the end of 2016 other polls were suggesting the number was edging towards 50%.

Ransomware Percentages

I’m going to go out on a limb and suggest that at least 50% of respondents who answered “Prefer not to say” were probably saying it because it’s happened and they don’t want to mention it. (It’s understandable, and very common.) I’ll also go out on a limb and suggest that at least a third of respondents who answered “Don’t know” probably had been but it might have been resolved through primary storage or other recovery options that left individual respondents unaware.

At the very base numbers though, almost 31% of respondents knew they definitely had been hit by ransomware or other data-destructive attacks, and with those extrapolations above we might be forgiven for believing that the number was closer to 38.9%.

The Guardian article was based on a survey of Fortune 500 senior IT executives, and ransomware at its most efficacious is targeted and combined with other social engineering techniques such as spear phishing, so it’s no wonder the “big” companies report high numbers of incidents – they’re getting targeted more deliberately. The respondents on the NetWorker survey however came from all geographies and all sizes, ranging from a few clients to thousands or more.

Bear in mind that being hit by ransomware is not a case of “lightning never strikes twice”. At a briefing I went to in the USA last year, we were told that one business alone had been hit by 270+ cases of ransomware since the start of the year. Anecdotally, those customers of mine even who mention having been hit by ransomware talk about it in terms of multiple situations, not just a single one.

Now as much as ever before, we need robust data protection, and air-gapped data protection for sensitive data – the Isolated Recovery Site (IRS) is something you’ll hear more of as ransomware gets more prevalent.

NetWorker users have spoken – ransomware is a real and tangible threat to businesses around the world.

I’ll be aiming to have the full report published by mid-February, and I’ll contact the winner of the prize at that time too.

How many copies do I need?

 Architecture, Backup theory, Data loss  Comments Off on How many copies do I need?
May 242016
 

So you’ve got your primary data stored on one array and it replicates to another array. How many backup copies do you need?

Copies

There’s no doubt we’re spawning more and more copies and pseudo-copies of our data. So much so that EMC’s new Enterprise Copy Data Management (eCDM) product was announced at EMC World. (For details on that, check out Chad’s blog here.)

With many production data sets spawning anywhere between 4 and 10 copies, and sometimes a lot more, a question that gets asked from time to time is: why would I need to duplicate my backups?

It seems a fair question if you’re using array to array replication, but let’s stop for a moment and think about the different types of data protection being applied in this scenario:

Replication without Cloning

Let’s say we’ve got two sites, production and disaster recovery, and for the sake of simplicity, a single SAN at each site. The two SANs replicate between one another. Backups are taken at one of the sites – in this example, the production site. There’s no duplication of the backups.

Replication is definitely a form of data protection, but its primary purpose is to provide a degree of fault tolerance – not true fault tolerance of course (that requires more effort), but the idea is that if the primary array is destroyed, there’s a copy of the data on the secondary array and it can take over production functions. Replication can also factor into maintenance activities – if you need to repair, update or even replace the primary array, you can failover operations to the secondary array, work on the primary, then fail back when you’re ready.

In the world of backups there’s an old saying however: nothing corrupts faster than a mirror. The same applies to replication…

“Ahah!”, some interject at this point, “What if the replication is asynchronous? That means if corruption happens in the source array we can turn off replication between the arrays! Problem solved!”

Over a decade ago I met an IT manager who felt the response to a virus infecting his network would be to have an operator run into the computer room and use an axe to quickly chop all the network connections away from the core switches. That might actually be more successful than relying on noticing corruption ahead of asynchronous replication windows and disconnecting replication links.

So if there’s corruption in the primary array that infects the secondary array – that’s no cause for concern, right? After all there’s a backup copy sitting there waiting and ready to be used. The answer is simple – replication isn’t just for minor types of fault tolerance or being able to switch production during maintenance operations, it’s also for those really bad disasters, such as something taking out your datacentre.

At this point it’s common to ‘solve’ the problem by moving the backups onto the secondary site (even if they run cross-site), creating a configuration like the following:

Replication, cross site backup

The thinking goes like this: if there’s a disaster at the primary site, the disaster recovery site not only takes over, but all our backups are there waiting to be used. If there’s a disaster at the disaster recovery site instead, then no data has been lost because all the data is still sitting on the production array.

Well, in only one very special circumstance: if you only need to keep backups for one day.

Backups typically offer reasonably poor RPO and RTO compared to things like replication, continuous data protection, continuous availability, snapshots, etc. But they do offer historical recoverability often essential to meet compliance requirements. Having to provide a modicum of recoverability for 7 years is practically the default these days – medical organisations typically have to retain data for the life of the patient, engineering companies for the lifespan of the construction, and so on. That’s not all backups of course – depending on your industry you’ll likely generate your long term backups either from your monthlies or your yearlies.

Aside: The use of backups to facilitate long term retention is a discussion that’s been running for the 20 years I’ve been working in data protection, and that will still be going in a decade or more. There are strong, valid arguments for using archive to achieve long term retention, but archive requires a data management policy, something many companies struggle with. Storage got cheap and the perceived cost of doing archive created a strong sense of apathy that we’re still dealing with today. Do I agree with that apathy? No, but I still have to deal with the reality of the situation.

So let’s revisit those failure scenarios again that can happen with off-site backups but no backup duplication:

  • If there’s a disaster at the primary site, the disaster recovery site takes over, and all backups are preserved
  • If there’s a disaster at the secondary site, the primary site is unaffected but the production replica data and all backups are lost: short term operational recovery backups and longer term compliance/legal retention backups

Is that a risk worth taking? I had a friend move interstate recently. The day after he moved in, his neighbour’s house burnt down. The fire spread to his house and destroyed most of his possessions. He’d been planning on getting his contents insurance updated the day of the fire.

Bad things happen. Taking the risk that you won’t lose your secondary site isn’t really operational planning, it’s casting your fate to the winds and relying on luck. The solution below though doesn’t rely on luck at all:

Replication and Duplicated Backups

There’s undoubtedly a cost involved; each copy of your data has a tangible cost regardless of whether that’s a primary copy or a secondary copy. Are there some backups you won’t copy? That depends on your requirements: there may for instance be test systems you need to backup, but there’s no need to have a secondary copy of them, but such decisions still have to be made on a risk vs cost basis.

Replication is all well and good, but it’s not a get-out-of-gaol card for avoiding cloned backups.

Dec 052011
 

Backup Metrics

When I discuss backup and recovery success metrics with customers, the question that keeps coming up is “what are desirable metrics to achieve?” I.e., if you were to broadly look at the data protection industry, what should we consider to be suitable metrics to aim for?

Bearing in mind I preach at the alter of Zero Error Policies, one might think that my aim is a 100% success rate for backups, but this isn’t quite the case. In particular, I recognise that errors will periodically occur – the purpose of a zero error policy is to eliminate repetitive errors, and ensure that no error goes unexplained. It is not however a blanket requirement that no error happens.

So what metrics do I recommend? They’re pretty simple:

  • Recoveries – 100% of recoveries should succeed.
  • Backups95-98% of backups should succeed.

That’s right – 100% of recoveries should succeed. Ultimately it doesn’t matter how successful (or apparently) successful your backups are, it’s the recoveries that matter. Remembering that we equate data protection to insurance policies, you can see that the goal is that 100% of “insurance claims” can be fulfilled.

Since 100% of recoveries should succeed, that metric is easy enough to understand – for every one recovery done, one recovery must succeed.

For backups though, we have to consider what constitutes a backup. In particular, if we consider this in terms of NetWorker, I’d suggest that you want to consider each saveset as a backup. As such, you want 95-98% of savesets to succeed.

This makes it relatively easy to confirm whether you’re meeting your backup targets. For instance, if you have 20 Linux hosts in your backup environment (including the backup server), and each host has 4 filesystems, then you’ll around 102 savesets on a nightly basis:

  • 20 x 4 filesystems = 80 savesets
  • 20 index savesets
  • 1 bootstrap saveset
  • 1 NMC database saveset

98% of 102 is 100 savesets (rounded), and 95% of 102 is 97 savesets, rounded. I specify a range there because on any given day it should be OK to hit the low mark, so long as a rolling average hits the high mark or, at bare minimum, sits comfortably between the low and the high mark for success rates. Of course, this is again tempered by the zero error policy guidelines; effectively, as much as possible, those errors should be unique or non-repeating.

You might wonder why I don’t call for a 100% success rate with backups – quite frankly much as it may be highly desirable, given the nature of a backup system – to touch on so many parts of an operating IT environment, it’s also one of the most vulnerable systems to unexpected events. You can design the hell out of a backup system, but you’ll still get an error if mid-way through a backup a client crashes, or a tape drive fails. So what I’m actually asserting with that 2-5% failure rate is the “nature of the beast” style failures: hardware issues, Murphy’s Law and OS/software issues.

Those are metrics you not only can depend on, but you should depend on, too.

Jun 212011
 

In “Distribute.IT reveals shared server data loss – News – iTnews Mobile Edition” (June 21, 2011), we’re told:

Distribute.IT has revealed that production data and backups for four of its shared servers were erased in a debilitating hack on its systems over a week ago.

“In assessing the situation, our greatest fears have been confirmed that not only was the production data erased during the attack, but also key backups, snapshots and other information that would allow us to reconstruct these Servers from the remaining data,” the company reported.

You may think that I’m saying the hack is wrong – and anyone conducting such a malicious attack is certainly being particularly unpleasant. But the simple truth is that such an attack should not be capable of rendering a company unable to recover its data.

It suggests multiple design failures on behalf of Distribute.IT:

  • Backups were not physically isolated; regardless of whether you can erase the current backup, or all the backups on nearline storage, there should be backup copies that are sent off-site and removed from such attack;
  • Alternatively, if there were offsite backups – if they were physically isolated, they were not sufficiently secured;
  • Retention policies seem inappropriately small; why could they not recover from say, a week ago, or two weeks ago? The loss of some data even under a sustained hack should be somewhat reversible if longer-term backups can be recovered from. Instead, we’re told: “we have been advised by the recovery teams that the chances for recovery beyond the data and files so far retrieved are slim”.

It’s also worth noting that this goes to demonstrate a worst case scenario about snapshots – they’re typically reliant on some preservation of original data (either running disks, or ensuring that the amount of data deleted/corrupted doesn’t exceed snapshot capacity).

I’m not crowing about data loss – I completely sympathise with Distribute.IT on this incident. However, it is undoubtedly the case that with an appropriately designed backup system, this level of data destruction should not have happened to them.

Apr 162011
 

Martin Glassborow, aka @storagebod, and I had a bit of a discussion via Twitter, which came down to the following:

  • Martin feels the default backup policy within an environment should be to backup nothing;
  • I feel the default backup policy within an environment should be to backup everything.

Now the interesting thing is, we both actually meet in the middle, but just start from different points.

Martin has discussed his reasoning behind his default policy here, in “Don’t BackUp“, which I encourage you to read before continuing. There is, indeed, as Martin suggested in a tweet to me last night, a nice absolutism in either approach – don’t backup, or backup everything. Yet, neither is really the case.

My approach – that being to start with “backup everything”, starts with the following assumptions:

  1. Hardware can fail.
  2. Software can fail.
  3. Humans can make errors.
  4. Processes can fail.

By my very nature I think I’m perfectly suited to working in the backup space. I’ve always been into backup. On the Vic-20, when I was learning to program, I’d always save my programs onto two different tapes. On the Commodore 64, I’d always save my programs and documents onto two different disks. When I went to the PC, I’d always have a copy on a hard drive, and a copy on a floppy drive.

Martin’s approach is this:

Making it policy that nothing gets backed-up unless requested takes out all ambiguity. There can be no assumptions about what is being backed-up, it makes it someone’s responsibility as opposed to an assumed default.

There is, undoubtedly, logic in what Martin suggests, but it’s not a logical starting point I can personally reconcile myself with, for the fundamental reason that it (IMHO) assumes that everyone who interacts with the system understands the system and the nature of their interaction.

It in fact runs completely contrary to an axiom in user desktop/laptop backup approaches – if you leave backups up to the users, nothing will get backed up. That holds true for pretty much every business I’ve ever interacted with, from the most, to the least technical.

It’s for that reason, that lack of total systems awareness and data responsibility from all users of any environment, that my approach starts from the other end. Backup everything.

But I don’t really mean it. I abhor wastage. Recently, I’ve learnt that wastage comes in many forms, which is why the decision to move interstate and re-evaluate what I/we own has been cleansing. (See the article “deconstruction of falling stars” over at my personal blog for a bit more on that front.)

As I abhor wastage, I don’t actually believe you should backup everything within your environment. Sure, some vendors might like that notion – infinite tapes, disk, storage, snapshots, you name it. But it’s neither practical nor commercial reality to do this.

No, there is a middle ground. For me, the sweet spot is this what I always come back to:

It is always better to backup a little more than you need, and waste some storage media, than it is to not backup quite enough, and be unable to recover.

So if your tape usage is say, 5-10% higher than it should be, or your VTL/B2D environment is 5-10% bigger than it really needs to be, I’m not concerned. (If it’s a crazy amount, like 100% more, then there’s a problem – a serious problem that has arisen from a lack of capacity planning, etc.)

I’ve seen IT sites where NetWorker agents have been deployed on every server within the environment, and when I’ve done a coverage analysis, I’ve seen servers that have this as the saveset:

/etc/hosts

Just that. Nothing more, nothing less. (You couldn’t get much less anyway.) I’ve equally seen sites where not only was a hot backup done of the production Oracle database via a module, but the database files were backed up as part of the filesystem backup, and then export/dumps were generated and backed up as well. Overkill? Yes. Were some backups unrecoverable? Yes.

Both are very clear examples of wastage, but I’ll tell you the difference.

The latter one – backing up too much, is time and money wastage. Neither are pleasant, both can hurt the bottom line of a company, yet that’s where it stops.

The former – backing up only what is explicitly requested, nothing more, is corporate wastage. There’s a little bit of monetary wastage involved (why spend the money on an agent to backup a single file?) – the real wastage though is that it could waste the company. Unable to recover legally required files because someone forgot to request them to be backed up? Hello, lawsuit loss. Unable to recover financial data that proves your company has correctly paid its taxes because someone forgot to request them to be backed up? Hello, double tax payments. For me it triggers thought of every possible nightmare scenario a company might experience, right through to total dissolution and loss of the company itself.

In my book, I make the differentiation between what I call inclusive and exclusive backup products. I define:

  • An inclusive backup product is one where you have to explicitly specify what gets backed up. By default, nothing is backed up unless you specify it.
  • An exclusive backup product is one where you have to explicitly specify what doesn’t get backed up. By default, everything is selected and you have to winnow that selection down yourself.

The first, I consider to be the hallmark of a workgroup backup product approach. Cost reduction is the primary focus of this approach. The second, I consider to be a fundamental requirement for a product to earn the “enterprise backup product” badge of honour. Without this, there is a distinct lack of trust.

While I can understand Martin’s starting point, and that he moves more to the middle of making sure the right things are backed up, I can’t agree with this logic that this is the best approach.

I’ve seen, heard of, and witnessed too many IT war stories.

Apr 152011
 

In the past I’ve talked about the importance of having zero error policies.

In “What is a zero error policy?“, I said:

Having a zero error policy requires the following three rules:

1. All errors shall be known.

2. All errors shall be resolved.

3. No error shall be allowed to continue to occur indefinitely.

If you’ve not read that article, I suggest you go read it, as well as the follow-up article, “Zero error policy management“.

I’m going to make, and stand by, with fervid determination, the following assertion:

If you do not have a zero-error policy for your backups, you do not have a valid backup system.

No ifs, no buts, no maybes, no exceptions.

Why? Because why. Because across all the sites I’ve seen, regardless of size, regardless of complexity, the only ones that actually work properly are those where every error is captured, identified, and dealt with. Only those sites would I point at and say “They have every chance of meeting their SLAs”.

In my book, I introduce the notion that just deploying software and thinking you have a backup system is like making a sacrifice to a volcano. So, without a zero error policy, what does a network diagram of your IT environment look like?

It looks like this:

Network diagram of backup environment without zero error policies

37 Signals and the “end of the IT department”

 Architecture, Aside, Data loss, General Technology, General thoughts  Comments Off on 37 Signals and the “end of the IT department”
Mar 022011
 

The folks over at 37 Signals published a little piece of what I would have to describe as crazy fiction, about how the combination of cloud and more technically savvy users means that we’re now seeing the end of the IT department.

I thought long and hard about writing a rebuttal here, but quite frankly, their lack of logic made me too mad to publish the article on my main blog, where I try to be a little more polite.

So, if you don’t mind a few strong words and want to read a rebuttal to 37 Signals, check out my response here.

Jan 142011
 

This is the fifth and final part of our four part series “Data Lifecycle Management”. (By slipping in an aside article, I can pay homage to Douglas Adams with that introduction.)

So far in data lifecycle management, I’ve discussed:

Now we need to get to our final part – the need to archive rather than just blindly deleting.

You might think that this and the previous article are at odds with one another, but in actual fact, I want to talk about the recklessness of deliberately using a backup system as a safety net to facilitate data deletion rather than incorporating archive into data lifecycle management.

My first introduction to deleting with reckless abaddon was at a University that instituted filesystem quotas, but due to their interpretation of academic freedom, could not institute mail quotas. Unfortunately one academic got the crafty notion that when his home directory filled, he’d create zip files of everything in the home directory and email it to himself, then delete the contents and start afresh. Violá! Pretty soon the notion got around, and suddenly storage exploded.

Choosing to treat a backup system as a safety net/blank cheque for data deletion is really quite a devilishly reckless thing to do. It may seem “smart” since the backup system is designed to recover lost data, but in reality it’s just plain dumb. It creates two very different and very vexing problems:

  • Introduces unnecessary recovery risks
  • Hides the real storage requirements

In the first instance: if it’s fixed, don’t break it. Deliberately increasing the level of risk in a system is, as I’ve said from the start, a reckless activity. A single backup glitch and poof! that important data you deleted because you temporarily needed more space is never, ever coming back. Here’s an analogy: running out of space in production storage? Solution? Turn off all the mirroring and now you’ve got DOUBLE the capacity! That’s the level of recklessness that I think this process equates to.

The second vexing problem it creates is that it completely hides the real storage requirements for an environment. If your users and/or administrators are deleting required primary data willy-nilly, you don’t ever actually have a real indication of how much storage you really need. On any one day you may appear to have plenty of storage, but that could be a mirage – the heat coming off a bunch of steaming deletes that shouldn’t have been done. This leads to over-provisioning in a particularly nasty way – approving new systems or new databases, etc., thinking there’s plenty of space, when in actual fact, you’ve maybe run out multiple times.

That is, over time, we can describe storage usage and deletion occurring as follows:

Deleting with reckless abaddon

This shows very clearly the problem that happens in this scenario – as multiple deletes are done over time to restore primary capacity, the amount of data that is deleted but known to be required later builds to the point where its not physically possible to have all of it residing on primary storage any longer should it be required. All we do is create a new headache while implementing at best a crude workaround.

In fact, in this new age of thin provisioning, I’d suggest that the companies where this is practiced rather than true data lifecycle management have a very big nightmare ahead of them. Users and administrators who are taught data management on the basis of “delete when it’s full” are going to stomp all over the storage in a thin provisioning environment. Instead of being a smart idea to avoiding archive, in a thin provisioning environment this could very well leave storage administrators in a state of breathless consternation – and systems falling over left, right and centre.

And so we come to the end of our data lifecycle discussion, at which point it’s worthwhile revisiting the diagram I used to introduce the lifecycle:

Data Lifecycle

Let me know when you’re all done with it and I’ll archive 🙂

Jul 272010
 

I’m not a storage geek – storage to me is a means to an end, almost irrelevant to the final goal.

I’m passionate about backup though, because backup is about making people happy.

Backup is about recovery, you see.

Recovery is about making sure people can go home on time rather than re-entering lost data all night.

Recovery is about knowing someone can turn up for a flight they booked six weeks earlier and know the airline still knows they booked the ticket.

Recovery is about knowing someone’s pay deposit isn’t lost after a brief systems hiccup.

Recovery is about a student saving a 50,000 word thesis on a server and knowing it will still be there next morning.

Recovery is about being able to look at digital photos of a loved one ten years after they’re gone.

I have the best job in the world.

If you work in backup and recovery, so do you.

%d bloggers like this: