pmdg

Would you buy a dangerbase?

 Backup theory, Policies  Comments Off on Would you buy a dangerbase?
Jun 072017
 

Databases. They’re expensive, aren’t they?

What if I sold you a Dangerbase instead?

What’s a dangerbase!? I’m glad you asked. A dangerbase is functionally almost exactly the same as a database, except it may be a little bit more lax when it comes to controls. Referential integrity might slip. Occasionally an insert might accidentally trigger a background delete. Nothing major though. It’s twenty percent less of the cost with only four times the risk of one of those pesky ‘databases’! (Oh, you might need 15% more infrastructure to run it on, but you don’t have to worry about that until implementation.)

Dangerbases. They’re the next big thing. They have a marketshare that’s doubling every two years! Two years! (Admittedly that means they’re just at 0.54% marketshare at the moment, but that’s double what it was last year!)

A dangerbase is a stupid idea. Who’d trust storing their mission critical data in a dangerbase? The idea is preposterous.

Sadly, dangerbases get considered all too often in the world of data protection.

Destroyed Bridge

What’s a dangerbase in the world of data protection? Here’s just some examples:

  • Relying solely on an on-platform protection mechanism. Accidents happen. Malicious activities happen. You need to always ensure you’ve got a copy of your data outside of the original production platform it is created and maintained on, regardless of what protection you’ve got in place there. And you should at least have one instance of each copy in a different physical location to the original.
  • Not duplicating your backups. Whether you call it a clone or a copy or a duplication doesn’t matter to me here – it’s the effect we’re looking for, not individual product nomenclature. If your backup isn’t copied, it means your backup represents a single point of failure in the recovery process.
  • Using post-process deduplication. (That’s something I covered in detail recently.)
  • Relying solely on RAID when you’re doing deduplication. Data Invulnerability Architecture (DIA) isn’t just a buzzterm, it’s essential in a deduplication environment.
  • Turning your databases into dangerbases by doing “dump and sweep”. Plugins have existed for decades. Dump and sweep is an expensive waste of primary storage space and introduces a variety of risk into your data protection environment.
  • Not having a data lifecycle policy! Without it, you don’t have control over capacity growth within your environment. Without that, you’re escalating your primary storage costs unnecessarily, and placing strain on your data protection environment – strain that can easily break it.
  • Not having a data protection advocate, or data protection architect, within your organisation. If data is the lifeblood of a company’s operations, and information is money, then failing to have a data protection architect/advocate within the organisation is like not bothering with having finance people.
  • Not having a disaster recovery policy that integrates into a business continuity policy. DR is just one aspect of business continuity, but if it doesn’t actually slot into the business continuity process smoothly, it’s as likely going to hinder than help the company.
  • Not understanding system dependencies. I’ve been talking about system dependency maps or tables for years. Regardless of what structure you use, the net effect is the same: the only way you can properly protect your business services is to know what IT systems they rely on, and what IT systems those IT systems rely on, and so on, until you’re at the root level.

That’s just a few things, but hopefully you understand where I’m coming from.

I’ve been living and breathing data protection for more than twenty years. It’s not just a job, it’s genuinely something I’m passionate about. It’s something everyone in IT needs to be passionate about, because it can literally make the difference between your company surviving or failing in a disaster situation.

In my book, I cover all sorts of considerations and details from a technical side of the equation, but the technology in any data protection solution is just one aspect of a very multi-faceted approach to ensuring data availability. If you want to take data protection within your business up to the next level – if you want to avoid having the data protection equivalent of a dangerbase in your business – check my book out. (And in the book there’s a lot more detail about integrating into IT governance and business continuity, a thorough coverage of how to work out system dependencies, and all sorts of details around data protection advocates and the groups that they should work with.)

Architecture Matters: Protection in the Cloud (Part 2)

 Architecture  Comments Off on Architecture Matters: Protection in the Cloud (Part 2)
Jun 052017
 

(Part 1).

Particularly when we think of IaaS style workloads in the Cloud, there’s two key approaches that can be used for data protection.

The first is snapshots. Snapshots fulfil part of a data protection strategy, but we do always need to remember with snapshots that:

  • They’re an inefficient storage and retrieval model for long-term retention
  • Cloud or not, they’re still essentially on-platform

As we know, and something I cover in my book quite a bit – a real data protection strategy will be multi-layered. Snapshots undoubtedly can provide options around meeting fast RTOs and minimal RPOs, but traditional backup systems will deliver a sufficient recovery granularity for protection copies stretching back weeks, months or years.

Stepping back from data protection itself – public cloud is a very different operating model to traditional  in-datacentre infrastructure spending. The classic in-datacentre infrastructure procurement process is an up-front investment designed around 3- or 5-year depreciation schedules. For some businesses that may mean a literal up-front purchase to cover the entire time-frame (particularly so when infrastructure budget is only released for the initial deployment project), and for others with more fluid budget options, there’ll be an investment into infrastructure that can be expanded over the 3- or 5-year solution lifetime to meet systems growth.

Cloud – public Cloud – isn’t costed or sold that way. It’s a much smaller billing window and costing model; use a GB or RAM, pay for a GB of RAM. Use a GHz or CPU, pay for a GHz of CPU. Use a GB of storage, pay for a GB of storage. Public cloud costing models often remind me of Master of the House from Les Miserables, particularly this verse:

Charge ’em for the lice, extra for the mice
Two percent for looking in the mirror twice
Here a little slice, there a little cut
Three percent for sleeping with the window shut
When it comes to fixing prices
There are a lot of tricks I knows
How it all increases, all them bits and pieces
Jesus! It’s amazing how it grows!

Master of the House, Les Miserables.

That’s the Cloud operating model in a nutshell. Minimal (or no) up-front investment, but you pay for every scintilla of resource you use – every day or month.

If you say, deploy a $30,000 server into your datacentre, you then get to use that as much or as little as you want, without any further costs beyond power and cooling*. With Cloud, you won’t be paying that $30,000 initial fee, but you will pay for every MHz, KB of RAM and byte of storage consumed within every billing period.

If you want Cloud to be cost-effective, you have to be able to optimise – you have to effectively game the system, so to speak. Your in-Cloud services have to be maximally streamlined. We’ve become inured to resource wastage in the datacentre because resources have been cheap for a long time. RAM size/speed grows, CPU speed grows, as does the number of cores, and storage – well, storage seems to have an infinite expansion capability. Who cares if what you’re doing generates 5 TB of logs per day? Information is money, after all.

To me, this is just the next step in the somewhat lost art of programmatic optimisation. I grew up in the days of 8-bit computing**, and we knew back then that CPU, RAM and storage weren’t infinite. This didn’t end with 8-bit computing, though. When I started in IT as a Unix system administrator, swap file sizing, layout and performance was something that formed a critical aspect of your overall configuration, because if – Jupiter forbid – your system started swapping, you needed a fighting chance that the swapping wasn’t going to kill your performance. Swap file optimisation was, to use a Bianca Del Rio line, all about the goal: “Not today, Satan.”

That’s Cloud, now. But we’re not so much talking about swap files as we are resource consumption. Optimisation is critical. A failure to optimise means you’ll pay more. The only time you want to pay more is when what you’re paying for delivers a tangible, cost-recoverable benefit to the business. (I.e., it’s something you get to charge someone else for, either immediately, or later.)

Cloud Cost

If we think about backup, it’s about getting data from location A to location B. In order to optimise it, you want to do two distinct thinks:

  • Minimise the number of ‘hops’ that data has to make in order to get from A to B
  • Minimise the amount of data that you need to send from A to B.

If you don’t optimise that, you end up in a ‘classic’ backup architecture that we used to rely so much on in the 90s and early 00s, such as:

Cloud Architecture Matters 1

(In this case I’m looking just at backup services that land data into object storage. There are situations where you might want higher performance than what object offers, but let’s stick just with object storage for the time being.)

I don’t think this diagram is actually good at giving the full picture. There’s another way I like to draw the diagram, and it looks like this:

Cloud Architecture Matters 2

In the Cloud, you’re going to pay for the systems you’re running for business purposes no matter what. That’s a cost you have to accept, and the goal is to ensure that whatever services or products you’re on-selling to your customers using those services will pay for the running costs in the Cloud***.

You want to ensure you can protect data in the Cloud, but sticking to architectures designed at the time of on-premises infrastructure – and physical infrastructure at that – is significantly sub-optimal.

Think of how traditional media servers (or in NetWorker parlance, storage nodes) needed to work. A media server is designed to be a high performance system that funnels data coming from client to protection storage. If a backup architecture still heavily relies on media servers, then the cost in the Cloud is going to be higher than you need it – or want it – to be. That gets worse if a media server needs to be some sort of highly specced system encapsulating non-optimised deduplication. For instance, one of NetWorker’s competitors provides details on their website of hardware requirements for deduplication media servers, so I’ve taken these specifications directly from their website. To work with just 200 TB of storage allocated for deduplication, a media server for that product needs:

  • 16 CPU Cores
  • 128 GB of RAM
  • 400 GB SSD for OS and applications
  • 2 TB of SSD for deduplication databases
  • 2 TB of 800 IOPs+ disk (SSD recommended in some instances) for index cache

For every 200 TB. Think on that for a moment. If you’re deploying systems in the Cloud that generate a lot of data, you could very easily find yourself having to deploy multiple systems such as the above to protect those workloads, in addition to the backup server itself and the protection storage that underpins the deduplication system.

Or, on the other hand, you could work with an efficient architecture designed to minimise the number of data hops, and minimise the amount of data transferred:

CloudBoost Workflow

That’s NetWorker with CloudBoost. Unlike that competitor, a single CloudBoost appliance doesn’t just allow you to address 200TB of deduplication storage, but 6 PB of logical object storage. 6 PB, not 200 TB. All that using 4 – 8 CPUs and 16 – 32GB of RAM, and with a metadata sizing ratio of 1:2000 (i.e., every 100 GB of metadata storage allows you to address 200 TB of logical capacity). Yes, there’ll be SSD optimally for the metadata, but noticeably less than the competitor’s media server – and with a significantly greater addressable range.

NetWorker and CloudBoost can do that because the deduplication workflow has been optimised. In much the same way that NetWorker and Data Domain work together, within a CloudBoost environment, NetWorker clients will participate in the segmentation, deduplication, compression (and encryption!) of the data. That’s the first architectural advantage: rather than needing a big server to handle all the deduplication of the protection environment, a little bit of load is leveraged in each client being protected. The second architectural advantage is that the CloudBoost appliance does not pass the data through. Clients send their deduplicated, compressed and encrypted data directly to the object storage, minimising the data hops involved****.

To be sure, there are still going to be costs associated with running a NetWorker+CloudBoost configuration in public cloud – but that will be true of any data protection service. That’s the nature of public cloud – you use it, you pay for it. What you do get with NetWorker+CloudBoost though is one of the most streamlined and optimised public cloud backup options available. In an infrastructure model where you pay for every resource consumed, it’s imperative that the backup architecture be as resource-optimised as possible.

IaaS workloads will only continue to grow in public cloud. If your business uses NetWorker, you can take comfort in being able to still protect those workloads while they’re in public cloud, and doing it efficiently, optimised for maximum storage potential with minimised resource cost. Remember always: architecture matters, no matter where your infrastructure is.


Hey, if you found this useful, don’t forget to check out Data Protection: Ensuring Data Availability.


 


* Yes, I am aware there’ll be other costs beyond power and cooling when calculating a true system management price, but I’m not going to go into those for the purposes of this blog.

** Some readers of my blog may very well recall earlier computing models. But I started with a Vic-20, then the Commodore-64, and both taught me valuable lessons about what you can – and can’t – fit in memory.

*** Many a company has been burnt by failing to cost that simple factor, but in the style of Michael Ende, that is another story, for another time.

**** Linux 64-bit clients do this now. Windows 64-bit clients are supported in NetWorker 9.2, coming soon. (In the interim Windows clients work via a storage node.)

May 232017
 

I’m going to keep this one short and sweet. In Cloud Boost vs Cloud Tier I go through a few examples of where and when you might consider using Cloud Boost instead of Cloud Tier.

One interesting thing I’m noticing of late is a variety of people talking about “VTL in the Cloud”.

BigStock Exhausted

I want to be perfectly blunt here: if your vendor is talking to you about “VTL in the Cloud”, they’re talking to you about transferring your workloads rather than transforming your workloads. When moving to the Cloud, about the worst thing you can do is lift and shift. Even in Infrastructure as a Service (IaaS), you need to closely consider what you’re doing to ensure you minimise the cost of running services in the Cloud.

Is your vendor talking to you about how they can run VTL in the Cloud? That’s old hat. It means they’ve lost the capacity to innovate – or at least, lost interest in it. They’re not talking to you about a modern approach, but just repeating old ways in new locations.

Is that really the best that can be done?

In a coming blog article I’ll talk about the criticality of ensuring your architecture is streamlined for running in the Cloud; in the meantime I just want to make a simple point: talking about VTL in the Cloud isn’t a “modern” discussion – in fact, it’s quite the opposite.

May 232017
 

Introduction

A seemingly straight-forward question, what constitutes a successful backup may not engender the same response from everyone you ask. On the surface, you might suggest the answer is simply “a backup that completes without error”, and that’s part of the answer, but it’s not the complete answer.

Bullseye

Instead, I’m going to suggest there’s actually at least ten factors that go into making up a successful backup, and explain why each one of them is important.

The Rules

One – It finishes without a failure

This is the most simple explanation of a successful backup. One that literally finishes successfully. It makes sense, and it should be a given. If a backup fails to transfer the data it is meant to transfer during the process, it’s obviously not successful.

Now, there’s a caveat here, something I need to cover off. Sometimes you might encounter situations where a backup completes successfully  but triggers or produces a spurious error as it finishes. I.e., you’re told it failed, but it actually succeeded. Is that a successful backup? No. Not in a useful way, because it’s encouraging you to ignore errors or demanding manual cross-checking.

Two – Any warnings produced are acceptable

Sometimes warnings will be thrown during a backup. It could be that a file had to be re-read, or a file was opened at the time of backup (e.g., on a Unix/Linux system) and could only be partially read.

Some warnings are acceptable, some aren’t. Some warnings that are acceptable on one system may not be acceptable on another. Take for instance, log files. On a lot of systems, if a log file is being actively written to when the backup is running, it could be that the warning of an incomplete capture of the file is acceptable. If the host is a security logging system and compliance/auditing requirements dictate all security logs are to be recoverable, an open-file warning won’t be acceptable.

Three – The end-state is captured and reported on

I honestly can’t say the number of times over the years I’ve heard of situations where a backup was assumed to have been running successfully, then when a recovery is required there’s a flurry of activity to determine why the recovery can’t work … only to find the backup hadn’t been completing successfully for days, weeks, or even months. I really have dealt with support cases in the past where critical data that had to be recovered was unrecoverable due to a recurring backup failure – and one that had been going on, being reported in logs and completion notifications, day-in, day-out, for months.

So, a successful backup is also a backup here the end-state is captured and reported on. The logical result is that if the backup does fail, someone knows about it and is able to choose an action for it.

When I first started dealing with NetWorker, that meant checking the savegroup completion reports in the GUI. As I learnt more about the importance of automation, and systems scaled (my system administration team had a rule: “if you have to do it more than once, automate it”), I built parsers to automatically interpret savegroup completion results and provide emails that would highlight backup failures.

As an environment scales further, automated parsing needs to scale as well – hence the necessity of products like Data Protection Advisor, where you not only get simple dashboards for overnight success ratios with drill-downs, root cause analysis, and all the way up to SLA adherence reports and beyond.

In short, a backup needs to be reported on to be successful.

Four – The backup method allows for a successful recovery

A backup exists for one reason alone – to allow the retrieval and reconstruction of data in the event of loss or corruption. If the way in which the backup is run doesn’t allow for a successful recovery, then the backup should not be counted as a successful backup, either.

Open files are a good example of this – particularly if we move into the realm of databases. For instance, on a regular Linux filesystem (e.g., XFS or EXT4), it would be perfectly possible to configure a filesystem backup of an Oracle server. No database plugin, no communication with RMAN, just a rolling sweep of the filesystem, writing all content encountered to the backup device(s).

But it wouldn’t be recoverable. It’s a crash-consistent backup, not an application-consistent backup. So, a successful backup must be a backup that can be successfully recovered from, too.

Five – If an off-site/redundant copy is required, it is successfully performed

Ideally, every backup should get a redundant copy – a clone. Practically, this may not always be the case. The business may decide, for instance, that ‘bronze’ tiered backups – say, of dev/test systems, do not require backup replication. Ultimately this becomes a risk decision for the business and so long as the right role(s) have signed off against the risk, and it’s deemed to be a legally acceptable risk, then there may not be copies made of specific types of backups.

But for the vast majority of businesses, there will be backups for which there is a legal/compliance requirement for backup redundancy. As I’ve said before, your backups should not be a single point of failure within your data protection environment.

So, if a backup succeeds but its redundant copy fails, the backup should, to a degree, be considered to have failed. This doesn’t mean you have to necessarily do the backup again, but if redundancy is required, it means you do have to make sure the copy gets made. That then hearkens back to requirement three – the end state has to be captured and reported on. If you’re not capturing/reporting on end-state, it means you won’t be aware if the clone of the backup has succeeded or not.

Six – The backup completes within the required timeframe

You have a flight to catch at 9am. Because of heavy traffic, you don’t arrive at the airport until 1pm. Did you successfully make it to the airport?

It’s the same with backups. If, for compliance reasons you’re required to have backups complete within 8 hours, but they take 16 to run, have they successfully completed? They might exit without an error condition, but if SLAs have been breached, or legal requirements have not been met, it technically doesn’t matter that they finished without error. The time it took them to exit was, in fact, the error condition. Saying it’s a successful backup at this point is sophistry.

Seven – The backup does not prevent the next backup from running

This can happen one of two different ways. The first is actually a special condition of rule six – even if there are no compliance considerations, if a backup meant to run once a day takes longer than 24 hours to complete, then by extension, it’s going to prevent the next backup from running. This becomes a double failure – not only does the next backup run, but the next backup doesn’t run because the earlier backup is blocking it.

The second way is not necessarily related to backup timing – this is where a backup completes, but it leaves system in state that prevents next backup from running. This isn’t necessarily a common thing, but I have seen situations where for whatever reason, the way a backup finished prevented the next backup from running. Again, that becomes a double failure.

Eight – It does not require manual intervention to complete

There’s two effective categories of backups – those that are started automatically, and those that are started manually. A backup may in fact be started manually (e.g., in the case of an ad-hoc backup), but should still be able to complete without manual intervention.

As soon as manual intervention is required in the backup process, there’s a much greater risk of the backup not completing successfully, or within the required time-frame. This is, effectively, about designing the backup environment to reduce risk by eliminating human intervention. Think of it as one step removed from the classic challenge that if your backups are required but don’t start without human intervention, they likely won’t run. (A common problem with ‘strategies’ around laptop/desktop self-backup requirements.)

There can be workarounds for this – for example, if you need to trigger a database dump as part of the backup process (e.g., for a database without a plugin), then it could be a password needs to be entered, and the dump tool only accepts passwords interactively. Rather than having someone actually manually enter the password, the dump command could instead be automated with tools such as Expect.

Nine – It does not unduly impact access to the data it is protecting

(We’re in the home stretch now.)

A backup should be as light-touch as possible. The best example perhaps of a ‘heavy touch’ backup is a cold database backup. That’s where the database is shutdown for the duration of the backup, and it’s a perfect situation of a backup directly impacting/impeding access to the data being protected. Sometimes it’s more subtle though – high performance systems may have limited IO and system resources to handle the steaming of a backup, for instance. If system performance is degraded by the backup, then it should be considered the case the backup is unsuccessful.

I liken this to uptime vs availability. A server might be up, but if the performance of the system is so poor that users consider the service offered by the system, it’s not usable. That’s where, for instance, systems like ProtectPoint can be so important – in high performance systems it’s not just about getting a high speed backup, but limiting the load of the database server during the backup process.

Ten – It is predictably repeatable

Of course, there are ad-hoc backups that might only ever need to be run once, or backups that you may never need to run again (e.g., pre-decommissioning backup).

The vast majority of backups within an environment though will be repeated daily. Ideally, the result of each backup should be predictably repeatable. If the backup succeeds today, and there’s absolutely no changes to the systems or environment, for instance, then it should be reasonable to expect the backup will succeed tomorrow. That doesn’t ameliorate the requirement for end-state capturing and reporting; it does mean though that the backup results shouldn’t effectively be random.

In Summary

It’s easy to understand why the simplest answer (“it completes without error”) can be so easily assumed to be the whole answer to “what constitutes a successful backup?” There’s no doubt it forms part of the answer, but if we think beyond the basics, there are definitely a few other contributing factors to achieving really successful backups.

Consistency, impact, recovery usefulness and timeliness, as well as all the other rules outlined above also come into how we can define a truly successful backup. And remember, it’s not about making more work for us, it’s about preventing future problems.


If you’ve thought the above was useful, I’d suggest you check out my book, Data Protection: Ensuring Data Availability. Available in paperback and Kindle formats.

Dell EMC Integrated Data Protection Appliance

 Architecture  Comments Off on Dell EMC Integrated Data Protection Appliance
May 102017
 

Dell EMC World is currently on in Las Vegas, and one of the most exciting announcements to come out of the show (in my opinion) is the Integrated Data Protection Appliance (IDPA).

Hyperconverged is eating into the infrastructure landscape – it’s a significantly growing market for Dell EMC, as evidenced by the VxRail and VxRack product lines. These allow you to deploy fast, efficiently and with a modernised consumption approach thanks to Enterprise Hybrid Cloud.

The next step in that hyperconverged path is hyperconverged data protection, which is where the IDPA comes in.

DellEMCIDPA

Hyperconverged data protection works on the same rationale as hyperconverged primary production infrastructure: you can go out to market and buy a backup product, data protection storage, systems infrastructure to run it on, etc., then when it all arrives, assemble, test and configure it, or, you could buy a single appliance with the right starting and growth capacity for you, get it delivered on-site pre-built and tested, and a few hours later be up and running your first backup.

The IDPA is an important step in the evolution of data protection, recognising the changing landscape in the IT infrastructure environment, notably:

  • Businesses want to see results realised from their investment as soon as it arrives
  • Businesses don’t want IT staff spending time doing ‘one-off’ installation activities.
  • The silo, ‘communicate via service tickets’ approach to IT is losing ground as the infrastructure administrator becomes a real role within organisations. It’s not just infrastructure becoming hyperconverged – it’s people, too.
  • The value of automation is finally being understood, since it frees IT resources to concentrate on projects and issue resolution, rather than BAU button pressing.
  • Mobile workforces and remote office environments increasingly means you may not have an IT person present on-site to physically make a change, etc.
  • Backup administrators need to become data protection administrators, and data protection architects.

And finally, there’s another, final aspect to the IDPA that cannot be overstated in the realm of hyper-virtualised environments: the IDPA is natural physical separation of your protection data from your operational infrastructure. Consider a traditional protection environment:

Traditional Environment

In a traditional protection environment, you’ll typically have separated protection storage (e.g., Data Domain), but it’s very typical these days, particularly in hyper-virtualised environments, to see the backup services themselves running within the same environment they’re protecting. That means if there is a significant primary systems infrastructure issue, your recovery time may take longer because you’ll have to first get the backup services up and running again.

IDPA provides complete separation though:

IDPA Environment

The backup services and configuration no longer run on your primary systems infrastructure, instead running in a separate appliance. This gives you higher levels of redundancy and protection for your protection environment, decreasing risk within your business.

Top picks for where you should consider an IDPA:

  • When deploying large-scale hyperconverged environments (e.g., VxRack)
  • For remote offices
  • For greenfields computer-rooms
  • For dealing with large new workloads
  • For modernising your approach to data protection
  • Whenever you want a single, turnkey approach to data protection with a single vendor supporting the entire stack

The IDPA can scale with your business; there’s models starting as low as 34TB usable (pre-dedupe) and scaling all the way to 1PB usable (and that’s before you consider cloud-tiering).

If you’re wanting to read more about IDPA, check out the official Dell EMC blog post for the release here.

 Posted by at 8:34 am  Tagged with:
May 052017
 

There was a time, comparatively not that long ago, when the biggest governing factor in LAN capacity for a datacentre was not the primary production workloads, but the mechanics of getting a full backup from each host over to the backup media. If you’ve been around in the data protection industry long enough you’ll have had experience of that – for instance, the drive towards 1Gbit networks over Fast Ethernet started more often than not in datacentres I was involved in thanks to backup. Likewise, the first systems I saw being attached directly to 10Gbit backbones in datacentres were the backup infrastructure.

Well architected deduplication can eliminate that consideration. That’s not to say you won’t eventually need 10Gbit, 40Gbit or even more in your datacentre, but if deduplication is architected correctly, you won’t need to deploy that next level up of network performance to meet your backup requirements.

In this blog article I want to take you through an example of why deduplication architecture matters, and I’ll focus on something that amazingly still gets consideration from time to time: post-ingest deduplication.

Before I get started – obviously, Data Domain doesn’t use post-ingest deduplication. Its pre-ingest deduplication ensures the only data written to the appliance is already deduplicated, and it further increases efficiency by pushing deduplication segmentation and processing out to the individual clients (in a NetWorker/Avamar environment) to limit the amount of data flowing across the network.

A post-deduplication architecture though has your protection appliance feature two distinct tiers of storage – the landing or staging tier, and the deduplication tier. So that means when it’s time to do a backup, all your clients send all their data across the network to sit, in original sized format, on the staging tier:

Post Process Dedupe 01

In the example above we’ve already had backups run to the post-ingest deduplication appliance; so there’s a heap of deduplicated data sitting in the deduplication tier, but our staging tier has just landed all the backups from each of the clients in the environment. (If it were NetWorker writing to the appliance, each of those backups would be the full sized savesets.)

Now, at some point after the backup completes (usually a preconfigured time), post-processing kicks in. This is effectively a data-migration window in a post-ingest appliance where all the data in the staging tier has to be read and processed for deduplication. For example, using the example above, we might start with inspecting ‘Backup01’ for commonality to data on the deduplication tier:

Post Process Dedupe 02

So the post-ingest processing engine starts by reading through all the content of Backup01 and constructs fingerprint analysis of the data that has landed.

Post Process Dedupe 03

As fingerprints are assembled, data can be compared against the data already residing in the deduplication tier. This may result in signature matches or signature misses, indicating new data that needs to be copied into the deduplication tier.

Post Process Dedupe 04

In this it’s similar to regular deduplication – signature matches result in pointers for existing data being updated and extended, and a signature miss results in needing to store new data on the deduplication tier.

Post Process Dedupe 05

Once the first backup file written to the staging tier has been dealt with, we can delete that file from the staging area and move onto the second backup file to start the process all over again. And we keep doing that over and over and over on the staging tier until we’re left with an empty staging tier:

Post Process Dedupe 06

Of course, that’s not the end of the process – then the deduplication tier will have to run its regular housekeeping operations to remove data that’s no longer referenced by anything.

Architecturally, post-ingest deduplication is a kazoo to pre-ingest deduplication’s symphony orchestra. Sure, you might technically get to hear the 1812 Overture, but it’s not really going to be the same, right?

Let’s go through where architecturally, post-ingest deduplication fails you:

  1. The network becomes your bottleneck again. You have to send all your backup data to the appliance.
  2. The staging tier has to have at least as much capacity available as the size of your biggest backup, assuming it can execute its post-process deduplication within the window between when your previous backup finishes and your next backup starts.
  3. The deduplication process becomes entirely spindle bound. If you’re using spinning disk, that’s a nightmare. If you’re using SSD, that’s $$$.
  4. There’s no way of telling how much space will be occupied on the deduplication tier after deduplication processing completes. This can lead you into very messy situations where say, the staging tier can’t empty because the deduplication tier has filled. (Yes, capacity maintenance is a requirement still on pre-ingest deduplication systems, but it’s half the effort.)

What this means is simple: post-ingest deduplication architectures are asking you to pay for their architectural inefficiencies. That’s where:

  1. You have to pay to increase your network bandwidth to get a complete copy of your data from client to protection storage within your backup window.
  2. You have to pay for both the staging tier storage and the deduplication tier storage. (In fact, the staging tier is often a lot bigger than the size of your biggest backups in a 24-hour window so the deduplication can be handled in time.)
  3. You have to factor the additional housekeeping operations into blackout windows, outages, etc. Housekeeping almost invariably becomes a daily rather than a weekly task, too.

Compare all that to pre-ingest deduplication:

Pre-Ingest Deduplication

Using pre-ingest deduplication, especially Boost based deduplication, the segmentation and hashing happen directly where the data is, and rather than sending the entire data to be protected from the client to the Data Domain, we only send the unique data. Data that already resides on the Data Domain? All we’ll have sent is a tiny fingerprint so the Data Domain can confirm it’s already there (and update its pointers for existing data), then moved on. After your first backup, that potentially means that on a day to day basis your network requirements for backup are reduced by 95% or more.

That’s why architecture matters: you’re either doing it right, or you’re paying the price for someone else’s inefficiency.


If you want to see more about how a well architected backup environment looks – technology, people and processes, check out my book, Data Protection: Ensuring Data Availability.

NetWorker 9.1.1 gets out the door

 NetWorker  Comments Off on NetWorker 9.1.1 gets out the door
May 022017
 

I had a fairly full-on weekend so I missed this one – NetWorker 9.1.1 is now available.

Being a minor release, this one is focused on general improvements and currency, as opposed to introducing a wealth of new features.

Upgrade

There’s some really useful updates around NMC, such as:

  • Performance/response improvements
  • Option for NMC to retrieve a vProxy support bundle for you
  • NMC now shows whenever the NetWorker server is running in service mode
  • NMC will give you a list of virtual machines backed up and skipped
  • NMC recoveries now highlight the calendar dates that are available to select backups to recover from

Additionally, NDMP and NDMA get some updates as well:

  • Some NDMP application options can now be set in the NetWorker client resource level, rather than having to establish them as an environment variable
  • NMDA for SAP/Oracle and Oracle/RMAN get more compact debug logs
  • NMDA for Sybase can now recover log-tail backups.

Finally, there’s the version currency:

  • NetWorker Server High Availability is now supported on SuSE 12 SP2 with HAE, and RHEL 7.3 in a High Availability Cluster (with Pacemaker).
  • NVP/vProxy supports vSphere 6.0u3
  • Meditech module supports Unity 4.1 and RecoverPoint 5.0.

As always for upgrades, make sure you read the release notes before diving in.


Also, don’t forget my new book is out: Data Protection: Ensuring Data Availability. It’s the perfect resource for any data protection architect.

Apr 272017
 

Regardless of whether you’re new to NetWorker or have been using it for a long time, if there’s any change happening within the overall computing environment of your business, there’s one thing you always need to have access to: compatibility guides.

As I’ve mentioned in a few times, including my book, there won’t be anything in your environment that touches more things other than the network itself than your enterprise backup and recovery product. With this in mind, it’s always critical to understand the potential impact of changes to our environment on the backups.

NetWorker, as well as majority of the rest of the Dell EMC Data Protection Suite, no longer has a static software compatibility guide. There’s enough variables that a static software compatibility guide is going to be tedious to search and maintain. So instead of being a document, it’s now a database, and you get to access it and generate custom compatibility information for exactly what you need.

bigStock Compatibility

If you’ve not used the interactive compatibility guide before, you’ll find it at:

http://compatibilityguide.emc.com:8080/CompGuideApp/

My recommendation? Bookmark it. Make sure it’s part of your essential NetWorker toolkit. (And for that matter: Data Domain OS, Boost Plugins, ProtectPoint, Avamar, etc.)

When you first hit the compatibility landing page, you’ll note a panel on the left-hand from where you can choose the product. In this case, when you expand NetWorker, you’ll get a list of versions since the interactive guide was introduced:

Part 01

Once you’ve picked the NetWorker version, the central panel will be updated to reflect the options you can check compatibility against:

Part 02

As you can see, there’s some ‘Instant’ reports for specific quick segments of information; beyond that, there’s the ability to create a specific drill-down report using the ‘Custom Reports’ option. For instance, say you’re thinking of deploying a new NetWorker server on Linux and you want to know what versions of Linux you can deploy on. To do that, under ‘NetWorker Component’ you’d select ‘Server OS’, then get a sub-prompt for broad OS type, then optionally drill down further. In this case:

Part 03

Here I’ve selected Server OS > Linux > All to get information about all compatible versions of Linux for running a NetWorker 9.1 server on. After you’ve made your selections, all you have to do is click ‘Generate Report’ to actually create a compatibility report. The report itself will look something like the following:

Part 04

Any area in the report that’s underlined is a hover prompt: hovering the mouse cursor over will popup the additional clarifying information referenced. Also note the “Print/Save Results” option – if say, as part of a change request, you need to submit documentary evidence, you can generate yourself a PDF that covers exactly what you need.

If you need to generate multiple and different compatibility reports to generate, you may need to click the ‘Reset’ button to blank out all options. (This will avoid a situation where you end up say, trying to find out about what versions of Exchange on Linux are supported!)

As far as the instant reports are concerned – this is about quickly generating information that you want to get straight away – you click on the option in the Instant Reports, and you don’t even need to click ‘Generate Report’. For instance, the NVP option:

Part 05

Part 06That’s really all there is to the interactive compatibility guide – it’s straight forward and it’s a really essential tool in the arsenal of a NetWorker or Dell EMC Data Protection Suite user.

Oh, there’s one more thing – there’s other compatibility guides of course: older NetWorker and Avamar software guides, NetWorker hardware guides, etc. You can get access to the legacy and traditional hardware compatibility guides via the right-hand area on the guide page:

Part 07

There you have it. If you need to check NetWorker or DPS compatibility, make the Interactive Compatibility Guide your first point of call.


Hey, don’t forget my book is available now in paperback and Kindle formats!

NetWorker 9.1 FLR Web Interface

 NVP, Recovery, vProxy  Comments Off on NetWorker 9.1 FLR Web Interface
Apr 042017
 

Hey, don’t forget, my new book is available. Jam packed with information about protecting across all types of RPOs and RTOs, as well as helping out on the procedural and governance side of things. Check it out today on Amazon! (Kindle version available, too.)


In my introductory NetWorker 9.1 post, I covered file level recovery (FLR) from VMware image level backup via NMC. I felt at the time that it was worthwhile covering FLR from within NMC as the VMware recovery integration in NMC was new with 9.1. But at the same time, the FLR Web interface for NetWorker has also had a revamp, and I want to quickly run through that now.

First, the most important aspect of FLR from the new NetWorker Virtual Proxy (NVP, aka “vProxy”) is not something you do by browsing to the Proxy itself. In this updated NetWorker architecture, the proxies are very much dumb appliances, completely disposable, with all the management intelligence coming from the NetWorker server itself.

Thus, to start a web based FLR session, you actually point your browser to:

https://nsrServer:9090/flr

The FLR web service now runs on the NetWorker server itself. (In this sense quite similarly to the FLR service for Hyper-V.)

The next major change is you no longer have to use the FLR interface from a system currently getting image based backups. In fact, in the example I’m providing today, I’m doing it from a laptop that isn’t even a member of the NetWorker datazone.

When you get to the service, you’ll be prompted to login:

01 Initial Login

For my test, I wanted to access via the Administration interface, so I switched to ‘Admin’ and logged on as the NetWorker owner:

02 Logging In as Administrator

After you login, you’re prompted to choose the vCenter environment you want to restore from:

03 Select vCenter

Selecting the vCenter server of course lets you then choose the protected virtual machine in that environment to be recovered:

04 Select VM and Backup

(Science fiction fans will perhaps be able to intuit my host naming convention for production systems in my home lab based on the first three virtual machine names.)

Once you’ve selected the virtual machine you want to recover from, you then get to choose the backup you want to recover – you’ll get a list of backups and clones if you’re cloning. In the above example I’ve got no clones of the specific virtual machine that’s been protected. Clicking ‘Next’ after you’ve selected the virtual machine and the specific backup will result in you being prompted to provide access credentials for the virtual machine. This is so that the FLR agent can mount the backup:

05 Provide Credentials for VM

Once you provide the login credentials (and they don’t have to be local – they can be an AD specified login by using the domain\account syntax), the backup will be mounted, then you’ll be prompted to select where you want to recover to:

06 Select Recovery Location

In this case I selected the same host, recovering back to C:\tmp.

Next you obviously need to select the file(s) and folder(s) you want to recover. In this case I just selected a single file:

07 Select Content to Recover

Once you’ve selected the file(s) and folder(s) you want to recover, click the Restore button to start the recovery. You’ll be prompted to confirm:

08 Confirm Recovery

The restore monitor is accessible via the bottom of the FLR interface, basically an upward-pointing arrow-head to expand. This gives you a view of a running, or in this case, a complete restore, since it was only a single file and took very little time to complete:

09 Recovery Success

My advice generally is that if you want to recover thousands or tens of thousands of files, you’re better off using the NMC interface (particularly if the NetWorker server doesn’t have a lot of RAM allocated to it), but for smaller collections of files the FLR web interface is more than acceptable.

And Flash-free, of course.

There you have it, the NetWorker 9.1 VMware FLR interface.


Hey, don’t forget, my new book is available. Jam packed with information about protecting across all types of RPOs and RTOs, as well as helping out on the procedural and governance side of things. Check it out today on Amazon! (Kindle version available, too.)


 

What to do on world backup day

 Backup theory, Best Practice, Recovery  Comments Off on What to do on world backup day
Mar 302017
 

World backup day is approaching. (A few years ago now, someone came up with the idea of designating one day of the year to recognise backups.) Funnily enough, I’m not a fan of world backup day, simply because we don’t backup for the sake of backing up, we backup to recover.

Every day should, in fact, be world backup day.

Something that isn’t done enough – isn’t celebrated enough, isn’t tested enough – are recoveries. For many organisations, recovery tests consist of actually doing a recovery when requested, and things like long term retention backups are never tested, and even more rarely recovered from.

bigStock Rescue

So this Friday, March 31, I’d like to suggest you don’t treat as World Backup Day, but World Recovery Test Day. Use the opportunity to run a recovery test within your organisation (following proper processes, of course!) – preferably a recovery that you don’t normally run in terms of day to day operations. People only request file recoveries? Sounds like a good reason to run an Exchange, SQL or Oracle recovery to me. Most recoveries are Exchange mail level recoveries? Excellent, you know they work, let’s run a recovery of a complete filesystem somewhere.

All your recoveries are done within a 30 day period of the backup being taken? That sounds like an excellent idea to do the recovery from an LTR backup written 2+ years ago, too.

Part of running a data protection environment is having routine tests to validate ongoing successful operations, and be able to confidently report back to the business that everything is OK. There’s another, personal and selfish aspect to it, too. It’s one I learnt more than a decade ago when I was still an on-call system administrator: having well-tested recoveries means that you can sleep easily at night, knowing that if the pager or mobile phone does shriek you into blurry-eyed wakefulness at 1am, you can in fact log onto the required server and run the recovery without an issue.

So this World Backup Day, do a recovery test.


The need to have an efficient and effective testing system is something I cover in more detail in Data Protection: Ensuring Data Availability. If you want to know more, feel free to check out the book on Amazon or CRC Press. Remember that it doesn’t matter how good the technology you deploy is if you don’t have the processes and training to use it.