The lazy admin

 Best Practice, Policies, Scripting  Comments Off on The lazy admin
Jul 112015
 

Are you an industriously busy backup administrator, or are you lazy?

Asleep at desk

When I started in IT in 1996, it wasn’t long before I joined a Unix system administration team that had an ethos which has guided me throughout my career:

The best sysadmins are lazy.

Even more so than system administration, this applies to anyone who works in data protection. The best people in data protection are lazy.

Now, there’s two types of lazy:

  • Slothful lazy – What we normally think of when we think of ‘lazy’; people who just don’t really do much.
  • Proactively lazy – People who do as much as they can in advance in order to have more time for the unexpected (or longer term projects).

If you’d previously thought I’d gone nuts suggesting I’ve spent my career trying to be lazy (particularly when colleagues read my blog), you’ll hopefully be having that “ah…ha!” moment realising I’m talking about being proactively lazy. This was something I learnt in 1996 – and almost twenty years down the track I’m pleased to see whole slabs of the industry (particularly infrastructure and data protection) are finally following suit and allowing me to openly talk about the virtues of being lazy.

Remember that embarrassingly enthusiastic dance Steve Ballmer was recorded doing years and years ago at a Microsoft conference while he chanted “Developers! Developers! Developers!” A proactively lazy data protection administrator chants “Automate! Automate! Automate!” in his or her head throughout the day.

Automation is the key to being operationally lazy yet proactively efficient. It’s also exactly what we see being the focus of DevOps, of cloud service providers, and massive scale converged infrastructure. So what are the key areas for automation? There’s a few:

  • Zero error policies – I’ve been banging the drum about zero error policies for over a decade now. If you want the TL;DR summary, a zero error policy is the process of automating the review of backup results such that the only time you get an alert is when a failure happens. (That also means treating any new “unknown” as a failure/review situation until you’ve included it in the review process.)
  • Service Catalogues and Policies – Service catalogues allow standard offerings that have been well-planned, costed and associated clearly with an architected system. Policies are the functional structures that enact the service catalogue approach and allow you to minimise the effort (and therefore the risk of human error) in configuration.
  • Visual Dashboards – Reports are OK, notifications are useful, but visual dashboards are absolutely the best at providing an “at a glance” view of a system. I may joke about Infographics from time to time, but there’s no questioning we’re a visual species – a lot of information can be pushed into a few simple glyphs or coloured charts*. There’s something to be said for a big tick to indicate everything’s OK, or an equally big X to indicate you need to dig down a little to see what’s not working.

There’s potentially a lot of work behind achieving that – but there are shortcuts. The fastest way to achieving it is sourcing solutions that have already been built. I still see the not-built-here syndrome plaguing some IT environments, and while sometimes it may have a good rationale, it’s an indication of that perennial problem of companies thinking their use cases are unique. The combination of the business, the specific employees, their specific customers and the market may make each business potentially unique, but the core functional IT requirements (“deploy infrastructure”, “protect data”, “deploy applications”, etc.) are standard challenges. If you can spend 100% of the time building it yourself from the ground up to do exactly what you need, or you can get something that does 80% and all you have to do is extend the last 20%, which is going to be faster? Paraphrasing Isaac Newton:

If I have seen further it is by standing on the shoulders of giants.

As you can see, being lazy properly is hard work – but it’s an inevitable requirement of the pressures businesses now place on IT to be adaptable, flexible and fast. The proactively lazy data protection service provider can step back out of the way of business functions and offer services that are both readily deployable and reliably work, focusing his or her time on automation and real problem solving rather than all that boring repetitive busyness.

Be proudly lazy: it’s the best way to work.


* Although I think we have to be careful about building too many simplified reports around colour without considering the usability to the colour-blind.

10 Things Still Wrong with Data Protection Attitudes

 Architecture, Backup theory, NetWorker  Comments Off on 10 Things Still Wrong with Data Protection Attitudes
Mar 072012
 

When I first started working with backup and recovery systems in 1996, one of the more frustrating statements I’d hear was “we don’t need to backup”.

These days, that sort of attitude is extremely rare – it was a hold-out from the days where computers were often considered non-essential to ongoing business operations. Now, unless you’re a tradesperson who does all your work as cash in hand jobs, the chances of a business not relying on computers in some form or another is practically unheard of. And with that change has come the recognition that backups are, indeed, required.

Yet, there’s improvements that can be made to data protection attitudes within many organisations, and I wanted to outline things that can still be done incorrectly within organisations in relation to backup and recovery.

Backups aren’t protected

Many businesses now clone, duplicate or replicate their backups – but not all of them.

What’s more, occasionally businesses will still design backup to disk strategies around non-RAID protected drives. This may seem like an excellent means of storage capacity optimisation, but it leaves a gaping hole in the data protection process for a business, and can result in catastrophic data loss.

Assembling a data protection strategy that involves unprotected backups is like configuring primary production storage without RAID or some other form of redundancy. Sure, technically it works … but you only need one error and suddenly your life is full of chaos.

Backups not aligned to business requirements

The old superstition was that backups were a waste of money – we do them every day, sometimes more frequently, and hope that we never have to recover from them. That’s no more a waste of money than an insurance policy that doesn’t get claimed on is.

However, what is a waste of money so much of the time is a backup strategy that’s unaligned to actual business requirements. Common mistakes in this area include:

  • Assigning arbitrary backup start times for systems without discussing with system owners, application administrators, etc.;
  • Service Level Agreements not established (including Recovery Time Objective and Recovery Point Objective);
  • Retention policies not set for business practice and legal/audit requirements.

Databases insufficiently integrated into the backup strategy

To put it bluntly, many DBAs get quite precious about the data they’re tasked with administering and protecting. And thats entirely fair, too – structured data often represents a significant percentage of mission critical functionality within businesses.

However, there’s nothing special about databases any more when it comes to data protection. They should be integrated into the data protection strategy. When they’re not, bad things can happen, such as:

  • Database backups completing after filesystem backups have started, potentially resulting in database dumps not being adequately captured by the centralised backup product;
  • Significantly higher amounts of primary storage being utilised to hold multiple copies of database dumps that could easily be stored in the backup system instead;
  • When cold database backups are run, scheduled database restarts may result in data corruption if the filesystem backup has been slower than anticipated;
  • Human error resulting in production databases not being protected for days, weeks or even months at a time.

When you think about it, practically all data within an environment is special in some way or another. Mail data is special. Filesystem data is special. Archive data is special. Yet, in practically no organisation will administrators of those specific systems get such free reign over the data protection activities, keeping them silo’d off from the rest of the organisation.

Growth not forecast

Backup systems are rarely static within an organisation. As primary data grows, so to does the backup system. As archive grows, the impact on the backup system can be a little more subtle, but there remains an impact.

Some of the worst mistakes I’ve seen made in backup systems planning is assuming what is bought today for backup will be equally suitable for next year or a period of 3-5 years from now.

Growth must not only be forecast for long-term planning within a backup environment, but regularly reassessed. It’s not possible, after all, to assume a linear growth pattern will remain constantly accurate; there will be spikes and troughs caused by new projects or business initiatives and decommissioning of systems.

Zero error policies aren’t implemented

If you don’t have a zero error policy in place within your organisation for backups, you don’t actually have a backup system. You’ve just got a collection of backups that may or may not have worked.

Zero error policies rigorously and reliably capture failures within the environment and maintain a structure for ensuring they are resolved, catalogued and documented for future reference.

Backups seen as a substitute for Disaster Recovery

Backups are not in themselves disaster recovery strategies; their processes without a doubt play into disaster recovery planning and a fairly important part, too.

But having a backup system in place doesn’t mean you’ve got a disaster recovery strategy in place.

The technology side of disaster recovery – particularly when we extend to full business continuity – doesn’t even approach half of what’s involved in disaster recovery.

New systems deployment not factoring in backups

One could argue this is an extension of growth and capacity forecasting, but in reality it’s more the case that these two issues will usually have a degree of overlap.

As this is typically exemplified by organisations that don’t have formalised procedures, the easiest way to ensure new systems deployment allows for inclusion into backup strategies is to have build forms – where staff would not only request storage, RAM and user access, but also backup.

To put it quite simply – no new system should be deployed within an organisation without at least consideration for backup.

No formalised media ageing policies

Particularly in environments that still have a lot of tape (either legacy or active), a backup system will have more physical components than just about everything else in the datacentre put together – i.e., all the media.

In such scenarios, a regrettably common mistake is a lack of policies for dealing with cartridges as they age. In particular:

  • Batch tracking;
  • Periodic backup verification;
  • Migration to new media as/when required;
  • Migration to new formats of media as/when required.

These tasks aren’t particularly enjoyable – there’s no doubt about that. However, they can be reasonably automated, and failure to do so can cause headaches for administrators down the road. Sometimes I suspect these policies aren’t enacted because in many organisations they represent a timeframe beyond the service time of the backup administrator. However, even if this is the case, it’s not an excuse, and in fact should point to a requirement quite the opposite.

Failure to track media ageing is probably akin to deciding not to ever service your car. For a while, you’ll get away with it. As time goes on, you’re likely to run into bigger and bigger problems until something goes horribly wrong.

Backup is confused with archive

Backup is not archive.

Archive is not backup.

Treating the backup system as a substitute for archive is a headache for the simple reason that archive is about extending primary storage, whereas backup is about taking copies of primary storage data.

Backup is seen as an IT function

While backup is undoubtedly managed and administered by IT staff, it remains a core business function. Like corporate insurance, it belongs to the central business, not only for budgetary reasons, but also continuance and alignment. If this isn’t the case yet, initial steps towards that shift can be achieved initially by ensuring there’s an information protection advisory council within the business – a grouping of IT staff and core business staff.

The 5 Golden Rules of Recovery

 Backup theory, NetWorker, Policies  Comments Off on The 5 Golden Rules of Recovery
Nov 102009
 

You might think, given that I wrote an article awhile ago about the Procedural Obligations of Backup Administrators that it wouldn’t be necessary to explicitly spell out any recovery rules – but this isn’t quite the case. It’s handy to have a “must follow” list of rules for recovery as well.

In their simplest form, these rules are:

  1. How
  2. Why
  3. Where
  4. When
  5. Who

Let’s look at each one in more detail:

  1. How – Know how to do a recovery, before you need to do it. The worst forms of data loss typically occur when a backup routine is put in place that is untried on the assumption that it will work. If a new type of backup is added to an environment, it must be tested before it is relied on. In testing, it must be documented by those doing the recovery. In being documented, it must be referenced by operational procedures*.
  2. Why – Know why you are doing a recovery. This directly affects the required resources. Are you recovering a production system, or a test system? Is it for the purposes of legal discovery, or because a database collapsed?
  3. Where – Know where you are recovering from and to. If you don’t know this, don’t do the recovery. You do not make assumptions about data locality in recovery situations.
  4. When – Know when the recovery needs to be completed by. This isn’t always answered by the why factor – you actually need to know both in order to fully schedule and prioritise recoveries.
  5. Who – Know who requested the recovery is authorised to do so. (In order to know this, there should be operational recovery procedures – forms and company policies – that indicate authorisation.)

If you know the how, why, where, when and who, you’re following the golden rules of recovery.


* Or to put it another way – documentation is useless if you don’t know it exists, or you can’t find it!

Aug 252009
 

This article has now moved to Enterprise Systems Backup, and can be read here.

Feb 132009
 

Note: It’s 2015, and I now completely disagree with what I wrote below. Feel free to read what I had to say, but then check out Virtualised Servers and Storage Nodes.

Introduction

When it comes to servers, I love virtualisation. No, not to the point where I’d want to marry virtualisation, but it is something I’m particularly keen about. I even use it at home – I’ve gone from 3 servers, one for databases, one as a fileserver, and one as an internet gateway down to one, thanks to VMware Server.

Done rightly, I think the average datacentre should be able to achieve somewhere in the order of 75% to 90% virtualisation. I’m not talking high performance computing environments – just your standard server farms. Indeed, having recently seen a demo for VMware’s Site Recovery Manager (SRM), and having participated in many site failover tests, I’ve become a bigger fan of the time and efficiency savings available through virtualisation.

That being said, I think backup servers fall into that special category of “servers that shouldn’t be virtualised”. In fact, I’d go so far as to say that even if every other machine in your server environment is virtual, your backup server still shouldn’t be a virtual machine.

There are two key reasons why I think having a virtualised backup server is a Really Bad Idea, and I’ll outline them below:

Dependency

In the event of a site disaster, your backup server should be at least equally the first server that is rebuilt. That is, you may start the process of getting equipment ready for restoration of data, but the backup server needs to be up and running in order to achieve data recovery.

If the backup server is configured as a guest within a virtual machine server, it’s hardly going to be the first machine to be configured is it? The virtual machine server will need to be built and configured first, then the backup server after this.

In this scenario, there is a dependency that results in the build of the backup server becoming a bottleneck to recovery.

I realise that we try to avoid scenarios where the entire datacentre needs to be rebuilt, but this still has to remain a factor in mind – what do you want to be spending time on when you need to recover everything?

Performance

Most enterprise class virtualisation systems offer the ability to set performance criteria on a per machine basis – that is, in addition to the basics you’d expect such as “this machine gets 1 CPU and 2GB of RAM”, you can also configure options such as limiting the number of MHz/GHz available to each presented CPU, or guaranteeing performance criteria.

Regardless though, when you’re a guest in a virtual environment, you’re still sharing resources. That might be memory, CPU, backplane performance, SAN paths, etc., but it’s still sharing.

That means at some point, you’re sharing performance. The backup server, which is trying to write data out to the backup medium (be that tape or disk), is potentially either competing with for, or at least sharing backplane throughput with the machines that is backing up.

This may not always make a tangible impact. However, debugging such an impact when it does occur becomes much more challenging. (For instance, in my book, I cover off some of the performance implications of having a lot of machines access storage from a single SAN, and how the performance of any one machine during backup is no longer affected just by that machine. The same non-trivial performance implications come into play when the backup server is virtual.)

In Summary

One way or the other, there’s a good reason why you shouldn’t virtualise your backup environment. It may be that for a small environment, the performance impact isn’t an issue and it seems logical to virtualise. However, if you are in a small environment, it’s likely that your failover to another site is likely to be a very manual process, in which case you’ll be far more likely to hit the dependency issue when it comes time for the full site recovery.

Equally, if you’re a large company that has a full failover site, then while the dependency issue may not be as much of a problem (due to say, replication, snapshots, etc.), there’s a very high chance that backup and recovery operations are very time critical, in which case the performance implications of having a backup server share resources with other machines will likely make a virtual backup server an unpalatable solution.

A final request

As someone who has done a lot of support, I’d make one special request if you do decide to virtualise your backup server*.

Please, please make sure that any time you log a support call with your service provider you let them know you’re running a virtual backup server. Please.


* Much as I’d like everyone to do as I suggest, I (a) recognise this would be a tad boring and (b) am unlikely at any point soon or in the future to become a world dictactor, and thus wouldn’t be able to issue such an edict anyway, not to mention (c) can occasionally be fallible.

Offsite your bootstrap reports

 NetWorker, Policies, Scripting  Comments Off on Offsite your bootstrap reports
Feb 092009
 

I can’t stress the importance enough of getting your bootstrap reports offsite. If you don’t have a bootstrap report available and you have to rebuild your NetWorker server, then you may potentially have to scan through a lot of media to find your most recent backup.

It’s all well and good having your bootstrap report emailed to your work email address, but what happens if whatever takes out your backup server takes out your mail server as well?

There’s two things you should therefore do with your bootstrap reports:

  • Print them out and send them offsite with your media – offsite media storage companies will usually store other records for you as well, so there’s a good chance yours will, even if it is for a slightly extra fee. If you send your bootstraps offsite with your media, then in the event of a disaster recovery, a physical printout of your bootstrap report should also come back when you recall your media.
  • Email them to an external, secure email address – I recommend using a free, secure mail service, such as say, Google Mail. Doing so keeps electronic copies of the bootstrap available for easy access in the event of a disaster where internet access is still achievable even if local mail isn’t possible. Of course, the password for this account should be (a) kept secure and (b) changed every time someone who knows it leaves the company.

(Hint: if for some reason you need to generate a bootstrap report outside of the regular email, always remember you can at any time run: mminfo -B).

Obviously this should be done in conjunction with local storage methods – local email, and local printouts.

Feb 052009
 

When backup to disk is deployed, most sites usually just transition from their standard tape backups to disk without any change to the schedules. That is, daily incrementals (or differentials), with weekly fulls. This isn’t necessarily the best way to make use of backup to disk, and I’ll explain in this post way.

One of the traditional reasons why long incremental cycles aren’t used in backup is the load and seek impact during recovery. That is, you’ll certainly reduce the amount of data you backup if you do incrementals for a month, but if they’re all going to tape, then the chances are that if you do a recovery towards the end of that month you may have a lot of tapes to load. Unless you’re using high speed loading tapes (e.g., the StorageTek/Sun 98/99 series drives), this is going to make a significant impact to the recovery. Indeed, even with such drives, you’re still going to have an impact that may be undesirable.

If you’re backing up to disk however, your options change. Disk seek times are orders of magnitude faster than tape seek times, and there’s no ‘load’ time associated with disk as opposed to tape media either.

In an average site where ‘odd’ things aren’t happening (e.g., filesystem backups of databases, etc.), my experience is that nightly incrementals take up somewhere between 5-8% of a full backup. That is, if the full backups are 10TB, the incrementals sit somewhere around 512 GB – 819 GB.

We’ll use these numbers for an example – 10TB full, 820GB incremental. Over the course of an average, 4-week month then, the total data backed up using the weekly-full strategy will be:

  • 4 x 10TB fulls
  • (6 x 820GB) x 4 incrementals

For a total of 59TB of backup.

Looking at a monthly full scenario for a 31-day month however, the sizing will instead be:

  • 1 x 10TB full
  • 30 x 820GB incrementals

This amounts to a total of 34TB of backup.

If you have to pay for a new array for disk backup units that have enough space to hold a months’ worth of backups, which would you rather pay for? 59TB of storage, or 34TB of storage?

(Of course, I know there’s some fudge space required in any such sizing – realistically you’d want to ensure that after you’ve fitted on everything you want to fit, there’s still enough room for another full backup. That way you’ve got sufficient space on disk to continue to backup to it while you’re staging data off.)

Obviously the needs of each individual site must be evaluated, so I’m not advocating a blind switch to this method; instead, it’s a design option you should be aware of.

Jan 252009
 

Having spent what seemed like much of 1999 coordinating the system administration efforts of a major Y2K project for an engineering company, I have fundamental problems with the plethora of journalists that claimed the overall limited number of Y2K issues experienced meant it was never a problem and was thus a waste of money. (I invite such journalists to stop filling their cars with fuel, since they don’t run out of fuel after they’re filled – it’s a similar logic.)

Thus I’m also aware of the difficulties posed by 2038 – that’s the point where we reach numbers that can no longer be expressed as seconds since 1 January 1970 in a 32-bit integer.

Interestingly NetWorker doesn’t seem to technically have the Y2038 problem, since that problem is meant to manifest in early 2038, but NetWorker allows retention and browse periods specified for its savesets up to and including 31 December 2038 23:59:59.

However, NetWorker does still appear to have a pseudo-2038 issue in that it currently doesn’t allow you to specify a browse or retention period beyond 31 December 2038 23:59:59.

For instance:

[root@nox ~]# save -qb Yearly -e "12/31/2038 23:59:59" /etc/sysconfig
save: /etc/sysconfig  219 KB 00:00:01    115 files
[root@nox ~]# save -qb Yearly -e "01/01/2039 00:00:01" /etc/sysconfig
6890:(pid 18236): invalid expiration time: 01/01/2039 00:00:01

I have 2 theories for this – neither of which I’m willing to bet on without being an EMC engineer who has access to the source code. Either NetWorker doesn’t really store dates as of 1 January 1970 (instead storing from some time later in January 1970) or it’s only partly surpassed the 32-bit barrier for date/time representation – e.g., the back-end supports it but the front-end doesn’t, or the back-end doesn’t support it and the front-end does, but knows the back-end doesn’t and therefore blocks the request.

Either way, it’s something that companies have to be aware of.

Where does that leave you?
Well, being unable to set a browse/retention period beyond 2038 for now is hardly an insurmountable issue, nor is it an issue that should, for instance, discount NetWorker from active consideration at a site.

Instead, it suggests that for long term data retention requirements – e.g., requirements exceeding 30 years (such as government archives, medical records that must be kept for the life of patients, academic records that must kept for the life of the student, etc.) need to be stored with well established and documented policies in place for extending that data retention as appropriate for backups.

Such policies aren’t difficult to enact. After all, data which is to be stored on tape for even 5+ years really should have policies to deal with recall and testing, and it goes without saying that data which is to be kept on tape for 30 years will most certainly need to be recalled for migration at some point during its lifetime. (Indeed, one could easily argue it would need to be recalled for migration to new media types at least 3 times alone.)

So, until NetWorker fully supports post-2038 dates, I’d recommend that companies with long-term data retention requirements document, and establish extensions to their policies as follows:

  • Technical policies:
    • All backups that should be kept beyond 2038 should be appropriately tagged – whether that simply be by being within a particular pool, or just by having a data expiration period higher than the year 2037 will depend on the individual company requirements.
    • Each new release of NetWorker should be tested, or researched to confirm whether it supports post-2038 dates.
    • At any point that post-2038 dates are supported, the savesets to be kept longer than 2038 should be extended.
  • Human resource policies:
    • New employee kit for system/backup administrators must make note of this requirement as an ongoing part of the job description of those who are responsible for data retention.
    • New employee kit for managers responsible for IT must make note of this requirement as part of an ongoing part of their job description.
    • HR policy guides should clearly state that these policies and requirements must be maintained, and be audited for periodically.

With such policies in place, being unable to set a browse/retention period currently beyond 2038 should be little cause for concern.