Betting the company

 Backup theory, Best Practice, Databases, General Technology  Comments Off on Betting the company
Jun 152016
 

Short of networking itself, backup and recovery systems touch more of your infrastructure than anything else. So it’s pretty common for any backup and recovery specialist to be asked how we can protect a ten or sometimes even twenty year old operating system or application.

Sure you can backup Windows 2012, but what about NT 4?

Sure you can backup Solaris 11, but what about Tru64 v5?

Sure you can backup Oracle 12, but what about Oracle 8?

These really are questions we get asked.

I get these questions. I even have an active Windows 2003 SMB server sitting in my home lab running as an RDP jump-point. My home lab.

Gambling the lot

So it’s probably time for me to admit: I’m not really speaking to backup administrators with this article, but the broader infrastructure teams and, probably more so, the risk officers within companies.

Invariably we get asked if we can backup AncientOS 1.1 or DefunctDatabase 3.2 because those systems are still in use within a business, and inevitably that’s because they’re in production use within a company. Sometimes they’re even running pseudo-mission critical services, but more often than not they’re just simply running essential services the business has deemed too costly to migrate to another platform.

I’m well aware of this. In 1999 I was the primary system administrator involved in a Y2K remediation project for a SAP deployment. The system as deployed was running on an early version of Oracle 8 as I recall (it might have been Oracle 7 – it was 17 years ago…), sitting on Tru64 with an old (even for then) version of SAP. The version of the operating system, the version of Oracle, the version of SAP and even things like the firmware in the DAS enclosures attached were all unsupported by the various vendors for Y2K.

The remediation process was tedious and slow because we had to do piecemeal upgrades of everything around SAP and beg for Y2K compliance exceptions from Oracle and Digital for specific components. Why? When the business had deployed SAP two years before, they’d spent $5,000,000 or so customizing it to the nth degree, and upgrading it would require a similarly horrifically expensive remediation customization project. It was, quite simply, easier and cheaper to risk periphery upgrades around the application.

It worked. (As I recall, the only system in the company that failed over the Y2K transition was the Access database put together at the last minute by some tech-boffin-project manager designed to track any Y2K incidents over the entire globe for the company. I’ve always found there to be beautiful irony in that.)

This is how these systems limp along within organisations. It costs too much to change them. It costs too much to upgrade them. It costs to much to replace them.

And so day by day, month by month, year by year, the business continues to bet that bad things won’t happen. And what’s the collateral for the bet? Well it could be the company itself. If it costs that much to change them, upgrade them or to replace them, what’s the cost going to be if they fail completely? There’s an old adage of a CEO and a CIO talking, and the CIO says: “Why are you paying all this money to train people? What if you train them and they leave?” To which the CEO responds, “What if we don’t train them and they stay?” I think this is a similar situation.

I understand. I sympathise – even empathise, but we’ve got to find a better way to resolve this problem, because it’s a lot more than just a backup problem. It’s even more than a data protection problem. It’s a data integrity problem, and that creates an operational integrity problem.

So why is the question “do you support X?” asked when the original vendor for X doesn’t even support it any more – and may not have done for a decade or more?

The question is not really whether we can supply backup agents or backup modules old enough to work with these systems unsupported by their vendor of origin, and whether you can get access to a knowledge-base that stretches back far enough to include details of those systems. Supply? Yes. Officially support? How much official support do you get from the vendor of origin?

I always think in these situations there’s a broader conversation to be had. Those legacy applications and operating systems are a sea anchor to your business at a time when you increasingly have to be able to steer and move the ship faster and with greater agility. Those scenarios where you’re reliant on technology so old it’s no longer supported are exactly those sorts of scenarios that are allowing startups and younger, more agile competitors to swoop in and take customers from you. And it’s those scenarios that also leave you exposed to an old 10GB ATA drive failing, or a random upgrade elsewhere in the company finally and unexpectedly resulting in that critical or essential system no longer being able to access the network.

So how do we solve the problem?

Sometimes there’s a simple workaround – virtualisation. If it’s an old x86 based platform, particularly Windows, there’s a good chance the system can at least be virtualised so it can at least run on modern hardware. That doesn’t solve the ‘supported’ problem, but it at least means greater protection: image level backups regardless of whether there’s an agent for the internal virtual machine, and snapshots and replication to reduce the requirements to ever have to consider a BMR. Usually being old, the amount of data on those systems is minimal, so that type of protection is not an issue.

But the real solution comes from being able to modernise the workload. We talk about platforms 1, 2 and 3 – platform 1 is the old mainframe approach to the world, platform 2 is the classic server/desktop architecture we’ve been living with for so long, and platform 3 is the new, mobile and cloud approach to IT. Some systems even get classified as platform ‘2.5’ – that interim step between the current and the new. What’s the betting that old curmudgeonly system that’s holding your business back from modernising is more like platform 1.5?

One way you can modernise is to look at getting innovative with software development. Increasing requirements for agility will drive more IT departments back to software development for platform 3 environments, so why not look at this as an opportunity to grow that development environment within your business? That’s where the EMC Federation can really swing in to help: Pivotal Labs is premised on new approaches to software development. Agile may seem like a buzz-word, but if you can cut software development down from 12-24 months to 6-12 weeks (or less!), doesn’t that mitigate many of the cost reasons to avoid dealing with the legacy platforms?

The other way of course is with traditional consulting approaches. Maybe there’s a way that legacy application can be adapted, or archived, in such a way that the business functions can be continued but the risk substantially reduced and the platform modernised. That’s where EMC’s consultancy services come in, where our content management services come in, and where our broad experience to hundreds of thousands of customer environments come in. Because I’ll be honest: your problems aren’t actually unique; you’re not the only business that’s dealing with legacy system components and while there may be industry-specific or even customer-specific aspects that are tricky, there’s a very, very good chance that somewhere, someone has gone through the same situation. The solution could very well be tailored specifically for your business, but the processes and tools that get used to get you to your solution don’t necessarily have to be bespoke.

It’s time to start thinking beyond whether those ancient and unsupported operating systems and applications can be backed up, but how they can be modernised so they stop holding the business back.

The Importance of Being Earnestly Automated

 Architecture, Best Practice, General Technology  Comments Off on The Importance of Being Earnestly Automated
Apr 132016
 

It was not long after I started in IT that I got the most important advice of my career. It came from a senior Unix system administrator in the team I’d just joined, and it shaped my career. In just eight words it stated the purpose of the system administrator, and I think IT as a whole:

The best system administrator is a lazy one.

On the face of it, it seems inappropriate advice: be lazy; yet that’s just the superficial reading of it. The real intent was this:

Automate everything you have to repeatedly do.

Automation

One of the reasons I was originally so blasé about Cloud was that it was old-hat. The same way that mainframe jockeys yawned and rolled their eyes when midrange people started talking about the wonders of virtualisation, I listened to people in IT extolling Cloud and found myself rolling my eyes – not just over the lack of data protection in early Cloud solutions – but to the stories about how Cloud was agile. And there’s no prizes for guessing where agility comes from: automation.

It surprises me twenty years on that the automation debate is still going on, and some people remain unconvinced.

There are three fundamental results of automation:

  • Repeatability
  • Reliability
  • Verifiability

When something is properly automated, it can be repeated easily and readily. That’s a fundamental tenet driving Cloud agility: you click on a button on a portal and hey presto!, a virtual machine is spun up and you receive an IP address to access it from. Or you click on a button on a portal and suddenly you’ve got yourself a SQL database or Exchange server or CRM system or any one of hundreds of different applications or business functions. If there’s human intervention at the back-end between when you click the button and when you get your service it’s not agile. It’s not Cloud. And it’s certainly not automated. Well, not fully or properly.

With repeatability becomes reliability – accuracy. It doesn’t matter whether the portal has been up for 1 hour or 1000 hours, it doesn’t matter whether it’s 01:00 or 13:00, and it doesn’t matter how many requests the portal has got: it’s not prone to error, it won’t miss a check-box because it’s rushed or tired or can’t remember what the correct option is. It doesn’t matter whether the computer doing the work in the background has never done it before because it’s just been added to the resource pool, or whether it’s done the process a million times before. Automation isn’t just about repeatability, it’s about reliable repeatability.

Equally as importantly, with automation – with repeatability – there comes verifiability. Not only can you reliably repeat the same activity time and time again, but every time it’s executed you can verify it was executed. You can monitor, measure and report. This can be from the simplest – verifying it was performed successfully or throwing an exception for a human to investigate – to the more complex, such as tracking and reporting the trends on how long it takes automated processes to complete, so you can see keep an eye on how the system is scaling.

Once you’ve got automation in place, you’ve freed up your IT staff from boring and repetitive duties. That’s not to remove them from their jobs, but to let the humans in your staff do the jobs humans do best: those involving dealing with the unexpected, or thinking of new solutions. Automated, repeatable tasks are best left to scripts and processes and even robots (when it comes to production). The purpose of being a lazy system administrator was not so that you could sit at your desk doing nothing all day, but so you could spend time handling exceptions and errors, designing new systems, working on new projects, and yes, automating new systems.

Automation is not just a Cloud thing. Automation is not just a system administration thing. Or a database/application administration thing. Or a build thing. Or a…

Automation is everything in IT, particularly in the infrastructure space. Cloud has well and truly raised the profile of automation, but the fundamental concept is not new. I’d go so far as to say that if your business isn’t focused on automation, you’re doing IT wrong.

Dec 222015
 

As we approach the end of 2015 I wanted to spend a bit of time reflecting on some of the data protection enhancements we’ve seen over the year. There’s certainly been a lot!

Protection

NetWorker 9

NetWorker 9 of course was a big part to the changes in the data protection landscape in 2015, but that’s not by any means the only advancement we saw. I covered some of the advances in NetWorker 9 in my initial post about it (NetWorker 9: The Future of Backup), but to summarise just a few of the key new features, we saw:

  • A policy based engine that unites backup, cloning, snapshot management and protection of virtualisation into a single, easy to understand configuration. Data protection activities in NetWorker can be fully aligned to service catalogue requirements, and the easier configuration engine actually extends the power of NetWorker by offering more complex configuration options.
  • Block based backups for Linux filesystems – speeding up backups for highly dense filesystems considerably.
  • Block based backups for Exchange, SQL Server, Hyper-V, and so on – NMM for NetWorker 9 is a block based backup engine. There’s a whole swathe of enhancements in NMM version 9, but the 3-4x backup performance improvement has to be a big win for organisations struggling against existing backup windows.
  • Enhanced snapshot management – I was speaking to a customer only a few days ago about NSM (NetWorker Snapshot Management), and his reaction to NSM was palpable. Wrapping NAS snapshots into an effective and coordinated data protection policy with the backup software orchestrating the whole process from snapshot creation, rollover to backup media and expiration just makes sense as the conventional data storage protection and backup/recovery activities continue to converge.
  • ProtectPoint Integration – I’ll get to ProtectPoint a little further below, but being able to manage ProtectPoint processes in the same way NSM manages file-based snapshots will be a big win as well for those customers who need ProtectPoint.
  • And more! – VBA enhancements (notably the native HTML5 interface and a CLI for Linux), NetWorker Virtual Edition (NVE), dynamic parallel savestreams, NMDA enhancements, restricted datazones and scaleability all got a boost in NetWorker 9.

It’s difficult to summarise everything that came in NetWorker 9 in so few words, so if you’ve not read it yet, be sure to check out my essay-length ‘summary’ of it referenced above.

ProtectPoint

In the world of mission critical databases where impact minimisation on the application host is a must yet backup performance is equally a must, ProtectPoint is an absolute game changer. To quote Alyanna Ilyadis, when it comes to those really important databases within a business,

“Ideally, you’d want the performance of a snapshot, with the functionality of a backup.”

Think about the real bottleneck in a mission critical database backup: the data gets transferred (even best case) via fibre-channel from the storage layer to the application/database layer before being passed across to the data protection storage. Even if you direct-attach data protection storage to the application server, or even if you mount a snapshot of the database at another location, you still have the fundamental requirement to:

  • Read from production storage into a server
  • Write from that server out to protection storage

ProtectPoint cuts the middle-man out of the equation. By integrating storage level snapshots with application layer control, the process effectively becomes:

  • Place database into hot backup mode
  • Trigger snapshot
  • Pull database out of hot backup mode
  • Storage system sends backup data directly to Data Domain – no server involved

That in itself is a good starting point for performance improvement – your database is only in hot backup mode for a few seconds at most. But then the real power of ProtectPoint kicks in. You see, when you first configure ProtectPoint, a block based copy from primary storage to Data Domain storage starts in the background straight away. With Change Block Tracking incorporated into ProtectPoint, the data transfer from primary to protection storage kicks into high gear – only the changes between the last copy and the current state at the time of the snapshot need to be transferred. And the Data Domain handles creation of a virtual synthetic full from each backup – full backups daily at the cost of an incremental. We’re literally seeing backup performance improvements in the order of 20x or more with ProtectPoint.

There’s some great videos explaining what ProtectPoint does and the sorts of problems it solves, and even it integrating into NetWorker 9.

Database and Application Agents

I’ve been in the data protection business for nigh on 20 years, and if there’s one thing that’s remained remarkably consistent throughout that time it’s that many DBAs are unwilling to give up control over the data protection configuration and scheduling for their babies.

It’s actually understandable for many organisations. In some places its entrenched habit, and in those situations you can integrate data protection for databases directly into the backup and recovery software. For other organisations though there’s complex scheduling requirements based on batch jobs, data warehousing activities and so on which can’t possibly be controlled by a regular backup scheduler. Those organisations need to initiate the backup job for a database not at a particular time, but when it’s the right time, and based on the amount of data or the amount of processing, that could be a highly variable time.

The traditional problem with backups for databases and applications being handled outside of the backup product is the chances of the backup data being written to primary storage, which is expensive. It’s normally more than one copy, too. I’d hazard a guess that 3-5 copies is the norm for most database backups when they’re being written to primary storage.

The Database and Application agents for Data Domain allow a business to sidestep all these problems by centralising the backups for mission critical systems onto highly protected, cost effective, deduplicated storage. The plugins work directly with each supported application (Oracle, DB2, Microsoft SQL Server, etc.) and give the DBA full control over managing the scheduling of the backups while ensuring those backups are stored under management of the data protection team. What’s more, primary storage is freed up.

Formerly known as “Data Domain Boost for Enterprise Applications” and “Data Domain Boost for Microsoft Applications”, the Database and Application Agents respectively reached version 2 this year, enabling new options and flexibility for businesses. Don’t just take my word for it though: check out some of the videos about it here and here.

CloudBoost 2.0

CloudBoost version 1 was released last year and I’ve had many conversations with customers interested in leveraging it over time to reduce their reliance on tape for long term retention. You can read my initial overview of CloudBoost here.

2015 saw the release of CloudBoost 2.0. This significantly extends the storage capabilities for CloudBoost, introduces the option for a local cache, and adds the option for a physical appliance for businesses that would prefer to keep their data protection infrastructure physical. (You can see the tech specs for CloudBoost appliances here.)

With version 2, CloudBoost can now scale to 6PB of cloud managed long term retention, and every bit of that data pushed out to a cloud is deduplicated, compressed and encrypted for maximum protection.

Spanning

Cloud is a big topic, and a big topic within that big topic is SaaS – Software as a Service. Businesses of all types are placing core services in the Cloud to be managed by providers such as Microsoft, Google and Salesforce. Office 365 Mail is proving very popular for businesses who need enterprise class email but don’t want to run the services themselves, and Salesforce is probably the most likely mission critical SaaS application you’ll find in use in a business.

So it’s absolutely terrifying to think that SaaS providers don’t really backup your data. They protect their infrastructure from physical faults, and their faults, but their SLAs around data deletion are pretty straight forward: if you deleted it, they can’t tell whether it was intentional or an accident. (And if it was an intentional delete they certainly can’t tell if it was authorised or not.)

Data corruption and data deletion in SaaS applications is far too common an occurrence, and for many businesses sadly it’s only after that happens for the first time that people become aware of what those SLAs do and don’t cover them for.

Enter Spanning. Spanning integrates with the native hooks provided in Salesforce, Google Apps and Office 365 Mail/Calendar to protect the data your business relies on so heavily for day to day operations. The interface is dead simple, the pricing is straight forward, but the peace of mind is priceless. 2015 saw the introduction of Spanning for Office 365, which has already proven hugely popular, and you can see a demo of just how simple it is to use Spanning here.

Avamar 7.2

Avamar got an upgrade this year, too, jumping to version 7.2. Virtualisation got a big boost in Avamar 7.2, with new features including:

  • Support for vSphere 6
  • Scaleable up to 5,000 virtual machines and 15+ vCenters
  • Dynamic policies for automatic discovery and protection of virtual machines within subfolders
  • Automatic proxy deployment: This sees Avamar analyse the vCenter environment and recommend where to place virtual machine backup proxies for optimum efficiency. Particularly given the updated scaleability in Avamar for VMware environments taking the hassle out of proxy placement is going to save administrators a lot of time and guess-work. You can see a demo of it here.
  • Orphan snapshot discovery and remediation
  • HTML5 FLR interface

That wasn’t all though – Avamar 7.2 also introduced:

  • Enhancements to the REST API to cover tenant level reporting
  • Scheduler enhancements – you can now define the start dates for your annual, monthly and weekly backups
  • You can browse replicated data from the source Avamar server in the replica pair
  • Support for DDOS 5.6 and higher
  • Updated platform support including SLES 12, Mac OS X 10.10, Ubuntu 12.04 and 14.04, CentOS 6.5 and 7, Windows 10, VNX2e, Isilon OneFS 7.2, plus a 10Gbe NDMP accelerator

Data Domain 9500

Already the market leader in data protection storage, EMC continued to stride forward with the Data Domain 9500, a veritable beast. Some of the quick specs of the Data Domain 9500 include:

  • Up to 58.7 TB per hour (when backing up using Boost)
  • 864TB usable capacity for active tier, up to 1.7PB usable when an extended retention tier is added. That’s the actual amount of storage; so when deduplication is added that can yield actual protection data storage well into the multiple-PB range. The spec sheet gives some details based on a mixed environment where the data storage might be anywhere from 8.6PB to 86.4PB
  • Support for traditional ES30 shelves and the new DS60 shelves.

Actually it wasn’t just the Data Domain 9500 that was released this year from a DD perspective. We also saw the release of the Data Domain 2200 – the replacement for the SMB/ROBO DD160 appliance. The DD2200 supports more streams and more capacity than the previous entry-level DD160, being able to scale from a 4TB entry point to 24TB raw when expanded to 12 x 2TB drives. In short: it doesn’t matter whether you’re a small business or a huge enterprise: there’s a Data Domain model to suit your requirements.

Data Domain Dense Shelves

The traditional ES30 Data Domain shelves have 15 drives. 2015 also saw the introduction of the DS60 – dense shelves capable of holding sixty disks. With support for 4 TB drives, that means a single 5RU data Domain DS60 shelf can hold as much as 240TB in drives.

The benefits of high density shelves include:

  • Better utilisation of rack space (60 drives in one 5RU shelf vs 60 drives in 4 x 3RU shelves – 12 RU total)
  • More efficient for cooling and power
  • Scale as required – each DS60 takes 4 x 15 drive packs, allowing you to start with just one or two packs and build your way up as your storage requirements expand

DDOS 5.7

Data Domain OS 5.7 was also released this year, and includes features such as:

  • Support for DS60 shelves
  • Support for 4TB drives
  • Support for ES30 shelves with 4TB drives (DD4500+)
  • Storage migration support – migrate those older ES20 style shelves to newer storage while the Data Domain stays online and in use
  • DDBoost over fibre-channel for Solaris
  • NPIV for FC, allowing up to 8 virtual FC ports per physical FC port
  • Active/Active or Active/Passive port failover modes for fibre-channel
  • Dynamic interface groups are now supported for managed file replication and NAT
  • More Secure Multi-Tenancy (SMT) support, including:
    • Tenant-units can be grouped together for a tenant
    • Replication integration:
      • Strict enforcing of replication to ensure source and destination tenant are the same
      • Capacity quota options for destination tenant in a replica context
      • Stream usage controls for replication on a per-tenant basis
    • Configuration wizards support SMT for
    • Hard limits for stream counts per Mtree
    • Physical Capacity Measurement (PCM) providing space utilisation reports for:
      • Files
      • Directories
      • Mtrees
      • Tenants
      • Tenant-units
  • Increased concurrent Mtree counts:
    • 256 Mtrees for Data Domain 9500
    • 128 Mtrees for each of the DD990, DD4200, DD4500 and DD7200
  • Stream count increases – DD9500 can now scale to 1,885 simultaneous incoming streams
  • Enhanced CIFS support
  • Open file replication – great for backups of large databases, etc. This allows the backup to start replicating before it’s even finished.
  • ProtectPoint for XtremIO

Data Protection Suite (DPS) for VMware

DPS for VMware is a new socket-based licensing model for mid-market businesses that are highly virtualized and want an effective enterprise-grade data protection solution. Providing Avamar, Data Protection Advisor and RecoverPoint for Virtual Machines, DPS for VMware is priced based on the number of CPU sockets (not cores) in the environment.

DPS for VMware is ideally suited for organisations that are either 100% virtualised or just have a few remaining machines that are physical. You get the full range of Avamar backup and recovery options, Data Protection Advisor to monitor and report on data protection status, capacity and trends within the environment, and RecoverPoint for a highly efficient journaled replication of critical virtual machines.

…And one minor thing

There was at least one other bit of data protection news this year, and that was me finally joining EMC. I know in the grand scheme of things it’s a pretty minor point, but after years of wanting to work for EMC it felt like I was coming home. I had worked in the system integrator space for almost 15 years and have a great appreciation for the contribution integrators bring to the market. That being said, getting to work from within a company that is so focused on bringing excellent data protection products to the market is an amazing feeling. It’s easy from the outside to think everything is done for profit or shareholder value, but EMC and its employees have a real passion for their products and the change they bring to IT, business and the community as a whole. So you might say that personally, me joining EMC was the biggest data protection news for the year.

In Summary

I’m willing to bet I forgot something in the list above. It’s been a big year for Data Protection at EMC. Every time I’ve turned around there’s been new releases or updates, new features or functions, and new options to ensure that no matter where the data is or how critical the data is to the organisation, EMC has an effective data protection strategy for it. I’m almost feeling a little bit exhausted having come up with the list above!

So I’ll end on a slightly different note (literally). If after a long year working with or thinking about Data Protection you want to chill for five minutes, listen to Kate Miller-Heidke’s cover of “Love is a Stranger”. She’s one of the best artists to emerge from Australia in the last decade. It’s hard to believe she did this cover over two years ago now, but it’s still great listening.

I’ll see you all in 2016! Oh, and don’t forget the survey.

Stop, Collaborate and Listen (Shareware)

 General Technology, General thoughts  Comments Off on Stop, Collaborate and Listen (Shareware)
May 102014
 

Starting today, I’m offering Stop, Collaborate and Listen in shareware format as a micromanual.

You’re encouraged to register, download and read the micromanual, but requested not to distribute it. If you find it useful, you’re requested to purchase it from Amazon, where the royalty will be like a colourful explosion of fireworks in the day of this most humble consultant.

If you find it really useful, you might even want to contact me to discuss how I could consult with your IT team to make it a reality for your business.

Click here to access the registration form and download.

Here’s a reminder of what Stop, Collaborate and Listen is about:

I’ve been an IT consultant for close to two decades. During that time, I’ve worked with a large number of IT departments ranging from those in small, privately held businesses through to departments servicing world-wide Fortune 500 companies. During that time I’ve seen some excellent examples of how the best IT departments and workers align to their business, but I’ve also seen what doesn’t work. Stop, Collaborate and Listen succinctly provides guidance on what to do in order to get the business/IT relationship working smoothly.

Business Talks

Stop, Collaborate and Listen

 General Technology, General thoughts  Comments Off on Stop, Collaborate and Listen
Apr 282014
 

I’ve been an IT consultant for close to two decades. During that time I’ve worked with a large number of IT departments ranging from those in small, privately held businesses, to departments servicing world-wide Fortune 500 companies. Those businesses have been in just about all industry verticals: Telecommunications, Mining, Education (Higher and Tertiary), Government (Local, State, Federal), Finance and Banking, Manufacturing, Importation, Research, and so on. Business Talks As you can imagine, during that time I’ve seen some excellent examples of how IT departments can best align to their businesses, and I’ve also seen what doesn’t work. Stop, Collaborate and Listen is a short eBook, a micromanual, which outlines three essential steps an IT department needs to take in order to ensure it remains relevant to the parent business. Ultimately, the IT/Business relationship is just that – a relationship. And all relationships need to be built on respect, understanding and communication. Stop, Collaborate and Listen provides a starting guide to IT managers and staff on how to ensure the business relationship is at its best. An early draft of one of the topics covered in Stop, Collaborate and Listen can be viewed here. You can buy the book from the Amazon Kindle Store ($3.99 US) using one of the links below:

Kept brief for the busy IT worker and manager, Stop, Collaborate and Listen is an essential guide to ensuring your IT department works closely to the core business.

Nov 152012
 

I’m directing this at all IT vendors – your job is enable me. Enable me to work, enable me to sell, enable me to speak with authority on your products.

It’s the same story, regardless of whether I’m a system integrator, a solutions architect, a support consultant or a customer.

Regardless of what my role, I want – I need – vendors to enable me to do my work. If they want to sell their hardware, or their software, or their services, I need to be convinced, and I need to be informed.

So what happens when I go to genericvendor.com/support, and want to pull down all the documentation for a specific product?

…click to download the install guide

…click to download the admin guide

…click to download the command reference guide

…click to download the error guide

…click to download the release notes

And so on, and so forth. It’s a click-hell – and for so many vendors, it’s not even one-click per document. It’s multiple clicks. If I want to learn about product X I might have to download 10, 20, 30 documents, and go through the click-hell with each document.

Some vendors offer download managers. It’s a bit of a clueless response – “It’s a problem, we’ll introduce ANOTHER download into the equation for you!”

There’s a simple solution though: zip. Or tar.gz*. That’s your download manager.

You’re a vendor, you have more than 2 documents for a product? Give me a zip file of all the documents. It should be as simple as:

Login > Click Support > Click Product > Download all docs…

(And that’s assuming you want people to be logged in before they access your documentation.)

Of course, that may mean I’ll get more documents than I need. I may not need to know how to integrate AS400 systems with your FCoE storage product over a WAN. But here’s the thing: I’ll accept that some of what I download in that consolidated zip file is dross, and I won’t complain about it, so long as I can download it all in one hit.

Oh, and when I open that zip file and unpack all the documents? Have them named properly, not by serial number or part number of some internal version of ISBN or Dewy-Decimal or some indecipherable 30 random-character filename dreamed up by your document management system that not only achieved sentience, but also went insane on the same day. If it’s an administration guide for product X version Y, call it “Product X Version Y Administration”, or something logical like that. That way my first act after downloading your documentation isn’t a tedious: “Preview > Find Title > Close Preview > Rename File > Type new Filename”. Even on a Mac, with excellent content based search capabilities, having a logical filename makes data so much easier to find.

It’s not much to ask for.

For goodness sakes, it’s so logical that I shouldn’t even need to ask.

Do you want me to know about your product, or not?

PS: Regrettably I’ve not had much opportunity to blog recently. My RSI has been particularly savage of late.


* If you suggest “.7z” or “.rar”, I will smack you.

 

The hard questions

 Aside, General Technology  Comments Off on The hard questions
Jul 312012
 

There are three hard questions that every company must be prepared to ask when it comes to data:

  1. Why do you care about your data?
  2. When do you care about your data?
  3. Who cares most about your data?

Sometimes these are not pleasant questions, and the answers may be very unpleasant. If they are, it’s time to revisit how you deal with data at your company.

Why do you care about your data?

…Do you care about your data because you’re tasked to care about it?

…Do you care about your data because you’re legally required to care about it?

…Or do you care about your data because it’s the right thing to do?

There’s no doubt that the first two reasons – being tasked, and being legally required to care about data are compelling, and valid reasons to do so. Chances are, if you’re in IT, then at some layer, being asked with data protection, or legally required to ensure data protection will play some factor in your job.

Yet neither reason is actually sufficiently compelling at all times. If everything we did in IT came down to job description or legal requirements, every job would be just as ‘glamorous’ as every other, and as many people would be eager to work in data protection as are in say, security, or application development.

Ultimately, people will care the most about data when they feel it’s the right thing to do. That is, when there’s an intrinsically felt moral obligation to care about it.

When do you care about your data?

…Do you care about your data when it is in transit within the network?

…Do you care about your data when it is at rest on your storage systems?

…Or do you care about your data when it’s been compromised?

The answer of course, should be always. At every part of the data lifecycle – at every location data can be found, it should have a custodian, and a custodian who cares because it’s the right thing to do. Yet, depressingly, we see clear examples time and time again where companies apparently only care about data when it’s been compromised.

(In this scenario, by compromise, I’m not referring solely to the classic security usage of the word, but to any situation where data is in some way lost or inappropriately modified.)

Who cares most about your data?

…Your management team?

…Your technical staff?

…Your users?

…Or external consultants?

For all intents and purposes, I’ve been an external consultant for the last 12+ years of my career. Ever since I left standard system administration behind, I’ve been working for system integrators, and as such when I walk into a business I’ve got that C-word title: consultant.

However, on several occasions over the course of my career, one thing has been abundantly, terrifyingly clear to me: I’ve cared more about the customer data than their own staff. Not all the staff, but typically more than two of the sub-groups mentioned above. This should not – this should never be the case. Now, I’m not saying I shouldn’t have to care about customer data: far from it. Anyone who calls themselves a consultant should have a deep and profound respect and care about the data of each customer he or she deals with. Yet, the users, management and technical staff at a company should always care more about their data than someone external to that customer.

Back to the hard questions

So let’s revisit those hard questions:

  1. Why do you care about your data?
  2. When do you care about your data?
  3. Who cares most about your data?

If your business has not asked those questions before, the key stakeholders may not like the answers, but I promise this: not asking them doesn’t change those answers. Until they’re answered, and addressed, a higher level of risk will exist in the business than should do so.

The IT organism

 General Technology, General thoughts  Comments Off on The IT organism
Jul 022012
 

Is an IT department like an organism?

If you were to work with that analogy, you might compare the network to the central nervous system, but after that, things will start to get a bit hazy and arbitrary.

Is the fileserver or the database server the heart?

Is the CTO the brain of the organism, or is it that crusty engineer who has been there since the department was started and is seemingly the go-to person on any complex question about how things work?

Truth be told, comparing IT to an organism generally isn’t all that suitable an analogy – but there is one aspect, unfortunately, where the comparison does work.

How many IT departments have you seen over the years where unstructured, uncontrolled organic growth that overwhelms the otherwise orderly function of the department? Sometimes it’s an individual sub-group exerting too much control and refusing to work within the bounds of a cooperative budget. Other times it’s an individual project that has just got way out of control and no-one is willing to pull the plug.

Even if we struggle to keep up the analogy of IT-as-an-organism, there’s an ugly medical condition that can be compared to unstructured, uncontrolled organic growth which threatens to overwhelm the IT department (or a section thereof): cancerous.

You see, it’s often easy to disregard such growth as just being about numbers – number of hours, number of dollars, but no real impact. Yet, having watched a previous employer crash and burn while two cancerous activities ate away at the engineering department, it’s something I’m acutely aware of when I’m dealing with companies. Most companies make the same mistake, too – they ignore the growth because they see it as just a numbers game. At the coal face though it’s not. You’ve potentially got people knowing that they’re working on a doomed or otherwise pointless project. Or you’ve got people who are impacted by that uncontrolled growth coming out of another section. Or worse, the overall parent business is affected because IT is no longer doing the job it was commissioned to do all those years ago.

I learnt to read simultaneously while learning to talk, thanks to a severe speech impediment and lots – lots – of flashcards. It had a variety of profound influences on how I deal with the world, something I’ve really only come to grasp in the last 12 months. For instance, some words and phrases spark a synaesthesia response – a word is not just a word, but a picture as well. For me, “calling a spade a spade”, so to speak, can be about conveying the mental image I get when I think of a word or phrase. In this case, when I hear about “unstructured organic growth” within an organisation, the mental image of a tumour immediately appears to me.

Like real cancer, there’s no easy solution. An IT department in this situation has some difficult and quite possibly painful decisions to make. Terminating an overrunning project for instance is a classic scenario. After all, much as it’s easy to say “don’t throw good money after bad”, we’re all human, and the temptation is to let things run for a little longer in case they suddenly rectify.

That’s how you can get 1 year into a 3 month new system implementation project and still not be finished.

Many managers complain that backup systems are a black hole, and I’m the first to admit that if you don’t budget correctly, they can indeed become a financial sump. However, I’m also the first to challenge that as a blanket rule backups just suck budget – they have CapEx, and they have OpEx, and planned/amortised correctly, they are no more likely to cause a budget blow-out than any other large system within an organisation. In a well running backup environment, financial blow-outs in backup costing usually means there’s a problem elsewhere: either storage capacity is not being adequately monitored and forecast, or systems growth is not being adequately monitored and forecast.

Yet, as a consultant, once you’re embedded within an organisation, even if you’ve had to push through budgetary considerations for backups at an excruciating amount of detail and precision, you’re equally likely to encounter at least one, if not more areas of cancerous growth within an IT department. That might sound like a gripe – I don’t mean it that way. I just mean: uncontrolled, organic growth is nothing to be ashamed of, and it’s not unique to any organisation. In fact, I’d hazard a guess that pretty much every IT organisation will encounter such a situation every few years.

Like the proverbial problem of sticking your head in the sand, the lesson is not to insist they never happen – that would be nice, but it just doesn’t play well with human nature. The real challenge is to encourage an open communications strategy that allows people to freely raise concerns. It may sound trite, but an IT organisation that promotes the Toyota Way is one to be envied: a belief in continuous improvement rather than focusing on huge changes, and a preparedness to allow anyone to put their hand up and ask, “Wait. Should we keep doing this?”

Feb 012012
 

Percentage Complete

I’d like to suggest that we should specify that “percentage complete” estimates – be they progress bars or sliders or any other representation, visual or textual, need a defined unit of measurement to them.

And we should define that unit of measurement as a maybe.

That is, if a piece of software reports that it is 98% complete at something, that’s 98 maybes out of a 100.

I perhaps, should mention, that I’m not thinking of NetWorker when I make this case. Indeed, it’s actually springing from spending 4+ hours one day monitoring a backup job from one of NetWorker’s competitors. A backup job that for the entire duration was at … 99% complete.

You see, in a lot of software, progress indicators just aren’t accurate. This lead to the term “Microsoft minute”, for instance, to describe the interminable reality bending specification of time remaining on file copies in Microsoft operating systems. Equally we can say the same thing of software installers; an installer may report that it’s 95% complete with 1 minute remaining for anywhere between 15 seconds and 2 hours – or more. It’s not just difficult to give an upper ceiling, it’s indeterminate.

I believe that software which can’t measure its progress with sufficient accuracy shouldn’t give an actual percentage complete status or time to complete status without explicitly stating it as being an estimate. To fail to do so is an act of deceit to the user.

I would also argue that no software can measure its process with sufficient accuracy, and thus all software should provide completion status as an estimate rather than a hard fact. After all:

  • Software cannot guarantee against making a blocking IO call
  • Software cannot guarantee that the operating system will not take resources away from it
  • Software cannot guarantee that a physical fault will not take resources away from it

In a real-time and fault-tolerant system, there is a much higher degree of potential accuracy. Outside of that – in regular software (commercial or enterprise), and on regular hardware/operating systems, the potential for interruption (and therefore, inaccuracy) is too great.

I don’t personally think it’s going to hurt interface designers to clearly state whenever a completion estimate is given that it’s an estimate. Of course, some users won’t necessarily notice it, and others will ignore it – but by blatantly saying it, they’re not implicitly raising false hope by citing an indeterminate measurement as accurate.

Jan 272012
 

Continuing on my post relating to dark data last week, I want to spend a little more about data awareness classification and distribution within an enterprise environment.

Dark data isn’t the end of the story, and it’s time to introduce the entire family of data-awareness concepts. These are:

  • Data – This is both the core data managed and protected by IT, and all other data throughout the enterprise which is:
    • Known about – The business is aware of it;
    • Managed – This data falls under the purview of a team in terms of storage administration (ILM);
    • Protected – This data falls under the purview of a team in terms of backup and recovery (ILP).
  • Dark Data – To quote the previous article, “all those bits and pieces of data you’ve got floating around in your environment that aren’t fully accounted for”.
  • Grey Data – Grey data is previously discovered dark data for which no decision has been made as yet in relation to its management or protection. That is, it’s now known about, but has not been assigned any policy or tier in either ILM or ILP.
  • Utility Data – This is data which is subsequently classified out of grey data state into a state where the data is known to have value, but is not either managed or protected, because it can be recreated. It could be that the decision is made that the cost (in time) of recreating the data is less expensive than the cost (both in literal dollars and in staff-activity time) of managing and protecting it.
  • Noise – This isn’t really data at all, but are all the “bits” (no pun intended) that are left which are neither grey data, data or utility data. In essence, this is irrelevant data, which someone or some group may be keeping for unnecessary reasons, and in actual fact should be considered eligible for either deletion or archival and deletion.

The distribution of data by awareness within the enterprise may resemble something along the following lines:

Data Awareness Percentage Distribution

That is, ideally the largest percentage of data should be regular data which is known, managed and protected. In all likelihood for most organisations, the next biggest percentage of data is going to be dark data – the data that hasn’t been discovered yet. Ideally however, after regular and dark data have been removed from the distribution, there should be at most 20% of data left, and this should be broken up such that at least half of that remaining data is utility data, with the last 10% split evenly between grey data and noise.

The logical implications of this layout should be reasonably straight forward:

  1. At all times the majority of data within an organisation should be known, managed and protected.
  2. It should be expected that at least 20% of the data within an organisation is undiscovered, or decentralised.
  3. Once data is discovered, it should exist in a ‘grey’ state for a very short period of time; ideally it should be reclassified as soon as possible into data, utility data or noise. In particular, data left in a grey state for an extended period of time represents just as dangerous a potential data loss situation as dark data.

It should be noted that regular data, even in this awareness classification scheme, will still be subject to regular data lifecycle decisions (archive, tiering, deletion, etc.) In that sense, primary data eligible for deletion isn’t really noise, because it’s previously been managed and protected; noise really is ex dark-data that will end up being deleted, either as an explicit decision, or due to a failure at some future point after the decision to classify it as ‘noise’, having never been managed or protected in a centralised, coordinated manner.

Equally, utility data won’t refer to say, Q/A or test databases that replicate the content of production databases. These types of databases will again have fallen under the standard data umbrella in that there will have been information lifecycle management and protection policies established for them, regardless of what those policies actually were.

If we bring this back to roles, then it’s clear that a pivotal role of both the DPAs (Data Protection Advocates) and the IPAC (Information Protection Advisory Council) within an organisation should be the rapid coordination of classification of dark data as it is discovered into one of the data, utility data or noise states.

%d bloggers like this: