Crunching NetWorker Deduplication Stats

If you use NetWorker with Data Domain, you’ve probably sometimes wanted to know which of your clients have the best deduplication – or perhaps more correctly, you’ve probably wanted to occasionally drill into clients delivering lower deduplication levels.

There are NMC and DPA reports that’ll pull this data for you, but likewise, you can get information directly from mminfo that’ll tell you details about deduplication for Data Domain Boost devices.

You’ll find the information you’re after within the output of mminfo -S. The -S option in mminfo provides a veritable treasure trove of extended details about individual savesets. To give you an example, let’s look at just one saveset in -S format. First, below, you’ll see the mminfo command to identify a saveset ID, then the -S option invoked against a single saveset.

[Sun Oct 24 13:43:02]

## ~  
## root@orilla 

$ mminfo -q "name=/nsr/scripts,savetime>=4 hours ago" -r "ssid,savetime(20)"
 ssid          date     time
3816067119    24/10/21 09:43:58
3665074337    24/10/21 10:20:00

[Sun Oct 24 13:43:10]

## ~  
## root@orilla 

$ mminfo -q "ssid=3665074337" -S
ssid=3665074337 savetime=24/10/21 10:20:00 (1635031200) orilla.turbamentis.int:/nsr/scripts
  level=manual sflags=vF       size=2453183044   files=18         insert=24/10/21
  create=24/10/21 complete=24/10/21 browse=24/11/21 10:20:00 retent=24/11/21 23:59:59
  clientid=c6fb4ece-00000004-5fdabaf1-5fdabaf0-00019ed8-a41317f3
          *backup start time: 1635031200;
*ss data domain backup cloneid: 1635031201;
*ss data domain dedup statistics: "v1:1635031201:2459622768:395458878:151768627";
                       group: NAS_PMdG;
            saveset features: CLIENT_SAVETIME;
  Clone #1: cloneid=1635031201  time=24/10/21 10:20:01    retent=24/11/21  flags=
    frag@         0 volid=3923434705 file/rec=       0/0     rn=0 last=24/10/21

There’s a support article that explains how to review this information, here. The line you’re looking for in particular though is the “*ss data domain dedup statistics” – outlined in that aforementioned support article.

I recently wanted to review some deduplication details, and while DPA and NMC are options, I needed to analyse some data over the weekend1 and as such didn’t have access to the live DPA or NMC host for a particular environment. So, since I’m not doing anything social while Melbourne’s COVID case numbers remain so high, I tasked myself with writing a script to analyse deduplication statistics from raw mminfo -S output.

The script I wrote (in Perl, of course), can either be run on a NetWorker server via a query option, or against the saved output of mminfo -S. The usage syntax is this:

Usage Options for dedupe-analysis

The script pulls the data and outputs at least three key data files, all in CSV format:

  • An “all” file that contains deduplication details for each saveset
  • A “client” file that contains summarised deduplication details for each client
  • A “type” file that contains summarised deduplication details for each client and backup type within the client (e.g., SQL, SAP, Oracle, Exchange, etc.)

Since a saveset, via cloning, can live on more than one Data Domain, the volume IDs are included in the output – and the second two files actually provide the breakdowns first by volume ID – so you can see stats on a per-Data Domain basis. There’s also an option to anonymise the host names in the output, and if you invoke that, a CSV will also be written containing the original hostname to anonymised hostname conversion2. I.e., this would allow you to send the anonymised version to someone for discussion, but privately lookup the real host in any subsequent discussion.

There’s an additional output option too, which can be handy if you’re analysing millions of savesets: it’s an option to create one output file per client. You still get the rollup data, but you’ll also get a per-client CSV file so you can deep-drill into individual client results with a better chance of avoiding Excel’s 1,000,000 row limit.

Here’s an example of an analysis run against my lab environment, with anonymisation turned on:

[Sat Oct 30 11:09:12]

## /nsr/scripts  
## root@orilla 

$ ./dedupe-analysis.pl -q "savetime>=20 years ago" -i -a -o PrestonLab

Processing saveset details
   ...processed 0 savesets
   ...processed 261 savesets total

Writing PrestonLab-all.csv
   ...written 0 saveset details
   ...wrote 261 saveset details
Written per-saveset details to PrestonLab-all.csv
Individual (per-client) file results requested. Processing.
   ...Writing noumenon.turbamentis.int data to PrestonLab/host-00000000.csv
   ...Writing orilla.turbamentis.int data to PrestonLab/host-00000001.csv

Writing PrestonLab-client.csv
Written per-client details to PrestonLab-client.csv

Writing PrestonLab-type.csv
Written per-client/type details to PrestonLab-type.csv

Writing host anonymisation mappings for private reference - do not distribute.
...anonymisation mappings written to PrestonLab-anonmap.csv.

So what sort of output does it generate? For some output examples, I’ve imported the generated CSV files into Excel and converted them to a table. Here’s an example of the “per-saveset” data:

Deduplicated Details by Individual Saveset Instance

When going by client rollup you’ll see content such as the following:

Deduplicated Details by Volume ID and Client

And the backup type output looks like the following:

Deduplicated Details by Volume ID, Client and BackupType

Now, they’re just lab environments there — and while accurate, they’re hardly edifying. What I was working towards was the analysis of a production environment. With host anonymisation turned on, here’s an example of that output:

Deduplication data by Volume ID, Client and Backup Type from a Production Environment

The only additional thing I’ve done in the example output there is to set number formatting on the Original/Post-Comp/Average Reduction columns.

If you’re interested in being able to run this against your own environment (or mminfo -S output, in general), here’s the script:

#!/usr/bin/perl -w

###########################################################################
# Modules
###########################################################################
use strict;
use File::Basename;
use Getopt::Std;
use Sys::Hostname;


###########################################################################
# Subroutines
###########################################################################

# in_list($elem,@list) returns true iff $elem appears in @list.
sub in_list {
	return 0 if (@_+0 < 2);
	
	my $element = $_[0];
	return 0 if (!defined($element));
	shift @_;
	
	my @list = @_;
	return 0 if (!@_ || @_+0 == 0);
	
	my $foundCount = 0;
	my $e = quotemeta($element);
	foreach my $item (@list) {
		my $i = quotemeta($item);
		$foundCount++ if ($e eq $i);
	}
	
	return $foundCount;
}

# show_help([@messages]) shows help, any additional messages, then exits.
sub show_help {
	my $self = basename($0);
	print <<EOF;
Usage:

$self [-h|-?] [-d] [-a] {-q query | -f file} -o file

Where:

-h | -?     Prints this help and exits.
-d          Enables debug mode for additional output.
-q query    Run nominated mminfo 'query' and analyse against results.
-f file     Run analysis against a file containing mminfo results.
-o file     File to write results data to. Do NOT include a file extension.
-a          Anonymise hostnames in output data.
-i          Write the 'all' savesets data as an individual CSV per client.

Analyses Data Domain deduplication statistics held in mminfo -S output
to build information about deduplication results on a per client and
workload type-basis.

EOF

	if (@_+0 > 0) {
		my @messages = @_;
		foreach my $message (@messages) {
			my $tmp = $message;
			chomp $tmp;
			print "$tmp\n";
		}
	}
	die "\n";
}

# get_backup_type($savesetname) returns the guessed backup type based on the saveset name.
sub get_backup_type {
	return "Other" if (@_+0 != 1);
	
	my $saveset = $_[0];
	my $backupType = "";
	
	if ($saveset =~ /^RMAN/) {
		$backupType = "Oracle";
	} elsif ($saveset =~ /^backint/) {
		$backupType = "SAP";
	} elsif ($saveset =~ /^SAPHANA/) {
		$backupType = "SAP HANA";
	} elsif ($saveset =~ /^VM/) {
		$backupType = "Virtual Machine";
	} elsif ($saveset =~ /^MSSQL/) {
		$backupType = "MSSQL";
	} elsif ($saveset =~ /^\// ||
			 $saveset =~ /^\<\d+\>\//) {
		$backupType = "Unix Filesystem";
	} elsif ($saveset =~ /^APPLICATIONS.*Exchange/) {
		$backupType = "Exchange";
	} elsif ($saveset =~ /^[A-Z]\:/ || 
			 $saveset =~ /SYSTEM/ ||
			 $saveset =~ /DISASTER/ ||
			 $saveset =~ /WINDOWS ROLES/ ||
			 $saveset =~ /\\VOLUME\{/ ||
			 $saveset =~ /\\\?\\GLOBALROOT/) { 	# Might need more test cases here
		$backupType = "Windows Filesystem";
	} elsif ($saveset =~ /^index/ || $saveset =~ /^bootstrap/ ) {
		$backupType = "NetWorker";
	} else {
		$backupType = "Other";
	}
	return $backupType;
}


###########################################################################
# Globals & main.
###########################################################################
my %opts = ();
my $version = "1.1";
my $method = "query";	# Default method is to seek a query from the command line.
my $query = "";			
my $file = "";			
my $debug = 0;			# Change to 1 for lots of messy debug output.
my $outFile = "";
my $baseOut = "";
my $byClientOut = "";
my $byTypeOut = "";
my $anonOut = "";
my $anonHosts = 0;
my %hostMap = ();	# Used for mapping hostnames to anonymised names.
my %vmMap = ();		# Used for mapping VM names to anonymised names.
my $hostCount = 0;	# Used for iterating anonymised hostnames.
my $vmCount = 0;	# Used for iterating anonymised VM names.
my $individualFile = 0;
my %clientElems = ();	# If we're doing individual file out, use this to speedily iterate through gathered data.


# Capture command line arguments.
if (getopts('h?vq:f:o:adi',\%opts)) {
	show_help() if (defined($opts{'h'}) || defined($opts{'?'}));
	show_help("This release: v$version") if (defined($opts{'v'}));
	
	if (defined($opts{'q'})) {
		$query = $opts{'q'};
	} elsif (defined($opts{'f'})) {
		$method = "file";
		$file = $opts{'f'};
		if (! -f $file) {
			show_help("Nominated file, '$file' does not exist or cannot be accessed.");
		}
	} else {
		show_help("You must specify a -q 'query' or -f 'file' option.")
	}

	$anonHosts = 1 if (defined($opts{'a'}));
	$individualFile = 1 if (defined($opts{'i'}));
	$debug = 1 if (defined($opts{'d'}));
	
	if (defined($opts{'o'})) {
		$outFile = $opts{'o'};
		if ($outFile =~ /\./) {
			show_help("Please don't give a file extension for the output base filename.");
		}
		$baseOut = $outFile . "-all.csv";
		$byClientOut = $outFile . "-client.csv";
		$byTypeOut = $outFile . "-type.csv";
		$anonOut = $outFile . "-anonmap.csv";
		
		if (-f $outFile || -f $baseOut || -f $byClientOut || -f $byTypeOut || -f $anonOut) {
			show_help("Output file exists. Pick another name.\nSelected output files are:\n" . join("\n",($baseOut,$byClientOut,$byTypeOut,$anonOut)));
		}
		if ($individualFile && -d $outFile) {
			show_help("Individual/per-client output specified but directory $outFile already exists.");
		}
	} else {
		show_help("You have not specified an output file.");
	}
}

my %dataSets = ();
my @rawData = ();
my $ssidCount = 0;
my $ddSavesets = 0;
my %volumeIDsbyCloneID = ();
if ($method eq "query") {
	my $queryCommand = "mminfo -q \'$query\' -S";
	if (open(MMI,"$queryCommand 2>&1 |")) {
		while (<MMI>) {
			my $line = $_;
			chomp $line;
			push(@rawData,$line);
			
			if ($line =~ /ss data domain dedup statistics/) {
				$ddSavesets++;
			}
			
			if ($line =~ /^ssid=\d+/) {
				$ssidCount++;
				if ($ssidCount % 100000 == 0) {
					print "...read $ssidCount saveset details ($ddSavesets on Data Domain)\n";
				}
			}
		}
	}
	close(MMI);
} elsif ($method eq "file") {
	if (open(FILE,$file)) {
		while (<FILE>) {
			my $line = $_;
			chomp $line;
			push(@rawData,$line);
			
			if ($line =~ /ss data domain dedup statistics/) {
				$ddSavesets++;
			}
			
			if ($line =~ /^ssid=\d+/) {
				$ssidCount++;
				if ($ssidCount % 100000 == 0) {
					print "...read $ssidCount saveset details ($ddSavesets on Data Domain)\n";
				}
			}
			
		}
	}
	close(FILE);
}

if ($ssidCount == 0) {
	die ("Did not find any savesets in an expected format in the $method.\n");
}
if ($ddSavesets == 0) {
	die ("Did not find any savesets on a Data Domain in the $method.\n");
}

# Now step through and discard all savesets that aren't on Data Domain devices.
my $count = 0;
my @dataSeg = ();
my $foundADDBoostSS = 0;
for (my $index = 0; $index < (@rawData+0); $index++) {
	my $line = $rawData[$index];
	
	if ($line =~ /^ssid/) {
		if ($index != 0) {
			# If we're at index 0 we're at the start of the file. We'll see
			# a line starting with 'ssid' but we won't have a prior record
			# to add to our dataSets.
			if ($foundADDBoostSS) {
				$debug && print "DEBUG: Boost saveset processing at index $index\n";

				$dataSets{$count} = join("\n",@dataSeg);
				@dataSeg = ();
				
				$foundADDBoostSS = 0;
				$count++;
				$debug && print "DEBUG: Incremented count to $count\n";
			}
		}
		@dataSeg = ($line);
	} else {
		push (@dataSeg,$line);
	}
	
	if ($line =~ /ss data domain dedup statistics/) {
		$foundADDBoostSS = 1;
	}
}
# Explicitly free up the initially read data. This is useful if you have >1M savesets
# in reducing the runtime footprint. (E.g., on sample of 1.7M savesets it uses 1/3 less
# memory over runtime.)
$debug && print "DEBUG: Deallocating raw data.\n";
undef @rawData;	

my %procData = ();
my $datum = 0;

my $procCount = 0;
print "\n\nProcessing saveset details\n";
foreach my $key (sort {$a <=> $b} keys %dataSets) {
	my $block = $dataSets{$key};
	my @block = split(/\n/,$block);
	
	my $ssid = "";
	my $savetime = "";
	my $nsavetime = "";
	my $client = "";
	my $saveset = "";
	my $level = "";
	my $ssflags = "";
	my $totalsize = 0;
	my $retention = "";
	my $dedupeStats = "";
	my $vmname = "";
	
	if ($procCount % 100000 == 0) {
		print "   ...processed $procCount savesets\n";
	}
	$procCount++;
	
	my $vm = 0;
	my $blockIndex = 0;
	foreach my $item (@block) {
		if ($item =~ /^ssid=(\d+) savetime=(.*) \((\d+)\) ([^:]*):(.*)/) {
		
			$ssid = $1;
			$savetime = $2;
			$nsavetime = $3;
			$client = $4;
			$saveset = $5;
			
			$procData{$datum}{ssid} = $ssid;
			$procData{$datum}{savetime} = $savetime;
			$procData{$datum}{client} = $client;
			$procData{$datum}{nsavetime} = $nsavetime;
			$procData{$datum}{saveset} = $saveset;
			
			if ($anonHosts) {
				if (%hostMap && defined($hostMap{$client})) {
					# Nothing to do here.
				} else {
					$hostMap{$client} = sprintf("host-%08d",$hostCount);
					$hostCount++;
				}
			}
			
			$debug && print "\nDEBUG: START $ssid ($savetime - $nsavetime): $client|$saveset\n";
		}
		
		if ($item =~ /^\s*Clone \#\d+: cloneid=(\d+)/) {
			my $cloneID = $1;
			my $volID = 0;
			# Here we assume any DD saveset only has a single frag to keep
			# things simple. This may break if someone clones something out
			# to tape.
			my $nextLine = $block[$blockIndex+1];
			if (!defined($nextLine)) {
				# Orphaned saveset/clone instance without a volume. Set VolID to Zero.
				$volID = 0;
			} else {
				if ($nextLine =~ /.*volid=\s*(\d+) /) {
					$volID = $1;
				} else {
					# Something odd going on here. Set VolID to Zero.
					$volID = 0;
				}
			}
			$volumeIDsbyCloneID{$ssid}{$cloneID} = $volID;
		}
		
		if ($item =~ /vcenter_hostname/) {
			$vm = 1;
		}
		if ($vm == 1 && $item =~ /^\s+\\"name\\": \\"(.*)\\"/) {
			$vmname = $1;
			if ($anonHosts) {
				if (%vmMap && defined($vmMap{$vmname})) {
					$procData{$datum}{vmname} = $vmMap{$vmname};
				} else {
					$vmMap{$vmname} = sprintf("vm-%09d",$vmCount);
					$vmCount++;
					$procData{$datum}{vmname} = $vmMap{$vmname};
				}
			} else {
				$procData{$datum}{vmname} = $vmname;
			}
	
			$vm = 0;
		}
		
		if ($item =~ /^\s+level=([^\s]*)\s+sflags=([^\s]*)\s+size=(\d+).*/) {
			$level = $1;
			$ssflags = $2;
			$totalsize = $3;
			
			$procData{$datum}{level} = $level;
			$procData{$datum}{ssflags} = $ssflags;
			$procData{$datum}{totalsize} = $totalsize;
			
			$debug && print "DEBUG: ----> Level = $level, ssflags = $ssflags, totalsize = $totalsize\n";
		}
		
		if ($item =~ /create=.* complete=.* browse=.* retent=(.*)$/) {
			$retention = $1;
			
			$procData{$datum}{retention} = $retention;
			
			$debug && print "DEBUG: ----> Retention = $retention\n";
		}
		
		if ($item =~ /^\*ss data domain dedup statistics: "(.*)"(\,|;)\s*$/) {
			$dedupeStats = $1;
			my $eol = $2;
			if ($eol eq ",") {
				my $stop = 0;
				my $lookahead = $blockIndex + 1;
				while (!$stop) {
					my $tmpLine = $block[$lookahead];
					if ($tmpLine =~ /^\s*"(.*)"(\,|;)\s*$/) {
						my $tmpstat = $1;
						my $tmpset = $2;
						$dedupeStats .= "\n" . $tmpstat;
						$stop = 1 if ($tmpset eq ";");
					}
					$lookahead++;
				}
			}
			
			$procData{$datum}{dedupe_stats} = $dedupeStats;
			
			$debug && print "DEBUG: ----> Dedupe stats: " . join("|||",split(/\n/,$dedupeStats)) . "\n";
		} elsif ($item =~ /^\*ss data domain dedup statistics: \\\s*$/) {
			my $stop = 0;
			my $lookahead = $blockIndex + 1;
			while (!$stop) {
				my $tmpLine = $block[$lookahead];
				if ($tmpLine =~ /^\s*"(.*)"(\,|;)\s*$/) {
					my $tmpstat = $1;
					my $tmpset = $2;
					$dedupeStats .= "\n" . $tmpstat;
					$stop = 1 if ($tmpset eq ";");
				}
				$lookahead++;
			}

			$dedupeStats =~ s/^\n(.*)/$1/s;			
			$procData{$datum}{dedupe_stats} = $dedupeStats;
			
			$debug && print "DEBUG: ----> Dedupe stats: " . join("|||",split(/\n/,$dedupeStats)) . "\n";		
		}
		
		$blockIndex++;
	}
	$datum++;
}
print ("   ...processed $procCount savesets total\n");

# Now dump $dataSets to save memory. This doesn't save as much
# as the previous @rawData drop but retaining in case it's useful
# for someone on a low-memory system.
$debug && print "DEBUG: Deallocating pre-parsed datasets.\n";
undef %dataSets;

# Now do a quick post-process through the clients if we have host anonymisation turned on
# to catch any index savesets and adjust them. It should be safe to do so here.
if ($anonHosts) {
	foreach my $elem (keys %procData) {
		if ($procData{$elem}{saveset} =~ /^index:(.*)/) {
			my $indexHostname = $1;
			if (%hostMap && defined($hostMap{$indexHostname})) {
				$procData{$elem}{saveset} = "index:" . $hostMap{$indexHostname};
			} else {
				$hostMap{$indexHostname} = sprintf("host-%08d",$hostCount);
				$hostCount++;
				$procData{$elem}{saveset} = "index:" . $hostMap{$indexHostname};
			}
		}
	}
}

my %clientData = ();

# As we write out the base file, assemble the client rollup data.
my @clients = ();
print ("\n\nWriting $baseOut\n");
my $countOut = 0;
if (open(OUTP,">$baseOut")) {
	print OUTP ("SSID,CloneID,VolumeID,Client,Savetime,Level,Name,OriginalMB,PreLCompMB,PostLCompMB,Reduction(:1)\n");
	foreach my $elem (keys %procData) {
		my $dedupeStats = $procData{$elem}{dedupe_stats};
		my @dedupeStats = split("\n",$dedupeStats);
		if ($countOut % 100000 == 0) {
			print "   ...written $countOut saveset details\n";
		}
		$countOut++;
	
		foreach my $stat (@dedupeStats) {
			if ($stat =~ /^v1:(\d+):(\d+):(\d+):(\d+)/) {
				my $clID = $1;
				my $original = $2;
				my $precomp = $3;
				my $postcomp = $4;
			
				my $client = $procData{$elem}{client};
				my $saveset = $procData{$elem}{saveset};
				my $savetime = $procData{$elem}{savetime};
				my $ssid = $procData{$elem}{ssid};
				my $level = $procData{$elem}{level};
				my $vmName = (defined($procData{$elem}{vmname})) ? $procData{$elem}{vmname} : "";
				my $volumeID = $volumeIDsbyCloneID{$ssid}{$clID};
				
				# Only build up the client<>element mappings if we have to output
				# individual clients.
				if ($individualFile) {
					if (in_list($client,@clients)) {
						push(@{$clientElems{$client}},$elem);
					} else {
						push (@clients,$client);
						@{$clientElems{$client}} = ($elem);
					}
				}
				
				if ($vmName ne "") {
					$saveset = "VM:$vmName";
				}
				$original = $original / 1024 / 1024;	#MB
				$precomp = $precomp / 1024 / 1024;		#MB
				$postcomp = $postcomp / 1024 / 1024;	#MB
				my $reduction = $original / $postcomp;	#:1
				
				my $backupType = get_backup_type($saveset);
				
				$procData{$elem}{backuptype} = $backupType;	# Store this in case we need it for individuals.
				
				if (%clientData &&
					defined($clientData{by_client}) &&
					defined($clientData{by_client}{$volumeID}) &&
					defined($clientData{by_client}{$volumeID}{$client})) {
					$clientData{by_client}{$volumeID}{$client}{original} += $original / 1024;
					$clientData{by_client}{$volumeID}{$client}{postcomp} += $postcomp / 1024;
					$clientData{by_client}{$volumeID}{$client}{count}++;
				} else {
					$clientData{by_client}{$volumeID}{$client}{original} = $original / 1024; 	# Store in GB
					$clientData{by_client}{$volumeID}{$client}{postcomp} = $postcomp / 1024;	# Store in GB			
					$clientData{by_client}{$volumeID}{$client}{count} = 1;
				}
				
				if (%clientData &&
					defined($clientData{by_type}) &&
					defined($clientData{by_type}{$volumeID}) &&
					defined($clientData{by_type}{$volumeID}{$client}) &&
					defined($clientData{by_type}{$volumeID}{$client}{$backupType})) {
					
					$clientData{by_type}{$volumeID}{$client}{$backupType}{original} += $original / 1024;	# Store in GB
					$clientData{by_type}{$volumeID}{$client}{$backupType}{postcomp} += $postcomp / 1024;	# Store in GB
					$clientData{by_type}{$volumeID}{$client}{$backupType}{count}++;
				} else {
					$clientData{by_type}{$volumeID}{$client}{$backupType}{original} = $original / 1024;	# Store in GB
					$clientData{by_type}{$volumeID}{$client}{$backupType}{postcomp} = $postcomp / 1024;	# Store in GB
					$clientData{by_type}{$volumeID}{$client}{$backupType}{count} = 1;
				}
				
				my $finalClientName = ($anonHosts == 1) ? $hostMap{$client} : $client;
				
				print OUTP ("$ssid,$clID,$volumeID,$finalClientName,$savetime,$level,$saveset,$original,$precomp,$postcomp,$reduction\n");
			
				if ($debug) {			
					print ("DEBUG: " . $client . "," . $saveset . " (ClID $clID)\n");

					printf ("DEBUG: Original Size: %.2f MB\n",$original);
					printf ("DEBUG:     Pre-LComp: %0.2f MB\n",$precomp);
					printf ("DEBUG:    Post-LComp: %0.2f MB\n",$postcomp);
			
					printf ("Debug:     Reduction: %0.2f:1\n",$reduction);
					print "DEBUG: \n";
				}
			}
		}
	}
}
close(OUTP);
print "   ...wrote $countOut saveset details\n";
print "Written per-saveset details to $baseOut\n";

# If we need to output per-client details, do that now.
if ($individualFile) {
	print "Individual (per-client) file results requested. Processing.\n";
	mkdir("$outFile");
	if (-d $outFile) {
		foreach my $client (sort {$a cmp $b} @clients) {
			my $filename = $client;
			# Override here.
			if ($anonHosts) {
				$filename = $hostMap{$client};
			}
			$filename =~ s/\./_/g;
			$filename = $outFile . "/" . $filename . ".csv";
			if (open(OUTP,">$filename")) {
				print "   ...Writing $client data to $filename\n";
				print OUTP ("SSID,CloneID,VolumeID,Client,VMName,Savetime,Level,Name,OriginalMB,PreLCompMB,PostLCompMB,Reduction(:1)\n");
				my @elemList = @{$clientElems{$client}};
				foreach my $elem (@elemList) {
					next if ($procData{$elem}{client} ne $client);
				
					# Else...
					my $dedupeStats = $procData{$elem}{dedupe_stats};
					my @dedupeStats = split("\n",$dedupeStats);
					foreach my $stat (@dedupeStats) {
						if ($stat =~ /^v1:(\d+):(\d+):(\d+):(\d+)/) {
							my $clID = $1;
							my $original = $2;
							my $precomp = $3;
							my $postcomp = $4;
		
							my $saveset = $procData{$elem}{saveset};
							my $savetime = $procData{$elem}{savetime};
							my $ssid = $procData{$elem}{ssid};
							my $level = $procData{$elem}{level};
							my $vmName = (defined($procData{$elem}{vmname})) ? $procData{$elem}{vmname} : "";
							my $volumeID = $volumeIDsbyCloneID{$ssid}{$clID};
				
							push (@clients,$client) if (!in_list($client,@clients));
				
							if ($vmName ne "") {
								$saveset = "VM:$vmName";
							}
							$original = $original / 1024 / 1024;	#MB
							$precomp = $precomp / 1024 / 1024;		#MB
							$postcomp = $postcomp / 1024 / 1024;	#MB
							my $reduction = $original / $postcomp;	#:1
							my $backupType = $procData{$elem}{backuptype};
							
							my $finalClient = ($anonHosts == 1) ? $hostMap{$client} : $client;
							my $finalSaveset = "";
							
							if ($anonHosts) {
								if ($vmName ne "") {
									$finalSaveset = $vmName;
								} else {
									$finalSaveset = $saveset;
								}
							} else {
								$finalSaveset = $saveset;
							}
										
							print OUTP ("$ssid,$clID,$volumeID,$finalClient,$vmName,$savetime,$level,$finalSaveset,$original,$precomp,$postcomp,$reduction\n");
						}
					}
				}
			}
			close(OUTP);
		}
	} else {
		die "Unable to create directory $outFile\n";
	}
}

# Write file: Summary by VolumeID and Client.
print("\n\nWriting $byClientOut\n");
if (open(OUTP,">$byClientOut")) {
	print OUTP ("VolumeID,Client,Total Original (GB),Total Post-Comp (GB),Average Reduction\n");
	foreach my $volumeID (keys %{$clientData{by_client}}) {
		foreach my $client (sort {$a cmp $b} keys %{$clientData{by_client}{$volumeID}}) {
			my $finalClientName = ($anonHosts == 1) ? $hostMap{$client} : $client;
			printf OUTP ("%s,%s,%.8f,%.8f,%.8f\n",
							$volumeID,
							$finalClientName,
							$clientData{by_client}{$volumeID}{$client}{original},
							$clientData{by_client}{$volumeID}{$client}{postcomp},
							$clientData{by_client}{$volumeID}{$client}{original} / $clientData{by_client}{$volumeID}{$client}{postcomp});
		}
	}
}
close(OUTP);
print "Written per-client details to $byClientOut\n";

# Write file: Summary by VolumeID, Client and BackupType.
print ("\n\nWriting $byTypeOut\n");
if (open(OUTP,">$byTypeOut")) {
	print OUTP ("VolumeID,Client,Backup Type,Total Original (GB),Total Post-Comp (GB),Average Reduction\n");
	foreach my $volumeID (sort {$a cmp $b} keys %{$clientData{by_type}}) {
		foreach my $client (sort {$a cmp $b} keys %{$clientData{by_type}{$volumeID}}) {
			foreach my $type (sort {$a cmp $b} keys %{$clientData{by_type}{$volumeID}{$client}}) {
				my $finalClientName = ($anonHosts == 1) ? $hostMap{$client} : $client;
				printf OUTP ("%s,%s,%s,%.8f,%.8f,%.8f\n",
								$volumeID,
								$finalClientName,
								$type,
								$clientData{by_type}{$volumeID}{$client}{$type}{original},
								$clientData{by_type}{$volumeID}{$client}{$type}{postcomp},
								$clientData{by_type}{$volumeID}{$client}{$type}{original} / $clientData{by_type}{$volumeID}{$client}{$type}{postcomp});
			}
		}
	}
}
close(OUTP);
print "Written per-client/type details to $byTypeOut\n";

# If we're anonymising hosts, write data that can be kept private to map
# anonymised hostnames to real hostnames.
if ($anonHosts) {
	print "\n\nWriting host anonymisation mappings for private reference - do not distribute.\n";
	if (open(ANONMAP,">$anonOut")) {
		print ANONMAP "Client Name,Anonymous Mapping\n";
		foreach my $client (sort {$a cmp $b} keys %hostMap) {
			print ANONMAP "$client,$hostMap{$client}\n";
		}
		
		print ANONMAP "\n";
		print ANONMAP "Virtual Machine,Anonymous Mapping\n";
		foreach my $vm (sort {$a cmp $b} keys %vmMap) {
			print ANONMAP "$vm,$vmMap{$vm}\n";
		}
		close (ANONMAP);
	}
	print "...anonymisation mappings written to $anonOut.\n";
}

One thing to note in the script — the breakdown of what types of backups there are was dependent on the saveset information that I had available to me. So it while it covers things like Windows and Unix filesystems, Oracle, SAP and MSSQL, it doesn’t include coverage for identifying Lotus Notes, DB2, and so on. (There’s a subroutine, get_backup_type, that interprets the backup type from the saveset name, that you’d need to modify if you wanted additional types.)

Footnotes

  1. There are, unfortunately, only so many hours in a busy work-day.
  2. The “all” output does not attempt to anonymise saveset names – though it will anonymise virtual machine names

1 thought on “Crunching NetWorker Deduplication Stats”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.