Wednesday, 13 April 2016

Is Deduplication Useless on Archive Data?

One of the techniques that storage vendors use to reduce the cost of hard disk-based storage is deduplication. Deduplication is the elimination of redundant data across files. The technology is ideal for backup, since so much of a current copy of data is similar to the prior copy. The few extra seconds required to identify redundant data is worth the savings in disk capacity. Deduplication for primary storage is popular for all-flash arrays. While the level of redundancy is not as great, the premium price of flash makes any capacity savings important.

In addition, given the excess performance of AFAs the deduplication feature can often be added without a noticeable performance impact. There is one process though where deduplication provides little value; archive. IT professionals need to measure costs differently when considering a storage destination for archive.

Why Deduplication Fails in Archive

When correctly implemented, an archive should be where the last known good copy of data is stored. The file that is stored in the archive is typically unique and is often only the final version of a file that is stored. Given that the archive storage target is not going to be flash, the low archive cost per GB means that the redundancy rate would need to be fairly high to justify deduplication. While there is some potential for redundancy between the various files stored in an archive, it should be relatively low.
A second and maybe a third copies of data stored in the archive are sometimes created, but these copies are designed to maintain chain of custody and to recover archive data in the case of a disaster. In both cases, these copies must be separate standalone data copies and can’t be part of a deduplication process.

A Different Storage Calculation

The lack of an effective result from deduplication means that storage costs for archive data need to be calculated very differently. The only efficiency technique that will work is compression, and only in some cases. The lack of an effective efficiency technology means that storage costs need to be based on raw capacities not effective capacities that some vendors feature in their brochures.
Basing storage costs on raw capacity is relatively easy but it is more than taking the cost of the storage media and dividing it by the amount of capacity. The “system” cost needs to be calculated. For a disk based archive, this means the storage controller hardware and disk shelves that will surround the actual hard disk media. In some cases empty disk shelves can be more expensive than the drives that go in them. For tape libraries, the cost of the library and its drives need to be calculated into the price.

Comparing Disk to Tape

Without the aid of deduplication, the comparison of disk and tape for long term archiving is more interesting (and realistic). Both disk and tape will have long upfront costs associated with them, the cost of the storage controller and the empty library. A hard disk-based archive will add the cost of additional shelves and each piece of disk media has a fair amount of technology on it. A tape archive will need to factor in the cost of buying enough tape drives to sustain writes to, and restores from, the archive, but the media has almost no technology associated with it and its cost per GB is relatively low.
Assuming that the upfront costs for a disk or tape based archive is similar, the cost to scale the disk archive, thanks to more expensive shelves and media, will outpace the costs to scale a tape based archive.


The lack of redundant data makes deduplication almost useless when considering archive capacity costs. Raw capacity is more critical and as these archives scale a disk-only approach becomes harder to cost justify. For optimal cost containment, a small disk front end that does not need to scale should be leveraged with a large scalable tape back end.

Article From: