One of the techniques that storage vendors use to reduce the
cost of hard disk-based storage is deduplication. Deduplication is the
elimination of redundant data across files. The technology is ideal for backup,
since so much of a current copy of data is similar to the prior copy. The few
extra seconds required to identify redundant data is worth the savings in disk
capacity. Deduplication for primary storage is popular for all-flash arrays.
While the level of redundancy is not as great, the premium price of flash makes
any capacity savings important.
In addition, given the excess performance of
AFAs the deduplication feature can often be added without a noticeable
performance impact. There is one process though where deduplication provides
little value; archive. IT professionals need to measure costs differently when
considering a storage destination for archive.
Why Deduplication Fails in Archive
When correctly implemented, an archive should be where the
last known good copy of data is stored. The file that is stored in the archive
is typically unique and is often only the final version of a file that is
stored. Given that the archive storage target is not going to be flash, the low
archive cost per GB means that the redundancy rate would need to be fairly high
to justify deduplication. While there is some potential for redundancy between
the various files stored in an archive, it should be relatively low.
A second and maybe a third copies of data stored in the
archive are sometimes created, but these copies are designed to maintain chain
of custody and to recover archive data in the case of a disaster. In both
cases, these copies must be separate standalone data copies and can’t be part
of a deduplication process.
A Different Storage Calculation
The lack of an effective result from deduplication means
that storage costs for archive data need to be calculated very differently. The
only efficiency technique that will work is compression, and only in some
cases. The lack of an effective efficiency technology means that storage costs
need to be based on raw capacities not effective capacities that some vendors
feature in their brochures.
Basing storage costs on raw capacity is relatively easy but
it is more than taking the cost of the storage media and dividing it by the
amount of capacity. The “system” cost needs to be calculated. For a disk based
archive, this means the storage controller hardware and disk shelves that will
surround the actual hard disk media. In some cases empty disk shelves can be
more expensive than the drives that go in them. For tape libraries, the cost of
the library and its drives need to be calculated into the price.
Comparing Disk to Tape
Without the aid of deduplication, the comparison of disk and
tape for long term archiving is more interesting (and realistic). Both disk and
tape will have long upfront costs associated with them, the cost of the storage
controller and the empty library. A hard disk-based archive will add the cost
of additional shelves and each piece of disk media has a fair amount of
technology on it. A tape archive will need to factor in the cost of buying
enough tape drives to sustain writes to, and restores from, the archive, but
the media has almost no technology associated with it and its cost per GB is
relatively low.
Assuming that the upfront costs for a disk or tape based
archive is similar, the cost to scale the disk archive, thanks to more
expensive shelves and media, will outpace the costs to scale a tape based
archive.
Conclusion
The lack of redundant data makes deduplication almost
useless when considering archive capacity costs. Raw capacity is more critical
and as these archives scale a disk-only approach becomes harder to cost
justify. For optimal cost containment, a small disk front end that does not
need to scale should be leveraged with a large scalable tape back end.
Article From: storageswiss.com