To identify
duplicates, data can be examined in different ways depending on the technology
used. For example, file-level dedupe—also called single-instance
storage(SIS)—will identify identical files, store a single copy and replace
subsequent identical copies with a pointer to the unique version stored.
Examples of file-level deduplication include Novell Inc.’s GroupWise and Microsoft
Corp.’s Exchange (although SIS isn’t supported in Exchange 2010) email
programs. EMC Corp. also provides file-level deduplication on its storage
arrays, including Clariion, Celerra and its new VNX series.
The disadvantage of
file-level deduplication is its lack of granularity and inability to provide
sub-file level dedupe. That means even the smallest change in a file makes it a
totally new file that will be stored. File-level dedupe is useful in email
environments where the same attachment is often sent to multiple recipients at
once or in unstructured data storage environments with low change rates.
However, it is not practical in a structured data environment when entire large
files such databases are constantly changing.
Vendors address the
lack of granularity inherent to file-level dedupe by breaking data in to
smaller “chunks” such as fixed blocks or variable length segments. Greater data
reduction can be achieved by storing only unique data segments and creating pointers
to all others that are identical. CommVault Systems Inc., FalconStor Software
Inc. and NetApp Inc. and are examples of vendors who leverage block-level
dedupe; while EMC’s Data Domain, Avamar and products from Sepaton Inc. are
based on variable-length byte segments. The knock against block-level dedupe is
that a block offset would change all blocks in a given data set, requiring all
new blocks to be stored as they are no longer duplicates. This situation is
alleviated with variable length segment dedupe but the technology is more
complex and resource-intensive. Sub-file dedupe (block-level or variable-length
segments) is commonly used in backup environments where multiple backup
versions of files often contain few changes.
In-line vs.
post-process dedupe (out-of-band)
Dedupe is referred
to as being in-line (also called in-band) when data is analyzed for
duplicates while it is being written to the storage media. In contrast,
post-process (or out-of-band) dedupe takes place after the data has been
written to disk. The benefits of post-process dedupe is that it does not affect
write performance but it requires enough disk space to accommodate the entire
data set until deduplication (and therefore reduction) can take place during
off-peak hours. On the other hand, in-line dedupe provides the immediate space
saving benefits of data reduction but is more resource intensive which can
impact write performance. The decision is a trade-off between immediate storage
space savings and performance, but that performance impact is less of a factor
as the technology improves. In-line dedupe products include offerings from
FalconStor, as well as EMC’s Data Domain and Sepaton, IBM Corp.’s ProtecTier
(formerly Diligent), while NetApp uses post-process dedupe.
Source vs.
target
Depending on the
technology implemented, deduplication can take place at the source (the sending
system) or at the target level (the receiving system). This distinction is
specific to backup environment which are typically based on a client/server (or
sender/receiver model). Source dedupe uses software on the backup
client that must be dedupe-capable and the backup server to be dedupe-aware.
This means some changes to an existing backup environment will be required. On
the other hand, target dedupe usually requires no change since the
deduplication-capable target device is seen as just another disk storage array
or virtual tape library (VTL) to the backup server. Source dedupe is used in an
effort to reduce the amount of data sent over the network when remote offices
are backing up to a central office. The trade-off is that source dedupe impact
performance on the client side thus extending the duration of the backup and
dedupe is limited to duplicate data at the client level only regardless how
many backup clients share identical data. 
Just how clean is your data? Identify where your data requires attention, allowing you to choose which areas to improve.
Get Free Email Append Test from AverickMedia
Appliance vs. software
Other possible considerations include picking between appliance-based and software-based dedupe. Deduplication appliances are typically integrated with an existing environment without requiring much change. This is the case when configuring a backup server to write to a dedupe-capable storage array (e.g., EMC Data Domain). On the other hand, dedupe software usually require changes to your environment, especially when migrating from a basic backup software to one that is dedupe-capable.
Other possible considerations include picking between appliance-based and software-based dedupe. Deduplication appliances are typically integrated with an existing environment without requiring much change. This is the case when configuring a backup server to write to a dedupe-capable storage array (e.g., EMC Data Domain). On the other hand, dedupe software usually require changes to your environment, especially when migrating from a basic backup software to one that is dedupe-capable.
Competing
manufacturers claim that appliance-based dedupe creates hardware vendor lock-in
because of a dependency on proprietary storage or appliances. However,
software-based dedupe can also be considered vendor lock-in, as the
deduplication capability is dependent on the specific software platform.
Vendors like IBM
and NetApp offer gateway appliances providing the ability to store deduplicated
data to supported third-party storage, but for all intents and purposes,
hardware- and software-based dedupe offerings are proprietary.
There are many
benefits of deduplication, but choosing the right dedupe approach requires
careful consideration of your backup environment.
About this author:
Pierre Dorion is the data center practice director and a senior consultant with
Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business
continuity and DR planning services and corporate data protection.
Article From: searchdatabackup.techtarget.com

