To identify duplicates, data can be examined in different
ways depending on the technology used. For example, file-level dedupe—also
called single-instance storage(SIS)—will identify identical files, store a
single copy and replace subsequent identical copies with a pointer to the
unique version stored. Examples of file-level deduplication include Novell
Inc.’s GroupWise and Microsoft Corp.’s Exchange (although SIS isn’t supported
in Exchange 2010) email programs. EMC Corp. also provides file-level
deduplication on its storage arrays, including Clariion, Celerra and its new
VNX series.
LEARN MORE ABOUT DEDUPLICATION BENEFITS
Download a free guide about deduplication technology
Listen to a podcast about choosing the best deduplication
tool
Watch a video on dedupe and disk backup
The disadvantage of file-level deduplication is its lack of
granularity and inability to provide sub-file level dedupe. That means even the
smallest change in a file makes it a totally new file that will be stored.
File-level dedupe is useful in email environments where the same attachment is
often sent to multiple recipients at once or in unstructured data storage
environments with low change rates. However, it is not practical in a
structured data environment when entire large files such databases are
constantly changing.
PRO+
Content
Find more PRO+ content and other member only offers, here.
E-Handbook
The evolution of data replication strategies
Vendors address the lack of granularity inherent to
file-level dedupe by breaking data in to smaller “chunks” such as fixed blocks
or variable length segments. Greater data reduction can be achieved by storing
only unique data segments and creating pointers to all others that are
identical. CommVault Systems Inc., FalconStor Software Inc. and NetApp Inc. and
are examples of vendors who leverage block-level dedupe; while EMC’s Data
Domain, Avamar and products from Sepaton Inc. are based on variable-length byte
segments. The knock against block-level dedupe is that a block offset would
change all blocks in a given data set, requiring all new blocks to be stored as
they are no longer duplicates. This situation is alleviated with variable
length segment dedupe but the technology is more complex and
resource-intensive. Sub-file dedupe (block-level or variable-length segments)
is commonly used in backup environments where multiple backup versions of files
often contain few changes.
In-line vs. post-process dedupe (out-of-band)
Dedupe is referred to as being in-line (also called in-band)
when data is analyzed for duplicates while it is being written to the storage
media. In contrast, post-process (or out-of-band) dedupe takes place after the
data has been written to disk. The benefits of post-process dedupe is that it
does not affect write performance but it requires enough disk space to
accommodate the entire data set until deduplication (and therefore reduction)
can take place during off-peak hours. On the other hand, in-line dedupe
provides the immediate space saving benefits of data reduction but is more
resource intensive which can impact write performance. The decision is a
trade-off between immediate storage space savings and performance, but that
performance impact is less of a factor as the technology improves. In-line
dedupe products include offerings from FalconStor, as well as EMC’s Data Domain
and Sepaton, IBM Corp.’s ProtecTier (formerly Diligent), while NetApp uses
post-process dedupe.
Source vs. target
Depending on the technology implemented, deduplication can
take place at the source (the sending system) or at the target level (the
receiving system). This distinction is specific to backup environment which are
typically based on a client/server (or sender/receiver model). Source dedupe
uses software on the backup client that must be dedupe-capable and the backup
server to be dedupe-aware. This means some changes to an existing backup
environment will be required. On the other hand, target dedupe usually requires
no change since the deduplication-capable target device is seen as just another
disk storage array or virtual tape library (VTL) to the backup server. Source
dedupe is used in an effort to reduce the amount of data sent over the network
when remote offices are backing up to a central office. The trade-off is that
source dedupe impact performance on the client side thus extending the duration
of the backup and dedupe is limited to duplicate data at the client level only
regardless how many backup clients share identical data.
Just how clean is your data? Identify where your data requires attention, allowing you to choose which areas to improve.
Get Free Email Append Test from AverickMedia
Appliance vs. software
Other possible considerations include picking between
appliance-based and software-based dedupe. Deduplication appliances are
typically integrated with an existing environment without requiring much
change. This is the case when configuring a backup server to write to a
dedupe-capable storage array (e.g., EMC Data Domain). On the other hand, dedupe
software usually require changes to your environment, especially when migrating
from a basic backup software to one that is dedupe-capable.
Competing manufacturers claim that appliance-based dedupe
creates hardware vendor lock-in because of a dependency on proprietary storage
or appliances. However, software-based dedupe can also be considered vendor
lock-in, as the deduplication capability is dependent on the specific software
platform.
Vendors like IBM and NetApp offer gateway appliances
providing the ability to store deduplicated data to supported third-party
storage, but for all intents and purposes, hardware- and software-based dedupe
offerings are proprietary.
There are many benefits of deduplication, but choosing the
right dedupe approach requires careful consideration of your backup
environment.
About this author: Pierre Dorion is the data center practice
director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz.,
specializing in the areas of business continuity and DR planning services and
corporate data protection.
Article From: searchdatabackup.techtarget.com