Although certainly not new, deduplication seems to recently
have become a much hotter trend. That being the case, I decided to write this
blog post as sort of a crash course in data deduplication for those who might
not be familiar with the technology.
1: Deduplication is used for a variety of purposes
Deduplication is used in any number of different products.
Compression utilities such as WinZip perform deduplication, but so do many of
the WAN optimization solutions. Most backup products that are currently being
offered also support deduplication.
2: Higher ratios produce diminishing returns
The effectiveness of data deduplication is measured as a
ratio. Although higher ratios do convey a higher degree of deduplication, they
can be misleading. It is impossible to deduplicate a file in a way that shrinks
the file by 100%. Hence, higher compression ratios have diminishing returns.
To show you what I mean, consider what happens when you deduplicate
1 TB of data. A 20:1 compression ratio reduces the size of the data from 1 TB
to 51.2 GB. However, a 25:1 compression ratio reduces the size of the data to
only 40.96 GB. Going from 20:1 to 25:1 only yields an extra 1% savings and
reduces the data by about 10 GB more than using 20:1.
3: Deduplication can be CPU intensive
Many deduplication algorithms work by hashing chunks of data
and then comparing the hashes for duplicates. This hashing process is CPU
intensive. This isn't usually a big deal if the deduplication process is
offloaded to an appliance or if it occurs on a backup target, but when source
deduplication takes place on a production server, the process can sometimes
affect the server's performance.
4: Post process deduplication does not initially save any
storage space
Post process deduplication often (but not always) occurs on
a secondary storage target, such as a disk that is used in
disk-to-disk-backups. In this type of architecture, the
data is written to the target storage in an uncompressed format. A scheduled
process performs the deduplication process later on. Because of the way this
process works, there is not initially any space savings on the target volume.
Depending on what software is being used, the target storage volume might even
temporarily require more space than the uncompressed data consumes on its own,
because of workspace required by the deduplication process.
5: Hash collisions are a rare possibility
When I talked about the CPU-intensive nature of the
deduplication process, I explained how chunks of data are hashed and how the
hashes are compared to determine which chunks can be deduplicated.
Occasionally, two dissimilar chunks of data can result in identical hashes.
This is known as a hash collision.
The odds of hash collisions occurring are generally
astronomical but vary depending on the strength of the hashing algorithm.
Because hashing is CPU intensive, some products initially use a weak hashing
algorithms to identify potentially duplicate data. This data is then rehashed
using a much stronger hashing algorithm to verify that the data really is
duplicate.
6: Media files don't deduplicate very well
Deduplication products can't deduplicate unique data. This
means that certain types of files don't deduplicate well because much of the
redundancy has already been removed from the file. Media files are a prime
example. File formats such as MP3, MP4, and JPEG are compressed media formats
and therefore tend not to deduplicate.
7: Windows Server 8 will offer native file system
deduplication
One of the new features Microsoft is including in Windows
Server 8 is file system level deduplication. This feature should increase the
amount of data that can be stored on an NTFS volume of a given size. Although
Windows Server 8 will be offering source deduplication, the deduplication
mechanism itself uses post process deduplication.
8: Windows Server 8 file system deduplication will be a
good complement to Hyper-V
Windows Server 8's file system level deduplication will be a
great complement to Hyper-V. Host servers by their very nature tend to contain
a lot of duplicate data. For example, if a host server is running 10 virtual
machines and each one is running the same Windows operating system, the host
contains 10 copies of each operating system file.
Microsoft is designing Windows Server 8's deduplication
feature to work with Hyper-V. This will allow administrators to eliminate
redundancy across virtual machines.
9: File system deduplication can make the use of solid
state drives more practical
Just how clean is your data? Identify where your data requires attention, allowing you to choose which areas to improve.
Get Free Email Append Test from AverickMedia
One of the benefits of performing deduplication across
virtual machines on a host server is that doing so reduces the amount of
physical disk space consumed by virtual machines. For some organizations, this
might make the use of solid state storage more practical for use with
virtualization hosts. Solid state drives have a much smaller capacity than
traditional hard drives, but they deliver better performance because there are no
moving parts.
10: Windows Server 8 guards against file system
corruption through a copy-on-write algorithm
While it's great that Windows Server 8 will offer volume
level deduplication, some have expressed concern about what will happen if a
file is modified after it has been deduplicated. Thankfully, modifying a
deduplicated file in Windows Server 8 will not cause corruption within the
modified file or within other files that may also contain the deduplicated data
chunks. Deduplicated files can be safely modified because Microsoft is using a
copy-on-write algorithm that will make a copy of the data prior to performing
the modification. That way, data chunks that may be shared by other
deduplicated files are not modified.
Article From: techrepublic.com
No comments:
Post a Comment