Data Deduplication explained

Data Deduplication

Data deduplication is a technique for reducing the amount of storage space an organization needs to save its data. In most organizations, the storage systems contain duplicate copies of many pieces of data. For example, the same file may be saved in several different places by different users, or two or more files that aren't identical may still include much of the same data. Deduplication eliminates these extra copies by saving just one copy of the data and replacing the other copies with pointers that lead back to the original copy. Companies frequently use deduplication in backup and disaster recovery applications, but it can be used to free up space in primary storage as well.

In its simplest form, deduplication takes place on the file level; that is, it eliminates duplicate copies of the same file. This kind of deduplication is sometimes called file-level deduplication or single instance storage (SIS). Deduplication can also take place on the block level, eliminating duplicated blocks of data that occur in non-identical files. Block-level deduplication frees up more space than SIS, and a particular type known as variable block or variable length deduplication has become very popular. Often the phrase "data deduplication" is used as a synonym for block-level or variable length deduplication.

Deduplication Benefits

The primary benefit of data deduplication is that it reduces the amount of disk or tape that organizations need to buy, which in turn reduces costs. NetApp reports that in some cases, deduplication can reduce storage requirements up to 95 percent, but the type of data you're trying to deduplicate and the amount of file sharing your organization does will influence your own deduplication ratio. While deduplication can be applied to data stored on tape, the relatively high costs of disk storage make deduplication a very popular option for disk-based systems. Eliminating extra copies of data saves money not only on direct disk hardware costs, but also on related costs, like electricity, cooling, maintenance, floor space, etc.

Deduplication can also reduce the amount of network bandwidth required for backup processes, and in some cases, it can speed up the backup and recovery process.

Deduplication vs. Compression

Deduplication is sometimes confused with compression, another technique for reducing storage requirements. While deduplication eliminates redundant data, compression uses algorithms to save data more concisely. Some compression is lossless, meaning that no data is lost in the process, but "lossy" compression, which is frequently used with audio and video files, actually deletes some of the less-important data included in a file in order to save space. By contrast, deduplication only eliminates extra copies of data; none of the original data is lost. Also, compression doesn't get rid of duplicated data -- the storage system could still contain multiple copies of compressed files.

Deduplication often has a larger impact on backup file size than compression. In a typical enterprise backup situation, compression may reduce backup size by a ratio of 2:1 or 3:1, while deduplication can reduce backup size by up to 25:1, depending on how much duplicate data is in the systems. Often enterprises utilize deduplication and compression together in order to maximize their savings.

Dedupe Implementation

The process for implementing data deduplication technology varies widely depending on the type of product and the vendor. For example, if deduplication technology is included in a backup appliance or storage solution, the implementation process will be much different than for standalone deduplication software.

In general, deduplication technology can be deployed in one of two basic ways: at the source or at the target. In source deduplication, data copies are eliminated in primary storage before the data is sent to the backup system. The advantage of source deduplication is that is reduces the bandwidth requirements and time necessary for backing up data. On the downside, source deduplication consumes more processor resources, and it can be difficult to integrate with existing systems and applications.

By contrast, target deduplication takes place within the backup system and is often much easier to deploy. Target deduplication comes in two types: in-line or post-process. In-line deduplication takes place before the backup copy is written to disk or tape. The benefit of in-line deduplication is that it requires less storage space than post-process deduplication, but it can slow down the backup process. Post-process deduplication takes place after the backup has been written, so it requires that organizations have a great deal of storage space available for the original backup. However, post-process deduplication is usually faster than in-line deduplication.

Deduplication Technology

Data deduplication is a highly proprietary technology. Deduplication methods vary widely from vendor to vendor, and many of those methods are patented. For example, Microsoft has a patent on single instance storage. In addition, Quantum owns a patent on variable length deduplication. Many other vendors also own patents related to deduplication technology.

Video: Deduplication for Dummies - What is deduplication?

How Does Data Deduplication Work?

Consider this scenario: Your organization is running a virtual desktop environment with hundreds of identical workstations all stored on an expensive storage array that was purchased specifically to support this initiative. So, you’re running hundreds of copies of Windows 8, Office 2013, your ERP software, and any other tools that your users might require. Each individual workstation image consumes, say, 25 GB of disk space. With just 200 such workstations, these images alone would consume 5 TB of capacity.

With deduplication, you can store just one copy of these individual virtual machines and then allow the storage array to simply place pointers to the rest. Each time the deduplication engine comes across a piece of data that is already stored somewhere in the environment, rather than write that full copy of data all over again, the system instead saves a small pointer in the data copy’s place, thus freeing up the blocks that would have otherwise been occupied. In the figure below, note that the graphic on the left shows what happens without deduplication. The graphic at the right shows deduplication in action. In this example, there are four copies of the blue block and two copies of the green block stored on this array. Deduplication enables just one block to be written for each block, thus freeing up those other four blocks.

Now, expand this example to a real world environment. Imagine the deduplication possibilities present in a VDI scenario. With hundreds of identical or close to identical desktop images, deduplication has the potential to significantly reduce the capacity needed to store all of those virtual machines.

Deduplication works by creating a data fingerprint for each object that is written to the storage array. As new data is written to the array, if there are matching fingerprints, additional data copies beyond the first are saved as tiny pointers. If a completely new data item is written – one that the array has not seen before – the full copy of the data is stored.

As you might expect, different vendors handle deduplication in different ways. In fact, there are two primary deduplication techniques that deserve discussion: Inline deduplication and post-process deduplication.

Inline deduplication

Inline deduplication takes place at the moment that data is written to the storage device. While the data is in transit, the deduplication engine fingerprints the data on the fly. As you might expect, this deduplication process does create some overhead. First, the system has to constantly fingerprint incoming data and then quickly identify whether or not that new fingerprint already matches something in the system. If it does, a pointer to the existing fingerprint is written. If it does not, the block is saved as-is. This process introduces the need to have processors that can keep up with what might be a tremendous workload. Further, there is the possibility that latency could be introduced into the storage I/O stream due to this process.

A few years ago, these might have been showstoppers since some storage controllers may not have been able to keep up with the workload need. Today, though, processors have moved far beyond what they were just a few years ago and these kinds of workloads don’t have the same negative performance impact that they might have once had. In fact, inline deduplication is a cornerstone feature for most of the new storage devices released in the past few years and, while it may introduce some overhead, it’s not that noticeable and provides far more benefits than costs.

Post-process deduplication

As mentioned, inline deduplication imposes the potential for some processing overhead and potential latency. The problem is that the deduplication engine has to run constantly, which means that the system needs to be adequately sized with constant deduplication in mind. Making matters worse, it can be difficult to predict exactly how much processing power will be needed to achieve the deduplication goal. As such, it’s not always possible to perfectly plan overhead requirements.

This is where post-process deduplication comes into play. Whereas inline deduplication processes deduplication entries as the data flows through the storage controllers, post-process deduplication happens on a regular schedule – perhaps overnight. With post-process deduplication, all data is written in its full form – copies and all – and on that regular schedule, the system then goes through and fingerprints all new data and removes multiple copies, replacing them with pointers to the original copy of the data.

Post-process deduplication enables organizations to utilize this data reduction service without having to worry about the constant processing overhead involved with inline deduplication. This process enables organizations to schedule dedupe to take place during off hours.

The biggest downside to post-process deduplication is the fact that all data is stored fully hydrated – a technical term that means that the data has not been deduplicated – and, as such, requires all of the space that non-deduplicated data needs. It’s only after the scheduled process that the data is shrunk. For those using post-process dedupe, bear in mind that, at least temporarily, you’ll need to plan on having extra capacity.

Summary

Deduplication can have significant benefits when it comes to reducing overall storage costs, but it’s important to understand the two major types of deduplication so that appropriate upfront planning can take place.

You might also like:

FromRussia.com and St. Petersburg Global Trade House are committed to offering matryoshka dolls of the highest caliber of quality. Russian matryoshkas make a great gift for a child because he/she is still forming his/her perception of the world.

KLCBright นำเข้าจำหน่ายปลีกส่ง โคมไฮเบย์ LED ราคา ถูก คุยกันได้ มีขนาดตั้งแต่ 50w 100w 150w 200w 250w 400w และ 500wคุณภาพดี มีรับประกัน เราไม่ใช่แค่ผู้ขาย เรายังให้คำปรึกษาแนะนำ ออกแบบ คำนวณระบบไฟฟ้าแสงสว่าง ให้เหมาะสมกับงานของท่าน

Comments

mechanicaljungle said…

Pressure plate clutch: The grasp pressure plate is a significant piece of your manual vehicle's grip framework. It is a substantial metal plate that is constrained by springs and a switch. Its principle reason for existing is to apply strain to the essential grip plate (or grasp circle), holding it against the motor flywheel

December 12, 2021 at 5:45 PM

Search This Blog

Networking Tips Tricks