HK1173515B

HK1173515B - Extensible pipeline for data deduplication

Info

Publication number: HK1173515B
Application number: HK13100327.9A
Authority: HK
Inventors: P．A．奥尔泰安; R．卡拉赫; A．M．埃尔-施密; J．R．本顿
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-12-16
Filing date: 2013-01-09
Publication date: 2016-03-24

Description

Scalable pipeline for data deduplication

Technical Field

The invention relates to a scalable pipeline for data deduplication.

Background

Data deduplication (also sometimes referred to as data optimization) refers to detecting, uniquely identifying, and eliminating redundant data in a storage system to reduce the amount of physical bytes of data that need to be stored on disk or transmitted over a network, without compromising the fidelity and integrity of the original data. Data deduplication thus results in savings in hardware and power costs (for storage) as well as data management costs (e.g., lower backup costs) by reducing the resources required to store and/or transfer data. These cost savings become important as the amount of digitally stored data grows.

Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One such technique is to identify the same area of data in one or more files and physically store only one unique area (block), while maintaining a reference to that block in association with all reoccurring files about this data. Another technique is to mix data deduplication with compression, for example, by storing compressed blocks.

There are many difficulties, tradeoffs, and choices with data deduplication, including in some circumstances, given the available time and resources, there is too much data to be deduplicated all in a single operation, and thus, consideration is made as to which data deduplication and how to initiate progressive deduplication over time. Furthermore, not all data that can be deduplicated yields equal savings (benefits) from deduplication, and thus there is a potential for doing a lot of work to gain little value. Other aspects of data deduplication, including file selection, data security issues, different types of chunking, different types of compression, etc., also need to be handled in order to accomplish data deduplication in a manner that provides the desired results.

Disclosure of Invention

This summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.

Briefly, various aspects of the subject matter described herein relate to techniques by which a modular data deduplication pipeline performs data deduplication, the pipeline including several stages/modules that operate in conjunction. At each stage, the pipeline allows modules to be replaced, selected, or extended (new modules added to the stage). This pipeline facilitates ensuring data processing, asynchronous processing, batch processing, and parallel processing. The pipeline is adjustable based on internal and external feedback, e.g., by selecting modules to improve deduplication quality, performance, and/or throughput (where internal feedback refers to adjustment based on data or file attributes discovered by the pipeline, and external feedback refers to adjustment based on external information passed to the pipeline (e.g., statistics of previously deduplicated data on many machines)).

In one implementation, a pipeline includes: a scanning phase/module that identifies a list of files available for deduplication; and a selection phase/module that selects files that fall within the scope and policy of deduplication. Further comprising: a chunking stage/module that performs chunking; and a hashing module that generates a global hash that uniquely identifies each block. A compression stage/module may also be included that compresses the blocks (which may be before or after hashing). The commit phase/module commits the referenced data.

In one aspect, the scan phase includes a crawler (groveler) that selects files to deduplicate via the pipeline. The crawler may access a policy to determine which files to select for deduplication. For example, this stage/module examines the namespace of the stored files and generates a stream of candidate files to be deduplicated (using one or more various criteria). This may include maximizing savings from deduplication, minimizing the impact of deduplication on the performance of the system, and so on. The crawler may operate on the snapshot of the file by processing the snapshot of the file into a log of the selected file for further deduplication processing. A selection stage coupled or linked to the scan stage may access policies to perform filtering, ranking, ordering, and grouping on the files (e.g., based on attributes and/or statistical attributes of the files) before providing the files for further deduplication processing via the pipeline.

The data deduplication pipeline includes a chunking stage that partitions data of a file into blocks via one or more modules/chunking algorithms. In one aspect, a chunking algorithm selector selects a chunking algorithm to use from available chunking algorithms, such as based on file data and/or file metadata.

The deduplication detection stage determines, for each chunk, whether the chunk is already stored in the chunk store. A compression module may be used that compresses the blocks that will be committed next. More specifically, block compression is an option, and a selected compression algorithm selected from available blocking algorithms based on block data, block metadata, file data, and/or file metadata may be used.

The commit phase commits blocks that are not detected as having been stored in the block store to the block store and commits reference data for blocks that have been stored in the block store. The chunking, compression, and/or committing may be performed on different subsets of files, for example, asynchronously and/or in parallel on different machines (virtual and/or physical). In general, pipeline models enable asynchronous processing of data, generally resulting in performance and scalability advantages.

In one aspect, files to be chunked may be queued for batch processing. The resulting blocks may similarly be queued for batch processing by one or more subsequent modules of the pipeline. Similar batching techniques may also be used for the blocks, resulting in a blocking process.

In one aspect, the pipeline is securely coupled to a host logic process configured to host a hosted module, such as a tiled module(s), or any other module or modules that access file data or tiled file data. A hosting process includes a data access component that securely accesses data for processing by hosted modules. For example, the secured process may obtain a duplicate file handle from the original process and use the duplicate file handle to access the data.

Other advantages of the present invention will become apparent from the following detailed description when read in conjunction with the accompanying drawings.

Drawings

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or like elements and in which:

FIG. 1 is a block diagram representing example components/stages of a scalable pipeline for data deduplication.

FIG. 2 is a block diagram and dataflow diagram representing components/stages of an extensible pipeline for data deduplication, along with additional details of example support components.

FIG. 3 is a block diagram of a data deduplication service with other components that represents how a scalable pipeline may be incorporated into a data deduplication environment.

FIG. 4 is a representation of data flow/invocation between various components, including data flow/invocation queued to support batch processing of files and data during data deduplication operations.

FIG. 5 is a representation of data flow/calls between various components, including data flow/calls used to obtain duplicate file handles during data deduplication operations to provide secure data handling.

FIG. 6 is a block diagram representing how crawlers (grovelers) scan files via a two-phase log-based file scan in a data deduplication environment.

FIG. 7 is a block diagram representing exemplary non-limiting networked environments in which various embodiments described herein can be implemented.

FIG. 8 is a block diagram representing an exemplary non-limiting computing system or operating environment in which one or more aspects of various embodiments described herein can be implemented.

Detailed Description

Various aspects of the technology described herein are generally directed to an extensible pipeline for data deduplication, where various modules/stages of the pipeline facilitate data deduplication, including by providing safe and efficient modules for module chaining, module selection, hosting asynchronous processing, and/or parallel processing. Typically, the various mechanisms required for deduplication (e.g., file selection, chunking, deduplication detection, compression and commit of blocks) are individually modular in the pipeline with the ability to replace, select among and/or expand each of the individual modules.

In one aspect, the pipeline scans files using a two-stage log-based algorithm and selects files for optimization based on attributes by sorting, ranking, and/or grouping based on statistical analysis and feedback. The selected files may be processed asynchronously, in batches, and/or in parallel for data deduplication. Furthermore, the stages of the pipeline are adapted for internal and external feedback hooks.

It should be understood that any of the examples herein are non-limiting. Thus, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and data deduplication processing in general.

Fig. 1 and 2 illustrate an example concept of a data deduplication pipeline 102 that includes software components (or a collection of components) that commence the process of deduplication of a file 104. Note that: although used as an example herein, a deduplication target may comprise any collection of "data streams" in an unstructured, semi-structured, or structured data storage environment, e.g., files, digital documents, streams, blobs, tables, databases, etc.; the pipeline architecture is designed to be generic and reusable across a wide variety of data stores. As described herein, the pipeline 102 includes several stages corresponding to extensible and/or selectable modules. Further, modules (other than for deduplication detection) may operate in parallel, e.g., to facilitate load balancing. The pipeline architecture also provides isolation to facilitate crash resistance, security, and resource management. These and other advantages are described below.

Typically, deduplication splits each file (or other data blob) into a continuous sequence of small data streams (called chunks), and then for each chunk, uniquely identifies each chunk using a hash, and then looks up (via a hash index) for the presence of duplicate chunks that were previously inserted into the system. When a duplicate block is detected, the particular region of the file corresponding to the original block is updated with a reference to the existing block and the block is discarded from the file. If no duplicates are detected, the blocks are saved to a block store (or other suitable location) in one implementation, indexed, and the file is updated with references to new blocks that are subsequently detected by other files for reference. The pipeline may also perform compression of the blocks.

To track chunks, each file contains references to its chunks, not file data, but rather the references are stored in the system along with their location in the current file, which consumes much less storage when multiple files reference the same chunk or chunks. In one implementation, the file is replaced with a sparse file (if there is no sparse file yet) with reparse points and/or streams that reference the corresponding block data. The reparse points and/or streams contain information sufficient to allow reconstruction of their corresponding file data during subsequent I/O overhauls. Alternative implementations for linking files and corresponding blocks are possible.

The pipeline 102 (labeled 102a and 102b in fig. 2; note that the file system dependency module 102b may be considered independent of the pipeline) includes file system dependency stages-implemented as one or more modules per stage. This includes a scanning phase 106 (including a scanner/crawler module 224) that scans a collection of files 104 of storage volumes 226, etc., to determine which are candidates for deduplication, typically those that have not been deduplicated. Other policies and/or criteria may be used such as not "specifically treating" the encrypted file, for example, because such a file may be rare even if there were blocks that would match another file. The output of the scan stage essentially comprises a file list 228 that is dynamically consumed by the next stage of the deduplication pipeline, including the select stage 108.

The file system crawler 224 thus identifies files (which have not generally been deduplicated) to be optimized in this optimization session to output a list 228 that is dynamically consumed by other parts of the pipeline. Crawler 224 may work incrementally by quickly excluding optimized files so that their scalability is not significantly affected by optimized files. Crawler 224 is capable of providing a consistent list of files, such as by reading from a snapshot of the scanned volume, as described below. Block 102b also includes a file streaming interface for file streaming access (which provides secure access to the file content), for example, for use in the chunking and compression modules described below (note that the chunking/hashing/compression modules may not have direct access to the file system (and may not be bound at all to file system characteristics)), whereby such modules may have access via a set of streaming interfaces that provide virtualized access to the file system.

Note that during a deduplication session, a snapshot of the file may be taken, the file list (228) may be temporarily stored to a permanent queue to ensure consistency in building the list of files to be scanned, and the snapshot discarded when it is no longer needed. The log allows for suspending and continuing the deduplication process and crash resistance while minimizing scan impact and providing other benefits, such as allowing progress/status reporting, estimating overall data size (useful in computing data structures such as index size), and the like. Additional details of this two-stage process are described below with reference to fig. 6.

In general, the selection stage 108 filters, orders and/or prioritizes (ranks) the candidates such that, for example, the candidates that are most likely to produce a high deduplication gain are processed through the pipeline first. The files are also grouped to facilitate efficient processing and/or to enable optimal selection of the most appropriate module to be used in further stages of the pipeline. File attributes such as file name, type, attributes, location on disk, etc., and/or statistical attribute data such as frequency of file operations over time may be used to determine a policy for the selection phase 108. In general, the scan stage 106 (crawler 224) and the select stage 108 (file selector/filter, etc.) typically work together according to policy driven criteria before feeding the file to other parts of the pipeline. Unlike a static filter that considers each file in isolation for selection, filtering can consider files in the entire dataset and is therefore dataset driven. For example, files that change over the same time period and/or files with similar path/name attributes may be filtered, similarly ranked and/or grouped together. External data may also be used for filtering, such as local feedback from previous deduplication or global knowledge from other deduplication operations and implementation learning.

A ranking score may be assigned to each file (rather than simply including/excluding filtering decisions). Such scoring facilitates ranking to prioritize which files (or subfiles) are processed first, so as to extract most of the deduplication savings as quickly as possible. The scoring may be based on machine learning techniques, such as using features weighted by process deduplication results. Grouping based on file attributes (which may include assigned ranking scores) is another option that facilitates batching, parallelization, splitting, chunking, memory usage (e.g., keeping related data/modules for the same group in RAM), and so forth.

Also shown in FIG. 2 is a filestream access (module) 227 by which filestream access (module) 227 may be accomplished through low priority I/O to minimize the performance impact of file reads on the underlying system. The filestream access module 227 may use techniques such as oplock to ensure that whenever some other external application attempts to access the file pipeline, it "backs off" from the file and closes its handle. Thus, accessing the rest of the file content (the chunking, hashing, and compression modules) in the pipeline modules is easier to implement because they do not have to implement the back-off logic by themselves. Via the modules, file stream access is implemented in a secure manner using copied file handles that are securely transferred from the pipeline process to the main memory process containing modules such as those described below.

The chunking stage 110 (which may include or follow file decompression as needed in a given implementation) breaks the file content into blocks; these blocks may be compressed. The chunking may be performed according to the structure of the file, a content-driven algorithm (e.g., dividing the media file into a media header and a media body, which in turn may be hierarchically partitioned into a series of portions), or by an algorithm that uses chunking of the file content based on a fast hashing technique that is repeatedly computed over a sliding window (such fast hashing functions include the CRC and Rabin family of functions), where a block is selected when the hashing function and the current block size/content satisfy a particular heuristic.

For each file selected to be deduplicated, the chunking stage 110 of the pipeline 102 may select (block 230 of FIG. 2) a chunking algorithm ChA depending on heuristics that may relate to file attributes such as file extensions, header information, and so forth₁-ChA_m. For example, a generic chunking algorithm may be selected for one file, while another file may have a chunking algorithm specific to its file extension (such as a chunking algorithm for the ZIP portion). The chunking algorithm may also be selected based on hints from the file selection phase 108 or based on internal or external feedback 120.

After a particular chunking algorithm is selected for the current file, a chunking stage 110 of the pipeline performs chunking. The chunking stage prepares file records containing file-related metadata (such as file name, size, etc.), which may be used by the chunking algorithm. The actual chunking algorithm may be executed in a process (if its execution is secure) or in a separate hosting process (if there is a security risk). The separate hosts also facilitate resource monitoring and pipeline reliability and resiliency; for example, if a chunked module encounters a "fatal" failure, the pipeline is not affected and the failed module can be restarted. The pipeline can skip the file and process the next file with the restarted module.

As described below with reference to fig. 3-5, for a separate host, the pipeline (the data stream transfer initialization module coupled thereto) may initialize a "data stream transfer object" that represents a handle for the data stream of the file. This handle can be used for file content access in the hosted process (hosted process), which is secure. The chunking stage 110 inserts the file records into the appropriate input queue of the main memory module associated with the selected chunking algorithm. When this queue buffer reaches a certain size, the entire buffer is sent to the hosted module for batch processing. The hosted module performs chunking for each file in the batch using the file handle initialized above.

The result of the chunking stage includes a block list 232 (per file) that is communicated using a set of "block records," each of which contains associated metadata that describes the type of data in the block. One example of such metadata is any rolling hash that is computed as part of the execution of the chunking algorithm. Another example is an indicator of the compression level of the actual data within the block (e.g., the ZIP chunking module will instruct the compression selector module not to compress the block that may have been compressed). Note that: for hosted process execution, blocks are inserted into the appropriate "output queue" for processing and then sent to pipelined processes in batches.

Chunks that can be processed in batches are consumed by the next stage of the pipeline (i.e., the deduplication detection stage that uniquely identifies each chunk by hashing and then uses hashing for deduplication detection) to provide a list of chunks whose chunks have been inserted into the chunk store. Note that: as represented in fig. 2 via blocks 234 and 236, deduplication may be avoided/chunks integrated (if there are too many chunks). More specifically, before adding a block to the output list, if there are too many blocks, an additional "block merge" may be performed. This step may be based on specific policy configurations that limit no more than X blocks per MB as a way to not degrade I/O performance (e.g., due to additional seeks beyond a specific limit), or by ensuring a specific maximum number of blocks per file, or maximum blocks per MB of data. An alternative approach is to perform a lookup before the merge is issued, which may provide finer granularity savings (but incurs a penalty on deduplication time).

After chunking, the deduplication detection stage 112 determines whether a chunk already exists in the chunk store. The strong hash computed for the block is calculated to invoke a lookup operation in the hash index service 242 (block 240). The hash index service indexes hashes of some or all of the only blocks that are already known to (or have been stored in) the deduplication system.

Note that: the hash index service may include a separate (scalable) module in the pipeline for hash computation. For example, one such module may use a cryptographically strong hash (such as SHA-256 or SHA-512) that ensures a very low likelihood of collisions between the hashes. Inputs to such a module include a block "record" that contains a reference to the block, (such as file identification/metadata, a file handle), and an offset of the block in the file. The hashing module may use the file stream access interface described above to securely read the blocks and hash the content, producing a strong hash needed for the subsequent stage. The output of this module (string chunk hash) is appended to the existing chunk metadata.

If hash index service 242 indicates that the chunk already exists in chunk store, the chunk reference/count (block 244) is added to chunk store module 246/chunk store 248. If the block does not already exist, the block is added as a new block to block store 242. Note that: the hash index service may be configured with efficiency-related considerations that do not necessarily guarantee that chunks are not already stored and that it is possible that a chunk will be replicated more than once in a deduplication system. Thus, as used herein, when the hash service determines that a chunk is not already present in the chunk store(s), then this means that it is not yet present, but is only a reasonable likelihood, not necessarily an absolute guarantee.

The block storage module 246 maintains a persistent database of actual blocks. Block store module 246 supports inserting new blocks into block store 248 (if no such blocks exist), adding reference data (block 244) to existing blocks in the block store (upon detection of a previously persisted block), and committing a set of block insertions and/or block reference additions, as described below. The block store may also implement various background/maintenance jobs, including garbage collection, data/metadata checks, and the like.

Block store module 246 is pluggable and selectable and extensible, similar to each of the other modules. Pipeline 102 may work with multiple block stores and store blocks based on their attributes. For example, popular blocks may be stored in low-scale storage with low latency near the line, while the rest of the blocks may be stored in high-scale block storage with higher latency.

Blocks marked as "add to block store" may be processed for compression. The compression algorithm selector (block 250) processes the file and block metadata (provided by the pipeline so far) and may attempt to determine which compression algorithm CA for this type of data₁-CA_nWork best, if any. After any compression is performed, the runtime (e.g., as part of the compression stage 114) can verify whether any substantial savings were made; for example, if the compressed block is larger than its uncompressed form, the uncompressed is stored (or compression may be attempted again with a different algorithm). The compression algorithm may be selected based on policy, file type, and the like.

To add a new block to block store 248, for a given file, this may be done in two phases by commit module (phase) 116, which corresponds to modules 252 and 254 of FIG. 2. First, new chunks are added (e.g., in batches) to chunk store 242. Second, after the file blocks have been processed, the block list is "committed" to block store 242. The list of block locators is serialized into a "stream map" self-referencing structure (which may be represented by an array of bytes), which may then be used to create a stream of blocks. The locator of this new block stream is stored in the reparse point associated with the file.

The file submission module/stage 116 replaces each file transactional with a reference to the data being deduplicated. To do so, the file commit module receives a list of files whose blocks have been inserted into the block store. In one implementation, each block list is encapsulated into a block ID stream 256 (identified by a unique stream ID), which is permanently stored to a reparse point associated with the file. During commit operations, the file commit module replaces the file transactionally with a reparse point containing the ID and locator of the block stream ID (i.e., the crash that occurred during these updates does not leave the file system or logical file contents in an inconsistent state), i.e., the blocks containing the stream map include a list of blocks that are used to assemble the current file and its logical offsets.

Note that file system updates can be committed in batches. For example, a reparse point that replaces N files (N being on the order of hundreds, thousands) may be followed by a flush operation that ensures file system level consistency with respect to the previous state prior to optimization. Files that change during deduplication can be ignored because crawl/scan is done on snapshots and mismatches can be time stamped as described below.

As seen above, data deduplication sessions are fully reentrant and restartable (despite crashes, reboots, failover occurring in the middle of processing). As is readily understood, the pipeline is also capable of operating with low system resources (e.g., low priority I/O and low CPU resources), and the pipeline is designed to reliably operate and maintain data coherency regardless of the reliability of the underlying hardware. Furthermore, the data deduplication system is designed to work in an efficient asynchronous/batch mode in all its stages. Multiple instances of a module may be created and may operate in parallel (either on one machine or on multiple machines), resulting in better overall utilization of hardware resources. For example, CPU intensive activities (chunking, hashing, and compression) may be load balanced across multiple CPUs by a generic job execution architecture implemented as part of a pipeline. In addition, for multiple volumes, asynchronous independent execution is designed that supports multiple data deduplication pipelines in parallel.

In addition, the stages of the pipeline are also externally adjustable and provide hooks for feedback 120 (FIG. 1), i.e., the pipeline is adaptable and its policies, selection algorithms, parameters, etc. can be adjusted while running or performing future runs based on the knowledge that has been gained. The feedback may be internal (based on processed data) and/or external 122 (provided by an external entity, e.g., based on other volumes or other on-machine optimized data).

3-5 provide the main interactions of the pipeline components and their examples of data optimization sessions on a single volume, as well as additional details of the service data for the optimized file. The pipeline operates in a deduplication service 300 coupled to a storage stack 332 that is accessible by a file server client 334 (such as a typical file server client that accesses files via SMB, NFS, or other protocol). The storage stack includes SRVs, including drivers for implementing an SMB stack (e.g., SRV2.sys), deduplication drivers for handling I/O paths for optimized files, and block storage 242, as described above. In the examples herein, NTFS is the underlying file system that holds deduplicated and undeduplicated files for storage volumes.

The deduplication services 330 may be accessed by an administration client 338, the administration client 338 including scripts, command lines, UIs, or other administration applications that manage deduplication access 330 on the current machine, either remotely or locally, for example, through a public API. Also represented is a server-side branch cache service component 340 that uses the data deduplication service 330 to generate chunks/signatures for files served in a branch cache scenario.

As represented in FIG. 3, the deduplication service 330 includes a management/configuration API module 342 that exposes the management APIs of the deduplication service 330. This API also exposes functionality for managing aspects of data deduplication, such as defining directory inclusion/exclusion rules for file deduplication, workload management, and the like. The configuration filter 344 includes modules that maintain a permanent configuration state, such as public presentation or private "tuning buttons," scheduling information, policy settings, etc. The pipeline API exposes an API set 346 that can be called by other external components, such as the branch cache service component 340, that consume the functionality of the pipeline 102.

Policy engine 348 includes modules for managing policy settings for a volume or overall machine. Such policy settings may include policies such as the minimum age of the file being considered for deduplication. Workload manager 350 includes a module responsible for initiating and coordinating multiple background management jobs/keeping services in an operational state (some of these jobs may be manually excluded). Running a data deduplication pipeline is one of these jobs, which is typically a scheduled background job, but may run on demand. At the end of execution, the workload manager 350 may generate a report. The workload manager 350 defines a process model for optimizing workloads, and in one implementation assumes that pipeline optimization can run in its own separate worker process (one process per scan) that allows natural machine resource sharing during parallelization of scans. Other operations may include garbage collection, defragmentation of chunk stores, data integrity checks, metadata checks, and the like.

The host process management component 352 is generally responsible for managing (e.g., creating, destroying) low priority host processes, such as chunking modules, for data manipulation and data parsing algorithms. For security reasons, these modules run in separate low priority processes. The streaming data initialization module 354 includes a utility module for preparing secure access to the actual file stream within the low priority hosted process.

Also shown in fig. 3 is a compression/decompression module 356, which in one implementation includes a shared library for handling in-memory compression and decompression of blocks/buffers. The compression stage 114 may be performed during data deduplication (as described above), while decompression is used during the read/write path (dehydration) stage.

The hosting process 358 includes a process designed to host modules in a low priority isolation manner, such as hosted partitioned modules running in separate processes. Examples include file and block buffer management modules that are responsible for managing the input/output buffers of the hosted blocking modules, which are used to minimize cross-process traffic. Other examples include a hosted chunking algorithm that includes an in-process module that performs actual data chunking as described above, and a streaming data access module that includes implementing an API for secure data access from within the hosted process. Note that: for security reasons, the hosted algorithm does not have direct access to the file system.

Data deduplication sessions are typically invoked on a scheduled basis. The same deduplication sequence (referred to as a "deduplication session") may be performed for each volume involved during deduplication. The following example is described in terms of a single volume, but it should be understood that one volume or multiple volumes and/or multiple machine deduplication may be performed in parallel.

Prior to actual deduplication, an initialization stage of the pipeline may be performed, including reading the latest policies and configuration settings from the configuration manager 344, and reading the latest values of per-volume persistent states, such as locality indicators (for hash indexes). Other initialization actions include instantiating a pipeline module, including the hosted blocking algorithm. During this stage, the hosted process is started and the input/output queue is initialized. Also during this stage, each module in the pipeline is configured with the appropriate parameters, e.g., read from a configuration database. In addition, hash index service 242 is initialized to load its in-memory data structures in order to be ready to service the block hash lookup request, and the crawler module is initialized to initiate a file scan of the current volume. For consistency reasons, the crawler may have its own per-module initialization phase, which may include operations such as creating a snapshot of the volume. As part of the preparation, the data stream transfer initialization module 354 initializes a "data stream transfer object" representing a handle for the data stream of the file. This handle may be used for file content access in the hosted process, as described below.

During the optimization session, crawler 224 scans the files, filters them according to policy driven criteria, and feeds them to pipeline 102. The files provided from the crawlers may not be in actual scan order because the selection module 108 (figure 1) may sort the list of files according to different criteria, such as the last modified timestamp, the file type, etc.

For each file selected to be optimized, the pipeline selects a chunking algorithm, e.g., depending on file properties such as file extensions, header information, etc., and pipeline 102 runs (chunking stage 110) to perform chunking. The runtime prepares a file record containing file-related metadata (such as file name, size, etc.) that can be used by the chunking algorithm. The actual chunking algorithm may be executed in a process (if its execution is secure) or in a separate process (if there is a security risk). If the chunking algorithm is hosted directly to the same process, it is simply executed.

If the chunking algorithm is implemented in a separate host process, asynchronous/batch mode execution is typically performed. To this end, as generally represented in FIG. 4, the pipeline runtime inserts file records into the appropriate input queue 440 of a hosting module 444 (e.g., corresponding to module 358) associated with the selected/hosted chunking algorithm (e.g., hosted module 446). If this queue buffer reaches a certain size, the entire buffer is sent to the hosted module 446 for batch processing.

Hosted module 446 performs chunking for each file in the batch using the file handle initialized above. The result of the chunked execution is a block list (per file) that is put into a block queue 442 for batch processing for execution of the hosted process. These result blocks are passed back using a set of "block records" that contain associated metadata describing the type of data in these blocks. Examples of such metadata are described above.

Thus, the pipeline 102 supports a high performance asynchronous/batch processing model that allows for efficient, asynchronous/batch file/block exchange between the main workload process and the hosted process in a manner that allows for repeated cross-process transitions of each file or block. Further, batches may be processed by different machines in parallel, thereby providing scalability.

After chunking has been performed, the pipeline runtime 102 determines whether the blocks already exist in the block store. To do so, the runtime computes the digest hash (computed for the block) to invoke a lookup operation in the hash index service 242 and takes an action based on the result, as described above.

If the block is marked "add to block store," the pipeline may attempt to perform compression, which may include executing a compression algorithm selector 230 (FIGS. 2 and 3), as generally described above. As also described above, new blocks are added to the block store in two stages, namely, first, new blocks (or references to existing blocks) are added to the block store in batches, and second, after the file blocks have been processed, the block list is "committed" to the block store. In one implementation, the list of block locators is serialized into a "stream map" byte array, which is used to create the stream of blocks; the locator of this new block stream is stored in the reparse point. During commit operations, the pipeline runtime will also transactionally replace the file with a reparse point containing the ID of the block stream ID and a locator, as described above.

Note that: the pipeline functions in the same way in the case of re-deduplication, which refers to re-deduplicating a file that is subject to writing (and thus is no longer deduplicated) after the initial deduplication. In this case, only dirty extents in the file (e.g., extents corresponding to data that has changed since the last optimization) are chunked, as described in co-pending U.S. patent application entitled "Partial Recall Of reduced Files" (attorney docket number 331301.01), filed concurrently herewith and incorporated by reference herein.

As previously described, the hosting architecture ensures secure data access to the contents of files from hosted modules, so that hosted modules have controlled read-only access to relevant files only. To this end, data streaming support (block 354) provides the class in the main (pipelined main memory) process that performs the initialization of the file handle, as generally represented in FIG. 5. This utility class implements methods to obtain a private file handle that is only accessible by the target process (e.g., GetFileHandle ([ in ] filePath, [ in ] targetProcessID, [ out ] duplicatedHandle).

In an example implementation, exposing file handles as classes of ISTream 550 in the hosted process is also provided. This enables in-memory IStream wrapping around read-only file handles passed in file-specific DDP _ ATCH _ ITEMs through the module host interface. Such internal members include read-only file handles, file metadata, and current read offsets.

Turning to additional details regarding the crawler 224, in one implementation, the crawler 224 operates via a two-phase log based file scan, as generally represented in FIG. 6, which allows for sorting, file system isolation, statistical reporting, and so forth. In general, during a deduplication session, a (VSS) snapshot 660 of a file may be taken by the master crawler component 662, logged to a permanent queue (log) 664, (and the snapshot discarded thereafter). The log 664 allows for suspension and continuation, and crash resistance, while minimizing scan impact and providing other benefits, such as allowing for progress/status reporting, estimating overall data size (useful in computing data structures such as index size), ordering and prioritizing files to be deduplicated, etc.

At each run, crawler 224 enumerates a list of files that have not been optimized and meet the current policy-specific optimization criteria (e.g., files 30 days or older since the last change). In a typical deployment, immediately after deduplication is implemented for a particular volume, all files will not have been deduplicated. In this case, the deduplication process will begin to deduplicate the files incrementally (given the particular policy-driven optimization order). This execution mode is re-startable, which means that if a task is cancelled or interrupted due to a machine reboot, the deduplication continues the next time. Note that: there is a possibility that the file is deleted/touched while chunking is complete. The system may compare the original file ID and change timestamp with the target file ID and timestamp prior to the actual change; if there is a mismatch, then deduplication is aborted for that particular file.

Thus, the crawl/scan is performed by the crawler thread 666 on the snapshot in the log file 664. More specifically, crawler master 662 is created by pipeline 102 and accepts control calls from the pipeline, including taking snapshots of volumes during initialization. The crawler thread 666 appends entries to the log file 664, typically to minimize the lifetime of the snapshot (as an optimization). The pipeline gets services from this log file 664.

Crawler thread 666 includes walk through (walk) snapshot files in a manner that may depend on the underlying file system. While walking through, stoker thread 666 appends entries to log file 664. Also shown in FIG. 6 is a pipeline feeder (drive) thread 668. This thread service pipeline, for example, it crawls the log file 664, builds the necessary information in the file object, and calls the pipeline:: onFile callback.

As can be seen, using the log file 664 as an intermediate location for file objects distributed to the pipeline is an optimization to the overall creeping process. The log file features minimize the lifetime of VSS snapshots taken of creeping sessions on the volume, allow ordering of file extensions etc. (e.g. partitioning by type if needed), and collect the number and total size of files in creeping sessions to report progress.

Via the pipeline and supporting components described above, the deduplication process does not maintain a persistent state (which is stateless in nature, except perhaps with an exception, e.g., "restart hints" stored persistently by crawlers to indicate where to restart the scan to the next scan job, so that incremental scans do not need to be started repeatedly from the same location, e.g., daily). The deduplication process may be cancelled at any time, e.g., manually or in a "backoff" operation. From a transactional perspective, the deduplication sequence is crash-consistent in that any crash at any point leaves the file system in an available state without recovery. In addition, the deduplication sequence is stable for moderate or sudden removal of the underlying roll; similar to other services, in situations where file/volume handles become invalid due to disassembly, deduplication code requires a back-off.

Also as described above, the pipeline is designed to operate in an efficient asynchronous/batch mode in its stages and support independent execution of multiple data-optimized pipelines in parallel for multiple volumes. The pipeline thus provides a modular scalable design for deduplication processing and algorithm selection, while meeting CPU and memory utilization requirements, performance and throughput requirements, and providing security via secure and efficient hosting of optimized modules and parallel processing. In addition, the pipeline provides a way to limit the scope of optimization and prioritize through file filtering, ranking, and grouping via the selection module. Deduplication can thus be improved in substantial stages (e.g., file selection, optimization, algorithm selection, etc.).

Exemplary networked and distributed environments

Those skilled in the art will appreciate that the embodiments and methods described herein may be implemented in connection with any computer or other client or server device, which may be deployed as part of a computer network or in a distributed computing environment, and which may be connected to any type of data store or stores. In this regard, the embodiments described herein may be implemented in any computer system or environment having any number of memory or storage units, and any number of applications and processes occurring across any number of storage units. This includes, but is not limited to, an environment with server computers and client computers deployed in a network environment or distributed computing environment, having remote or local storage.

Distributed computing provides sharing of computer resources and services through the exchange of communications between computing devices and systems. These resources and services include the exchange of information, cache storage and disk storage for objects such as files. These resources and services also include processing power sharing among multiple processing units for load balancing, resource expansion, processing specialization, and so forth. Distributed computing makes use of network connections, allowing clients to leverage their collective power to benefit the entire enterprise. In this regard, various devices may have applications, objects, or resources that may participate in a resource management mechanism as described with reference to various embodiments of the invention.

FIG. 7 provides a schematic diagram of an exemplary networked or distributed computing environment. The distributed computing environment comprises computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. that may include programs, methods, data stores, programmable logic, etc. as represented by example applications 730, 732, 734, 736, 738. It is to be appreciated that computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. can comprise different devices, such as Personal Digital Assistants (PDAs), audio/video devices, mobile phones, MP3 players, personal computers, laptop computers, etc.

Each computing object 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. can communicate with one or more other computing objects 710, 712, etc. and computing objects or devices 720, 722, 724, 726, 728, etc. by way of the communications network 740, either directly or indirectly. Although illustrated as a single element in fig. 7, communications network 740 may comprise other computing objects and computing devices that provide services to the system of fig. 7, and/or may represent multiple interconnected networks, which are not shown. Each computing object 710, 712, etc. or computing object or device 720, 722, 724, 726, 728, etc. can also contain an application, such as applications 730, 732, 734, 736, 738, that can utilize an API, or other object, software, firmware, and/or hardware, suitable for communication with the application implementations provided in accordance with the various embodiments of the invention.

There are a variety of systems, components, and network configurations that support distributed computing environments. For example, computing systems may be connected together by wired or wireless systems, by local networks, or by widely distributed networks. Currently, many networks are coupled to the Internet, which provides an infrastructure for widely distributed computing and encompasses many different networks, but any network infrastructure can be used to facilitate exemplary communications with the system as described in various embodiments.

Thus, a host of network topologies and network infrastructures, such as client/server, peer-to-peer, or hybrid architectures, can be utilized. A "client" is a member of a class or group that uses the services of another class or group to which it is not related. A client may be a process, for example, roughly a set of instructions or tasks, that requests a service provided by another program or process. The client process uses the requested service without having to "know" any working details about the other program or the service itself.

In a client/server architecture, particularly a networked system, a client is typically a computer that accesses shared network resources provided by another computer (e.g., a server). In the illustration of FIG. 7, as a non-limiting example, computing objects or devices 720, 722, 724, 726, 728, etc. can be thought of as clients and computing objects 710, 712, etc. can be thought of as servers where computing objects 710, 712, etc. act as servers providing data services, such as receiving data from client computing objects or devices 720, 722, 724, 726, 728, etc., storing data, processing data, transmitting data to client computing objects or devices 720, 722, 724, 726, 728, etc., although any computer can be considered a client, a server, or both, depending on the circumstances.

A server is typically a remote computer system accessible over a remote or local network, such as the Internet or wireless network infrastructure. A client process may be active in a first computer system and a server process may be active in a second computer system, communicating with each other over a communications medium, thereby providing distributed functionality and allowing multiple clients to take advantage of the information-gathering capabilities of the server.

In a network environment in which the communications network 740 or bus is the Internet, for example, the computing objects 710, 712, etc. can be Web servers with which other computing objects or devices 720, 722, 724, 726, 728, etc. communicate via any of a number of known protocols, such as the Hypertext transfer protocol (HTTP). Computing objects 710, 712, etc. acting as servers may also serve as clients, e.g., computing objects or devices 720, 722, 724, 726, 728, etc., as may be characteristic of a distributed computing environment.

Exemplary computing device

As noted above, the techniques described herein may be advantageously applied to any device. It should be understood, therefore, that handheld, portable and other computing devices and computing objects of all kinds are contemplated for use in connection with the various embodiments. Accordingly, the below general purpose remote computer described below in FIG. 8 is but one example of a computing device.

Embodiments may be implemented in part via an operating system, for use by a developer of services for a device or object, and/or included within application software that operates to perform one or more functional aspects of the embodiments described herein. Software may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers, such as client workstations, servers or other devices. Those skilled in the art will appreciate that computer systems have a variety of configurations and protocols that can be used to communicate data, and thus no particular configuration or protocol should be considered limiting.

FIG. 8 thus illustrates an example of a suitable computing system environment 800 in which one or aspects of the embodiments described herein can be implemented, although as made clear above, the computing system environment 800 is only one example of a suitable computing environment and is not intended to suggest any limitation as to scope of use or functionality. Moreover, the computing system environment 800 is not intended to be interpreted as having any dependency relating to any one or combination of components illustrated in the exemplary computing system environment 800.

With reference to FIG. 8, an exemplary remote device for implementing one or more embodiments includes a general purpose computing device in the form of a computer 810. Components of computer 810 may include, but are not limited to, a processing unit 820, a system memory 830, and a system bus 822 that couples various system components including the system memory to the processing unit 820.

Computer 810 typically includes a variety of computer readable media and can be any available media that can be accessed by computer 810. The system memory 830 may include computer storage media in the form of volatile and/or nonvolatile memory such as Read Only Memory (ROM) and/or Random Access Memory (RAM). By way of example, and not limitation, system memory 830 may also include an operating system, application programs, other program modules, and program data.

A user may enter commands and information into the computer 810 through input device(s) 840. A monitor or other type of display device is also connected to the system bus 822 via an interface, such as output interface 850. In addition to a monitor, computers can also include other peripheral output devices such as speakers and a printer, which may be connected through output interface 850.

The computer 810 may operate in a networked or distributed environment using logical connections to one or more other remote computers, such as a remote computer 870. The remote computer 870 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, or any other remote media use or transmission device, and may include any or all of the elements described above relative to the computer 810. The logical connections depicted in FIG. 8 include a network 872, such Local Area Network (LAN) or a Wide Area Network (WAN), but may also include other networks/buses. Such networking environments are commonplace in homes, offices, enterprise-wide computer networks, intranets and the Internet.

As mentioned above, while exemplary embodiments have been described in connection with various computing devices and network architectures, the underlying concepts may be applied to any network system and any computing device or system in which it is desirable to improve the efficiency of resource usage.

Moreover, there are numerous ways of implementing the same or similar functionality, e.g., an appropriate API, tool kit, driver code, operating system, control, standalone or downloadable software object, etc., which enables applications and services to use the techniques provided herein. Thus, embodiments herein are contemplated from the standpoint of an API (or other software object), as well as from a software or hardware object that implements one or more embodiments as described herein. Thus, various embodiments described herein may have aspects that are wholly in hardware, partly in hardware and partly in software, as well as in software.

The word "exemplary" is used herein to mean serving as an example, instance, or illustration. For the avoidance of doubt, the subject matter disclosed herein is not limited to these examples. Moreover, any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Furthermore, to the extent that the terms "includes," "has," "includes," and other similar words are used, for the avoidance of doubt, such terms are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements when employed in a claim.

As noted, the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. As used herein, the terms "component," "module," "system," and the like are likewise intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computer and the computer can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.

The system as described above has been described with reference to interaction between several components. It will be understood that these systems and components may include components or specified sub-components, some specified components or sub-components, and/or additional components, and according to various permutations and combinations of the above. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers, such as a management layer, may be provided to communicatively couple to such sub-components in order to provide integrated functionality. Any components described herein may also interact with one or more other components not specifically described herein but generally known by those of skill in the art.

In view of the exemplary systems described herein, methodologies that may be implemented in accordance with the described subject matter can also be understood from the flow charts with reference to the figures. While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of blocks, it is to be understood and appreciated that the embodiments are not limited by the order of the blocks, as some blocks may occur in different orders and/or concurrently with other blocks from what is depicted and described herein. Although non-sequential or branched flow is illustrated via flowchart, it can be appreciated that various other branches, flow paths, and orders of the blocks, may be implemented which achieve the same or similar result. Moreover, some illustrated blocks may be optional in implementing the methodologies described hereinafter.

Conclusion

While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.

In addition to the embodiments described herein, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiments for performing the same or equivalent function of the corresponding embodiments without deviating therefrom. Further, multiple processing chips or multiple devices may share the performance of one or more functions described herein, and similarly, storage may be effected across multiple devices. Accordingly, the present invention should not be limited to any single embodiment, but rather should be construed in breadth, spirit and scope in accordance with the appended claims.

Claims

1. In a computing environment, a system includes a data deduplication pipeline (102), the data deduplication pipeline comprising:

a scanning phase comprising a crawler selecting files for deduplication via the pipeline, the crawler configured to access a policy to determine which files to select for deduplication;

a chunking stage (110) configured to segment data of a file into blocks, wherein the chunking stage comprises one or more modules each corresponding to a chunking algorithm;

a deduplication detection stage (112) configured to determine, for each chunk, whether the chunk has already been stored in the deduplication system; and

a commit phase (116) that commits to the deduplication system blocks that have not been determined by the deduplication detection phase to be stored in the deduplication system, and commits reference data for blocks that have been determined to be stored in the deduplication system.

2. The system of claim 1, wherein the chunking phase comprises at least two modules that perform chunking on different subsets of the file in parallel, or wherein the commit phase comprises at least two modules that store blocks in one or more block stores of the deduplication system in parallel, or wherein the chunking phase comprises at least two modules that perform chunking on different subsets of the file in parallel, and the commit phase comprises at least two modules that store blocks in one or more block stores of the deduplication system in parallel.

3. The system of claim 1, further comprising a compression stage including one or more modules each corresponding to a compression algorithm to compress at least one of the blocks prior to submitting the block to the deduplication system, and if the compression stage includes a plurality of available compression algorithms, the system further comprises a compression algorithm selector configured to select a compression algorithm from the plurality of available compression algorithms.

4. The system of claim 1, further comprising a compression stage comprising at least two modules that perform compression on different subsets of the blocks in parallel, or a hashing stage comprising at least two modules that perform hashing on different subsets of the blocks in parallel, or both at least two modules that perform compression on different subsets of the blocks in parallel and at least two modules that perform hashing on different subsets of the blocks in parallel.

5. The system of claim 1, further comprising a selection stage configured to receive files identified via the scan stage or another mechanism, or both, and to access a policy to perform filtering, ranking, ordering, or grouping, or any combination thereof, on the files before providing the files for further processing via the pipeline.

6. The system of claim 1, wherein the pipeline is configured to batch multiple files batched into a file queue or other batch of groupings of files, or wherein the pipeline is configured to batch multiple blocks batched into a block queue or other batch of groupings of blocks, or wherein the pipeline is configured to batch multiple files batched into a file queue or other batch of groupings of files, and wherein the pipeline is configured to batch multiple blocks batched into a block queue or other batch of groupings of files.

7. In a computing environment, a method executed at least in part on at least one processor, comprising:

receiving a file to be deduplicated (228), the file selected for deduplication by a crawler, the crawler configured to access a policy to determine which files to select for deduplication;

processing data of the file into blocks (230) in a modular blocking stage (110) comprising one or more blocking algorithms;

providing the chunks (232) to an indexing stage that determines (240) whether each of the chunks is already present in a deduplication system;

if each chunk in the chunk store phase is determined to not already exist in the deduplication system, committing (254) the chunk, and if the chunk is determined to exist in the deduplication system, committing the reference data for the chunk; and

reference information is submitted (116) to the file, the reference information corresponding to the one or more blocks extracted from the file.

8. The method of claim 7, further comprising obtaining a snapshot of a set of candidate files to be deduplicated, scanning the candidate files to select a file to be deduplicated, and logging the file to be deduplicated and based on attributes of the file, statistical attributes of the file, statistically derived attributes of a file dataset, internal feedback, or external feedback, or any combination of attributes of the files, statistical attributes of the files, statistically derived attributes of a file dataset, internal feedback, or external feedback to process the files in the log to perform filtering, ranking, ordering, or grouping on the files, or any combination of filtering, ranking, ordering, or grouping the files, and outputting the files for receipt for further deduplication processing.

9. A method, comprising:

selecting a file (224, 228) for data deduplication, the selecting being made as follows: determining which files to select for deduplication by an access policy;

queuing the file (440) for batch processing;

processing the file into blocks (444, 446) in a secure modular chunking stage comprising one or more chunking algorithms;

queuing the block (442) for batch processing;

processing the chunks to determine (240) whether each chunk already exists in a deduplication system, and if not, storing (252) each chunk not already existing in the deduplication system, and if so, storing (116) reference data (256) for each chunk already existing;

submitting (254) the one or more chunks or chunk application data about sessions, or the one or more chunks and chunk reference data about sessions, to the deduplication system in conjunction with updating an index for each chunk that is not already present in the deduplication system; and

updating file metadata to associate the file with a reference to the one or more blocks.