HK40016528B

HK40016528B - Method, computer-readable storage medium and electronic apparatus for reducing video data

Info

Publication number: HK40016528B
Application number: HK62020006057.7A
Authority: HK
Inventors: H·莎朗潘尼
Original assignee: 阿斯卡瓦公司
Priority date: 2017-04-28
Filing date: 2018-04-26
Publication date: 2024-05-24

Description

Methods for simplifying video data, computer-readable storage media, and electronic devices

技术领域Technical Field

本公开涉及数据存储、取回和通信。更具体来说，本公开涉及对已经使用基本数据滤筛进行无损简化的数据执行多维搜索和与内容关联的取回。This disclosure relates to data storage, retrieval, and communication. More specifically, this disclosure relates to performing multidimensional searches and content-related retrieval of data that has been losslessly simplified using basic data filtering.

背景技术Background Technology

当今的信息时代以巨量数据的产生、捕获和分析为标志。新的数据从多样的来源产生，这方面的实例包括购买交易记录、企业及政府记录和通信、电子邮件、社交媒体发帖、数字图片和视频、机器日志、来自嵌入式设备的信号、数字传感器、蜂窝电话全球定位卫星、航天卫星、科学计算以及大挑战科学。数据以多样的格式生成，其中得许多数据是无结构的，并且不适合输入到传统的数据库中。企业、政府和个人以前所未有的速度生成数据，并且在存储、分析和传送该数据方面遇到困难。为了保存累积的数据，每年在购买存储系统方面要花费数百亿美元。在用以处理数据的计算机系统上也要花费类似地巨大金额。The information age is characterized by the generation, capture, and analysis of massive amounts of data. New data arises from diverse sources, including purchase transaction records, corporate and government records and communications, emails, social media posts, digital images and videos, machine logs, signals from embedded devices, digital sensors, cellular phones, GPS satellites, space satellites, scientific computing, and Big Data. Data is generated in diverse formats, much of it unstructured and unsuitable for input into traditional databases. Businesses, governments, and individuals are generating data at an unprecedented rate and are facing challenges in storing, analyzing, and transmitting it. Hundreds of billions of dollars are spent annually on purchasing storage systems to preserve the accumulated data. Similar enormous sums are spent on computer systems used to process this data.

在最现代的计算机和存储系统中，在被组织成存储分级结构的多层存储上容纳和部署数据。需要被经常并且快速地存取的数据被放置在最快速但是也最昂贵的层级，大多数数据(包括用于备份的拷贝)则优选地被存储在最密集并且最便宜的存储介质中。最快速并且最昂贵的数据存储层级是计算机系统的非易失性随机存取存储器或RAM，其驻留在紧邻微处理器核心的位置并且为随机数据存取给出最低等待时间和最高带宽。逐渐地更密集并且更便宜但是也更慢的各层(其对于随机存取具有逐渐地更高的等待时间和更低的带宽)包括非易失性固态存储器或闪存存储装置、硬盘驱动器(HDD)并且最后是磁带驱动器。In the most modern computer and storage systems, data is housed and deployed across tiered storage organized into hierarchical structures. Data that requires frequent and rapid access is placed in the fastest but also most expensive tier, while most data (including copies for backup) is preferably stored in the densest and cheapest storage media. The fastest and most expensive data storage tier is the computer system's non-volatile random access memory, or RAM, which resides immediately adjacent to the microprocessor core and provides the lowest latency and highest bandwidth for random data access. Gradually denser and cheaper but also slower tiers (with progressively higher latency and lower bandwidth for random access) include non-volatile solid-state memory or flash memory, hard disk drives (HDDs), and finally, magnetic tape drives.

为了更加有效地存储和处理不断增加的数据，计算机行业持续对数据存储介质的密度和速度以及对计算机的处理能力作出改进。但是数据量的增加速度远远超出计算和数据存储系统的容量和密度的改进。来自2014年的数据存储行业的统计数据表明，在过去的几年里所产生并捕获的新数据构成全世界至今所捕获的数据的一大部分。全世界至今为止所产生的数据的数量估计超出多个泽字节(一个泽字节是10²¹个字节)。数据的大量增加对于必须可靠地存储、处理和传送该数据的数据存储、计算和通信系统提出了高要求。这就促使更多地使用无损数据简化或压缩技术来紧缩(compact)数据，从而能够以更低的成本来存储并且同样高效地处理和传送数据。To more effectively store and process ever-increasing amounts of data, the computer industry has continuously improved the density and speed of data storage media, as well as the processing power of computers. However, the rate of increase in data volume far outpaces improvements in the capacity and density of computing and data storage systems. Statistics from the data storage industry in 2014 show that the new data generated and captured in the past few years constitutes a large portion of all data captured worldwide to date. The amount of data generated worldwide to date is estimated to exceed several zettabytes (one zettabyte is 10 ^{^21} bytes). This massive increase in data places high demands on data storage, computing, and communication systems that must reliably store, process, and transmit this data. This has led to increased use of lossless data simplification or compression techniques to compact data, enabling storage and equally efficient processing and transmission at lower costs.

已经出现了多种无损数据简化(reduction)或压缩技术，并且近年来发生了演进。这些技术对数据进行检查以寻找数据中的某种形式的冗余，并且利用该冗余在没有任何信息损失的情况下实现数据足迹(data footprint)的简化。对于期望利用数据中的特定形式的冗余的给定技术，所实现的数据简化的程度取决于在数据中找到该特定形式的冗余的频度。所希望的是数据简化技术能够灵活地发现并且利用数据中的任何可用的冗余。由于数据源自多种来源和环境并且具有多种格式，因此对于用以应对这一多样数据的通用无损数据简化技术的开发和采用的兴趣很大。除了字母表之外通用数据简化技术不需要关于输入数据的先验知识；因此通用数据简化技术一般可以被应用于任何和所有数据，而不需要事先知道数据的结构和统计分布特性。Various lossless data reduction or compression techniques have emerged and evolved in recent years. These techniques examine data to find some form of redundancy and utilize that redundancy to simplify the data footprint without any loss of information. For a given technique that aims to utilize a specific form of redundancy in the data, the degree of data reduction achieved depends on the frequency with which that specific form of redundancy is found in the data. The desired outcome is that data reduction techniques can flexibly discover and utilize any available redundancy in the data. Because data originates from diverse sources and environments and exists in various formats, there is great interest in the development and adoption of general-purpose lossless data reduction techniques to handle this diversity. Except for the alphabet, general-purpose data reduction techniques do not require prior knowledge of the input data; therefore, they can generally be applied to any and all data without prior knowledge of the data's structure and statistical distribution characteristics.

可以被用来比较数据压缩技术的不同实现方式的优度(goodness)量度包括在目标数据集上实现的数据简化的程度，实现压缩或简化的效率，以及解压缩并取回数据以供未来使用的效率。效率量度评估解决方案的性能和成本有效性。性能量度包括新数据可以被消耗并简化的吞吐量或摄取速率，对输入数据进行简化所需要的等待时间或时间，数据可以被解压缩并取回的吞吐量或速率，以及解压缩并取回数据所需要的等待时间或时间。成本量度包括任何所需的专用硬件组件的成本，比如微处理器核心或微处理器利用(中央处理单元利用)，专用暂时存储器的数量和存储器带宽，以及对于保存数据的各个存储层级所需要的存取次数和带宽。应当提到的是，在简化数据足迹的同时提供高效且快速的压缩以及解压缩和取回不仅具有降低存储和传送数据的总体成本的好处，而且还具有高效地允许对于数据的后续处理的好处。Metrics of goodness that can be used to compare different implementations of data compression techniques include the degree of data simplification achieved on the target dataset, the efficiency of compression or simplification, and the efficiency of decompressing and retrieving data for future use. Efficiency metrics assess the performance and cost-effectiveness of a solution. Performance metrics include the throughput or ingestion rate at which new data can be consumed and simplified, the latency or time required to simplify the input data, the throughput or rate at which data can be decompressed and retrieved, and the latency or time required to decompress and retrieve the data. Cost metrics include the cost of any required dedicated hardware components, such as microprocessor cores or microprocessor utilization (central processing unit utilization), the amount of dedicated temporary memory and memory bandwidth, and the number of accesses and bandwidth required for each storage level holding the data. It should be noted that providing efficient and fast compression, decompression, and retrieval while simplifying the data footprint not only has the benefit of reducing the overall cost of storing and transmitting data but also the benefit of efficiently allowing for subsequent processing of the data.

当前在业内所使用的许多通用数据压缩技术是从在1977年由Abraham Lempel和Jacob Ziv开发的Lempel-Ziv压缩方法导出的，例如参见Jacob Ziv和Abraham Lempel的“AUniversal Algorithm for Sequential Data Compression(用于顺序数据压缩的通用算法)”，IEEE transactions on information theory，Vol.IT-23，No.3，1977年5月。这种方法成为允许通过因特网的高效数据传输的基础。Lempel-Ziv方法(也就是LZ77、LZ78及其变体)通过用引用替换串的重复出现而简化数据足迹，其中所述引用是针对在顺序呈现的输入数据流的滑动窗口内所见到的所述串的先前的出现。在消耗来自输入数据流的给定数据块的新鲜串时，这些技术搜索过先前在直到窗口长度的当前和先前块内所见到的所有串。如果所述新鲜串是重复，则用对原始串的后向引用将其替换。如果通过重复串所消除的字节数目大于后向引用所需的字节数目，则实现了数据的简化。为了搜索过在窗口中所见到的所有串，并且为了提供最大串匹配，这些技术的实现方式采用多种方案，其中包括迭代扫描以及建立包含在窗口中见到的所有串的字典的临时簿记结构。在消耗新的输入字节以组装新鲜串时，这些技术或者扫描过现有窗口中的所有字节，或者对串的字典进行引用(随后是一些计算)以便判定是否找到重复并且用后向引用将其替换(或者判定是否需要对字典进行添加)。Many general-purpose data compression techniques currently used in the industry are derived from the Lempel-Ziv compression method developed by Abraham Lempel and Jacob Ziv in 1977, see, for example, Jacob Ziv and Abraham Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory, Vol. IT-23, No. 3, May 1977. This method forms the basis for enabling efficient data transmission over the Internet. The Lempel-Ziv method (i.e., LZ77, LZ78, and their variants) simplifies the data footprint by replacing repeated occurrences of strings with references to previous occurrences of the string seen within a sliding window of the sequentially presented input data stream. These techniques search through all strings previously seen within the current and previous blocks up to the window length when consuming fresh strings from a given block of input data. If the fresh string is a duplicate, it is replaced with a backreference to the original string. Data simplification is achieved if the number of bytes eliminated by the duplicate string is greater than the number of bytes required for the backreference. These techniques employ various methods to search all strings seen in the window and to provide the maximum string match, including iterative scanning and a temporary bookkeeping structure that builds a dictionary containing all strings seen in the window. While consuming new input bytes to assemble a fresh string, these techniques either scan all bytes in the existing window or reference the dictionary of strings (followed by some calculations) to determine if a duplicate is found and replaced with a backreference (or to determine if the dictionary needs to be expanded).

Lempel-Ziv压缩方法常常伴随有应用于数据的第二优化，其中基于其在正被压缩的数据块中的出现频率或概率对源符号进行动态重编码，所述动态重编码常常采用可变宽度编码方案从而对于频率更高的符号使用长度更短的代码，从而导致数据的简化。对于这种基于熵的重编码方法的示例，参见David A.Huffman的“A Method for theConstruction of Minimum-Redundancy Codes(用于构造最小冗余代码的方法)”，Proceedings of the IRE–Institute of Radio Engineers，1952年9月，pp.1098-1101。这种技术被称作Huffman重编码，并且通常需要第一遍经过数据以计算频率以及第二遍经过数据以实际编码数据。围绕这一主题的几种变型也在使用之中。Lempel-Ziv compression methods are often accompanied by a second optimization applied to the data, in which source symbols are dynamically recoded based on their frequency or probability of occurrence in the data block being compressed. This dynamic recoding often employs a variable-width coding scheme, using shorter codes for more frequent symbols, thus simplifying the data. For an example of this entropy-based recoding method, see David A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proceedings of the IRE–Institute of Radio Engineers, September 1952, pp. 1098-1101. This technique is called Huffman recoding and typically requires a first pass through the data to calculate frequencies and a second pass through the data to actually encode the data. Several variations on this topic are also in use.

使用这些技术的一个实例是一种被称作“Deflate”的方案，该方案将Lempel-ZivLZ77压缩方法与Huffman重编码相组合。Deflate提供了压缩流数据格式规范，所述规范规定一种用于把字节序列表示成(通常更短的)比特序列的方法，以及一种用于把所述比特序列打包成字节的方法。Deflate方案最初由PKWARE,Inc.的Phillip W.Katz设计用于PKZIP归档实用程序。例如参见Phillip W.Katz的标题为“String searcher,and compressorusing same(串搜索器以及使用串搜索器的压缩器)”的美国专利5,051,745，1991年9月24日。美国专利5,051,745描述了一种用于针对预定目标串(输入串)搜索符号矢量(窗口)的方法。所述解决方案采用具有针对窗口中的每一个符号的指针的指针阵列，并且使用一种散列方法对窗口中的可能位置进行过滤，其中需要在所述可能位置处搜索输入串的完全相同的拷贝。随后是在这些位置处进行扫描和串匹配。One example of using these techniques is a scheme called "Deflate," which combines the Lempel-ZivLZ77 compression method with Huffman recoding. Deflate provides a compressed streaming data format specification that defines a method for representing a sequence of bytes as (typically shorter) bit sequences, and a method for packing said bit sequences into bytes. The Deflate scheme was originally designed by Phillip W. Katz of PKWARE, Inc. for the PKZIP archiving utility. See, for example, Phillip W. Katz's U.S. Patent 5,051,745, September 24, 1991, entitled "String searcher, and compressor using same." U.S. Patent 5,051,745 describes a method for searching a symbol vector (window) for a predetermined target string (input string). The solution employs an array of pointers having pointers to each symbol in the window and uses a hashing method to filter possible positions in the window where an exact copy of the input string needs to be searched for. The next step is to scan and match strings at these locations.

Deflate方案被实施在用于数据压缩的zlib库中。zlib是作为例如Linux、Mac OSX、iOS之类的几种软件平台以及多种游戏机的关键组件的软件库。zlib库提供Deflate压缩和解压缩代码以供zip(文件归档)、gzip(单文件压缩)、png(用于无损压缩图像的便携式网络图形格式)以及许多其他应用来使用。zlib现在被广泛用于数据传输和存储。服务器和浏览器的大多数HTTP事务使用zlib来压缩和解压缩数据。类似的实现方式正越来越多地被数据存储系统使用。The Deflate scheme is implemented in the zlib library for data compression. zlib is a software library that is a key component of several software platforms, such as Linux, Mac OSX, and iOS, as well as various game consoles. The zlib library provides Deflate compression and decompression code for use with zip (file archiving), gzip (single-file compression), png (a portable web graphics format for lossless image compression), and many other applications. zlib is now widely used for data transmission and storage. Most HTTP transactions in servers and browsers use zlib to compress and decompress data. Similar implementations are increasingly being used by data storage systems.

由Intel Corp.在2014年4月公布的一篇标题为“High Performance ZLIBCompression onArchitecture Processors(架构处理器上的高性能ZLIB压缩)”的文章描述了运行在当代Intel处理器(Core I7 4770处理器，3.4GHz，8MB高速缓存)上并且在Calgary数据资料库上操作的zlib库的优化版本的压缩和性能。在zlib中所使用的Deflate格式把用于匹配的最小串长度设定成3个字符，把最大匹配长度设定成256个字符，并且把窗口的大小设定成32千字节。所述实现方式提供对于9个优化等级的控制，其中第9级提供最高压缩但是使用最多计算并且实施最详尽的串匹配，第1级是最快的等级并且采用贪婪(greedy)串匹配。该文章报告，在使用单线程处理器的情况下，使用第1级(最快等级)zlib获得51％的压缩比并且对于输入数据平均花费17.66时钟/字节。在3.4GHz的时钟频率下，这意味着在用尽单一处理器核心的同时达到192MB/秒的摄取速率。所述报告还描述了在使用第6级优化获得压缩中的适度增益时其性能如何快速地下降到38MB/秒的摄取速率(平均88.1时钟/字节)，并且在使用第9级优化时下降到16MB/秒的摄取速率(平均209.5时钟/字节)。A paper published by Intel Corp. in April 2014, titled "High Performance ZLIB Compression on Architecture Processors," describes the compression and performance of an optimized version of the zlib library running on a contemporary Intel processor (Core i7 4770 processor, 3.4 GHz, 8 MB cache) and operating on the Calgary database. The Deflate format used in zlib sets the minimum string length for matching to 3 characters, the maximum match length to 256 characters, and the window size to 32 kilobytes. The implementation provides control over nine optimization levels, with level 9 offering the highest compression but using the most computation and performing the most exhaustive string matching, and level 1 being the fastest level employing greedy string matching. The paper reports that, using a single-threaded processor, zlib achieves a 51% compression ratio with an average clock speed of 17.66 clock cycles per byte for the input data. At a clock frequency of 3.4 GHz, this translates to an ingestion rate of 192 MB/s while fully utilizing a single processor core. The report also describes how performance rapidly drops to an ingestion rate of 38 MB/s (average 88.1 clocks/byte) when using Level 6 optimization for a modest gain in compression, and further to an ingestion rate of 16 MB/s (average 209.5 clocks/byte) when using Level 9 optimization.

现有的数据压缩方案通常使用当代微处理器上的单一处理器核心操作在从10MB/秒到200MB/秒的摄取速率下。为了进一步提升摄取速率，采用多个核心，或者减小窗口大小。使用定制硬件加速器会实现摄取速率的进一步改进，但是成本会增加。Existing data compression schemes typically operate at ingestion rates ranging from 10 MB/s to 200 MB/s using a single processor core on modern microprocessors. To further improve ingestion rates, multiple cores are employed, or the window size is reduced. Using custom hardware accelerators can achieve further improvements in ingestion rates, but this increases the cost.

前面所描述的现有数据压缩方法在通常具有单一消息或文件或者几个文件的大小的本地窗口中能够有效地利用较短的串和符号等级的细粒度冗余。但是当在操作于较大或极大的数据集上并且需要高数据摄取和数据取回速率的应用中使用这些方法时则存在严重的限制和缺陷。The existing data compression methods described above can effectively utilize short strings and fine-grained redundancy at the symbol level within a local window typically containing a single message or file, or a few files. However, these methods suffer from serious limitations and drawbacks when used in applications operating on large or extremely large datasets and requiring high data ingestion and retrieval rates.

一个重要的限制是这些方法的实用实现方式只能在本地窗口内高效地利用冗余。虽然这些实现方式可以接受任意长的输入数据流，但是效率决定在将于其中发现细粒度冗余的窗口的大小方面存在限制。这些方法是高度计算密集型的，并且需要对于窗口中的所有数据的频繁和快速的存取。在消耗产生新鲜输入串的输入数据的每一个新鲜字节(或几个字节)时触发各种簿记结构的串匹配和查找。为了达到所期望的摄取速率，用于串匹配的窗口和相关联的机器必须主要驻留在处理器高速缓存子系统中，从而在实践中对窗口大小构成约束。A significant limitation is that practical implementations of these methods can only efficiently utilize redundancy within a local window. While these implementations can accept arbitrarily long input data streams, efficiency limits the size of the window in which fine-grained redundancy will be found. These methods are highly computationally intensive and require frequent and rapid access to all data within the window. String matching and lookups in various bookkeeping structures are triggered as each fresh byte (or few bytes) of input data that produces a fresh input string is consumed. To achieve the desired ingestion rate, the window used for string matching and the associated machine must reside primarily in the processor cache subsystem, thus imposing a constraint on the window size in practice.

例如为了在单一处理器核心上达到200MB/秒的摄取速率，每个所摄取字节的平均可用时间预算(包括所有数据存取和计算)是5ns，这在使用具有3.4GHz操作频率的当代处理器的情况下意味着17个时钟。这一预算容许对于芯片上高速缓存的存取(花费少量循环)以及随后的一些串匹配。当前的处理器具有几兆字节容量的芯片上高速缓存。对于主存储器的存取花费超出200个循环(～70ns)，因此主要驻留在存储器中的更大窗口将使得摄取速率进一步变慢。此外，随着窗口大小增大以及去到重复串的距离增大，规定后向引用的长度的成本也增加，从而只能致使在更大的范围内搜索更长的串的重复。For example, to achieve an ingestion rate of 200 MB/s on a single processor core, the average available time budget per ingested byte (including all data accesses and computations) is 5 ns, which translates to 17 clock cycles using a modern processor operating at 3.4 GHz. This budget allows for accesses to the on-chip cache (which takes a few cycles) and some subsequent string matching. Current processors have on-chip caches of several megabytes in size. Accesses to main memory take more than 200 cycles (~70 ns), so larger windows primarily residing in memory will further slow down the ingestion rate. Furthermore, as the window size increases and the distance to repeating strings increases, the cost of specifying the length of backreferences also increases, thus necessitating searching for longer string repeats over a larger area.

在大多数当代数据存储系统上，存储在存储分级结构的各个层级上的数据的足迹比系统中的存储器容量大几个数量级。举例来说，虽然系统可以提供数百吉字节的存储器，但是驻留在闪存存储装置中的活跃数据的数据足迹可以是数十太字节，并且存储系统中的总数据可以处于数百太字节到多个拍字节的范围。此外，对于每一个相继层级，对于后续存储层级的可实现的数据存取吞吐量下降一个数量级或更多。当滑动窗口变大到无法容纳在存储器中时，这些技术受到对于接下来的数据存储等级的随机IO(输入或输出操作)存取的显著更低的带宽和更高等待时间的节制。In most modern data storage systems, the footprint of data stored across each tier of the storage hierarchy is several orders of magnitude larger than the system's memory capacity. For example, while a system may provide hundreds of gigabytes of memory, the footprint of active data residing in flash storage can be tens of terabytes, and the total data in the storage system can range from hundreds of terabytes to several petabytes. Furthermore, for each successive tier, the achievable data access throughput for subsequent storage tiers decreases by an order of magnitude or more. When the sliding window becomes too large to fit in memory, these techniques are constrained by significantly lower bandwidth and higher latency for random I/O (input or output) access to subsequent data storage tiers.

例如考虑具有4千字节的传入数据的一个文件或页面，所述文件或页面可以通过对已经存在于数据中并且分散在256太字节足迹上的例如100个平均长度为40字节的串进行引用而从现有数据组装。每一项引用将花费6个字节来规定其地址以及用于串长度的1个字节，同时有望节省40个字节。虽然在本例中描述的页面可以被压缩多于五倍，但是对应于该页面的摄取速率将受到获取并验证100个重复串所需的对于存储系统的100次或更多次IO存取的限制(即使可以完美地以低成本预测这些串驻留在何处)。在用尽存储系统的所有带宽的情况下，给出250000次随机IO存取/秒(这意味着对于4KB页面的1GB/秒的随机存取带宽)的存储系统只能以10MB/秒的摄取速率每秒压缩2500个这样的4KB大小的页面，从而使其不可用作存储系统。For example, consider a file or page with 4 kilobytes of incoming data, which can be assembled from existing data by referencing, for example, 100 strings of average length 40 bytes that already exist in the data and are scattered across a 256 terabyte footprint. Each reference would take 6 bytes to specify its address and 1 byte for the string length, while potentially saving 40 bytes. While the page described in this example could be compressed more than five times, the ingestion rate corresponding to that page would be limited by the 100 or more IO accesses to the storage system required to fetch and verify 100 repeating strings (even if the location of these strings could be perfectly predicted at low cost). Using all the bandwidth of the storage system, a storage system given 250,000 random IO accesses/second (meaning 1 GB/second of random access bandwidth for a 4KB page) could only compress 2,500 such 4KB pages per second at an ingestion rate of 10 MB/second, rendering it unusable as a storage system.

具有太字节或拍字节量级的大窗口大小的传统压缩方法的实现方式将受到对存储系统的减小的数据存取带宽的困扰，并且将是不可接受地缓慢。因此，这些技术的实用实现方式只有在能够容纳于处理器高速缓存或系统存储器中的窗口大小上本地存在冗余的情况下才能高效地发现和利用所述冗余。如果冗余数据在空间上或时间上与传入数据分开多个太字节、拍字节或艾字节，这些实现方式由于受到存储存取带宽的限制将无法以可接受的速度发现冗余。Traditional compression methods with large window sizes on the order of terabytes or petabytes are hampered by reduced data access bandwidth to the storage system and are unacceptably slow. Therefore, practical implementations of these techniques can only efficiently discover and utilize redundancy if it exists locally within a window size that can be accommodated in the processor cache or system memory. If redundant data is spatially or temporally separated from the incoming data by multiple terabytes, petabytes, or exabytes, these implementations will be unable to discover redundancy at an acceptable speed due to storage access bandwidth limitations.

传统方法的另一个限制在于其不合适于随机数据存取。跨越被压缩的整个窗口的各个数据块需要在能够对任何块内的任何组块进行存取之前被解压缩。这就对窗口的大小构成了实用限制。此外，传统上在未压缩数据上实施的操作(例如搜索操作)无法高效地在已压缩数据上实施。Another limitation of traditional methods is their unsuitability for random data access. Individual data blocks spanning the entire compressed window need to be decompressed before any group of blocks within any block can be accessed. This imposes a practical limitation on window size. Furthermore, operations traditionally performed on uncompressed data (such as search operations) cannot be efficiently performed on compressed data.

传统方法(特别是基于Lempel-Ziv的方法)的另一个限制在于其仅仅沿着一个维度搜索冗余——也就是用后向引用替换完全相同的串。Huffman重编码方案的限制在于其需要经过数据两遍以便计算频率并且随后进行重编码。这样在更大的块上就变得较慢。Another limitation of traditional methods (especially those based on Lempel-Ziv) is that they only search for redundancy along one dimension—that is, replacing identical strings with backreferences. The Huffman recoding scheme is limited by the fact that it requires traversing the data twice to calculate frequencies and then recoding. This becomes slow on larger blocks.

在全局数据存储库上检测长重复串的数据压缩方法使用数字指纹处理与散列方案的组合。这种压缩处理被称作数据去重复(deduplication)。最基本的数据去重复技术把文件分解成固定大小块，并且在数据储存库中寻找重复块。如果创建了文件的拷贝，则第一文件中的每一个块将在第二文件中具有重复，并且可以用针对原始块的引用替换所述重复。为了加快潜在地重复块的匹配，采用一种散列方法。散列函数是把一个串转换成被称作其散列值的数字值的函数。如果两个串相等，其散列值也相等。散列函数把多个串映射到给定的散列值，从而可以把长串简化成长度短得多的散列值。散列值的匹配将比两个长串的匹配快得多；因此首先进行散列值的匹配以便过滤掉可能是重复的可能串。如果输入串或块的散列值匹配存在于储存库中的串或块的散列值，随后则可以把输入串与储存库中的具有相同散列值的每一个串进行比较以便证实重复的存在。Data compression methods for detecting long duplicate strings on a global data repository use a combination of digital fingerprinting and hashing schemes. This compression process is called data deduplication. The most basic data deduplication technique breaks a file down into fixed-size blocks and searches for duplicate blocks in the data repository. If a copy of the file is created, each block in the first file will have a duplicate in the second file, and the duplicate can be replaced with a reference to the original block. To speed up the matching of potentially duplicate blocks, a hashing method is used. A hash function is a function that converts a string into a numeric value called its hash value. If two strings are equal, their hash values are also equal. Hash functions map multiple strings to a given hash value, thus simplifying long strings into much shorter hash values. Hash value matching is much faster than matching two long strings; therefore, hash value matching is performed first to filter out possible duplicate strings. If the hash value of the input string or block matches the hash value of a string or block existing in the repository, the input string can then be compared with every string in the repository that has the same hash value to confirm the existence of a duplicate.

把文件分解成固定大小块是简单且方便的，并且固定大小块在高性能存储系统中是高度期望的。但是这种技术在其所能够发现的冗余的数量方面存在限制，这意味着这些技术的压缩等级较低。举例来说，如果对第一文件进行拷贝以创建第二文件，并且即使如果只把单一字节的数据插入到第二文件中，则所有下游块的对准都将改变，每一个新块的散列值将被重新计算，并且所述数据去重复方法将不再能找到所有重复。Breaking a file into fixed-size blocks is simple and convenient, and fixed-size blocks are highly desirable in high-performance storage systems. However, this technique is limited in the amount of redundancy it can detect, meaning that these techniques achieve lower compression levels. For example, if a first file is copied to create a second file, and even if only a single byte of data is inserted into the second file, the alignment of all downstream blocks will change, the hash value of each new block will be recalculated, and the data deduplication method will no longer be able to find all duplicates.

为了解决数据去重复方法中的这一限制，行业内采用了使用指纹处理在匹配内容的位置处同步和对准数据流。后面的这种方案导致基于指纹的可变大小块。Michael Rabin展示出如何能够使用随机选择的不可约多项式对比特串进行指纹处理，例如参见MichaelO.Rabin的“Fingerprinting by Random Polynomials(通过随机多项式进行指纹处理)”，Center for Research in Computing Technology，Harvard University，TR-15-81，1981年。在这种方案中，随机选择的素数p被用来对长字符串进行指纹处理，这是通过计算被视为大整数对p取模的该串的余数。这种方案需要在￡比特整数上实施整数运算，其中k＝log₂(p)。或者可以使用k次随机不可约多项式，指纹则是数据的多项式表示对素多项式取模。To address this limitation in data deduplication methods, the industry has adopted fingerprinting to synchronize and align data streams at the locations of matching content. This latter approach results in fingerprint-based variable-size blocks. Michael Rabin demonstrates how it is possible to fingerprint bit strings using randomly chosen irreducible polynomials, see, for example, Michael O. Rabin's "Fingerprinting by Random Polynomials," Center for Research in Computing Technology, Harvard University, TR-15-81, 1981. In this scheme, a randomly chosen prime number p is used to fingerprint long strings by calculating the remainder of the string as a large integer modulo p. This scheme requires integer operations on £ bit integers, where k = log ₂ (p). Alternatively, a k-th degree random irreducible polynomial can be used, where the fingerprint is the polynomial representation of the data modulo a prime polynomial.

这种指纹处理方法被使用在数据去重复系统中以便识别将在该处建立组块边界的适当位置，从而使得系统可以在全局储存库中寻找这些组块的重复。可以在找到特定值的指纹时设定组块边界。作为这种用法的一个实例，通过采用32次或更低次多项式，可以对于输入数据中的每一个48字节串计算指纹(在输入的第一字节处开始，并且在随后的每一个相继字节处进行)。随后可以检查32比特指纹的13个低位比特，并且每当这13个比特的值是预先规定的值(例如值1)时则设定断点。对于随机数据，所述13个比特具有该特定值的概率将是2¹³分之1，因此对于每8KB可能会遇到近似一个这样的断点，从而导致平均大小为8KB的可变大小组块。所述断点或组块边界将会有效地对准到取决于数据内容的指纹。当很久没有找到指纹时，可以在某一预先规定的阈值处强制断点，从而使得系统确保为储存库创建短于预先规定的大小的组块。例如参见Athicha Muthitacharoen、Benjie Chen和DavidMazières的“A Low-bandwidth Network File System(低带宽网络文件系统)”，SOSP‘01，Proceedings of the eighteenth ACM symposium on Operating Systems Principles，10/21/2001，pp.174-187。This fingerprinting method is used in data deduplication systems to identify appropriate locations where chunk boundaries should be established, allowing the system to search for duplicates of these chunks in the global repository. Chunk boundaries can be set when a fingerprint with a specific value is found. As an example of this usage, a fingerprint can be calculated for each 48-byte string of input data using a polynomial of degree 32 or lower (starting at the first byte of input and proceeding to each subsequent byte). The 13 least significant bits of the 32-bit fingerprint can then be examined, and a breakpoint is set whenever the value of these 13 bits is a predefined value (e.g., the value 1). For random data, the probability of these 13 bits having this specific value will be 1 in 2 ^{^13} , so approximately one such breakpoint might be encountered every 8KB, resulting in variable-sized chunks with an average size of 8KB. The breakpoints or chunk boundaries will effectively align to the fingerprint depending on the data content. When no fingerprint is found for a long time, a breakpoint can be forced at a predefined threshold, ensuring that the system creates chunks shorter than the predefined size for the repository. For example, see Athicha Muthitacharoen, Benjie Chen, and David Mazières, “A Low-bandwidth Network File System,” SOSP’01, Proceedings of the eighteenth ACM symposium on Operating Systems Principles, 10/21/2001, pp. 174-187.

由Michael Rabin和Richard Karp开发的Rabin-Karp串匹配技术提供了对于指纹处理和串匹配的效率的进一步改进(例如参见Michael O.Rabin和R.Karp的“EfficientRandomized Pattern-Matching Algorithms(高效的随机化模式匹配算法)”，IBM Jour.ofRes.and Dev.，vol.31，1987年，pp.249-260)。应当提到的是，检查m字节子串的指纹的指纹处理方法可以在O(m)的时间内评估指纹处理多项式函数。由于这种方法将需要被应用在开始于例如n字节输入流的每一个字节的子串上，因此在整个数据流上实施指纹处理所需的总工作量将是O(n×m)。Rabin-Karp识别出被称作滚动散列(Rolling Hash)的散列函数，在所述滚动散列上，通过独立于子串的长度仅仅进行恒定次数的运算，有可能从前一个子串计算下一个子串的散列值。因此，在向右移位一个字节之后，可以在新的m字节串上递增进行指纹计算。这样就把用以计算指纹的工作量减少到O(1)，并且把用于对整个数据流进行指纹处理的总工作量减少到O(n)，从而与数据的大小成线性。这样就大大加快了指纹的计算和识别。The Rabin-Karp string matching technique, developed by Michael Rabin and Richard Karp, provides further improvements in the efficiency of fingerprint processing and string matching (see, for example, Michael O. Rabin and R. Karp, “Efficient Randomized Pattern-Matching Algorithms,” IBM Jour. of Res. and Dev., vol. 31, 1987, pp. 249-260). It should be noted that fingerprint processing methods that examine the fingerprint of an m-byte substring can evaluate the fingerprint processing polynomial function in O(m) time. Since this method would need to be applied to substrings starting with, for example, an n-byte input stream, the total workload required to perform fingerprint processing over the entire data stream would be O(n×m). Rabin-Karp identifies a hash function called rolling hash, on which the hash value of the next substring can be calculated from the previous substring by performing only a constant number of operations independent of the substring length. Therefore, after shifting one byte to the right, fingerprint calculation can be performed incrementally on the new m-byte string. This reduces the workload for fingerprint calculation to O(1) and the total workload for fingerprinting the entire data stream to O(n), thus becoming linear with the data size. This significantly speeds up fingerprint calculation and recognition.

对于前面描述的数据去重复方法的典型的数据存取和计算要求可以被如下描述。对于给定的输入，一旦完成指纹处理从而创建组块，并且在计算出用于该组块的散列值之后，这些方法首先需要针对存储器和后续存储层级的一个存取集合，以便搜索并且查找保持储存库中的所有组块的散列值的全局散列表。这通常将需要针对存储的第一IO存取。在散列表中找到匹配之后是第二存储IO集合(取决于在储存库中存在多少具有相同散列值的组块，这通常是一次但是也可以多于一次)，以便获取具有相同散列值的实际数据组块。最后实施逐字节匹配以便把输入组块与所获取的潜在匹配组块进行比较，从而确认并且识别重复。随后是用对原始块的引用替换新的重复块的第三存储IO存取(针对元数据空间)。如果在全局散列表中没有匹配(或者如果没有找到复制)，系统需要一次IO以把新的块输入到储存库中，并且需要另一次IO来更新全局散列表以便输入新的散列值。因此，对于较大的数据集(其中元数据和全局散列表无法容纳在存储器中，因此需要存储IO对其进行存取)，这样的系统对于每个输入组块可能需要平均三次IO。通过采用多种过滤器可能实现进一步的改进，从而常常可以在无需用以对全局散列表进行存取的第一存储IO的情况下检测到全局散列表中的缺失，从而把对其中一些组块进行处理所需的IO次数减少到两次。The typical data access and computation requirements for the data deduplication methods described above can be described as follows. For a given input, once fingerprinting is complete to create a block and a hash value for that block is computed, these methods first require a set of accesses to the memory and subsequent storage tiers to search and find a global hash table that holds the hash values of all blocks in the storage. This typically requires a first I/O access to the storage. After a match is found in the hash table, a second set of storage I/Os (depending on how many blocks with the same hash value exist in the storage, this is usually one but can be more than one) is performed to retrieve the actual data block with the same hash value. Finally, a byte-by-byte match is performed to compare the input block with the retrieved potential matching blocks, thus confirming and identifying duplicates. This is followed by a third storage I/O access (to the metadata space) to replace the new duplicate block with a reference to the original block. If no match is found in the global hash table (or if no copy is found), the system requires one I/O to input the new block into the storage and another I/O to update the global hash table to input the new hash value. Therefore, for large datasets (where metadata and the global hash table cannot fit in memory and thus require storage I/O for access), such a system may require an average of three I/O operations per input block. Further improvements can be achieved by employing multiple filters, which can often detect missing data in the global hash table without requiring the first storage I/O operation to access it, thereby reducing the number of I/O operations required to process some of these blocks to two.

给出250000次随机IO存取/秒(这意味着对于4KB页面的1GB/秒的随机存取带宽)的存储系统每秒可以摄取大约83333(250000除以每个输入组块3次IO)个平均大小为4KB的输入组块并且对其进行去重复，从而在用尽存储系统的所有带宽的情况下允许333MB/秒的摄取速率。如果仅使用存储系统的一半带宽(从而使得另一半可用于对所存储的数据进行存取)，这样的去重复系统仍然可以给出166MB/秒的摄取速率。如果在系统中有足够的处理能力可用，则(受到I/O带宽限制的)这些摄取速率是可以实现的。因此，在给定足够处理能力的情况下，数据去重复系统能够以经济的IO在全局数据范围内找到较大的数据重复，并且在当代存储系统上以每秒数百兆字节的摄取速率给出数据简化。A storage system with 250,000 random I/O accesses per second (meaning 1 GB/s of random access bandwidth for 4KB pages) can ingest approximately 83,333 (250,000 divided by 3 I/Os per input block) of input blocks averaging 4KB in size per second and deduplicate them, allowing an ingestion rate of 333 MB/s even when the storage system's full bandwidth is utilized. If only half of the storage system's bandwidth is used (so that the other half is available for accessing the stored data), such a deduplication system can still achieve an ingestion rate of 166 MB/s. These ingestion rates are achievable (limited by I/O bandwidth) if sufficient processing power is available in the system. Therefore, given sufficient processing power, data deduplication systems can find large data duplicates across the global data scope with economical I/O and deliver data simplification at ingestion rates of hundreds of megabytes per second on modern storage systems.

基于前面的描述应当清楚的是，虽然这些去重复方法在全局范围内找到长串的重复方面是有效的，但是其主要在找到大的重复方面是有效的。如果数据在更细的粒度上存在变化或修改，则使用这种方法将不会找到可用的冗余。这大大减小了这些方法对其有效的数据集的广度。这些方法已被使用在特定的数据存储系统和应用中，例如对于数据的定期备份，其中正被备份的新数据只有几个文件被修改，其余部分都是已被保存在先前的备份中的文件的重复。同样地，基于数据去重复的系统常常被部署在其中产生数据或代码的多份精确拷贝的环境中，比如数据中心中的虚拟化环境。但是随着数据演进并且更加一般地或者在更细的粒度上被修改，基于数据去重复的技术则失去其有效性。Based on the foregoing description, it should be clear that while these deduplication methods are effective at finding long strings of duplicates globally, they are primarily effective at finding large duplicates. If the data changes or is modified at a finer granular level, these methods will not find usable redundancy. This significantly reduces the breadth of effective datasets for these methods. These methods have been used in specific data storage systems and applications, such as for periodic backups of data where only a few files being backed up have been modified, and the rest are duplicates of files already preserved in previous backups. Similarly, data deduplication-based systems are often deployed in environments that generate multiple exact copies of data or code, such as virtualized environments in data centers. However, as data evolves and is modified more generally or at a finer granular level, data deduplication techniques lose their effectiveness.

一些方法(其通常被采用在数据备份应用中)不实施输入数据与其散列值匹配输入的串之间的实际的逐字节比较。这样的解决方案依赖于使用例如SHA-1之类的强散列函数的低冲突概率。但是由于冲突(其中多个不同的串可以映射到相同的散列值)的有限非零概率，这样的方法不能被视为提供无损数据简化，因此将不满足主存储和通信的高数据完整性要求。Some methods (often used in data backup applications) do not perform an actual byte-by-byte comparison between the input data and the string whose hash value matches the input. Such solutions rely on the low probability of collisions using strong hash functions such as SHA-1. However, due to the finite non-zero probability of collisions (where multiple different strings can map to the same hash value), such methods cannot be considered to provide lossless data simplification and therefore will not meet the high data integrity requirements of primary storage and communication.

一些方法组合多种现有的数据压缩技术。在这样的设置中，通常首先对数据应用全局数据去重复方法。随后在经过去重复的数据集上并且采用小窗口，应用与Huffman重编码相组合的Lempel-Ziv串压缩方法以实现进一步的数据简化。Some approaches combine multiple existing data compression techniques. In such a setup, a global data deduplication method is typically applied first. Subsequently, on the deduplicated dataset, and using a small window approach, a Lempel-Ziv string compression method combined with Huffman recoding is applied to achieve further data simplification.

但是尽管采用了所有至此已知的技术，在不断增长和累积的数据的需求与世界经济使用最佳可用现代存储系统所能可负担地适应的情况之间仍然存在几个数量级的差距。在给定不断增长的数据所需要的非常高的存储容量需求的情况下，仍然需要进一步简化数据足迹的改进的方式。仍然需要开发解决现有技术的限制或者沿着尚未被现有技术解决的维度利用数据中的可用冗余的方法。与此同时，能够以可接受的速度并且以可接受的处理成本高效地存取和取回数据仍然非常重要。仍然需要能够直接对简化的数据高效地执行搜索操作。However, despite employing all known technologies to date, a gap of several orders of magnitude remains between the ever-growing and accumulating demand for data and the affordability of the world economy using the best available modern storage systems. Given the very high storage capacity requirements of the ever-growing data, there is still a need to further simplify and improve the data footprint. Methods to overcome the limitations of existing technologies or to leverage available redundancy in data along dimensions not yet addressed by current technologies still need to be developed. At the same time, the ability to efficiently access and retrieve data at acceptable speeds and processing costs remains crucial. The ability to perform efficient search operations directly on simplified data remains essential.

总而言之，长期以来一直需要能够利用较大和极大的数据集中的冗余并且提供高数据摄取、数据搜索和数据取回速率的无损数据减损解决方案。In summary, there has long been a need for lossless data reduction solutions that can leverage redundancy in large and massive datasets and provide high rates of data ingestion, search, and retrieval.

发明内容Summary of the Invention

这里描述的实施例的特征在于可在提供高数据摄取和数据取回速率的同时，对较大和极大的数据集进行无损数据简化，并且不存在现有数据压缩系统的缺陷和限制的技术和系统。The embodiments described herein are characterized by techniques and systems that can perform lossless data simplification on large and extremely large datasets while providing high data ingestion and data retrieval rates, without the defects and limitations of existing data compression systems.

具体而言，一些实施例可以从视频数据中提取压缩的移动图片数据和压缩的音频数据。接下来，实施例可以从压缩的移动图片数据提取内帧(I帧)(intra-frame)。实施例然后可以无损简化I帧以获得无损简化的I帧。无损简化I帧可以包括，对于每个I帧，(1)通过使用I帧对基于基本数据单元的内容组织基本数据单元的数据结构执行第一内容关联查找来识别第一组基本数据单元，以及(2)使用第一组基本数据单元来无损简化I帧。实施例可以附加地对压缩的音频数据解压缩来获得一组音频分量。接下来，对于所述一组音频分量中的每个音频分量，实施例可以(1)通过使用音频分量对基于基本数据单元的内容组织基本数据单元的数据结构执行第二内容关联查找来识别第二组基本数据单元，以及(2)使用第二组基本数据单元来无损简化音频分量。Specifically, some embodiments may extract compressed moving image data and compressed audio data from video data. Next, embodiments may extract intraframes (I-frames) from the compressed moving image data. Embodiments may then losslessly simplify the I-frames to obtain losslessly simplified I-frames. Losslessly simplifying an I-frame may include, for each I-frame, (1) identifying a first set of basic data units by performing a first content association lookup on a data structure of basic data units based on the I-frame's content organization, and (2) using the first set of basic data units to losslessly simplify the I-frame. Embodiments may additionally decompress the compressed audio data to obtain a set of audio components. Next, for each audio component in the set of audio components, embodiments may (1) identify a second set of basic data units by performing a second content association lookup on a data structure of basic data units based on the audio components' content organization, and (2) using the second set of basic data units to losslessly simplify the audio component.

一些实施例可以初始化存储在第一存储器设备中并且被配置成基于基本数据单元的内容来组织基本数据单元的数据结构。接下来，实施例可以将输入数据分解成候选单元序列。对于每个候选单元，实施例可以(1)通过使用候选单元对数据结构执行内容关联查找来识别一组基本数据单元，以及(2)通过使用所述一组基本数据单元来无损简化候选单元，其中如果候选单元的大小未被充分简化，则候选单元作为新的基本数据单元被添加到数据结构中。接下来，实施例可以将无损简化的候选单元存储在第二存储器设备中。在检测到数据结构的一个或多个分量的大小大于阈值时，实施例可以(1)将数据结构的一个或多个分量移动到第二存储器设备，以及(2)初始化数据结构中被移动到第二存储器设备的一个或多个分量。无损简化的数据批次可以包括(1)在时间相邻的初始化之间存储在第二存储器设备上的无损简化的候选单元，以及(2)在时间相邻的初始化之间移动到第二存储器设备的数据结构的分量。在变型中，实施例可以基于存储在第二存储器设备上的无损简化的数据批次来创建一组包裹，其中所述一组包裹有助于数据从一个计算机到另一个计算机的归档和移动。Some embodiments may initialize a data structure stored in a first memory device and configured to organize basic data units based on their contents. Next, the embodiments may decompose the input data into a sequence of candidate units. For each candidate unit, the embodiments may (1) identify a set of basic data units by performing a content association lookup on the data structure using the candidate unit, and (2) losslessly simplify the candidate unit using the set of basic data units, wherein if the size of the candidate unit is not sufficiently simplified, the candidate unit is added to the data structure as a new basic data unit. Next, the embodiments may store the losslessly simplified candidate units in a second memory device. When the size of one or more components of the data structure is detected to be greater than a threshold, the embodiments may (1) move one or more components of the data structure to the second memory device, and (2) initialize the one or more components of the data structure moved to the second memory device. A losslessly simplified batch of data may include (1) losslessly simplified candidate units stored on the second memory device between temporally adjacent initializations, and (2) components of the data structure moved to the second memory device between temporally adjacent initializations. In a variant, the embodiment may create a set of packages based on lossless, simplified data batches stored on a second storage device, wherein the set of packages facilitates the archiving and movement of data from one computer to another.

一些实施例可以将输入数据分解成候选单元序列。接下来，对于每个候选单元，实施例可以(1)将候选单元划分为一个或多个字段，(2)对于每个字段，用素多项式(primepolynomial)除以该字段以获得商和余数对，(3)基于一个或多个商和余数对确定名称，(4)通过使用名称对基于基本数据单元的相应名称的内容组织基本数据单元的数据结构执行内容关联查找来识别一组基本数据单元，以及(5)通过使用所述一组基本数据单元来无损简化候选单元。Some embodiments may decompose the input data into a sequence of candidate units. Next, for each candidate unit, the embodiments may (1) divide the candidate unit into one or more fields, (2) for each field, divide the field by a prime polynomial to obtain a quotient and remainder pair, (3) determine a name based on one or more quotient and remainder pairs, (4) identify a set of basic data units by performing a content association lookup on the data structure of the basic data units based on the corresponding names of the basic data units using the name pairs, and (5) simplify the candidate units without loss by using the set of basic data units.

一些实施例可以将输入数据分解成候选单元序列。接下来，对于每个候选单元，实施例可以(1)通过使用候选单元对基于基本数据单元的内容组织基本数据单元的数据结构执行内容关联查找来识别一组基本数据单元，以及(2)通过使用所述一组基本数据单元来无损简化候选单元。实施例然后可以将无损简化的候选单元存储在一组蒸馏文件中。接下来，实施例可以将基本数据单元存储在一组基本数据单元文件中。在一些实施例中，每个无损简化的候选单元为用于简化候选单元的每个基本数据单元指定基本数据单元文件，该基本数据单元文件包含基本数据单元和可以在基本数据单元文件中找到该基本数据单元的位置的偏移量。在一些实施例中，每个蒸馏文件存储基本数据单元文件的列表，这些基本数据单元文件包含用于无损简化存储在蒸馏文件中的候选单元的基本数据单元。Some embodiments may decompose the input data into a sequence of candidate units. Next, for each candidate unit, the embodiments may (1) identify a set of basic data units by performing a content association lookup on the data structure of the basic data units based on the content organization of the basic data units using the candidate units, and (2) losslessly simplify the candidate units using said set of basic data units. The embodiments may then store the losslessly simplified candidate units in a set of distillation files. Next, the embodiments may store the basic data units in a set of basic data unit files. In some embodiments, each losslessly simplified candidate unit specifies a basic data unit file for each basic data unit used to simplify the candidate unit, the basic data unit file containing the basic data unit and an offset of the location where the basic data unit can be found in the basic data unit file. In some embodiments, each distillation file stores a list of basic data unit files containing the basic data units used to losslessly simplify the candidate units stored in the distillation file.

使用一组基本数据单元来无损简化数据单元(例如，I帧、音频分量、候选单元等)可以包括：(1)响应于确定(i)对所述一组基本数据单元的引用的大小以及(ii)重建程序的描述的大小的总和小于数据单元的大小的阈值比例，生成数据单元的第一无损简化表示，其中第一无损简化表示包括对所述一组基本数据单元中的每个基本数据单元的引用以及重建程序的描述；以及(2)响应于确定(i)对所述一组基本数据单元的引用的大小以及(ii)重建程序的描述的大小的总和大于或等于数据单元的大小的阈值比例，将数据单元添加为数据结构中的新基本数据单元，并生成数据单元的第二无损简化表示，其中第二无损简化表示包括对新基本数据单元的引用。注意的是，重建程序的描述可以指定变换序列，该变换序列在应用于所述一组基本数据单元(即，用于无损简化数据单元的一个或多个基本数据单元)时，产生数据单元。Using a set of basic data units to losslessly simplify data units (e.g., I-frames, audio components, candidate units, etc.) may include: (1) generating a first lossless simplified representation of the data unit in response to determining that the sum of (i) the size of the references to the set of basic data units and (ii) the size of the description of the reconstruction procedure is less than a threshold proportion of the data unit size, wherein the first lossless simplified representation includes a reference to each of the basic data units in the set of basic data units and a description of the reconstruction procedure; and (2) adding the data unit as a new basic data unit in the data structure in response to determining that the sum of (i) the size of the references to the set of basic data units and (ii) the size of the description of the reconstruction procedure is greater than or equal to a threshold proportion of the data unit size, and generating a second lossless simplified representation of the data unit, wherein the second lossless simplified representation includes a reference to the new basic data unit. Note that the description of the reconstruction procedure may specify a transformation sequence that, when applied to the set of basic data units (i.e., one or more basic data units for lossless simplification of the data unit), produces the data unit.

附图说明Attached Figure Description

图1A示出了根据这里所描述的一些实施例的用于数据简化的方法和装置，其把输入数据因式分解成各个单元并且从驻留在基本数据滤筛中的基本数据单元导出这些单元。Figure 1A illustrates a method and apparatus for data simplification according to some embodiments described herein, which factorizes input data into individual units and derives these units from basic data units residing in a basic data filter.

图1B-1G示出了根据这里所描述的一些实施例的图1A中所示的方法和装置的各种变型。Figures 1B-1G illustrate various variations of the method and apparatus shown in Figure 1A according to some embodiments described herein.

图1H给出了根据这里所描述的一些实施例的描述蒸馏数据(Distilled Data)的结构的格式和规范的一个实例。Figure 1H shows an example of the format and specification for describing the structure of distilled data according to some embodiments described herein.

图1I到1P示出了对应于图1A到图1G中示出的用于数据简化的方法和装置的各种变型的输入数据到无损简化形式的概念性变换。Figures 1I to 1P illustrate conceptual transformations of input data into lossless simplified forms, corresponding to various variations of the methods and apparatuses for data simplification shown in Figures 1A to 1G.

图2示出了根据这里所描述的一些实施例的通过把输入数据因式分解成各个单元并且从驻留在基本数据滤筛中的基本数据单元导出这些单元而进行数据简化的处理。Figure 2 illustrates a data simplification process according to some embodiments described herein, which involves factoring the input data into individual units and deriving these units from the basic data units residing in the basic data filter.

图3A、3B、3C、3D和3E示出了根据这里所描述的一些实施例的可以被用来基于其名称对基本数据单元进行组织的不同的数据组织系统。Figures 3A, 3B, 3C, 3D, and 3E illustrate different data organization systems that can be used to organize basic data units based on their names, according to some embodiments described herein.

图3F给出了根据这里所描述的一些实施例的自描述树节点数据结构。Figure 3F shows the self-describing tree node data structure according to some embodiments described herein.

图3G给出了根据这里所描述的一些实施例的自描述叶节点数据结构。Figure 3G illustrates a self-describing leaf node data structure according to some embodiments described herein.

图3H给出了根据这里所描述的一些实施例的包括导航前瞻字段的自描述叶节点数据结构。Figure 3H illustrates a self-describing leaf node data structure including a navigation look-ahead field, according to some embodiments described herein.

图4示出了根据这里所描述的一些实施例的如何可以把256TB的基本数据组织成树形式的一个实例，并且呈现出如何可以把树布置在存储器和存储装置中。Figure 4 illustrates an example of how 256TB of basic data can be organized into a tree structure according to some embodiments described herein, and shows how the tree can be arranged in memory and storage devices.

图5A-5C示出了关于如何可以使用这里所描述的实施例组织数据的一个实际的实例。Figures 5A-5C illustrate a practical example of how data can be organized using the embodiments described herein.

图6A-6C分别示出了根据这里所描述的一些实施例的如何可以把树数据结构用于参照图1A-1C描述的内容关联映射器。Figures 6A-6C respectively illustrate how a tree data structure can be used in a content association mapper described with reference to Figures 1A-1C, according to some embodiments described herein.

图7A提供了根据这里所描述的一些实施例的可以在重建程序中规定的变换的一个实例。Figure 7A provides an example of a transformation that can be specified in the reconstruction procedure according to some embodiments described herein.

图7B示出了根据这里所描述的一些实施例的从基本数据单元导出候选单元的结果的实例。Figure 7B shows an example of the results of deriving candidate cells from basic data cells according to some embodiments described herein.

图8A-8E示出了根据这里所描述的一些实施例的如何通过把输入数据因式分解成固定大小单元并且把所述单元组织在参照图3D和3E描述的树数据结构中而实施数据简化。Figures 8A-8E illustrate how, according to some embodiments described herein, data simplification is implemented by factoring input data into fixed-size units and organizing said units in a tree data structure described with reference to Figures 3D and 3E.

图9A-9C示出了根据这里所描述的一些实施例的基于图1C中示出的系统的DataDistillation^TM(数据蒸馏)方案的一个实例。Figures 9A-9C illustrate an example of a DataDistillation ^™ scheme based on the system shown in Figure 1C, according to some embodiments described herein.

图10A提供了根据这里所描述的一些实施例的关于如何对基本数据单元应用在重建程序中规定的变换以产生导出单元(Derivative Element)的一个实例。Figure 10A provides an example of how to apply the transformations specified in the reconstruction procedure to basic data units to produce derived elements, according to some embodiments described herein.

图10B-10C示出了根据这里所描述的一些实施例的数据取回处理。Figures 10B-10C illustrate data retrieval processes according to some embodiments described herein.

图11A-11G示出了根据这里所描述的一些实施例的包括Data Distillation^TM机制(可以利用软件、硬件或者其组合来实施)的系统。Figures 11A-11G illustrate systems including the Data Distillation ^™ mechanism (which may be implemented using software, hardware, or a combination thereof) according to some embodiments described herein.

图11H示出了根据这里所描述的一些实施例的Data Distillation^TM装置如何可以与范例通用计算平台进行接口。Figure 11H illustrates how the Data Distillation ^™ device, according to some embodiments described herein, can interface with an exemplary general-purpose computing platform.

图11I图解说明在块处理存储系统中，如何把Data Distillation^TM装置用于数据简化。Figure 11I illustrates how the Data Distillation ^™ device can be used for data simplification in a block processing storage system.

图12A-12B示出了根据这里所描述的一些实施例的使用Data Distillation^TM装置在受到带宽约束的通信介质上传送数据。Figures 12A-12B illustrate the use of the Data Distillation ^™ device to transmit data over a bandwidth-constrained communication medium according to some embodiments described herein.

图12C-12K示出了根据这里所描述的一些实施例的由Data Distillation^TM装置对于各种使用模型所产生的简化数据的各个分量。Figures 12C-12K illustrate the various components of simplified data generated by the Data Distillation ^™ device for various usage models according to some embodiments described herein.

图12L-12R示出了根据本文描述的一些实施例如何在分布式系统上部署和执行蒸馏处理以能够以非常高的摄取速率适应显著更大的数据集。Figures 12L-12R illustrate how distillation processes can be deployed and performed on a distributed system, according to some embodiments described herein, to be able to adapt to significantly larger datasets at very high ingestion rates.

图13-17图示了根据本文描述的一些实施例如何可以对简化的数据执行多维搜索和数据取回。Figures 13-17 illustrate how multidimensional search and data retrieval can be performed on simplified data according to some embodiments described herein.

图18A-18B表示用于按照MPEG 1，层3标准(也被称为MP3)的音频数据的压缩和解压缩的编码器和解码器的方框图。Figures 18A-18B show block diagrams of encoders and decoders used for compressing and decompressing audio data according to the MPEG 1, Layer 3 standard (also known as MP3).

图18C表示如何增强首次示于图1A中的Data Distillation(数据蒸馏)装置，以对MP3数据进行数据简化。Figure 18C illustrates how to enhance the Data Distillation apparatus, first shown in Figure 1A, to simplify MP3 data.

图19表示可以如何增强首次示于图1A中的数据蒸馏装置，以对视频数据进行数据简化。Figure 19 illustrates how the data distillation apparatus, first shown in Figure 1A, can be enhanced to simplify video data.

具体实施方式Detailed Implementation

给出后面的描述是为了使得本领域技术人员能够制作和使用本发明，并且是在特定应用及其需求的情境中所提供的。本领域技术人员将很容易认识到针对所公开的实施例的各种修改，并且这里所定义的一般原理可以被应用于其他实施例和应用而不会背离本发明的精神和范围。因此，本发明不限于所示出的实施例，而是应当符合与这里所公开的原理和特征相一致的最宽泛的范围。在本公开内容中，当某一短语对于一个实体集合使用术语“和/或”时，除非另行表明，否则所述短语涵盖所述实体集合的所有可能组合。举例来说，短语“X、Y和/或Z”涵盖以下其中组合：“只有X”，“只有Y”，“只有Z”，“X和Y，但是没有Z”，“X和Z，但是没有Y”，“Y和Z，但是没有X”，以及“X、Y和Z”。The following description is provided to enable those skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Those skilled in the art will readily recognize various modifications to the disclosed embodiments, and the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of the invention. Therefore, the invention is not limited to the embodiments shown, but should be accorded the broadest scope consistent with the principles and features disclosed herein. In this disclosure, when a phrase uses the term “and/or” for a set of entities, unless otherwise stated, the phrase covers all possible combinations of the set of entities. For example, the phrase “X, Y, and/or Z” covers the following combinations: “X only,” “Y only,” “Z only,” “X and Y, but no Z,” “X and Z, but no Y,” “Y and Z, but no X,” and “X, Y, and Z.”

使用基本数据滤筛的数据的高效无损简化Efficient and lossless simplification of data using basic data filtering

在这里所描述的一些实施例中，数据被组织和存储，以便在整个数据集的全局范围内高效地发现和利用冗余。输入数据流被分解成被称作单元的构成片段或组块，并且以比单元本身更细的粒度检测和利用各个单元当中的冗余，从而减缩所存储的数据的总体足迹。识别出被称作基本数据单元的一个单元集合并且将其用作数据集的共同的共享构建块，并且将其存储在被称作基本数据存储库或基本数据滤筛的结构中。基本数据单元简单地是具有特定大小的比特、字节或数位的序列。取决于实现方式，基本数据单元可以是固定大小或可变大小。输入数据的其他构成单元从基本数据单元导出，并且被称作导出单元。因此，输入数据被因式分解成基本数据单元和导出单元。In some embodiments described herein, data is organized and stored to efficiently discover and utilize redundancy across the entire dataset. The input data stream is decomposed into constituent fragments or chunks called cells, and redundancy within each cell is detected and utilized at a finer granularity than the cell itself, thereby reducing the overall footprint of the stored data. A set of cells, called basic data cells, is identified and used as common shared building blocks of the dataset, and stored in a structure called a basic data repository or basic data filter. A basic data cell is simply a sequence of bits, bytes, or digits of a specific size. Depending on the implementation, basic data cells can be fixed-size or variable-size. Other constituent cells of the input data are derived from the basic data cells and are called derived cells. Thus, the input data is factored into basic data cells and derived cells.

基本数据滤筛对基本数据单元进行排序和组织，从而使得可以按照内容关联方式对基本数据滤筛进行搜索和存取。在给定一些输入内容和一些限制的情况下可以对基本数据滤筛进行查询以取回包含该内容的基本数据单元。在给定输入单元的情况下，可以使用所述单元的值或者所述单元中的特定字段的值对基本数据滤筛进行搜索，以便快速地提供一个基本数据单元或者较小的基本数据单元集合，从中可以导出输入单元并且只利用规定所述导出所需的最小存储。在一些实施例中，基本数据滤筛中的单元被组织成树形式。通过在基本数据单元上实施变换从基本数据单元导出导出单元，这样的变换被规定在重建程序中，所述重建程序描述如何从一个或多个基本数据单元生成导出单元。距离阈值规定关于导出单元的所存储足迹的大小的限制。该阈值有效地规定导出单元与基本数据单元的最大可允许距离，并且还对可以被用来生成导出单元的重建程序的大小作出限制。The basic data filter sorts and organizes basic data units, allowing them to be searched and accessed according to content association. Given some input content and certain constraints, the basic data filter can be queried to retrieve the basic data units containing that content. Given an input unit, the basic data filter can be searched using the value of that unit or the value of a specific field within that unit to quickly provide a basic data unit or a small set of basic data units from which the input unit can be derived using only the minimum storage required for the specified derivation. In some embodiments, the units in the basic data filter are organized in a tree structure. Derived units are derived from basic data units by implementing transformations on them; such transformations are specified in a reconstruction procedure that describes how to generate derived units from one or more basic data units. A distance threshold specifies a limit on the size of the stored footprint of the derived unit. This threshold effectively specifies the maximum permissible distance between the derived unit and the basic data unit, and also limits the size of the reconstruction procedure that can be used to generate the derived unit.

导出数据的取回是通过在由所述导出规定的一个或多个基本数据单元上执行重建程序而实现的。The retrieval of exported data is achieved by performing a reconstruction procedure on one or more basic data units defined by the export.

在本公开内容中，前面描述的通用无损数据简化技术可以被称作DataDistillation^TM处理。所述处理实施类似于化学中的蒸馏的功能——把混合物分离成其构成单元。基本数据滤筛也被称作滤筛或Data Distillation^TM滤筛或基本数据存储库。In this disclosure, the general non-destructive data simplification technique described above may be referred to as DataDistillation ^™ processing. This processing performs a function similar to distillation in chemistry—separating a mixture into its constituent units. The basic data filter is also referred to as a filter, Data Distillation ^™ filter, or basic data repository.

在这种方案中，输入数据流被因式分解成一个单元序列，其中每一个单元是基本数据单元或者从一个或多个基本数据单元导出的导出单元。每一个单元被变换成无损简化表示，所述无损简化表示在基本数据单元的情况下包括对基本数据单元的引用，并且在导出单元的情况下包括对所述导出中所涉及的一个或多个基本数据单元的引用，以及关于重建程序的描述。因此，输入数据流被因式分解成处于无损简化表示中的单元序列。(出现在无损简化表示中的)该单元序列被称作蒸馏数据流或蒸馏数据。蒸馏数据中的单元序列与输入数据中的单元序列具有一一对应关系，也就是说蒸馏数据中的单元序列中的第n个单元对应于输入数据中的单元序列中的第n个单元。In this scheme, the input data stream is factored into a sequence of units, where each unit is a basic data unit or a derived unit derived from one or more basic data units. Each unit is transformed into a lossless simplified representation, which includes references to the basic data units in the case of the basic data units, and references to one or more basic data units involved in the derivation in the case of the derived units, as well as a description of the reconstruction procedure. Therefore, the input data stream is factored into a sequence of units in the lossless simplified representation. This sequence of units (appearing in the lossless simplified representation) is called the distilled data stream or distilled data. There is a one-to-one correspondence between the sequence of units in the distilled data and the sequence of units in the input data; that is, the nth unit in the sequence of units in the distilled data corresponds to the nth unit in the sequence of units in the input data.

在本公开内容中描述的通用无损数据简化技术接收输入数据流并且将其转换成蒸馏数据流与基本数据滤筛的组合，从而使得蒸馏数据流和基本数据滤筛的足迹的总和通常小于输入数据流的足迹。在本公开内容中，蒸馏数据流和基本数据滤筛被统称作无损简化数据，并且将被可互换地称作“简化数据流”或“简化数据”或“Reduced Data(简化数据)”。同样地，对于通过本公开内容中描述的无损数据简化技术所产生的出现在无损简化格式中的单元序列，可互换地使用以下术语：“简化输出数据流”、“简化输出数据”、“蒸馏数据流”、“蒸馏数据”以及“Distilled Data(蒸馏数据)”。The general lossless data simplification technique described in this disclosure receives an input data stream and transforms it into a combination of a distilled data stream and a basic data filter, such that the sum of the footprints of the distilled data stream and the basic data filter is generally less than the footprint of the input data stream. In this disclosure, the distilled data stream and the basic data filter are collectively referred to as lossless simplified data and will be interchangeably referred to as "simplified data stream," "simplified data," or "Reduced Data." Similarly, the following terms are interchangeably used for the sequence of cells appearing in the lossless simplification format produced by the lossless data simplification technique described in this disclosure: "simplified output data stream," "simplified output data," "distilled data stream," "distilled data," and "Distilled Data."

图1A示出了根据这里所描述的一些实施例的用于数据简化的方法和装置，其把输入数据因式分解成各个单元并且从驻留在基本数据滤筛中的基本数据单元导出这些单元。该图示出了数据简化或Data Distillation^TM方法和装置的总体方块图，并且提供了功能组件、结构和操作的总览。图1A中示出的组件和/或操作可以使用软件、硬件或其组合来实现。Figure 1A illustrates a method and apparatus for data simplification according to some embodiments described herein, which factorizes input data into individual units and derives these units from basic data units residing in a basic data filter. This figure shows a general block diagram of the data simplification, or Data Distillation ^™ method and apparatus, and provides an overview of the functional components, structure, and operation. The components and/or operations shown in Figure 1A can be implemented using software, hardware, or a combination thereof.

从输入数据流接收字节序列并且将其作为输入数据102给出到数据简化装置103，其也被称作Data Distillation^TM装置。解析器和因式分解器104对传入数据进行解析并且将其分解成组块或候选单元。因式分解器决定将在输入流中的何处插入中断以便把该流切分成候选单元。一旦识别出数据中的两处接连的中断，则由解析器和因式分解器创建候选单元105并且将其给出到基本数据滤筛106，所述基本数据滤筛也被称作DataDistillation^TM滤筛。A sequence of bytes is received from the input data stream and given as input data 102 to the data simplification device 103, also known as the Data Distillation ^™ device. A parser and factorizer 104 parse the incoming data and decompose it into chunks or candidate units. The factorizer determines where to insert interrupts in the input stream to segment the stream into candidate units. Once two consecutive interrupts are identified in the data, the parser and factorizer create candidate units 105 and give them to a basic data filter 106, also known as the Data Distillation ^™ filter.

Data Distillation^TM滤筛或基本数据滤筛106包含所有基本数据单元(在图1A中被标记成PDE)，并且基于其值或内容对其进行排序和组织。所述滤筛提供对于两种存取的支持。首先，可以通过对基本数据单元驻留在滤筛中的位置的位置的引用对每一个基本数据单元进行直接存取。其次，可以通过使用内容关联映射器121按照内容关联方式对各个单元进行存取，所述内容关联映射器121可以通过软件、硬件或其组合来实施。针对滤筛的这种第二存取形式是一项重要的特征，并且由所公开的实施例使用来识别与候选单元105精确地匹配的基本数据单元，或者用来识别可以从中导出候选单元的基本数据单元。具体来说，在给定候选单元(例如候选单元105)的情况下，可以对基本数据滤筛106进行搜索(基于候选单元105的值或者基于候选单元105中的特定字段的值)，以便快速地提供一个基本数据单元107或者基本数据单元107的较小集合，从中可以导出候选单元并且只利用规定所述导出所需的最小存储。The Data Distillation ^™ filter, or basic data filter 106, contains all basic data units (labeled PDEs in FIG. 1A) and sorts and organizes them based on their values or content. The filter provides support for two types of access. First, each basic data unit can be accessed directly by referencing its location within the filter. Second, units can be accessed in a content-associative manner using a content association mapper 121, which can be implemented in software, hardware, or a combination thereof. This second form of access to the filter is an important feature and is used by the disclosed embodiments to identify basic data units that precisely match candidate unit 105, or to identify basic data units from which candidate units can be derived. Specifically, given a candidate unit (e.g., candidate unit 105), the basic data filter 106 can be searched (based on the value of candidate unit 105 or the value of a specific field in candidate unit 105) to quickly provide a basic data unit 107 or a smaller set of basic data units 107 from which candidate units can be derived and only the minimum storage required for the specified derivation is utilized.

可以利用其值分散在数据空间内的一个基本数据单元集合对所述滤筛或基本数据滤筛106进行初始化。或者根据这里参照图1A-1C和图2所描述的Data Distillation^TM处理，所述滤筛最初可以是空的，并且可以随着摄取数据将基本数据单元动态地添加到所述滤筛。The filter or basic data filter 106 can be initialized using a set of basic data units whose values are distributed throughout the data space. Alternatively, according to the Data Distillation ^™ process described herein with reference to Figures 1A-1C and 2, the filter can initially be empty, and basic data units can be dynamically added to the filter as data is ingested.

导出器110接收候选单元105以及所取回的适合于导出的基本数据单元107(从基本数据滤筛106相关联地取回的内容)，确定是否可以从这些基本数据单元当中的一个或多个导出候选单元105，生成简化数据分量115(由对相关的基本数据单元的引用和重建程序构成)，并且向基本数据滤筛提供更新114。如果候选单元是所取回的基本数据单元的重复，则导出器在蒸馏数据108中放入对位于基本数据滤筛中的基本数据单元的引用(或指针)，并且还有表明这是基本数据单元的指示。如果没有找到重复，则导出器把候选单元表达成在一个或多个所取回的基本数据单元上实施的一项或多项变换的结果，其中所述变换序列被统称作重建程序，例如重建程序119A。每一项导出可能需要由导出器构造该项导出自身所独有的程序。重建程序规定可以对基本数据单元应用的例如插入、删除、替换、串联、算术以及逻辑运算之类的变换。如果导出单元的足迹(被计算成重建程序的大小加上针对所需的基本数据单元的引用的大小)处在关于候选单元的特定的指定距离阈值之内(以便允许数据简化)，则把候选单元改订成导出单元并且由重建程序与对(多个)相关基本数据单元的引用的组合替换——这些形成本例中的简化数据分量115。如果超出所述阈值，或者如果没有从基本数据滤筛取回适当的基本数据单元，则可以指示基本数据滤筛把所述候选安装成新鲜基本数据单元。在这种情况下，导出器在蒸馏数据中放入对新添加的基本数据单元的引用，并且还有表明这是基本数据单元的指示。The exporter 110 receives candidate units 105 and retrieved basic data units 107 suitable for export (content retrieved in association from the basic data filter 106), determines whether simplified data components 115 (consisting of references to and reconstruction procedures of the associated basic data units) can be generated from one or more candidate units 105 among these basic data units, and provides an update 114 to the basic data filter. If the candidate unit is a duplicate of the retrieved basic data unit, the exporter places a reference (or pointer) to the basic data unit located in the basic data filter in the distillation data 108, along with an indication that it is a basic data unit. If no duplicate is found, the exporter expresses the candidate unit as the result of one or more transformations performed on one or more retrieved basic data units, wherein the sequence of transformations is collectively referred to as a reconstruction procedure, such as reconstruction procedure 119A. Each export may require the exporter to construct a procedure unique to that export. The reconstruction procedure specifies transformations that can be applied to the basic data units, such as insertion, deletion, substitution, concatenation, arithmetic, and logical operations. If the footprint of the derived unit (calculated as the size of the reconstruction procedure plus the size of the references to the desired basic data unit) is within a specific specified distance threshold for the candidate unit (to allow for data simplification), the candidate unit is revised into the derived unit and replaced by a combination of the reconstruction procedure and references to (or more) related basic data units—these form the simplified data component 115 in this example. If the threshold is exceeded, or if no suitable basic data unit is retrieved from the basic data filter, the basic data filter can be instructed to install the candidate as a fresh basic data unit. In this case, the exporter places a reference to the newly added basic data unit in the distilled data, along with an indication that it is a basic data unit.

对数据取回的请求(例如取回请求109)可以采取对基本数据滤筛中的包含基本数据单元的位置的引用的形式，或者在导出项的情况下可以采取对基本数据单元的此类引用与相关联的重建程序的组合的形式(或者在基于多个基本数据单元的导出项的情况下是对多个基本数据单元的引用与相关联的重建程序的组合)。通过使用对基本数据滤筛中的基本数据单元的一项或多项引用，取回器111可以对基本数据滤筛进行存取以便取回一个或多个基本数据单元，并且把所述一个或多个基本数据单元以及重建程序提供到重建器112，所述重建器112在所述一个或多个基本数据单元上执行(在重建程序中规定的)变换以便生成重建数据116(也就是所请求的数据)，并且响应于数据取回请求将其递送到取回数据输出113。A request for data retrieval (e.g., retrieval request 109) may take the form of a reference to a location in the basic data filter containing basic data units, or, in the case of a derived item, a combination of such a reference to a basic data unit and an associated reconstruction procedure (or, in the case of a derived item based on multiple basic data units, a combination of references to multiple basic data units and an associated reconstruction procedure). By using one or more references to basic data units in the basic data filter, the retrieval unit 111 may access the basic data filter to retrieve one or more basic data units and provide the one or more basic data units and the reconstruction procedure to the reconstructor 112, which performs (as specified in the reconstruction procedure) transformations on the one or more basic data units to generate reconstructed data 116 (i.e., the requested data) and delivers it to the retrieved data output 113 in response to the data retrieval request.

在该实施例的一种变型中，基本数据单元可以通过压缩形式(使用本领域内已知的技术，包括Huffman编码和Lempel Ziv方法)被存储在滤筛中，并且在需要时被解压缩。这样做的优点是简化了基本数据滤筛的总体足迹。唯一的约束在于内容关联映射器121必须像以前一样继续提供对于基本数据单元的内容关联存取。In a variation of this embodiment, basic data units can be stored in the filter in a compressed form (using techniques known in the art, including Huffman coding and the Lempel Ziv method) and decompressed when needed. The advantage of this is that it simplifies the overall footprint of the basic data filter. The only constraint is that the content association mapper 121 must continue to provide content association access to the basic data units as before.

图1B和1C示出了根据这里所描述的一些实施例的图1A中所示的方法和装置的变型。在图1B中，重建程序可以被存储在基本数据滤筛中，并且像基本数据单元那样被对待。对重建程序的引用或指针119B被提供在蒸馏数据108中，而不是提供重建程序119A本身。如果重建程序由其他导出项共享，并且如果对重建程序的引用或指针(加上在重建程序与对重建程序的引用之间作出区分所需的任何元数据)所需的存储空间小于重建程序本身，则实现进一步的数据简化。Figures 1B and 1C illustrate variations of the method and apparatus shown in Figure 1A according to some embodiments described herein. In Figure 1B, the reconstruction procedure can be stored in the basic data filter and treated as a basic data unit. A reference or pointer 119B to the reconstruction procedure is provided in the distillation data 108, rather than providing the reconstruction procedure 119A itself. Further data simplification is achieved if the reconstruction procedure is shared by other derived items, and if the storage space required for the reference or pointer to the reconstruction procedure (plus any metadata needed to distinguish between the reconstruction procedure and the reference to the reconstruction procedure) is less than that required for the reconstruction procedure itself.

在图1B中，重建程序可以像基本数据单元那样被对待和存取，并且作为基本数据单元被存储在基本数据滤筛中，从而允许从基本数据滤筛对重建程序进行内容关联搜索和取回。在用以创建导出单元的导出处理期间，一旦导出器110确定对于导出所需要的重建程序，其随后可以确定该候选重建程序是否已经存在于基本数据滤筛中，或者确定是否可以从已经存在于基本数据滤筛中的另一个条目导出该候选重建程序。如果候选重建程序已经存在于基本数据滤筛中，则导出器110可以确定对所述预先存在的条目的引用，并且把所述引用包括在蒸馏数据108中。如果可以从已经驻留在基本数据滤筛中的现有条目导出候选重建程序，则导出器可以把候选重建程序的导出项或改订递送到蒸馏数据，也就是说导出器在蒸馏数据中放入对预先存在于基本数据滤筛中的条目的引用连同从所述预先存在的条目导出候选重建程序的增量重建程序。如果候选重建程序既不存在于基本数据滤筛中也无法从基本数据滤筛中的条目导出，则导出器110可以把重建程序添加到基本数据滤筛中(把重建程序添加到滤筛的操作可以返回对新添加的条目的引用)，并且把对重建程序的引用包括在蒸馏数据108中。In Figure 1B, the reconstruction procedure can be treated and accessed like a basic data unit and stored as a basic data unit in the basic data filter, thereby allowing content-related searches and retrieval of the reconstruction procedure from the basic data filter. During the export process used to create the export unit, once the exporter 110 determines the reconstruction procedure required for export, it can then determine whether the candidate reconstruction procedure already exists in the basic data filter, or whether the candidate reconstruction procedure can be exported from another entry already existing in the basic data filter. If the candidate reconstruction procedure already exists in the basic data filter, the exporter 110 can determine a reference to the pre-existing entry and include the reference in the distillation data 108. If the candidate reconstruction procedure can be exported from an existing entry already residing in the basic data filter, the exporter can deliver the exported item or revision of the candidate reconstruction procedure to the distillation data; that is, the exporter puts a reference to the entry pre-existing in the basic data filter along with an incremental reconstruction procedure for exporting the candidate reconstruction procedure from the pre-existing entry into the distillation data. If a candidate reconstruction procedure neither exists in the basic data filter nor can it be exported from an entry in the basic data filter, the exporter 110 can add the reconstruction procedure to the basic data filter (the operation of adding a reconstruction procedure to the filter can return a reference to the newly added entry) and include the reference to the reconstruction procedure in the distillation data 108.

图1C给出了根据这里所描述的一些实施例的图1B中所示的方法和装置的一种变型。具体来说，图1C中的被用来存储和查询重建程序的机制类似于被用来存储和查询基本数据单元的机制，但是重建程序被保持在与包含基本数据单元的结构分开的结构(称为基本重建程序滤筛)中。这样的结构中的条目被称作基本重建程序(在图1C中被标记为PRP)。回想到基本数据滤筛106包括支持快速内容关联查找操作的内容关联映射器121。图1C中示出的实施例包括类似于内容关联映射器121的内容关联映射器122。在图1C中，内容关联映射器122和内容关联映射器121被显示成基本数据滤筛或基本数据存储库106的一部分。在其他实施例中，内容关联映射器122和重建程序可以在称为基本重建程序滤筛的结构中与基本数据滤筛或基本数据存储库106分开存储。Figure 1C illustrates a variation of the method and apparatus shown in Figure 1B according to some embodiments described herein. Specifically, the mechanism used to store and retrieve the reconstruction program in Figure 1C is similar to the mechanism used to store and retrieve basic data units, but the reconstruction program is maintained in a separate structure (referred to as a basic reconstruction program filter) from the structure containing the basic data units. Entries in such a structure are called basic reconstruction programs (labeled PRP in Figure 1C). Recall that the basic data filter 106 includes a content association mapper 121 that supports fast content association lookup operations. The embodiment shown in Figure 1C includes a content association mapper 122 similar to the content association mapper 121. In Figure 1C, the content association mapper 122 and the content association mapper 121 are shown as part of the basic data filter or basic data repository 106. In other embodiments, the content association mapper 122 and the reconstruction program may be stored separately from the basic data filter or basic data repository 106 in a structure called the basic reconstruction program filter.

在该实施例的一种变型中，基本数据单元可以通过压缩形式(使用本领域内已知的技术，包括Huffman编码和Lempel Ziv方法)被存储在滤筛中，并且在需要时被解压缩。同样地，基本重建程序可以通过压缩形式(使用本领域内已知的技术，包括Huffman编码和Lempel Ziv方法)被存储在基本重建程序滤筛中，并且在需要时被解压缩。这样做的优点是减缩了基本数据滤筛和基本重建程序滤筛的总体足迹。唯一的约束在于内容关联映射器121和122必须像以前一样继续提供对于基本数据单元和基本重建程序的内容关联存取。In a variation of this embodiment, basic data units can be stored in a filter in a compressed form (using techniques known in the art, including Huffman coding and the Lempel Ziv method) and decompressed when needed. Similarly, basic reconstruction procedures can be stored in a basic reconstruction procedure filter in a compressed form (using techniques known in the art, including Huffman coding and the Lempel Ziv method) and decompressed when needed. The advantage of this is that it reduces the overall footprint of the basic data filter and the basic reconstruction procedure filter. The only constraint is that content association mappers 121 and 122 must continue to provide content association access to the basic data units and the basic reconstruction procedures as before.

图1D给出了根据这里所描述的一些实施例的图1A中所示的方法和装置的一种变型。具体来说，在图1D所描述的实施例中，基本数据单元被内联存储在蒸馏数据中。基本数据滤筛或基本数据存储库106继续提供对于基本数据单元的内容关联存取，并且继续在逻辑上包含基本数据单元。其保持对内联位于蒸馏数据中的基本数据单元的引用或链接。例如在图1D中，基本数据单元130内联位于蒸馏数据108中。基本数据滤筛或基本数据存储库106保持对基本数据单元130的引用131。同样地，在这种设置中，导出单元的无损简化表示将包含对所需的基本数据单元的引用。在数据取回期间，取回器111将从所需的基本数据单元所处的位置获取所述基本数据单元。Figure 1D illustrates a variation of the method and apparatus shown in Figure 1A according to some embodiments described herein. Specifically, in the embodiment described in Figure 1D, basic data units are stored inline within the distillation data. A basic data filter or basic data repository 106 continues to provide content-associated access to the basic data units and continues to logically contain them. It maintains references or links to the basic data units inline within the distillation data. For example, in Figure 1D, basic data unit 130 is inline within distillation data 108. The basic data filter or basic data repository 106 maintains a reference 131 to basic data unit 130. Similarly, in this setup, a lossless simplified representation of the derived unit will contain references to the desired basic data unit. During data retrieval, the retriever 111 retrieves the basic data unit from the location where the desired basic data unit is located.

图1E给出了根据这里所描述的一些实施例的图1D中所示的方法和装置的一种变型。具体来说，在图1E所描述的实施例中，与图1B中所示出的设置一样，重建程序可以从其他基本重建程序导出，并且被规定为增量重建程序加上对基本重建程序的引用。这样的基本重建程序像基本数据单元一样被对待，并且在逻辑上被安装在基本数据滤筛中。此外，在这种设置中，基本数据单元和基本重建程序都被内联存储在蒸馏数据中。基本数据滤筛或基本数据存储库106继续提供对于基本数据单元和基本重建程序的内容关联存取，并且继续在逻辑上包含这些基本数据单元和基本重建程序，同时保持对这些基本数据单元和基本重建程序内联位于蒸馏数据中的位置的引用或链接。例如在图1E中，基本数据单元130内联位于蒸馏数据108中。同样地在图1E中，基本重建程序132内联位于蒸馏数据中。基本数据滤筛或基本数据存储库106保持对基本数据单元130(即PDE_i)的引用131(即Reference_to_PDE_i)，以及对基本重建程序132(即Prime_Recon_Program_l)的引用133(即Reference_to_PDEj)。同样地，在这种设置中，导出单元的无损简化表示将包含对所需的基本数据单元和所需的基本重建程序的引用。在数据取回期间，取回器111将从所需的分量在相应的蒸馏数据中所处的位置获取所述分量。Figure 1E illustrates a variation of the method and apparatus shown in Figure 1D according to some embodiments described herein. Specifically, in the embodiment described in Figure 1E, as shown in Figure 1B, the reconstruction procedure can be derived from other basic reconstruction procedures and is defined as an incremental reconstruction procedure plus a reference to the basic reconstruction procedure. Such basic reconstruction procedures are treated like basic data units and are logically installed in the basic data filter. Furthermore, in this configuration, both the basic data units and the basic reconstruction procedures are stored inline in the distillation data. The basic data filter or basic data repository 106 continues to provide content-associative access to the basic data units and basic reconstruction procedures and continues to logically contain these basic data units and basic reconstruction procedures while maintaining references or links to the locations of these basic data units and basic reconstruction procedures inline in the distillation data. For example, in Figure 1E, basic data unit 130 is inline in distillation data 108. Similarly, in Figure 1E, basic reconstruction procedure 132 is inline in distillation data. The basic data filter or basic data repository 106 maintains a reference 131 (i.e., Reference_to_PDE_i) to the basic data unit 130 (i.e., PDE_i) and a reference 133 (i.e., Reference_to_PDEj) to the basic reconstruction procedure 132 (i.e., Prime_Recon_Program_l). Similarly, in this setup, the lossless simplified representation of the derived unit will contain references to the desired basic data unit and the desired basic reconstruction procedure. During data retrieval, the retriever 111 obtains the desired component from its position within the corresponding distillation data.

图1F给出了根据这里所描述的一些实施例的图1E中所示的方法和装置的一种变型。具体来说，在图1F所描述的实施例中，与图1C中所示出的设置一样，基本数据滤筛108包含分开的映射器——用于基本数据单元的内容关联映射器121和用于基本重建程序的内容关联映射器122。Figure 1F shows a variation of the method and apparatus shown in Figure 1E according to some embodiments described herein. Specifically, in the embodiment described in Figure 1F, as shown in Figure 1C, the basic data filter 108 includes separate mappers—a content association mapper 121 for basic data units and a content association mapper 122 for basic reconstruction procedures.

图1G给出了图1A到1F中所示的方法和装置的一种更加一般化的变型。具体来说，在图1G所描述的实施例中，基本数据单元可以位于基本数据滤筛中或者内联位于蒸馏数据中。一些基本数据单元可以位于基本数据滤筛中，其他的基本数据单元则内联位于蒸馏数据中。同样地，基本重建程序可以位于基本数据滤筛中或者内联位于蒸馏数据中。一些基本重建程序可以位于基本数据滤筛中，其他的基本重建程序则内联位于蒸馏数据中。基本数据滤筛在逻辑上包含所有基本数据单元和基本重建程序，并且在基本数据单元或基本重建程序内联位于蒸馏数据中的情况下，基本数据滤筛提供对其位置的引用。Figure 1G illustrates a more generalized variation of the method and apparatus shown in Figures 1A to 1F. Specifically, in the embodiment described in Figure 1G, the basic data units may be located in a basic data filter or inline within the distillation data. Some basic data units may be located in the basic data filter, while others are inline within the distillation data. Similarly, the basic reconstruction procedures may be located in a basic data filter or inline within the distillation data. Some basic reconstruction procedures may be located in the basic data filter, while others are inline within the distillation data. The basic data filter logically contains all the basic data units and basic reconstruction procedures, and when a basic data unit or basic reconstruction procedure is inline within the distillation data, the basic data filter provides a reference to its location.

前面对于把输入数据因式分解成各个单元并且从驻留在基本数据滤筛中的基本数据单元导出这些单元的用于数据简化的方法和装置的描述仅仅是出于说明和描述的目的而给出的。所述描述不意图进行穷举或者把本发明限制到所公开的形式。因此，本领域技术人员将会想到许多修改和变型。The foregoing description of methods and apparatus for data simplification, which factorize input data into individual units and derive these units from the basic data units residing in a basic data filter, is given for illustrative purposes only. The description is not intended to be exhaustive or to limit the invention to the disclosed forms. Therefore, many modifications and variations will occur to those skilled in the art.

图1H给出了根据这里所描述的一些实施例的描述用于Data Distillation^TM处理的方法和装置的图1A-1G中的蒸馏数据119A的结构的格式和规范的一个实例。由于DataDistillation^TM处理把输入数据因式分解成基本数据单元和导出单元，因此用于数据的无损简化表示的格式在蒸馏数据中标识这些单元并且描述这些单元的各个分量。自描述格式标识蒸馏数据中的每一个单元，表明其是基本数据单元还是导出单元，并且描述该单元的各个分量，也就是对安装在滤筛中的一个或多个基本数据单元的引用，对安装在基本数据滤筛中的重建程序的引用(如图1B的119B)，或者对存储在基本重建程序(PRP)滤筛中的重建程序的引用(如图1C的119C)，以及内联重建程序(RP)。基本重建程序(PRP)滤筛也被可互换地称作基本重建程序(PRP)存储库。图1H中的格式通过在多个基本数据单元上执行重建程序而规定导出，其中导出单元和每一个基本数据单元的大小是独立地可规定的。图1H中的格式还规定内联位于蒸馏数据中而不是位于基本数据滤筛内的基本数据单元。这是通过操作码编码7规定的，其规定单元的类型是内联位于蒸馏数据中的基本数据单元。蒸馏数据使用该格式被存储在数据存储系统中。该格式中的数据被数据取回器111消耗，从而可以获取并且随后重建数据的各个分量。Figure 1H shows an example of the format and specification of the structure of distillation data 119A in Figures 1A-1G, which describes the methods and apparatus for Data Distillation ^™ processing according to some embodiments described herein. Since Data Distillation ^™ processing factorizes input data into basic data units and derived units, a format for a lossless simplified representation of the data identifies these units in the distillation data and describes the individual components of these units. The self-describing format identifies each unit in the distillation data, indicating whether it is a basic data unit or a derived unit, and describes the individual components of that unit, namely, references to one or more basic data units mounted in a filter, references to reconstruction procedures mounted in a basic data filter (e.g., 119B in Figure 1B), or references to reconstruction procedures stored in a Basic Reconstruction Procedure (PRP) filter (e.g., 119C in Figure 1C), and inline reconstruction procedures (RPs). The Basic Reconstruction Procedure (PRP) filter is also interchangeably referred to as a Basic Reconstruction Procedure (PRP) repository. The format in Figure 1H specifies the derivation by performing reconstruction procedures on multiple basic data units, wherein the size of the derived units and each basic data unit is independently definable. The format in Figure 1H also specifies that the basic data units are inline within the distillation data rather than within the basic data filter. This is specified by opcode 7, which specifies that the type of unit is the basic data unit inline within the distillation data. The distillation data is stored in the data storage system using this format. The data in this format is consumed by the data retrieval unit 111, thereby allowing the acquisition and subsequent reconstruction of the individual components of the data.

图1I到1P示出了对应于图1A到图1G中示出的用于数据简化的方法和装置的各种变型的输入数据到无损简化形式的概念性变换。图1I示出了输入数据流如何被因式分解成候选单元，并且随后候选单元被视为基本数据单元或导出单元。最后，数据被变换成无损简化形式。图1I到1N示出了对应于各个实施例的无损简化形式的各种变型。Figures 1I to 1P illustrate conceptual transformations of input data into lossless simplified forms, corresponding to various variations of the methods and apparatuses for data simplification shown in Figures 1A to 1G. Figure 1I shows how the input data stream is factored into candidate units, and then the candidate units are treated as basic data units or derived units. Finally, the data is transformed into a lossless simplified form. Figures 1I to 1N illustrate various variations of the lossless simplified forms corresponding to the various embodiments.

图1I和图1J示出了通过图1A中所示的方法和装置所产生的数据的无损简化形式的实例。图1I中的无损简化形式包括内容关联映射器，并且是允许连续的进一步数据摄取以及针对现有的基本数据单元简化该数据的形式，与此同时，图1J中的无损简化形式不再保留内容关联映射器，从而导致更小的数据足迹。图1K和图1L示出了通过图1C中所示的方法和装置所产生的数据的无损简化形式的实例。图1K中的无损简化形式包括内容关联映射器，并且是允许连续的进一步数据摄取以及针对现有的基本数据单元和基本重建程序简化该数据的形式，与此同时，图1L中的无损简化形式不再保留内容关联映射器，从而导致更小的数据足迹。Figures 1I and 1J illustrate examples of lossless simplified forms of data produced by the method and apparatus shown in Figure 1A. The lossless simplified form in Figure 1I includes a content association mapper and allows for continuous further data ingestion and simplification of the data for existing basic data units. Meanwhile, the lossless simplified form in Figure 1J no longer retains the content association mapper, resulting in a smaller data footprint. Figures 1K and 1L illustrate examples of lossless simplified forms of data produced by the method and apparatus shown in Figure 1C. The lossless simplified form in Figure 1K includes a content association mapper and allows for continuous further data ingestion and simplification of the data for existing basic data units and basic reconstruction procedures. Meanwhile, the lossless simplified form in Figure 1L no longer retains the content association mapper, resulting in a smaller data footprint.

图1M和图1N示出了通过图1F中所示的方法和装置所产生的数据的无损简化形式的实例，其中基本数据单元和基本重建程序内联位于蒸馏数据中。图1M中的无损简化形式包括内容关联映射器，并且是允许连续的进一步数据摄取以及针对现有的基本数据单元和基本重建程序简化该数据的形式，与此同时，图1N中的无损简化形式不再保留内容关联映射器，从而导致更小的数据足迹。图1O和图1P示出了通过图1G中所示的方法和装置所产生的数据的无损简化形式的实例，其中基本数据单元和基本重建程序可以内联位于蒸馏数据中或者位于基本数据滤筛中。图1O中的无损简化形式包括内容关联映射器，并且是允许连续的进一步数据摄取以及针对现有的基本数据单元和基本重建程序简化该数据的形式，与此同时，图1P中的无损简化形式不再保留内容关联映射器，从而导致更小的数据足迹。Figures 1M and 1N illustrate examples of lossless simplified forms of data produced by the method and apparatus shown in Figure 1F, where the basic data units and basic reconstruction procedures are inlined within the distilled data. The lossless simplified form in Figure 1M includes a content association mapper and is a form that allows for continuous further data ingestion and simplification of the data for existing basic data units and basic reconstruction procedures. Meanwhile, the lossless simplified form in Figure 1N no longer retains the content association mapper, resulting in a smaller data footprint. Figures 1O and 1P illustrate examples of lossless simplified forms of data produced by the method and apparatus shown in Figure 1G, where the basic data units and basic reconstruction procedures may be inlined within the distilled data or located within a basic data filter. The lossless simplified form in Figure 1O includes a content association mapper and is a form that allows for continuous further data ingestion and simplification of the data for existing basic data units and basic reconstruction procedures. Meanwhile, the lossless simplified form in Figure 1P no longer retains the content association mapper, resulting in a smaller data footprint.

在图1A到P所示出的实施例的变型中，简化数据的各个分量可以使用本领域内已知的技术(比如Huffman编码和Lempel Ziv方法)被进一步简化或压缩，并且通过该压缩形式被存储。这些分量可以随后在需要被使用在数据蒸馏装置中时被解压缩。这样做的好处是进一步简化了数据的总体足迹。In variations of the embodiments shown in Figures 1A to 1P, the individual components of the simplified data can be further simplified or compressed using techniques known in the art (such as Huffman coding and the Lempel Ziv method), and stored in this compressed form. These components can then be decompressed when needed for use in a data distillation apparatus. The advantage of doing so is that the overall data footprint is further simplified.

图2示出了根据这里所描述的一些实施例的通过把输入数据因式分解成各个单元并且从驻留在基本数据滤筛中的基本数据单元导出这些单元而进行数据简化的处理。随着输入数据到达，其可以被解析和因式分解或者分解成一系列候选单元(操作202)。从输入消耗下一个候选单元(操作204)，并且基于候选单元的内容对基本数据滤筛实施内容关联查找，以便查看是否存在可以从中导出候选单元的任何适当的单元(操作206)。如果基本数据滤筛没有找到任何这样的单元(操作208的“否”分支)，则候选单元将作为新的基本数据单元被分配并且输入到滤筛中，并且在蒸馏数据中为候选单元创建的条目将是对新创建的基本数据单元的引用(操作216)。如果对基本数据滤筛的内容关联查找确实产生可以潜在地从中导出候选单元的一个或多个适当的单元(操作208的“是”分支)，则在所取回的基本数据单元上实施分析和计算以便从中导出候选单元。应当提到的是，在一些实施例中，首先仅获取用于适当的基本数据单元的元数据并且在所述元数据上实施分析，并且只有在认为有用的情况下才随后获取适当的基本数据单元(在这些实施例中，用于基本数据单元的元数据提供关于基本数据单元的内容的一些信息，从而允许系统基于元数据快速地排除匹配或者评估可导出性)。在其他实施例中，基本数据滤筛直接取回基本数据单元(也就是说在取回基本数据单元之前并不首先取回元数据以便对元数据进行分析)，从而在所取回的基本数据单元上实施分析和计算。Figure 2 illustrates a data simplification process according to some embodiments described herein, which involves factoring input data into individual units and deriving these units from basic data units residing in a basic data filter. As input data arrives, it can be parsed and factored or decomposed into a series of candidate units (operation 202). The next candidate unit is consumed from the input (operation 204), and a content-related lookup is performed on the basic data filter based on the content of the candidate unit to see if any suitable unit from which the candidate unit can be derived exists (operation 206). If the basic data filter does not find any such unit (the "No" branch of operation 208), the candidate unit is assigned as a new basic data unit and input into the filter, and the entry created for the candidate unit in the distilled data will be a reference to the newly created basic data unit (operation 216). If the content-related lookup on the basic data filter does indeed produce one or more suitable units from which candidate units can potentially be derived (the "Yes" branch of operation 208), analysis and computation are performed on the retrieved basic data units to derive the candidate unit. It should be mentioned that in some embodiments, metadata for appropriate basic data units is initially retrieved and analyzed on the metadata, and the appropriate basic data units are subsequently retrieved only if deemed useful (in these embodiments, the metadata for the basic data units provides some information about the content of the basic data units, thereby allowing the system to quickly exclude matches or evaluate derivability based on the metadata). In other embodiments, the basic data filter directly retrieves the basic data units (that is, metadata is not retrieved first for analysis before retrieving the basic data units), thereby performing analysis and computation on the retrieved basic data units.

实施第一检查以便查看候选是否任何这些单元的重复(操作210)。可以使用任何适当的散列技术加速这一检查。如果候选与从基本数据滤筛取回的基本数据单元完全相同(操作210的“是”分支)，则蒸馏数据中的为候选单元创建的条目由对该基本数据单元的引用以及表明该条目是基本数据单元的指示所替换(操作220)。如果没有找到重复(操作210的“否”分支)，则基于候选单元从基本数据滤筛取回的条目被视为潜在地可以从中导出候选单元的条目。以下是基本数据滤筛的重要、新颖而且并非是显而易见的特征：当没有在基本数据滤筛中找到重复时，基本数据滤筛可以返回基本数据单元，所述基本数据单元虽然并非与候选单元完全相同，却是可以潜在地通过对(多个)基本数据单元应用一项或多项变换而导出候选单元的单元。所述处理随后可以实施分析和计算，以便从最适当的基本数据单元或者适当的基本数据单元的集合导出候选单元(操作212)。在一些实施例中，所述导出把候选单元表达成在一个或多个基本数据单元上实施的变换的结果，这样的变换被统称作重建程序。每一项导出可能需要构造其自身独有的程序。除了构造重建程序之外，所述处理还可以计算通常表明存储候选单元的改订以及从所述改订重建候选单元所需要的存储资源和/或计算资源的水平的距离量度。在一些实施例中，导出单元的足迹被用作从(多个)基本数据单元到候选的距离度量——具体来说，距离量度可以被定义成重建程序的大小加上对在导出中所涉及的一个或多个基本数据单元的引用的大小的总和。可以选择具有最短距离的导出。把对应于该导出的距离与距离阈值进行比较(操作214)，如果该距离没有超出距离阈值，则接受该导出(操作214的“是”分支)。为了产生数据简化，所述距离阈值必须总是小于候选单元的大小。举例来说，距离阈值可以被设定到候选单元大小的50％，从而使得只有在导出项的足迹小于或等于候选单元足迹的一半时才接收导出项，从而对于为之存在适当导出的每一个候选单元确保2x或更大的简化。距离阈值可以是预定的百分比或比例，其或者是基于用户规定的输入或者是由系统选择。距离阈值可以由系统基于系统的静态或动态参数确定。一旦导出被接收，候选单元被改订并且被重建程序与对一个或多个基本数据单元的引用的组合所替换。蒸馏数据中的为候选单元创建的条目被所述导出所替换，也就是说被表明这是导出单元的指示连同重建程序加上对在导出中所涉及的一个或多个基本数据单元的引用所替换(操作218)。另一方面，如果对应于最佳导出的距离超出距离阈值(操作214中的“否”分支)，则将不会接收任何可能的导出项。在这种情况下，候选单元可以作为新的基本数据单元被分配并且输入到滤筛中，并且在蒸馏数据中为候选单元创建的条目将是对新创建的基本数据单元的引用连同表明这是基本数据单元的指示(操作216)。A first check is performed to see if any of these units are duplicates of the candidate (operation 210). This check can be accelerated using any suitable hashing technique. If the candidate is exactly the same as the basic data unit retrieved from the basic data filter (the "yes" branch of operation 210), the entry created for the candidate unit in the distilled data is replaced by a reference to that basic data unit and an indication that the entry is the basic data unit (operation 220). If no duplicate is found (the "no" branch of operation 210), the entry retrieved from the basic data filter based on the candidate unit is considered an entry from which the candidate unit can potentially be derived. The following is an important, novel, and not obvious feature of the basic data filter: when no duplicate is found in the basic data filter, the basic data filter can return basic data units that, while not exactly the same as the candidate units, are units from which the candidate units can potentially be derived by applying one or more transformations to the basic data units(s). The process can then be performed to perform analysis and computation to derive the candidate unit from the most suitable basic data unit or a suitable set of basic data units (operation 212). In some embodiments, the derivation expresses candidate cells as the result of transformations performed on one or more basic data cells, collectively referred to as reconstruction procedures. Each derivation may require constructing its own unique procedure. In addition to constructing reconstruction procedures, the process may also calculate a distance metric that typically indicates the level of storage and/or computational resources required to reconstruct candidate cells from the revisions. In some embodiments, the footprint of the derived cell is used as a distance metric from the basic data cells(s) to the candidate—specifically, the distance metric may be defined as the size of the reconstruction procedure plus the sum of the sizes of references to one or more basic data cells involved in the derivation. The derivation with the shortest distance can be selected. The distance corresponding to that derivation is compared to a distance threshold (operation 214), and if the distance does not exceed the distance threshold, the derivation is accepted (the "yes" branch of operation 214). For data simplification, the distance threshold must always be less than the size of the candidate cell. For example, a distance threshold can be set to 50% of the candidate cell size, such that an export is only accepted if its footprint is less than or equal to half the candidate cell footprint, thus ensuring a simplification of 2x or greater for each candidate cell for which an appropriate export exists. The distance threshold can be a predetermined percentage or proportion, either based on user-defined input or selected by the system. The distance threshold can be determined by the system based on static or dynamic parameters. Once an export is accepted, the candidate cell is revised and replaced by a combination of a reconstruction procedure and references to one or more basic data cells. The entry created for the candidate cell in the distillation data is replaced by the export, that is, by an indication that this is an exported cell along with the reconstruction procedure plus references to one or more basic data cells involved in the export (operation 218). On the other hand, if the distance corresponding to the optimal export exceeds the distance threshold (the "No" branch in operation 214), no possible exports will be accepted. In this case, candidate cells can be assigned as new basic data cells and input into the filter screen, and the entries created for candidate cells in the distillation data will be references to the newly created basic data cells along with indications that this is a basic data cell (operation 216).

最后，所述处理可以检查是否存在任何附加的候选单元(操作222)，并且如果还有更多候选单元则返回操作204(操作222的“是”分支)，或者如果没有更多候选单元则终止处理(操作222的“否”分支)。Finally, the process can check if there are any additional candidate units (operation 222), and if there are more candidate units, return to operation 204 (the "yes" branch of operation 222), or terminate the process if there are no more candidate units (the "no" branch of operation 222).

可以采用多种方法来实施图2中的操作202，也就是对传入数据进行解析并且将其分解成候选单元。因式分解算法需要决定将在字节流中的何处插入中断以便把该流切分成候选单元。可能的技术包括(而不限于)把流分解成固定大小块(比如4096字节的页面)，或者应用指纹处理方法(比如对输入流的子串应用随机素多项式的技术)以便在数据流中定位变成单元边界的指纹(这种技术可以导致可变大小单元)，或者对输入进行解析以便检测报头或者某种预先声明的结构并且基于该结构来界定单元。可以对输入进行解析以便检测通过图式(schema)声明的特定结构。可以对输入进行解析以便在数据中检测预先声明的模式、语法或规则表达法的存在。一旦识别出数据中的两处接连的中断，则创建候选单元(所述候选单元是位于所述两处接连的中断之间的数据)并且将其呈现到基本数据滤筛以供内容关联查找。如果创建了可变大小单元，则需要规定候选单元的长度并且作为元数据与候选单元一起携带。Several methods can be used to implement operation 202 in Figure 2, which involves parsing the incoming data and decomposing it into candidate units. The factorization algorithm needs to determine where to insert interruptions in the byte stream to segment the stream into candidate units. Possible techniques include (but are not limited to) decomposing the stream into fixed-size blocks (e.g., 4096-byte pages), applying fingerprinting methods (e.g., applying random prime polynomials to substrings of the input stream) to locate fingerprints that become unit boundaries in the data stream (this technique can result in variable-size units), or parsing the input to detect headers or some pre-declared structure and defining units based on that structure. The input can be parsed to detect specific structures declared through a schema. The input can be parsed to detect the presence of pre-declared patterns, syntax, or rule expressions in the data. Once two consecutive interruptions are identified in the data, candidate units (the data located between the two consecutive interruptions) are created and presented to a basic data filter for content-related lookup. If variable-size cells are created, the length of the candidate cells needs to be specified and carried along with the candidate cells as metadata.

基本数据滤筛的一项重要功能是基于为之给出的候选单元而提供内容关联查找并且快速地提供一个基本数据单元或者较小的基本数据单元集合，从中可以导出候选单元并且只利用规定所述导出所需的最小存储。在给定较大数据集的情况下，这是一个困难的问题。在给定以太字节计的数据的情况下，即使对于千字节大小的单元，仍然要搜索数以十亿计的单元并且从中作出选择。这一问题在更大的数据集上甚至会更加严重。因此变得很重要的是使用适当的技术对单元进行组织和排序，并且随后在单元的该组织内检测相似性和可导出性，以便能够快速地提供适当的基本数据单元的较小集合。A key function of basic data filtering is to provide content-related lookups based on given candidate units and to quickly provide a basic data unit or a smaller set of basic data units from which candidate units can be derived using only the minimum storage required for the specified derivation. This is a challenging problem given large datasets. Even with kilobyte-sized units, billions of units still need to be searched and selected from given terabyte-sized data. This problem becomes even more severe with even larger datasets. Therefore, it becomes crucial to organize and sort units using appropriate techniques and subsequently detect similarity and derivability within that organization to quickly provide a smaller set of appropriate basic data units.

可以基于每一个单元(也就是基本数据单元)的值对滤筛中的条目进行排序，从而可以按照升序或降序通过值来安排所有的条目。或者可以沿着基于单元中的特定字段的值的主轴对条目进行排序，随后是使用单元的其余内容的次要轴。在本上下文中，字段是来自单元的内容的邻接字节的集合。可以通过对单元的内容应用指纹处理方法来定位字段，从而使得指纹的位置标识字段的位置。或者可以选择单元内容内部的特定的固定偏移量以便定位字段。还可以采用其他方法以定位字段，其中包括而不限于对单元进行解析以便检测所声明的特定结构并且定位该结构内的字段。Entries in the filter can be sorted based on the value of each cell (i.e., the basic data unit), allowing all entries to be arranged in ascending or descending order by value. Alternatively, entries can be sorted along a primary axis based on the value of a specific field within the cell, followed by a secondary axis using the rest of the cell's content. In this context, a field is a set of adjacent bytes from the cell's content. Fields can be located by applying a fingerprinting method to the cell's content, allowing the fingerprint's position to identify the field's location. Alternatively, a specific fixed offset within the cell's content can be selected to locate the field. Other methods can also be used to locate fields, including, but not limited to, parsing the cell to detect a specific declared structure and locate fields within that structure.

在另一种形式的组织中，单元内的特定字段或字段组合可以被视为维度，从而可以使用这些维度的串联以及随后的每一个单元的剩余内容对数据单元进行排序和组织。一般来说，字段与维度之间的对应性或映射可以是任意地复杂。例如在一些实施例中，确切地一个字段可以映射到确切地一个维度。在其他实施例中，多个字段的组合(例如F1、F2和F3)可以映射到一个维度。可以通过串联两个字段或者通过对其应用任何其他适当的功能而实现字段的组合。重要的要求是被用来对单元进行组织的字段、维度以及单元的剩余内容的安排必须允许通过其内容唯一地识别所有基本数据单元并且在滤筛中对其进行排序。In another form of organization, specific fields or combinations of fields within a cell can be considered dimensions, allowing data cells to be sorted and organized using the concatenation of these dimensions and the remaining content of each subsequent cell. Generally, the correspondence or mapping between fields and dimensions can be arbitrarily complex. For example, in some embodiments, an exact field may map to an exact dimension. In other embodiments, combinations of multiple fields (e.g., F1, F2, and F3) may map to a dimension. Field combinations can be achieved by concatenating two fields or by applying any other suitable functionality to them. An important requirement is that the arrangement of the fields, dimensions, and remaining content of the cells used to organize them must allow for the unique identification of all basic data cells by their content and their sorting in a filter.

在又一个实施例中，可以将某个合适的函数(诸如代数或算术变换)应用于单元，其中该函数具有以下属性：函数的结果唯一地识别每个单元。在一个这样的实施例中，每个单元被素多项式或某个选择的数或值除，并且除的结果(其包括商和余数对)被用作在基本数据滤筛中组织和排序单元的函数。例如，包含余数的位可以形成函数的结果的前导字节，后面跟着包含商的位。或替代地，包含商的位可以用于形成函数的结果的前导字节，后面跟着包含余数的位。对于用于除输入单元的给定除数，商和余数对将唯一地识别该单元，因此该对可以用于形成函数的结果，该结果用于在基本数据滤筛中组织和排序单元。通过将该函数应用于每个单元，可以基于函数的结果在滤筛中组织基本数据单元。该函数仍将唯一地识别每个基本数据单元，并将提供在基本数据滤筛中排序和组织基本数据单元的一种替代方法。In another embodiment, a suitable function (such as an algebraic or arithmetic transformation) can be applied to the cells, wherein the function has the property that the result of the function uniquely identifies each cell. In one such embodiment, each cell is divided by a prime polynomial or a number or value of some choice, and the result of the division (which includes a quotient and remainder pair) is used as a function to organize and sort the cells in the basic data sieve. For example, the bits containing the remainder can form the leading byte of the result of the function, followed by the bits containing the quotient. Alternatively, the bits containing the quotient can be used to form the leading byte of the result of the function, followed by the bits containing the remainder. For a given divisor used to divide the input cell, the quotient and remainder pair will uniquely identify the cell, and thus the pair can be used to form the result of the function, which is used to organize and sort the cells in the basic data sieve. By applying this function to each cell, basic data cells can be organized in the sieve based on the result of the function. The function will still uniquely identify each basic data cell and will provide an alternative method for sorting and organizing basic data cells in the basic data sieve.

在又一个实施例中，可以将某些合适的函数(诸如代数或算术变换)应用于单元的每个字段，其中该函数具有以下属性：函数的结果唯一地识别该字段。例如，可以对每个单元的内容的连续字段或连续部分执行诸如除以合适的多项式或数或值之类的函数，使得可以将连续函数的结果的串联用于排序和组织基本数据滤筛中的单元。注意的是，在每个字段上，可以将不同的多项式用于除法。每个函数将根据该部分或字段的除法运算得出的商和余数提供适当排序的位的串联。通过使用应用于单元的字段的函数的这种串联，可以在滤筛中对每个基本数据单元进行排序和组织。函数的串联仍将唯一地识别每个基本数据单元，并将提供在基本数据滤筛中排序和组织基本数据单元的一种替代方法。In another embodiment, certain suitable functions (such as algebraic or arithmetic transformations) can be applied to each field of the cell, wherein the function has the property that the result of the function uniquely identifies the field. For example, a function such as division by a suitable polynomial or number or value can be performed on consecutive fields or consecutive portions of the contents of each cell, such that a concatenation of the results of consecutive functions can be used to sort and organize cells in the basic data filter. Note that different polynomials can be used for division on each field. Each function will provide a concatenation of appropriately sorted bits based on the quotient and remainder obtained from the division operation of that portion or field. By using such a concatenation of functions applied to the fields of the cell, each basic data cell can be sorted and organized in the filter. The concatenation of functions will still uniquely identify each basic data cell and will provide an alternative method for sorting and organizing basic data cells in the basic data filter.

在一些实施例中，单元的内容可以被表示成下面的表达法：单元＝头部.*sig1.*sig2.*…sigI.*…sigN.*尾部，其中“头部”是包括单元的开头字节的字节序列，“尾部”是包括单元的结尾字节的字节序列，“sig1”、“sig2”、“sigI”和“sigN”是表征单元的单元内容主体内的特定长度的各个签名或模式或规则表达法或字节序列。各个签名之间的表达法“.*”是通配符表达法，也就是允许除了表达法“.*”之后的签名之外的其他任何值的任意数目的中间字节的规则表达法标记。在一些实施例中，N元组(sig1,sig2,…sigI,…sigN)被称作单元的骨架数据结构或骨架，并且可以被视为单元的简化实质子集或实质。在其他实施例中，(N+2)元组(头部,sig1,sig2,…sigI,…sigN,尾部)被称作单元的骨架数据结构或骨架。或者可以采用头部或尾部连同其余签名的N+1元组。In some embodiments, the content of a unit can be represented as follows: Unit = Header.*sig1.*sig2.*…sigI.*…sigN.*Tail, where “Header” is a sequence of bytes including the first byte of the unit, “Tail” is a sequence of bytes including the last byte of the unit, and “sig1”, “sig2”, “sigI”, and “sigN” are signatures, patterns, rules, or byte sequences of a specific length representing the content body of the unit. The expression “.*” between the signatures is a wildcard expression, which is a rule expression tag that allows any number of intermediate bytes of any value other than the signature following the expression “.*”. In some embodiments, the N-tuple (sig1, sig2,…sigI,…sigN) is called the skeleton data structure or skeleton of the unit and can be regarded as a simplified subset or essence of the unit. In other embodiments, the (N+2)-tuple (Header, sig1, sig2,…sigI,…sigN, Tail) is called the skeleton data structure or skeleton of the unit. Alternatively, an N+1 tuple consisting of the header or tail along with the remaining signatures can be used.

可以对单元的内容应用指纹处理方法，以便确定单元内容内的骨架数据结构的各个分量(或签名)的位置。或者可以选择单元内容内部的特定的固定偏移量以便定位分量。还可以采用其他方法来定位骨架数据结构的分量，其中包括而不限于对单元进行解析以便检测所声明的特定结构并且定位该结构内的分量。可以基于其骨架数据结构在滤筛中对基本数据单元进行排序。换句话说，单元的骨架数据结构的各个分量可以被视为维度，从而可以使用这些维度的串联以及随后的每一个单元的剩余内容对滤筛中的基本数据单元进行排序和组织。Fingerprinting methods can be applied to the content of a cell to determine the location of each component (or signature) of the skeleton data structure within the cell content. Alternatively, a specific fixed offset within the cell content can be selected to locate the components. Other methods can also be used to locate the components of the skeleton data structure, including, but not limited to, parsing the cell to detect a specific declared structure and locate the components within that structure. Basic data cells can be sorted in a filter based on their skeleton data structure. In other words, the components of a cell's skeleton data structure can be considered as dimensions, allowing the basic data cells in the filter to be sorted and organized using the concatenation of these dimensions and the remaining content of each subsequent cell.

一些实施例把输入数据因式分解成候选单元，其中每一个候选单元的大小显著大于对全局数据集中的所有此类单元进行存取所需要的引用的大小。关于被分解成此类数据组块(并且按照内容关联方式被存取)的数据的一项观察是，实际的数据关于数据组块所能规定的全部可能值是非常稀疏的。例如考虑1泽字节数据集。需要大约70个比特对数据集中的每一个字节进行寻址。在128字节(1024比特)的组块大小下，在1泽字节数据集中近似有2⁶³个组块，因此需要63个比特(少于8个字节)对所有组块进行寻址。应当提到的是，1024比特的单元或组块可以具有2¹⁰²⁴个可能值当中的一个，而数据集中的给定组块的实际值的数目最多是2⁶³个(如果所有组块都不同的话)。这表明实际的数据关于通过一个单元的内容所能达到或命名的值的数目是极为稀疏的。这就允许使用非常适合于组织非常稀疏的数据的树结构，从而允许高效的基于内容的查找，允许把新的单元高效地添加到树结构，并且在对于树结构本身所需要的增量存储方面是成本有效的。虽然在1泽字节数据集中仅有2⁶³个不同的组块，从而只需要63个区分信息比特将其区别开，但是相关的区分比特可能分散在单元的整个1024个比特上，并且对于每一个单元出现在不同的位置处。因此，为了完全区分所有单元，仅仅检查来自内容的固定的63比特是不够的，相反，整个单元内容都需要参与单元的分拣，特别在提供对于数据集中的任一个和每一个单元的真实内容关联存取的解决方案中尤其是如此。在Data Distillation^TM框架中，希望能够在被用来对数据进行排序和组织的框架内检测可导出性。考虑到所有前述内容，基于内容的树结构(其随着检查更多内容逐渐地区分数据)是对经过因式分解的数据集中的所有单元进行排序和区分的适当组织。这样的结构提供了可以被作为可导出单元的分组或者具有类似的可导出性属性的单元分组来对待的许多中间子树等级。这样的结构可以利用表征每一个子树的元数据或者利用表征每一个数据单元的元数据通过分级方式被加强。这样的结构可以有效地传达其所包含的整个数据的构成，包括数据中的实际值的密度、邻近性和分布。Some implementations factorize the input data into candidate units, where the size of each candidate unit is significantly larger than the size of the references required to access all such units in the global dataset. One observation regarding data decomposed into such data chunks (and accessed in a content-associative manner) is that the actual data is very sparse with respect to all possible values that can be specified by a data chunk. Consider, for example, a 1 zettabyte dataset. Approximately 70 bits are needed to address each byte in the dataset. With a chunk size of 128 bytes (1024 bits), there are approximately 2 ^{^63} chunks in a 1 zettabyte dataset, thus requiring 63 bits (less than 8 bytes) to address all chunks. It should be noted that a 1024-bit unit or chunk can have one of 2 ^{^1024} possible values, while the actual number of values for a given chunk in the dataset is at most 2 ^{^63} (if all chunks are distinct). This demonstrates that the actual data is extremely sparse with respect to the number of values that can be reached or named through the content of a single unit. This allows for the use of tree structures, which are well-suited for organizing very sparse data, enabling efficient content-based lookups, allowing new units to be added efficiently to the tree structure, and being cost-effective in terms of the incremental storage required for the tree structure itself. While there are only ²⁶³ distinct chunks in a 1 zettabyte dataset, requiring only 63 distinguishing bits to differentiate them, the relevant distinguishing bits may be scattered across the entire 1024 bits of a unit and appear in different locations for each unit. Therefore, to fully distinguish all units, simply examining the fixed 63 bits from the content is insufficient; instead, the entire unit content needs to be involved in unit sorting, especially in solutions that provide true content-based access to any and every unit in the dataset. Within the Data Distillation ^™ framework, the goal is to detect derivability within the framework used to sort and organize data. Considering all the foregoing, a content-based tree structure (which progressively distinguishes data as more content is examined) is the appropriate organization for sorting and distinguishing all units in a factorized dataset. This structure provides numerous intermediate subtree levels that can be treated as groups of derivable units or groups of units with similar derivability properties. Such a structure can be enhanced hierarchically using metadata representing each subtree or each data unit. This structure effectively conveys the composition of the entire data it contains, including the density, proximity, and distribution of actual values within the data.

一些实施例把基本数据单元按照树形式组织在滤筛中。每一个基本数据单元具有从该基本数据单元的整个内容构造的独特“名称”。该名称被设计成足以唯一地标识基本数据单元，并且将其与树中的所有其他单元作出区分。可以通过几种方式从基本数据单元的内容构造名称。名称可以简单地由基本数据单元的所有字节构成，这些字节按照其存在于基本数据单元中的相同顺序出现在名称中。在另一个实施例中，被称作维度的特定字段或字段组合(其中字段和维度在前面作了描述)被用来形成名称的开头字节，基本数据单元的其余内容则形成名称的其余部分，从而使得基本数据单元的整个内容都参与创建单元的完整并且唯一的名称。在另一个实施例中，单元的骨架数据结构的字段被选择成维度(其中字段和维度在前面作了描述)，并且被用来形成名称的开头字节，基本数据单元的其余内容形成名称的其余部分，从而使得基本数据单元的整个内容都参与创建单元的完整并且唯一的名称。Some embodiments organize basic data units in a tree structure within a filter. Each basic data unit has a unique "name" constructed from its entire content. This name is designed to uniquely identify the basic data unit and distinguish it from all other units in the tree. The name can be constructed from the content of the basic data unit in several ways. The name can simply consist of all the bytes of the basic data unit, appearing in the name in the same order they are present in the basic data unit. In another embodiment, a specific field or combination of fields, referred to as a dimension (where fields and dimensions have been described previously), is used to form the first byte of the name, and the remaining content of the basic data unit forms the rest of the name, thus ensuring that the entire content of the basic data unit contributes to creating a complete and unique name for the unit. In yet another embodiment, a field from the skeleton data structure of the unit is selected as a dimension (where fields and dimensions have been described previously) and used to form the first byte of the name, with the remaining content of the basic data unit forming the rest of the name, thus ensuring that the entire content of the basic data unit contributes to creating a complete and unique name for the unit.

在一些实施例中，可以通过对单元执行代数或算术变换来计算单元的名称，同时保留每个名称唯一地识别每个单元的属性。在一个这样的实施例中，将每个单元除以素多项式或某个所选择的数字或值，并将除法的结果(即商和余数对)用于形成单元的名称。例如，包含余数的位可以形成名称的前导字节，后面跟着包含商的位。或者替代地，包含商的位可以用于形成名称的前导字节，后面跟着包含余数的位。对于用于除输入单元的给定除数，商和余数对将唯一地识别该单元，因此该对可以用于形成每个单元的名称。通过使用名称的这种表示形式，可以基于基本数据单元的名称在滤筛中组织基本数据单元。名称仍将唯一地识别每个基本数据单元，并将提供在基本数据滤筛中排序和组织基本数据单元的一种替代方法。In some embodiments, the name of a cell can be calculated by performing an algebraic or arithmetic transformation on the cell, while preserving the attribute that each name uniquely identifies each cell. In one such embodiment, each cell is divided by a prime polynomial or some chosen number or value, and the result of the division (i.e., the quotient and remainder pair) is used to form the cell's name. For example, the bit containing the remainder can form the leading byte of the name, followed by the bit containing the quotient. Alternatively, the bit containing the quotient can be used to form the leading byte of the name, followed by the bit containing the remainder. For a given divisor used to divide the input cell, the quotient and remainder pair will uniquely identify the cell, and thus the pair can be used to form the name of each cell. By using this representation of the name, basic data cells can be organized in a sieve based on the name of the basic data cell. The name will still uniquely identify each basic data cell and will provide an alternative method for sorting and organizing basic data cells in a basic data sieve.

在另一个实施例中，可以采用这种生成名称的方法的变体(其涉及除法和提取商/余数对)，其中可以在每个单元的内容的连续字段或连续部分上执行除以合适的多项式或数或值的除法，从而产生每个单元的名称的连续部分(每个部分是根据该部分或字段的除法运算得到的商和余数的适当排序的位的串联)。注意的是，在每个字段上，可以将不同的多项式用于除法。使用名称的这种表示形式，可以根据基本数据单元的名称在滤筛中组织基本数据单元。名称仍将唯一地识别每个基本数据单元，并将提供在基本数据滤筛中排序和组织基本数据单元的一种替代方法。In another embodiment, a variation of this name generation method (which involves division and extracting quotient/remainder pairs) can be employed, wherein division by a suitable polynomial or number or value can be performed on consecutive fields or consecutive portions of the contents of each cell, thereby producing consecutive portions of the name of each cell (each portion is a concatenation of appropriately ordered bits of the quotient and remainder obtained according to the division operation of that portion or field). Note that different polynomials can be used for division on each field. Using this representation of names, basic data cells can be organized in a filter based on their names. The names will still uniquely identify each basic data cell and will provide an alternative method for sorting and organizing basic data cells in a basic data filter.

每一个基本数据单元的名称被用来在树中对基本数据单元进行排序和组织。对于大多数实际的数据集，即使是大小非常大的那些数据集(比如由2⁵⁸个4KB大小单元构成的1泽字节数据集)，也预期到名称的较小字节子集将常常可以用来对树中的大部分基本数据单元进行分拣和排序。The name of each basic data unit is used to sort and organize the basic data units in the tree. For most real-world datasets, even those that are very large (such as a 1-zeta-byte dataset consisting of ²⁵⁸ 4KB units), it is expected that a smaller subset of the names will often be used to sort and organize the majority of the basic data units in the tree.

图3A示出了前缀树(trie)数据结构，其中基于来自每一个基本数据单元的相继字节的值把基本数据单元组织在逐渐更小的群组中。在图3A所示的实例中，每一个基本数据单元具有从该基本数据单元的整个内容构造的独特名称，该名称简单地由基本数据单元的所有字节构成，这些字节按照其存在于基本数据单元中的相同顺序出现在名称中。前缀树的根节点表示所有基本数据单元。前缀树的其他节点表示基本数据单元的子集或群组。从前缀树的根节点或第1级(在图3A中被标记成根302)开始，基于其名称的最高有效字节(在图3A中被标记成N1)的值把基本数据单元分组到子树中。在其名称的最高有效字节中具有相同值的所有基本数据单元将被一起分组到共同的子树中，并且通过该值标示的链接将从根节点存在到表示该子树的节点。例如在图3A中，节点303表示分别在其对应名称的最高有效字节N1中具有相同的值2的基本数据单元的子树或群组。在图3A中，该群组包括基本数据单元305、306和307。Figure 3A illustrates a trie data structure, where basic data units are organized into progressively smaller groups based on the values of successive bytes from each basic data unit. In the example shown in Figure 3A, each basic data unit has a unique name constructed from its entire contents, simply consisting of all the bytes of the basic data unit appearing in the same order they are present in the basic data unit. The root node of the trie represents all basic data units. The other nodes of the trie represent subsets or groups of basic data units. Starting from the root node of the trie, or level 1 (labeled root 302 in Figure 3A), basic data units are grouped into subtrees based on the value of the most significant byte of their name (labeled N1 in Figure 3A). All basic data units with the same value in the most significant byte of their name are grouped together into a common subtree, and a link indicated by that value exists from the root node to the node representing that subtree. For example, in Figure 3A, node 303 represents a subtree or group of basic data units that have the same value 2 in the most significant byte N1 of their respective names. In Figure 3A, this group includes basic data units 305, 306, and 307.

在前缀树的第二级，每一个基本数据单元的名称的第二最高有效字节被用来把每一个基本数据单元群组进一步划分成更小的子组。例如在图3A中，使用第二最高有效字节N2把由节点303表示的基本数据单元群组进一步细分成各个子组。节点304表示在其对应名称的最高有效字节N1中具有值2并且在第二最高有效字节N2中具有值1的基本数据单元的子组。该子组包括基本数据单元305和306。At the second level of the prefix tree, the second most significant byte of the name of each basic data unit is used to further subdivide each group of basic data units into smaller subgroups. For example, in Figure 3A, the group of basic data units represented by node 303 is further subdivided into subgroups using the second most significant byte N2. Node 304 represents a subgroup of basic data units that have a value of 2 in the most significant byte N1 of their corresponding name and a value of 1 in the second most significant byte N2. This subgroup includes basic data units 305 and 306.

所述细分处理在前缀树的每一级继续，从而创建从亲代节点到每一个子代节点的链接，其中子代节点表示通过亲代节点表示的基本数据单元的一个子集。这一处理继续到在前缀树的叶子处只有单独的基本数据单元为止。叶节点表示叶子的群组。在图3A中，节点304是叶节点。由节点304表示的基本数据单元的群组包括基本数据单元305和306。在图3A中，通过使用其名称的第三最高有效字节，该群组被进一步细分成单独的基本数据单元305和306。值N3＝3导致基本数据单元305，值N3＝5则导致基本数据单元306。在该例中，其完整名称当中的仅仅3个有效字节就足以完全标识基本数据单元305和306。同样地，来自名称的仅仅两个有效字节就足以标识基本数据单元307。The subdivision process continues at each level of the prefix tree, creating links from parent nodes to each child node, where a child node represents a subset of the basic data units represented by the parent node. This process continues until only individual basic data units remain at the leaves of the prefix tree. Leaf nodes represent groups of leaves. In Figure 3A, node 304 is a leaf node. The group of basic data units represented by node 304 includes basic data units 305 and 306. In Figure 3A, this group is further subdivided into individual basic data units 305 and 306 by using the third most significant byte of their names. A value N3 = 3 results in basic data unit 305, and a value N3 = 5 results in basic data unit 306. In this example, only 3 significant bytes from their full names are sufficient to fully identify basic data units 305 and 306. Similarly, only two significant bytes from the name are sufficient to identify basic data unit 307.

该例示出了在基本数据单元的给定混合中如何只有名称的一个字节子集用来在树中标识基本数据单元，并且不需要整个名称以到达独有的基本数据单元。此外，基本数据单元或者基本数据单元的群组可能分别需要不同数目的有效字节以便能够对其进行唯一标识。因此，从根节点到基本数据单元的前缀树深度对于不同的基本数据单元可以是不同的。此外，在前缀树中，每一个节点可能具有下降到下方的子树的不同数目的链接。This example illustrates how, in a given mix of basic data units, only a subset of the name's bytes are used to identify the basic data unit in the tree, and the entire name is not required to reach the unique basic data unit. Furthermore, basic data units or groups of basic data units may each require a different number of valid bytes to uniquely identify them. Therefore, the depth of the prefix tree from the root node to the basic data unit can be different for different basic data units. Additionally, in the prefix tree, each node may have a different number of links descending to the subtrees below.

在这样的前缀树中，每一个节点具有由规定如何到达该节点的字节序列构成的名称。举例来说，对应于节点304的名称是“21”。此外，在树中的当前单元分布中唯一地标识单元的来自单元名称的字节子集是从根节点去到该基本数据单元的“路径”。例如在图3A中，具有值213的路径301标识基本数据单元305。In such a prefix tree, each node has a name consisting of a sequence of bytes that specifies how to reach that node. For example, the name corresponding to node 304 is "21". Furthermore, the subset of bytes from the node name that uniquely identifies a node in the current node distribution within the tree is the "path" from the root node to that basic data unit. For example, in Figure 3A, path 301 with the value 213 identifies basic data unit 305.

这里所描述的前缀树结构可能会产生很深的树(也就是具有许多等级的树)，这是因为树中的单元名称的每一个区分字节都为前缀树增加了一级深度。The prefix tree structure described here can produce very deep trees (i.e., trees with many levels) because each distinguishing byte of the unit name in the tree adds one level of depth to the prefix tree.

应当提到的是，图3A-3E中的树数据结构是从左向右绘制的。因此当我们从图的左侧向图的右侧移动时，我们从树的更高等级移动到树的较低等级。在给定节点的下方(也就是说朝向图3A-3E中的给定节点的右侧)，对于通过来自名称的区分字节的特定值所选择的任何子代，驻留在该子代下方的各个子树中的所有单元在单元名称中的该相应字节中都将具有相同的值。It should be mentioned that the tree data structure in Figures 3A-3E is drawn from left to right. Therefore, as we move from the left side of the graph to the right side, we move from a higher level of the tree to a lower level. Below a given node (that is, to the right of a given node in Figures 3A-3E), for any child selected by a specific value from the distinguishing byte of the name, all cells in the respective subtrees residing below that child will have the same value in the corresponding byte of the cell name.

现在我们将描述一种在给定输入候选单元的情况下对前缀树进行内容关联查找的方法。此方法涉及使用候选单元的名称对前缀树结构进行导航，随后进行后续的分析和筛选以便决定作为总体内容关联查找的结构将返回什么。换句话说，前缀树导航处理返回第一结果，随后在该结果上实施分析和筛选以便确定总体内容关联查找的结果。We will now describe a method for performing content association lookup on a prefix tree given input candidate units. This method involves navigating the prefix tree structure using the names of the candidate units, followed by subsequent analysis and filtering to determine what the structure will return as the overall content association lookup. In other words, the prefix tree navigation process returns a first result, which is then analyzed and filtered to determine the outcome of the overall content association lookup.

为了开始前缀树导航处理，来自候选单元的名称的最高有效字节的值将被用来选择从根节点到一个后续节点的链接(由该值表示)，所述后续节点表示在其名称的最高有效字节中具有该相同值的基本数据单元的子树。从该节点继续，检查来自候选单元的名称的第二字节并且选择由该值标示的链接，从而前进到前缀树中的更深(或更低)一级，并且选择现在与候选单元共享来自其名称的至少两个有效字节的基本数据单元的更小子组。这一处理继续到到达单一基本数据单元为止，或者继续到没有链接与来自候选单元的名称的相应字节的值相匹配为止。在这些条件当中的任一条件下，树导航处理终止。如果到达单一基本数据单元，则可以将其返回以作为前缀树导航处理的结果。如果没有，一种替换方案是报告“错失(miss)”。另一种替换方案是返回以导航终止的节点为根部的子树中的多个基本数据单元。To initiate the prefix tree navigation process, the value of the most significant byte of the candidate unit's name is used to select a link (represented by this value) from the root node to a subsequent node, which represents a subtree of basic data units that have the same value in the most significant byte of their name. Continuing from this node, the second byte of the candidate unit's name is examined, and the link indicated by this value is selected, thus advancing to a deeper (or lower) level in the prefix tree and selecting a smaller subgroup of basic data units that now share at least two significant bytes from their name with the candidate unit. This process continues until a single basic data unit is reached, or until no link matches the value of the corresponding byte in the candidate unit's name. Under either of these conditions, the tree navigation process terminates. If a single basic data unit is reached, it can be returned as the result of the prefix tree navigation process. If not, one alternative is to report a "miss". Another alternative is to return multiple basic data units in the subtree rooted at the node where navigation terminated.

一旦前缀树导航处理终止，可以使用其他标准或要求对前缀树导航处理的结果进行分析和筛选，以便确定作为内容关联查找的结果应当返回什么。举例来说，当由前缀树导航处理返回单一基本数据单元或多个基本数据单元时，在有资格作为内容关联查找的结果被返回之前，可以附加地要求其与候选单元的名称共享特定最小数目的字节(否则内容关联查找返回错失)。筛选要求的另一个实例可以是如果前缀树导航处理在没有到达单一基本数据单元的情况下终止从而返回多个基本数据单元(以前缀树导航终止的节点为根部)以作为前缀树导航处理的结果，则只有在所述多个基本数据单元的数目少于所规定的特定限制的情况下，这些单元才将有资格作为总体内容关联查找的结果被返回(否则内容关联查找返回错失)。可以采用多项要求的组合来确定内容关联查找的结果。通过这种方式，查找处理将报告“错失”或返回单一基本数据单元，或者如果不是单一基本数据单元，则是可能作为用于导出候选单元的良好起点的基本数据单元的集合。Once the prefix tree navigation process terminates, the results can be analyzed and filtered using other criteria or requirements to determine what should be returned as a result of the content association lookup. For example, when the prefix tree navigation process returns a single basic data unit or multiple basic data units, it can be additionally required that they share a specific minimum number of bytes with the name of the candidate unit before being eligible to be returned as a result of the content association lookup (otherwise, the content association lookup returns an error). Another example of a filtering requirement could be that if the prefix tree navigation process terminates without reaching a single basic data unit, thus returning multiple basic data units (rooted at the node where the prefix tree navigation terminated) as a result of the prefix tree navigation process, then these units will only be eligible to be returned as a result of the overall content association lookup if the number of said multiple basic data units is less than a specified specific limit (otherwise, the content association lookup returns an error). A combination of multiple requirements can be used to determine the result of the content association lookup. In this way, the lookup process will report an "error" or return a single basic data unit, or, if not a single basic data unit, a set of basic data units that might serve as a good starting point for deriving candidate units.

下面描述的图3B-3E涉及图3A中示出的树数据结构的变型和修改。虽然这些变型提供了优于图3A中示出的前缀树数据结构的改进和优点，但是用于对所述数据结构进行导航的处理类似于前面参照图3A所描述的处理。也就是说，在对于图3B-3E中示出的树数据结构的树导航终止之后，实施后续分析和筛选以便确定总体内容关联查找的结果，所述总体处理返回错失、单一基本数据单元或者可能作为用于导出候选单元的良好起点的基本数据单元的集合。Figures 3B-3E described below relate to variations and modifications of the tree data structure shown in Figure 3A. While these variations offer improvements and advantages over the prefix tree data structure shown in Figure 3A, the processing for navigating said data structure is similar to that described earlier with reference to Figure 3A. That is, after tree navigation of the tree data structure shown in Figures 3B-3E terminates, subsequent analysis and filtering are performed to determine the results of the overall content association lookup, which returns a set of basic data units that were missed, single basic data units, or that may serve as a good starting point for deriving candidate units.

图3B示出了可以被用来基于其名称对基本数据单元进行组织的另一种数据组织系统。在图3B所示出的实例中，每一个基本数据单元具有从该基本数据单元的整个内容构造的独特名称，该名称简单地由基本数据单元的所有字节构成，这些字节按照其存在于基本数据单元中的相同顺序出现在名称中。图3B示出了一种更加紧凑的结构，其中单一链接采用来自下方子树中的基本数据单元的名称的多个字节(而不是使用在图3A的前缀树中的单一字节)，以便产生细分或下一级的分组。从亲代节点到子代节点的链接现在由多个字节标示。此外，来自任何给定亲代节点的每一个链接可以采用不同数目的字节以便区分和标识与该链接相关联的子树。例如在图3B中，从根节点到节点308的链接通过使用来自名称的4个字节(N₁N₂N₃N₄＝9845)来区分，从根节点到节点309的链接则通过使用来自名称的3个字节(N₁N₂N₃＝347)来区分。Figure 3B illustrates another data organization system that can be used to organize basic data units based on their names. In the example shown in Figure 3B, each basic data unit has a unique name constructed from its entire content, simply consisting of all the bytes of the basic data unit, which appear in the name in the same order they are present in the basic data unit. Figure 3B shows a more compact structure where a single link uses multiple bytes from the names of basic data units in the subtree below (instead of using a single byte in the prefix tree of Figure 3A) to create subdivisions or next-level groupings. Links from parent nodes to child nodes are now identified by multiple bytes. Furthermore, each link from any given parent node can use a different number of bytes to distinguish and identify the subtree associated with that link. For example, in Figure 3B, the link from the root node to node 308 is distinguished by using 4 bytes from the name ( _N1 _N2 _N3 _N4 = 9845), while the link from the root node to node 309 is distinguished by using 3 bytes from the name ( _N1 _N2 _N3 = 347).

应当提到的是，在(使用来自给定候选单元的)树导航期间，当到达树中的任何亲代节点时，树导航处理需要确保检查来自候选单元的名称的足够多的字节以便明确地决定将要选择哪一个链接。为了选择给定的链接，来自候选项的名称的字节必须与标示去到该特定链接的过渡的所有字节相匹配。同样地，在这样的树中，树的每一个节点具有由规定如何到达该节点的字节序列构成的名称。举例来说，节点309的名称可以是“347”，这是因为其表示一个基本数据单元的群组(例如单元311和312)，其中所述基本数据单元的名称的3个开头字节是“347”。在使用其名称的开头3个字节是347的候选单元对树进行查找时，该数据模式导致树导航处理到达如图3B中示出的节点309。同样地，在树中的当前单元混合当中唯一地标识单元的来自该单元的字节子集是从根节点去到该基本数据单元的“路径”。例如在图3B中，字节序列3475导致基本数据单元312，并且在该例中示出的基本数据单元混合当中唯一地标识基本数据单元312。It should be mentioned that during tree navigation (using candidates from a given candidate unit), when reaching any parent node in the tree, the tree navigation process needs to ensure that enough bytes from the candidate unit's name are examined to definitively determine which link to choose. To select a given link, the bytes from the candidate unit's name must match all bytes indicating the transition to that particular link. Similarly, in such a tree, each node has a name consisting of a sequence of bytes specifying how to reach that node. For example, the name of node 309 could be "347" because it represents a group of basic data units (e.g., units 311 and 312) whose names begin with "347". This data pattern causes the tree navigation process to reach node 309 as shown in Figure 3B when searching the tree using candidate units whose names begin with 347. Likewise, the subset of bytes from a unit that uniquely identifies a unit within the current mix of units in the tree is the "path" from the root node to that basic data unit. For example, in Figure 3B, byte sequence 3475 results in basic data unit 312, and basic data unit 312 is uniquely identified in the basic data unit mix shown in this example.

对于多样并且稀疏的数据，可以证明图3B中的树结构比图3A的前缀树结构更加灵活和紧凑。For diverse and sparse data, it can be demonstrated that the tree structure in Figure 3B is more flexible and compact than the prefix tree structure in Figure 3A.

图3C示出了可以被用来基于其名称对基本数据单元进行组织的另一种数据组织系统。在图3C所示出的实例中，每一个基本数据单元具有从该基本数据单元的整个内容构造的独特名称，该名称简单地由基本数据单元的所有字节构成，这些字节按照其存在于基本数据单元中的相同顺序出现在名称中。图3C示出了(针对图3B中所描述的组织的)另一种变型，其使得树更加紧凑，并且(在必要和/或有用时)通过使用规则表达法来规定导致各种链接的来自基本数据单元名称的值而对子树中的单元进行分组。通过使用规则表达法允许对相同子树下的在相应字节上共享相同表达法的单元进行高效的分组；随后可以是对于子树内的不同基本数据单元的更加局部的歧义消除。此外，通过使用规则表达法允许以更加紧凑的方式来描述把单元映射到任何下方的子树所需要的字节的值。这样进一步减少了规定树所需要的字节数目。举例来说，规则表达法318规定28个接连的“F”的模式；如果在树导航期间遵循该链接，我们可以到达包括模式320的单元314，所述模式320具有依照规则表达法318的28个接连的“F”。同样地，到达单元316的路径具有使用规定具有16个接连的“0”的模式的规则表达法的链接或分支。对于这样的树，树导航处理需要检测并执行这样的规则表达法以便确定将要选择哪一个链接。Figure 3C illustrates another data organization system that can be used to organize basic data units based on their names. In the example shown in Figure 3C, each basic data unit has a unique name constructed from its entire content, simply consisting of all the bytes of the basic data unit, which appear in the name in the same order they are present in the basic data unit. Figure 3C shows another variation (for the organization described in Figure 3B) that makes the tree more compact and (where necessary and/or useful) groups units in subtrees by using regular expressions to specify the values from the basic data unit names that lead to various links. Using regular expressions allows for efficient grouping of units under the same subtree that share the same expression on the corresponding bytes; this can subsequently lead to more localized disambiguation of different basic data units within a subtree. Furthermore, using regular expressions allows for a more compact description of the byte values required to map a unit to any of the subtrees below. This further reduces the number of bytes required to specify the tree. For example, rule expression 318 specifies a pattern of 28 consecutive "F"s; if this link is followed during tree navigation, we can reach unit 314, which includes pattern 320, having 28 consecutive "F"s according to rule expression 318. Similarly, the path to unit 316 has links or branches using a rule expression specifying a pattern of 16 consecutive "0"s. For such a tree, tree navigation processing needs to detect and execute such rule expressions to determine which link to select.

图3D示出了可以被用来基于其名称对基本数据单元进行组织的另一种数据组织系统。在图3D所示出的实例中，每一个基本数据单元具有从该基本数据单元的整个内容构造的独特名称。对每一个单元应用指纹处理方法，以便识别包含评估到所选指纹的内容的字段的位置。在单元中找到的第一指纹的位置处的字段被作为维度对待，来自该字段的特定数目的字节(例如x个字节，其中x显著小于单元中的字节数目)被提取出来并且被用作单元名称的开头字节，名称的其余字节由基本数据单元的其余字节构成，并且按照其存在于基本数据单元中的相同循环顺序出现。该名称被用来对树中的基本数据单元进行组织。在该例中，当在单元中没有检测到指纹时，通过简单地按照其存在于单元中的顺序使用单元的所有字节而制订名称。一个单独的子树(通过表明没有找到指纹的指示标示)基于其名称保持并组织所有这样的单元。Figure 3D illustrates another data organization system that can be used to organize basic data units based on their names. In the example shown in Figure 3D, each basic data unit has a unique name constructed from its entire content. A fingerprinting method is applied to each unit to identify the location of a field containing the content evaluated to a selected fingerprint. The field at the location of the first fingerprint found in the unit is treated as a dimension, and a specific number of bytes from that field (e.g., x bytes, where x is significantly smaller than the number of bytes in the unit) are extracted and used as the first byte of the unit name. The remaining bytes of the name are composed of the remaining bytes of the basic data unit and appear in the same cyclic order as they exist in the basic data unit. This name is used to organize the basic data units in the tree. In this example, when no fingerprint is detected in the unit, the name is formulated simply by using all the bytes of the unit in the order they exist in the unit. A separate subtree (by an indicator showing that no fingerprint was found) maintains and organizes all such units based on their names.

如图3D中所示，可以对单元338(其包含t个字节的数据，也就是B₁B₂B₃…B_t)应用指纹处理技术，以便在标识将被选择成“维度1”的字节B_i+1处获得指纹位置“指纹1”。接下来，可以提取来自由“指纹1”标识的位置的x个字节以形成“维度1”，并且这x个字节可以被用作图3D中的每一个单元的名称的开头字节N₁N₂…N_x。随后串联来自单元338的t-x个字节(从B_i+x+1开始并且随后绕回到B₁B₂B₃…B_i)并且将其用作名称的其余字节N_x+1N_x+2…N_t。当在单元中没有找到指纹时，名称N₁N₂…N_t简单地是来自单元338的B₁B₂B₃…B_t。使用其名称在树中对基本数据单元进行分拣和组织。举例来说，在使用路径13654…06经历树的两个等级之后识别出并且到达基本数据单元(PDE)330，其中字节13654…0是作为来自维度1的字节的N₁N₂…N_x。从根部沿着链接334(通过表明没有找到指纹的指示标示)到达的节点335处的单独子树保持并组织其内容未评估到所选指纹的所有基本数据单元。因此，在这种组织中，一些链接(例如链接336)可以使用由按照与单元中相同的顺序出现的单元的字节构成的名称对单元进行组织，其他链接(例如链接340)则可以使用利用指纹制订的名称对单元进行组织。As shown in Figure 3D, fingerprinting can be applied to cell 338 (which contains t bytes of data, i.e., _B1 _B2 _B3 … _Bt ) to obtain the fingerprint location "fingerprint 1" at byte Bi ₊₁ , where the identifier will be selected as "dimension 1". Next, x bytes from the location identified by "fingerprint 1" can be extracted to form "dimension 1", and these x bytes can be used as the first bytes _N1 _N2 … _Nx of the name for each cell in Figure 3D. Then, tx bytes from cell 338 (starting from Bi _+x+1 and then wrapping back to _B1 _B2 _B3 … _Bi ) are concatenated and used as the remaining bytes Nx ₊₁ Nx ₊₂ … _Nt of the name. When no fingerprint is found in the cell, the name _N1 _N2 … _Nt is simply _B1 _B2 _B3 … _Bt from cell 338. The basic data cells are then sorted and organized in the tree using their names. For example, after traversing two levels of the tree using path 13654…06, a basic data unit (PDE) 330 is identified and reached, where byte 13654…0 is N ₁ N ₂ …N _x as bytes from dimension 1. A separate subtree at node 335, reached from the root along link 334 (by an indicator showing that no fingerprint was found), maintains and organizes all basic data units whose contents were not evaluated to the selected fingerprint. Thus, in this organization, some links (e.g., link 336) can organize units using names consisting of bytes of units appearing in the same order as in the unit, while other links (e.g., link 340) can organize units using names derived from fingerprints.

在接收到候选单元时，所述处理应用前面描述的相同技术来确定候选单元的名称，并且使用该名称对树进行导航以进行内容关联查找。因此，对基本数据单元(在其被安装到树中时)和候选单元(在从解析器和因式分解器接收到候选单元时)应用相同并且一直的处理，以便创建其名称。树导航处理使用候选单元的名称对树进行导航。在该实施例中，如果在候选单元中没有找到指纹，则树导航处理沿着组织并包含其内容未评估到指纹的基本数据单元的子树向下导航。Upon receiving a candidate cell, the process applies the same techniques described above to determine the name of the candidate cell and uses that name to navigate the tree for content-related lookups. Therefore, the same and consistent processing is applied to both basic data cells (when they are installed into the tree) and candidate cells (when they are received from the parser and factorizer) to create their names. The tree navigation process uses the names of the candidate cells to navigate the tree. In this embodiment, if no fingerprint is found in a candidate cell, the tree navigation process navigates downwards along the subtree that organizes and contains basic data cells whose content has not been evaluated to a fingerprint.

图3E示出了可以被用来基于其名称对基本数据单元进行组织的另一种数据组织系统。在图3E所示出的实例中，每一个基本数据单元具有从该基本数据单元的整个内容构造的独特名称。对每一个单元应用指纹处理方法，以便识别包含评估到两个指纹当中的任一个的内容的字段的位置。单元中的第一指纹(图3E中的指纹1)的第一次出现位置处的字段被作为第一维度(维度1)对待，第二指纹(图3E中的指纹2)的第一次出现位置处的字段被作为第二维度(维度2)对待。使用指纹处理寻找单元上的两个不同指纹导致四种可能的情形：(1)在单元中找到全部两个指纹，(2)找到指纹1但是没有找到指纹2，(3)找到指纹2但是没有找到指纹1，以及(4)没有找到指纹。基本数据单元可以被分组到对应于每一种情形的4个子树中。在图3E中，“FP1”标示指纹1的存在，“FP2”标示指纹2的存在，“～FP1”标示指纹1的缺失，并且“～FP2”标示指纹2的缺失。Figure 3E illustrates another data organization system that can be used to organize basic data units based on their names. In the example shown in Figure 3E, each basic data unit has a unique name constructed from the entire content of that basic data unit. A fingerprinting method is applied to each unit to identify the location of a field containing content evaluated to either of two fingerprints. The field at the first occurrence of the first fingerprint in the unit (fingerprint 1 in Figure 3E) is treated as the first dimension (dimension 1), and the field at the first occurrence of the second fingerprint (fingerprint 2 in Figure 3E) is treated as the second dimension (dimension 2). Using fingerprinting to find two different fingerprints on a unit results in four possible scenarios: (1) both fingerprints are found in the unit, (2) fingerprint 1 is found but fingerprint 2 is not found, (3) fingerprint 2 is found but fingerprint 1 is not found, and (4) no fingerprint is found. Basic data units can be grouped into four subtrees corresponding to each scenario. In Figure 3E, “FP1” indicates the presence of fingerprint 1, “FP2” indicates the presence of fingerprint 2, “~FP1” indicates the absence of fingerprint 1, and “~FP2” indicates the absence of fingerprint 2.

对于所述4种情形当中的每一种，如下创建单元的名称：(1)当找到全部两个指纹时，可以提取来自由“指纹1”标识的位置的x个字节以形成“维度1”，并且可以提取来自由“指纹2”标识的位置的y个字节以形成“维度2”，这x+y个字节可以被用作图3E中的每一个这样的单元的名称的开头字节N₁N₂…N_x+y。随后通过循环方式提取来自单元348的其余的t-(x+y)个字节(在来自第一维度的字节之后开始)，并且将其串联并且用作名称的其余字节N_x+y+ ₁N_x+y+2…N_t。(2)当找到指纹1但是没有找到指纹2时，可以提取来自由“指纹1”标识的位置的x个字节以形成在前维度，并且这x个字节可以被用作每一个这样的单元的名称的开头字节N₁N₂…N_x。随后串联来自单元348的其余的t-x个字节(从B_i+x+1开始并且随后绕回到B₁B₂B₃…B_i)，并且将其用作名称的其余字节N_x+1N_x+2…N_t。(3)当找到指纹2但是没有找到指纹1时，可以提取来自由“指纹2”标识的位置的y个字节以形成在前维度，并且这y个字节可以被用作每一个这样的单元的名称的开头字节N₁N₂…N_y。随后串联来自单元348的其余的t-y个字节(从B_j+y+1开始并且随后绕回到B₁B₂B₃…B_j)，并且将其用作名称的其余字节N_y+1N_y+2…N_t。(4)当在单元中没有找到指纹时，名称N₁N₂…N_t简单地是来自单元348的B₁B₂B₃…B_t。因此对于这4种情形当中的每一种存在单独的子树。对于所述四种情形可以如下概括用以提取针对单元348的名称(N₁N₂N₃…N_t)的处理：For each of the four scenarios, the cell name is created as follows: (1) When both fingerprints are found, x bytes from the location identified by "fingerprint 1" can be extracted to form "dimension 1", and y bytes from the location identified by "fingerprint 2" can be extracted to form "dimension 2". These x+y bytes can be used as the first bytes N ₁ N ₂ …N _x+y of the name of each such cell in Figure 3E. Then, the remaining t-(x+y) bytes from cell 348 (starting after the bytes from the first dimension) are extracted in a cyclic manner and concatenated and used as the remaining bytes N _x+y+ ₁ N _x+y+2 …N _t of the name. (2) When fingerprint 1 is found but fingerprint 2 is not found, x bytes from the location identified by "fingerprint 1" can be extracted to form the preceding dimension, and these x bytes can be used as the first bytes N ₁ N ₂ …N _x of the name of each such cell. The remaining tx bytes from cell 348 (starting from Bi _+x+1 and then wrapping back _to _B1B2B3 … _Bi ) are then concatenated and used as the remaining bytes Nx _+1Nx ₊₂ … _Nt for the name. ( ₃ ) When fingerprint 2 is found but fingerprint 1 is not found, y bytes from the position identified by "fingerprint 2" can be extracted to form the preceding dimension, and these _y bytes can be used as the first bytes _N1N2 … _Ny for the name of each such cell. The remaining ty bytes from cell 348 (starting from Bj _+y+1 and then wrapping back to _B1B2B3 … _Bj ₎ are then concatenated and used _as the remaining bytes Ny _+1Ny ₊₂ … _Nt for the name. (4) When no fingerprint is found in the cell, the name _N1N2 _… _Nt is simply _B1B2B3 … _Bt _from _cell 348. Therefore, each of these four cases has a separate subtree. The processing for extracting the name ( _N1 _N2 _N3 … _Nt ) for unit 348 can be summarized as follows:

(1)指纹1和指纹2都被找到：(1) Both fingerprint 1 and fingerprint 2 have been found:

N₁-N_x←B_i+1–B_i+x＝来自维度1的x个字节N ₁ -N _x ←B _i+1 –B _i+x = x bytes from dimension 1

N_x+1–N_x+y←B_j+1–B_j+y＝来自维度2的y个字节N _x+1 – _{N x+y} ← B _j+1 – _{B j+y} = y bytes from dimension 2

N_x+y+1…N_t＝其余的字节(来自大小为t个字节的候选单元)＝B_i+x+1B_i+x+2B_i+x+ ₃...B_jB_j+y+1B_j+y+2B_j+y+3...B_tB₁B₂B₃...B_i N _x+y+1 …N _t = the remaining bytes (from candidate units of size t bytes) = B _i+x+1 B _i+x+2 B _i+x+ ₃ ...B _j B _j+y+1 B _j+y+2 B _j+y+3 ...B _t B ₁ B ₂ B ₃ ...B _i

(2)找到指纹1但是没有找到指纹2：(2) Fingerprint 1 was found, but fingerprint 2 was not found:

N_x+1…N_t＝其余的字节(来自大小为t个字节的候选单元)＝B_i+x+1B_i+x+2B_i+x+ ₃...B_tB₁B₂B₃...B_i N _x+1 …N _t = the remaining bytes (from candidate units of size t bytes) = B _i+x+1 B _i+x+2 B _i+x+ ₃ ...B _t B ₁ B ₂ B ₃ ...B _i

(3)找到指纹2但是没有找到指纹1：(3) Fingerprint 2 was found, but fingerprint 1 was not found:

N₁–N_y←B_j+1–B_j+y＝来自维度2的y个字节N ₁ –N _y ←B _j+1 –B _j+y = y bytes from dimension 2

N_y+1…N_t＝其余的字节(来自大小为t个字节的候选单元)＝B_j+y+1B_j+y+2B_j+y+ ₃...B_tB₁B₂B₃...B_j N _y+1 …N _t = the remaining bytes (from candidate units of size t bytes) = B _j+y+1 B _j+y+2 B _j+y+ ₃ ...B _t B ₁ B ₂ B ₃ ...B _j

(4)没有找到指纹：(4) No fingerprints found:

N₁-N_x←B₁–B_t N ₁ -N _x ←B ₁ –B _t

在接收到候选单元时，所述处理应用前面描述的相同技术以确定候选单元的名称。在该实施例中，对候选单元应用前面描述的4种名称构造方法(取决于是否找到指纹1和指纹2)，正如在被输入到滤筛中时对其所应用的那样。因此，对基本数据单元(在其被安装到树中时)和候选单元(在从解析器和因式分解器接收到候选单元时)应用相同并且一致的处理，以便创建其名称。树导航处理使用候选单元的名称对树进行导航以便进行内容关联查找。Upon receiving a candidate unit, the processing applies the same techniques described previously to determine the name of the candidate unit. In this embodiment, the four name construction methods described previously (depending on whether fingerprint 1 and fingerprint 2 are found) are applied to the candidate unit, just as they are applied when it is input into the filter. Therefore, the same and consistent processing is applied to basic data units (when they are installed in the tree) and candidate units (when they are received from the parser and factorizer) to create their names. The tree navigation process uses the names of the candidate units to navigate the tree for content-related lookups.

如果内容关联查找是成功的，则将产生在特定维度的位置处于候选单元具有相同模式的基本数据单元。举例来说，如果在候选单元中找到全部两个指纹，则树导航处理将从根节点开始沿着树的链接354向下。如果候选单元具有作为“维度1”的模式“99…3”和作为“维度2”的模式“7…5”，则树导航处理将到达节点334。这样就到达包含可能是导出目标的两个基本数据单元(PDE 352和PDE 353)的子树。实施附加的分析和筛选(通过首先检查元数据，并且需要的话通过随后获取和检查实际的基本数据单元)以便确定哪一个基本数据单元最适合于导出。因此，这里所描述的实施例识别出可以被使用在滤筛中的多种树结构。可以采用这样的结构或者其变型的组合以便对基本数据单元进行组织。一些实施例通过树形式来组织基本数据单元，其中单元的整个内容被用作该单元的名称。但是各个字节出现在单元名称中的序列不一定是所述字节出现在单元中的序列。单元的特定字段被作为维度提取出来，并且被用来形成名称的开头字节，单元的其余字节构成名称的其余部分。使用这些名称在滤筛中通过树形式对单元进行排序。名称的开头数位被用来区分树的更高分支(或链接)，其余数位被用来逐渐地区分树的所有分支(或链接)。树的每一个节点可以具有从该节点发出的不同数目的链接。此外，来自一个节点的每一个链接可以通过不同数目的字节被区分和标示，并且通过使用规则表达法以及用以表达其规范的其他有力方式可以实现对于这些字节的描述。所有这些特征导致紧凑的树结构。对各个单元基本数据单元的引用驻留在树的叶节点处。If the content association lookup is successful, basic data units with the same pattern as candidate units at a specific dimension will be generated. For example, if both fingerprints are found in a candidate unit, the tree navigation process will start from the root node and proceed downwards along link 354 of the tree. If the candidate unit has the pattern "99…3" as "dimension 1" and the pattern "7…5" as "dimension 2", the tree navigation process will reach node 334. This leads to a subtree containing two basic data units (PDE 352 and PDE 353) that may be the export targets. Additional analysis and filtering are performed (by first examining the metadata and, if necessary, by subsequently acquiring and examining the actual basic data units) to determine which basic data unit is most suitable for export. Thus, the embodiments described herein identify a variety of tree structures that can be used in the filtering. Such structures or combinations of their variations can be employed to organize basic data units. Some embodiments organize basic data units in a tree form, where the entire content of the unit is used as the name of that unit. However, the sequence of bytes appearing in the unit name is not necessarily the sequence of bytes appearing in the unit. Specific fields of a cell are extracted as dimensions and used to form the first byte of its name, with the remaining bytes forming the rest of the name. These names are used to sort cells in a tree structure during filtering. The first few digits of the name are used to distinguish higher branches (or links) of the tree, and the remaining digits are used to progressively distinguish all branches (or links) of the tree. Each node in the tree can have a different number of links emanating from it. Furthermore, each link from a node can be distinguished and identified by a different number of bytes, and these bytes can be described using regular notations and other powerful methods to express their specifications. All these features result in a compact tree structure. References to the basic data units of each cell reside at the leaf nodes of the tree.

在一个实施例中，可以对构成基本数据单元的字节应用指纹处理方法。驻留在通过指纹识别出的位置处的一定数目的字节可以被用来构成单元名称的一个分量。可以组合一个或多个分量以便提供一个维度。多个指纹可以被用来识别多个维度。这些维度被串联并且被用作单元名称的开头字节，单元的其余字节构成单元名称的其余部分。由于维度位于通过指纹识别出的位置处，因此提高了从来自每一个单元的一致的内容形成名称的可能性。在通过指纹定位的字段处具有相同内容值的单元将沿着树的相同枝干被分组在一起。通过这种方式，类似的单元将在树数据结构中被分组在一起。通过使用其名称的替换制订，可以把没有在其中找到指纹的单元一起分组在单独的子树中。In one embodiment, a fingerprinting method can be applied to the bytes that constitute the basic data unit. A certain number of bytes residing at the location identified by the fingerprint can be used to form a component of the unit name. One or more components can be combined to provide a dimension. Multiple fingerprints can be used to identify multiple dimensions. These dimensions are concatenated and used as the first byte of the unit name, with the remaining bytes of the unit constituting the rest of the unit name. Because the dimensions are located at the location identified by the fingerprint, the likelihood of forming a name from consistent content from each unit is increased. Units with the same content value at the field located by the fingerprint will be grouped together along the same branch of the tree. In this way, similar units will be grouped together in the tree data structure. By using substitution of their names, units in which no fingerprint was found can be grouped together in separate subtrees.

在一个实施例中，可以对单元的内容应用指纹处理方法，以便确定单元内容内的(前面所描述的)骨架数据结构的各个分量(或签名)的位置。或者可以选择单元内容内部的特定的固定偏移量以定位分量。还可以采用其他方法来定位单元的骨架数据结构的分量，其中包括而不限于对单元进行解析以便检测所声明的特定结构并且定位该结构内的分量。骨架数据结构的各个分量可以被视为维度，从而使用这些维度的串联以及随后的每一个单元的其余内容来创建每一个单元的名称。名称被用来对树中的基本数据单元进行排序和组织。In one embodiment, a fingerprinting method can be applied to the content of a cell to determine the location of the individual components (or signatures) of the skeleton data structure (described above) within the cell content. Alternatively, a specific fixed offset within the cell content can be selected to locate the components. Other methods can also be used to locate the components of the skeleton data structure of a cell, including, but not limited to, parsing the cell to detect a specific declared structure and locate the components within that structure. The individual components of the skeleton data structure can be viewed as dimensions, and the concatenation of these dimensions, along with the remaining content of each subsequent cell, is used to create a name for each cell. The names are used to sort and organize the basic data units in the tree.

在另一个实施例中，对单元进行解析以便检测单元中的特定结构。该结构中的特定字段被识别成维度。多个这样的维度被串联并且被用作名称的开头字节，单元的其余字节构成单元名称的其余部分。由于维度位于通过对单元进行解析并且检测其结构而识别出的位置处，因此提高了从来自每一个单元的一致的内容形成名称的可能性。在通过所述解析定位的字段处具有相同内容值的单元将沿着树的相同枝干被分组在一起。通过这种方式，同样地，类似的单元将在树数据结构中被分组在一起。In another embodiment, cells are parsed to detect specific structures within them. Specific fields within this structure are identified as dimensions. Multiple such dimensions are concatenated and used as the first byte of a name, with the remaining bytes of the cell constituting the rest of the cell name. Because the dimensions are located at positions identified by parsing the cells and detecting their structure, the likelihood of forming names from consistent content from each cell is increased. Cells with the same content value at the field located by the parsing will be grouped together along the same branch of the tree. Similarly, in this way, similar cells will be grouped together in the tree data structure.

在一些实施例中，树数据结构中的每一个节点包括自描述规范。树节点具有一个或多个子代。每一个子代条目包含关于去到该子代的链接上的区分字节的信息以及对该子代节点的引用。子代节点可以是树节点或叶节点。图3F给出了根据这里所描述的一些实施例的自描述树节点数据结构。图3F中示出的树节点数据结构规定(A)关于从根节点到该树节点的路径的信息，包括所有以下分量或者其中的一个子集：用以到达该树节点的来自名称的实际字节序列，从根节点到达该节点所消耗的名称的字节数目，关于所消耗的该字节数目是否大于某一预先规定的阈值的指示，以及描述去到该节点的路径并且对于树的内容关联搜索以及对于涉及树的构造的决定是有用的其他元数据，(B)该节点所具有的子代的数目，以及(C)对于每一个子代(其中每一个子代对应于树的一个分支)规定(1)子代ID，(2)为了沿着树的该链接向下过渡所需要的来自名称的后继字节的区分字节的数目，(3)对于沿着该链接向下的来自名称的字节的实际值的规定，以及(4)对该子代节点的引用。In some embodiments, each node in the tree data structure includes a self-describing specification. A tree node has one or more children. Each child entry contains distinguishing bytes about the links to that child and a reference to that child node. Child nodes can be tree nodes or leaf nodes. Figure 3F illustrates a self-describing tree node data structure according to some embodiments described herein. The tree node data structure shown in Figure 3F specifies (A) information about the path from the root node to the tree node, including all or a subset of the following components: the actual byte sequence from the name used to reach the tree node, the number of bytes of the name consumed to reach the node from the root node, an indication of whether the number of bytes consumed is greater than a certain predefined threshold, and other metadata describing the path to the node and useful for content association searches of the tree and for decisions involving the construction of the tree, (B) the number of children the node has, and (C) for each child (where each child corresponds to a branch of the tree), specifying (1) the child ID, (2) the number of distinguishing bytes from the name required to transition down the link of the tree, (3) the actual value of the bytes from the name for the downward transition along the link, and (4) a reference to the child node.

图3G给出了根据这里所描述的一些实施例的自描述叶节点数据结构。叶节点具有一个或多个子代。每一个子代是去到一个基本数据单元的链接。每一个子代条目包含关于去到该基本数据单元的链接上的区分字节的信息，对该基本数据单元的引用，重复和导出项的计数，以及关于该基本数据单元的其他元数据。图3G中示出的叶节点数据结构规定(A)关于从根节点到该叶节点的路径的信息，包括所有以下组成部分或者其中的一个子集：用以到达该叶节点的来自名称的实际字节序列，从根节点到达该节点所消耗的名称的字节数目，关于所消耗的该字节数目是否大于某一预先规定的阈值的指示，以及描述去到该节点的路径并且对于树的内容关联搜索以及对于涉及树的构造的决定是有用的其他元数据，(B)该节点所具有的子代的数目，以及(C)对于每一个子代(其中每一个子代对应于该叶节点下方的一个基本数据单元)规定(1)子代ID，(2)为了沿着树的该链接向下过渡到一个基本数据单元所需要的来自名称的后继字节的区分字节的数目，(3)对于沿着该枝干向下的来自名称的字节的实际值的规定，(4)对在树的该路径上终止树的基本数据单元的引用，(5)关于多少重复和导出项指向该基本数据单元的计数(这被用来确定在删除存储系统中的数据时是否可以从滤筛中删除条目)，以及(6)包括基本数据单元的大小等等的对应于基本数据单元的其他元数据。Figure 3G illustrates a self-describing leaf node data structure according to some embodiments described herein. A leaf node has one or more children. Each child is a link to a basic data unit. Each child entry contains information about the distinguishing bytes on the link to that basic data unit, references to that basic data unit, counts of duplicates and derivations, and other metadata about that basic data unit. The leaf node data structure shown in Figure 3G specifies (A) information about the path from the root node to the leaf node, including all or a subset of the following components: the actual byte sequence from the name used to reach the leaf node, the number of bytes of the name consumed to reach the node from the root node, an indication of whether the number of bytes consumed is greater than a predefined threshold, and other metadata describing the path to the node and useful for content association searches of the tree and for decisions involving the construction of the tree, (B) the number of children the node has, and (C) for each child (where each child corresponds to the leaf node). The basic data unit below the point specifies (1) the child ID, (2) the number of distinguishing bytes from the name required to transition down the tree to a basic data unit, (3) the actual value of the bytes from the name down the branch, (4) the reference to the basic data unit that terminates the tree on the path, (5) the count of how many duplicates and derived items point to the basic data unit (this is used to determine whether an entry can be removed from the filter when deleting data in the storage system), and (6) other metadata corresponding to the basic data unit, including the size of the basic data unit, etc.

为了提高新鲜的基本数据单元被安装到树中的效率，一些实施例把一个附加的字段合并到被保持在树的叶节点处的对应于每一个基本数据单元的叶节点数据结构中。应当提到的是，当必须把新鲜的单元插入到树中时，可能需要所讨论的子树中的每一个基本数据单元的名称或内容的附加字节以便决定将把所述新鲜单元插入到子树中的何处，或者是否触发子树的进一步分割。对于这些附加字节的需求可能需要获取其中几个所讨论的基本数据单元，以便对于这些单元当中的每一个提取出关于所述新鲜单元的相关区分字节。为了减少并且优化(并且在某些情况下完全消除)对于这一任务所需的IO的次数，叶节点中的数据结构包括来自该叶节点下方的每一个基本数据单元的名称的特定数目的附加字节。这些附加字节被称作导航前瞻字节，并且帮助关于新鲜的传入单元对基本数据单元进行分拣。对应于给定的基本数据单元的导航前瞻字节在该基本数据单元被安装到滤筛中时被安装到叶节点结构中。可以使用多种标准静态地或者动态地选择将为此目的保留的字节数目，其中包括所涉及的子树的深度以及该子树中的基本数据单元的密度。例如对于正被安装在树的较浅等级的基本数据单元，解决方案可以添加比驻留在非常深的树中的基本数据单元更长的导航前瞻字节。此外，当新鲜的单元正被安装到滤筛中时，并且如果在现有的目标子树中已经有许多导航前瞻字节(从而提高了即将发生再分割的可能性)，则在所述新鲜的基本数据单元被安装到子树中时可以为之保留附加的导航前瞻字节。To improve the efficiency of installing fresh basic data units into the tree, some embodiments incorporate an additional field into the leaf node data structure corresponding to each basic data unit, which is maintained at the leaf nodes of the tree. It should be noted that when a fresh unit must be inserted into the tree, additional bytes of the name or content of each basic data unit in the subtree in question may be needed to determine where the fresh unit will be inserted into the subtree, or whether to trigger further subtree splitting. The need for these additional bytes may require fetching several of the basic data units in question to extract relevant distinguishing bytes about the fresh unit for each of these units. To reduce and optimize (and in some cases eliminate) the number of I/O operations required for this task, the data structure in the leaf node includes a specific number of additional bytes from the name of each basic data unit below that leaf node. These additional bytes are called navigation look-ahead bytes and help sort the basic data units about the incoming fresh units. The navigation look-ahead bytes corresponding to a given basic data unit are installed into the leaf node structure when that basic data unit is installed into the filter. The number of bytes to be reserved for this purpose can be selected statically or dynamically using various criteria, including the depth of the subtree involved and the density of basic data units within that subtree. For example, for basic data units being installed at a shallower level of the tree, the solution can add longer lookahead bytes than for basic data units residing in a very deep tree. Furthermore, additional lookahead bytes can be reserved for fresh basic data units being installed into the subtree if there are already many lookahead bytes in the existing target subtree (thus increasing the likelihood of an impending resegmentation).

图3H给出了包括导航前瞻字段的对应于叶节点的叶节点数据结构。该数据结构规定(A)关于从根节点到该叶节点的路径的信息，包括所有以下组成部分或者其中的一个子集：用以到达该叶节点的来自名称的实际字节序列，从根节点到达该节点所消耗的名称的字节数目，关于所消耗的该字节数目是否大于某一预先规定的阈值的指示，以及描述去到该节点的路径并且对于树的内容关联搜索以及对于涉及树的构造的决定是有用的其他元数据，(B)该节点所具有的子代的数目，以及(C)对于每一个子代(其中每一个子代对应于该叶节点下方的一个基本数据单元)规定(1)子代ID，(2)为了沿着树的该链接向下过渡到一个基本数据单元所需要的来自名称的后继字节的区分字节的数目，(3)对于沿着该枝干向下的字节的实际值的规定，(4)对在树的该路径上终止树的基本数据单元的引用，(5)规定为所述基本数据单元保留多少导航前瞻字节的导航前瞻字段以及这些字节的实际值，(6)关于多少重复和导出项指向该基本数据单元的计数(这被用来确定在删除存储系统中的数据时是否可以从滤筛中删除条目)，以及(7)包括基本数据单元的大小等等的对应于基本数据单元的其他元数据。Figure 3H shows the leaf node data structure corresponding to the leaf node, including the navigation look-ahead field. This data structure specifies (A) information about the path from the root node to the leaf node, including all or a subset of the following components: the actual byte sequence from the name used to reach the leaf node, the number of bytes of the name consumed to reach the node from the root node, an indication of whether the number of bytes consumed exceeds a predefined threshold, and other metadata describing the path to the node and useful for content-related searches of the tree and for decisions involving the construction of the tree, (B) the number of children the node has, and (C) for each child (where each child corresponds to a basic data unit below the leaf node), specifying (1) the child ID, (2) The number of distinguishing bytes from the successor bytes of the name required to transition down the link of the tree to a basic data unit, (3) The specification of the actual value of the bytes going down the branch, (4) The reference to the basic data unit of the tree terminating on the path of the tree, (5) The specification of how many navigation lookahead bytes to reserve for the basic data unit and the actual value of these bytes, (6) The count of how many duplicates and derived items point to the basic data unit (which is used to determine whether an entry can be removed from the filter when deleting data in the storage system), and (7) Other metadata corresponding to the basic data unit, including the size of the basic data unit, etc.

在一些实施例中，树的各个分支被用来把各个数据单元映射到各个群组或范围中，其中所述群组或范围是通过对沿着导向作为范围定界符的子代子树的链接的区分字节进行解释而形成的。该子代子树中的所有单元将使得单元中的相应字节的值小于或等于对于去到所述特定子代子树的链接所规定的区分字节的值。因此，每一个子树现在将表示其值落在特定范围内的一组单元。在给定的子树内，该树的每一个后续等级将逐渐地把单元集合划分成更小的范围。该实施例为图3F中示出的自描述树节点结构的组成部分提供了不同的解释。图3F中的N个子代通过其在树节点数据结构中的区分字节的值被排序，并且表示非重叠范围的有序序列。对于N个节点存在N+1个范围，最低的或第1个范围由小于或等于最小条目的值构成，并且第N+1个范围由大于第N个条目的值构成。第N+1个范围将被作为超出范围而对待，因此N个链接导向下方的N个子树或范围。In some embodiments, the branches of the tree are used to map individual data units to groups or ranges, which are formed by interpreting the distinguishing bytes of links along the child subtrees that act as range delimiters. All units in the child subtree will have a value in the corresponding byte less than or equal to the value of the distinguishing byte specified for the link to that particular child subtree. Thus, each subtree will now represent a group of units whose values fall within a specific range. Within a given subtree, each subsequent level of the tree progressively divides the set of units into smaller ranges. This embodiment provides a different interpretation of the components of the self-describing tree node structure shown in Figure 3F. The N children in Figure 3F are ordered by the values of their distinguishing bytes in the tree node data structure and represent an ordered sequence of non-overlapping ranges. For N nodes, there are N+1 ranges, the lowest or first range consisting of values less than or equal to the smallest entry, and the N+1th range consisting of values greater than the Nth entry. The N+1th range is treated as out of range, and thus the N links lead to the N subtrees or ranges below.

例如在图3F中，子代1定义最低范围并且使用6个字节(值为abef12d6743a)来区分其范围——对应于子代1的范围是从00000000到abef12d6743a。如果候选单元的相应的6个字节落在该范围内(包括端值)，则将选择对应于该子代的链接。如果候选单元的相应的6个开头字节大于范围定界符abef12d6743a，则子代1将不会被选择。为了检查候选单元是否落在对应于子代2的范围内必须满足两个条件——首先候选必须处于紧接着的前一个子代的范围之外(在该例中是子代1)，其次其名称中的相应字节必须小于或等于对应于子代2的范围定界符。在该例中，对应于子代2的范围定界符由值为dcfa的2个字节描述。因此，对应于候选单元的2个相应字节必须小于或等于dcfa。使用这种方法，可以对候选单元和树节点中的所有子代进行检查，以便检查候选单元落在所述N+1个范围当中的哪一个范围内。对于图3F中示出的实例，如果候选单元的名称的4个相应字节大于对应于子代N的链接的区分字节的值f3231929，则将检测到错失状况。For example, in Figure 3F, child 1 defines the lowest range and uses 6 bytes (with a value of abef12d6743a) to distinguish its range—the range corresponding to child 1 is from 00000000 to abef12d6743a. If the corresponding 6 bytes of a candidate cell fall within this range (including the end value), the link corresponding to that child will be selected. If the corresponding 6 starting bytes of a candidate cell are greater than the range delimiter abef12d6743a, then child 1 will not be selected. To check if a candidate cell falls within the range corresponding to child 2, two conditions must be met—first, the candidate must be outside the range of the immediately preceding child (in this case, child 1), and second, the corresponding bytes in its name must be less than or equal to the range delimiter corresponding to child 2. In this example, the range delimiter corresponding to child 2 is described by 2 bytes with a value of dcfa. Therefore, the 2 corresponding bytes corresponding to the candidate cell must be less than or equal to dcfa. Using this method, all children of the candidate unit and the tree node can be checked to determine which of the N+1 ranges the candidate unit falls into. For the example shown in Figure 3F, an error is detected if the four corresponding bytes of the candidate unit's name are greater than the value f3231929 of the distinguishing byte corresponding to the link of child N.

可以对树导航处理进行修改以便容纳这一新的范围节点。在到达范围节点时，为了选择从该节点发出的给定链接，来自候选的名称的字节必须落在对于该特定链接所定义的范围内。如果来自候选的名称的字节的值大于所有链接中的相应字节的值，则候选单元落在下方子树所跨越的所有范围之外——在这种情况下(其被称作“超出范围状况”)检测到错失状况，并且树导航处理终止。如果候选单元的名称的开头字节落在由沿着导向子代子树的链接的相应的区分字节所确定的范围之内，树导航继续到下方的该子树。除非由于“超出范围状况”而终止，否则树导航可以逐渐地沿着树继续下到更深处，直到其到达叶节点数据结构为止。The tree navigation process can be modified to accommodate this new range node. Upon reaching a range node, in order to select a given link emanating from that node, the bytes from the candidate name must fall within the range defined for that particular link. If the value of the bytes from the candidate name is greater than the value of the corresponding bytes in all links, the candidate falls outside all ranges spanned by the subtree below—in this case (called an "out-of-range condition"), an error condition is detected, and the tree navigation process terminates. If the first byte of the candidate name falls within the range determined by the corresponding distinguishing bytes of the links along the guiding child subtree, tree navigation continues down to that subtree. Unless terminated due to an "out-of-range condition," tree navigation can gradually continue deeper down the tree until it reaches the leaf node data structure.

这种范围节点可以结合在图3A-3E中描述的前缀树节点被采用在树结构中。在一些实施例中，树结构的特定数目的等级的上方节点可以是前缀树节点，其中树遍历是基于候选单元的名称的开头字节与沿着树的链接的相应字节之间的精确匹配。后续节点可以是具有通过候选的名称的相应字节落在其中的范围所决定的树遍历的范围节点。当树导航处理终止时，正如在本文献中早前所描述的那样，多种标准可以被用来决定作为总体内容关联查找侧结果将返回什么。Such range nodes can be combined with the prefix tree nodes described in Figures 3A-3E and employed in the tree structure. In some embodiments, the top node of a specific number of levels in the tree structure can be a prefix tree node, where the tree traversal is based on an exact match between the first byte of the candidate cell's name and the corresponding byte along the link in the tree. Subsequent nodes can be range nodes with a tree traversal determined by the range in which the corresponding byte of the candidate cell's name falls. When the tree navigation process terminates, as described earlier in this document, multiple criteria can be used to determine what will be returned as the overall content-related lookup result.

前面对用于表示和使用树节点和叶节点的方法和装置的描述仅仅是出于说明和描述的目的而给出的。其并不意图作出穷举或者把本发明限制到所公开的形式。因此，本领域技术人员将认识到许多修改和变型。The foregoing description of methods and apparatus for representing and using tree nodes and leaf nodes is given for illustrative purposes only. It is not intended to be exhaustive or to limit the invention to the forms disclosed. Therefore, those skilled in the art will recognize many modifications and variations.

在给出候选单元以作为输入时，可以对前面描述的树节点和叶节点结构进行遍历，并且可以基于候选单元的内容对树实施内容关联查找。将从候选单元的字节构造候选单元的名称，正如基本数据单元的名称是当基本数据单元被安装在滤筛中时从其内容构造的。在给定输入候选单元的情况下，对树进行内容关联查找的方法涉及使用候选单元的名称对树结构进行导航，随后进行分析和筛选以便决定作为总体内容关联查找的结果将返回什么。换句话说，树导航处理返回第一输出结果，随后在该结果上实施分析和筛选，以便确定总体内容关联查找的结果。When candidate units are given as input, the tree node and leaf node structure described above can be traversed, and a content-related lookup can be performed on the tree based on the content of the candidate units. The name of the candidate unit is constructed from its bytes, just as the name of the basic data unit is constructed from its content when the basic data unit is placed in a filter. Given input candidate units, the method of performing a content-related lookup on the tree involves navigating the tree structure using the names of the candidate units, followed by analysis and filtering to determine what will be returned as the result of the overall content-related lookup. In other words, the tree navigation process returns a first output result, which is then analyzed and filtered to determine the result of the overall content-related lookup.

如果存在具有与候选相同的名称开头字节(或者落在相同范围内的字节)的任何基本数据单元，树将通过由链接标示的单元子树的形式来标识基本数据单元的该子集。一般来说，每一个树节点或叶节点可以存储允许树导航处理决定将选择哪一个外出链接(如果存在的话)以便导航到树中的下一个更低等级的信息，这是基于输入单元的名称的相应字节以及在沿着所选链接对树进行导航时所到达的节点的身份。如果每一个节点都包含该信息，则树导航处理可以通过递归方式向下导航树中的每一个等级，直到没有找到匹配(此时树导航处理可以返回存在于以当前节点为根部的子树中的一个基本数据单元集合)或者到达一个基本数据单元(此时树导航处理可以返回该基本数据单元以及任何相关联的元数据)为止。If any basic data unit exists with the same name-starting bytes as the candidate (or bytes falling within the same range), the tree identifies that subset of basic data units as a subtree of units marked by links. Generally, each tree node or leaf node can store information that allows the tree navigation process to decide which outgoing link (if any) to choose to navigate to the next lower level in the tree. This information is based on the corresponding bytes of the input unit's name and the identity of the node reached while navigating the tree along the selected link. If every node contains this information, the tree navigation process can recursively navigate down each level of the tree until no match is found (at which point the tree navigation process can return a set of basic data units existing in the subtree rooted at the current node) or until a basic data unit is reached (at which point the tree navigation process can return that basic data unit along with any associated metadata).

一旦树导航处理终止，可以使用其他标准和要求对树导航处理的结果进行分析和筛选，以便确定作为总体内容关联查找的结果应当返回什么。首先，可以挑选在其名称中具有最多数目的与候选相同的开头字节的基本数据单元。其次，当由树导航处理返回单一基本数据单元或多个基本数据单元时，在有资格作为内容关联查找的结果被返回之前，可以附加地要求其与候选单元的名称共享特定的最少数目的字节(否则内容关联查找返回错失)。筛选要求的另一个实例可以是，如果树导航处理在没有到达单一基本数据单元的情况下终止并且从而作为树导航处理的结果返回多个基本数据单元(以树导航终止的节点为根部)，则只有在这些单元的数目小于所规定的特定限制(比如4-16个单元)的情况下，所述多个基本数据单元才将有资格作为总体内容关联查找的结果被返回(否则内容关联查找返回错失)。可以采用多项要求的组合来确定内容关联查找的结果。如果仍有多个候选，则可以检查导航前瞻字节并且还有相关联的元数据，以便决定哪些基本数据单元是最适当的。如果仍然无法把选择收窄到单一基本数据单元，则可以把多个基本数据单元提供到导出功能。通过这种方式，查找处理将报告“错失”或返回单一基本数据单元，或者如果不是单一基本数据单元，则是可能作为用于导出候选单元的良好起点的基本数据单元的集合。Once the tree navigation process terminates, the results can be analyzed and filtered using other criteria and requirements to determine what should be returned as the result of the overall content association lookup. First, basic data units with the most identical starting bytes in their names to candidates can be selected. Second, when the tree navigation process returns a single or multiple basic data units, they can be additionally required to share a specific minimum number of bytes with the candidate unit's name before being eligible for return as a content association lookup result (otherwise, the content association lookup returns an error). Another example of a filtering requirement could be that if the tree navigation process terminates without reaching a single basic data unit and thus returns multiple basic data units as a result of the tree navigation process (rooted at the node where the tree navigation terminates), then these multiple basic data units will only be eligible for return as a result of the overall content association lookup if the number of these units is less than a specified limit (e.g., 4-16 units) (otherwise, the content association lookup returns an error). A combination of multiple requirements can be used to determine the result of the content association lookup. If multiple candidates remain, the navigation lookahead bytes and associated metadata can be examined to determine which basic data units are most appropriate. If it is still not possible to narrow down the selection to a single basic data unit, multiple basic data units can be provided to the export function. In this way, the lookup process will report "missed" or return a single basic data unit, or, if not a single basic data unit, a set of basic data units that may serve as a good starting point for exporting candidate cells.

树需要被设计成用于高效的内容关联存取。具有良好平衡的树对于大部分数据将提供可比的存取深度。预期树的几个上方等级将常常驻留在处理器高速缓存中，接下来的几个等级驻留在快速存储器中，并且后续等级驻留在闪存存储装置中。对于非常大的数据集，可能有一个或多个等级需要驻留在闪存存储装置或者甚至是盘中。Trees need to be designed for efficient content-associative access. A well-balanced tree will provide comparable access depth for most data. It is expected that the top few levels of the tree will often reside in the processor cache, the next few levels in fast memory, and subsequent levels in flash storage. For very large datasets, one or more levels may need to reside in flash storage or even on disk.

图4示出了根据这里所描述的一些实施例的如何可以把256TB的基本数据组织成树形式的一个实例，并且呈现出如何可以把树布置在存储器和存储装置中。假设每个节点平均展开64(2⁶)个子代，则可以通过到达(平均)驻留在树的第6级(也就是说在5次链接遍历或跳跃之后)的叶节点数据结构(例如在图3H中描述)而存取对某一基本数据单元的引用。因此，在5次跳跃之后，树的第6级处的此类结构将与另外的2³⁰个此类节点并列驻留，每一个节点平均具有64个子代(这些子代是对基本数据单元的引用)，从而容纳近似640亿个基本数据单元。在4KB的单元大小下，这样就容纳256TB个基本数据单元。Figure 4 illustrates an example of how 256TB of basic data can be organized into a tree structure according to some embodiments described herein, and demonstrates how the tree can be arranged in memory and storage devices. Assuming each node expands to an average of 64 (2 ^{^6} ) children, a reference to a basic data unit can be accessed by reaching a leaf node data structure (e.g., depicted in Figure 3H) residing at (on average) level 6 of the tree (i.e., after 5 link traversals or jumps). Therefore, after 5 jumps, such a structure at level 6 of the tree will reside alongside another 2 ^{^30} such nodes, each with an average of 64 children (which are references to basic data units), thus accommodating approximately 64 billion basic data units. At a cell size of 4KB, this accommodates 256TB of basic data units.

树可以被布置成使得可以如下遍历树的6个等级：3个等级驻留在芯片上高速缓存中(其中包含规定对应于去到近似256K个节点的链接的过渡的近似四千个“上方等级”树节点数据结构)，存储器中的2个等级(其中包含规定对应于去到近似10亿个叶节点的链接的过渡的1600万个“中间等级”树节点数据结构)，以及闪存存储装置中的第6级(容纳10亿个叶节点数据结构)。驻留在闪存存储装置中的树的该第6级的10亿个叶节点数据结构提供对640亿个基本数据单元的引用(每个叶节点平均64个单元)。The tree can be arranged such that its six levels can be traversed as follows: three levels reside in the on-chip cache (containing approximately four thousand "upper-level" tree node data structures that define the transitions to links leading to approximately 256K nodes), two levels in memory (containing 16 million "intermediate-level" tree node data structures that define the transitions to links leading to approximately one billion leaf nodes), and a sixth level in the flash memory device (containing one billion leaf node data structures). The one billion leaf node data structures of this sixth level of the tree residing in the flash memory device provide references to 64 billion basic data units (an average of 64 units per leaf node).

在图4所示的实例中，在第4和第5级，每一个节点对于每个单元专用平均16个字节(对应于子代ID的1个字节，例如对PDE的6字节引用，加上用于字节计数的一个字节，加上用以规定实际过渡字节以及一些元数据的平均8个字节)。在第6级，每一个叶节点对于每个单元专用平均48个字节(对应于子代ID的1个字节，用于字节计数的1个字节，用以规定实际过渡字节的8个字节，对基本数据单元的6字节引用，用于来自该基本数据单元的导出项计数的1个字节，16个字节的导航前瞻，对应于基本数据单元的大小的2个字节，以及13个字节的其他元数据)，因此对于树所需的闪存存储装置中的总容量(包括对基本数据单元的引用并且包括任何元数据)是大约3个太字节。对于树的上方节点所需的总容量是这一大小的一小部分(这是因为节点更少，规定对子代节点的更严格的引用所需的字节更少，并且每个节点所需的元数据更少)。在该例中，上方树节点对于每个单元专用平均8个字节(对应于子代ID的1个字节，用于字节计数的1个字节，加上用以规定实际过渡字节的平均3-4个字节，以及对子代节点的2-3字节引用)。在该例中，总体上使用3TB(或者256TB的1.17％)的附加装置把具有256TB的基本数据的合成数据集分拣到10亿个群组中。In the example shown in Figure 4, at levels 4 and 5, each node has an average of 16 bytes dedicated to each cell (corresponding to 1 byte for the child ID, 6 bytes for the PDE reference, plus 1 byte for byte counting, plus an average of 8 bytes for specifying the actual transition bytes and some metadata). At level 6, each leaf node has an average of 48 bytes dedicated to each cell (corresponding to 1 byte for the child ID, 1 byte for byte counting, 8 bytes for specifying the actual transition bytes, 6 bytes for the basic data unit reference, 1 byte for counting the derived items from that basic data unit, 16 bytes for navigation lookahead, 2 bytes corresponding to the size of the basic data unit, and 13 bytes for other metadata). Therefore, the total capacity required in flash storage for the tree (including references to basic data units and any metadata) is approximately 3 terabytes. The total capacity required for nodes at the top of the tree is a fraction of this size (because there are fewer nodes, fewer bytes are needed to specify more stringent references to child nodes, and less metadata is required per node). In this example, the upper tree node has an average of 8 bytes dedicated to each unit (1 byte for the child ID, 1 byte for byte counting, plus an average of 3-4 bytes to specify the actual transition bytes, and 2-3 bytes for the reference to the child node). In this example, an additional 3TB (or 1.17% of 256TB) of additional equipment is used to sort the synthetic dataset with 256TB of basic data into 1 billion groups.

在图4所示出的实例中，256TB的基本数据包含640亿个4KB基本数据单元，为了完全区分所述640亿个基本数据单元需要少于5字节(或36比特)的地址。从内容关联角度来看，如果数据的混合使得在前3个等级当中的每一个等级处消耗平均4字节的渐进式名称，并且在接下来的3个等级当中的每一个等级处消耗8个字节，从而总共(平均)36字节(288比特)的名称将区分所有640亿个基本数据单元。这36个字节将少于构成每一个单元的4KB的1％。如果可以通过其字节的1％(或者甚至5-10％)来标识4KB的基本数据单元，则(构成大部分字节的)其余字节可以容许微扰，并且具有此类微扰的候选仍然可以到达该基本数据单元并且可以考虑从该基本数据单元导出。In the example shown in Figure 4, the 256TB of basic data contains 64 billion 4KB basic data units. To completely distinguish these 64 billion basic data units, less than 5 bytes (or 36 bits) of address space is required. From a content association perspective, if the data mixing results in an average of 4 bytes of progressively named names being consumed in each of the first three levels and 8 bytes in each of the next three levels, a total of (average) 36 bytes (288 bits) of names would distinguish all 64 billion basic data units. These 36 bytes would be less than 1% of the 4KB that makes up each unit. If a 4KB basic data unit can be identified by 1% (or even 5-10%) of its bytes, the remaining bytes (comprising the majority of the bytes) can tolerate perturbations, and candidates with such perturbations can still reach the basic data unit and be considered for derivation from it.

应当提到的是，(为了区分下方的各个子树)在任何给定的链接上所需的字节数目将由构成数据集的单元混合中的实际数据决定。同样地，从给定节点发出的链接数目也将随着数据而改变。自描述树节点和叶节点数据结构将生命对于每一个链接所需的字节的实际数目和值，以及从任何节点发出的链接的数目。It should be mentioned that (to distinguish the various subtrees below) the number of bytes required for any given link will be determined by the actual data in the mix of units that make up the dataset. Similarly, the number of links emanating from a given node will also vary with the data. The self-describing tree node and leaf node data structures will specify the actual number and value of bytes required for each link, as well as the number of links emanating from any node.

可以施加进一步的控制以便限制在树的各个等级处专用的高速缓存、存储器和存储装置的数量，以便在增量存储的已分配预算内把输入分拣到尽可能多的已区分群组中。为了应对其中存在需要非常深的子树来完全区分单元的数据密度和口袋(pocket)的情况，可以通过以下步骤高效地应对这样的密度：把相关单元的更大集合分组到树的特定深度(例如第6级)处的平坦群组中，并且在其上实施流线式搜索和导出(这是通过首先检查导航前瞻和元数据以确定最佳基本数据单元，或者(作为回退)对于其余的数据仅仅寻找重复而不是由所述方法提供的完全导出)。这样将避免产生非常深的树。另一种替换方案是允许(具有许多等级的)很深的树，只要这些等级能够容纳在可用的存储器中。当更深的等级溢出到闪存或盘时，可以采取一些步骤以使得树从该等级往后变平坦，从而最小化原本将由于针对存储在闪存或盘中的更深等级的树节点的多次相继存取而招致的等待时间。Further controls can be imposed to limit the amount of dedicated caches, memory, and storage devices at each level of the tree, so that input can be sorted into as many distinguished groups as possible within the allocated budget of incremental storage. To address situations where there are data densities and pockets requiring very deep subtrees to fully distinguish units, such densities can be handled efficiently by grouping larger sets of related units into flat groups at specific depths of the tree (e.g., level 6) and implementing streamlined search and deriving on these groups (this is done by first examining navigation looks-throughs and metadata to determine the optimal basic data units, or (as a fallback) simply looking for duplicates for the remaining data instead of full derivings provided by the method). This avoids creating very deep trees. An alternative is to allow for very deep trees (with many levels), as long as these levels can be accommodated in available memory. When deeper levels overflow to flash memory or disk, steps can be taken to flatten the tree from that level onwards, minimizing latency that would otherwise result from multiple successive accesses to tree nodes at deeper levels stored in flash memory or disk.

预期来自单元名称的全部字节当中的相对较小的一部分将常常足以标识每一个基本数据单元。使用这里所描述的实施例在多种真实世界数据集上实施的研究证实，基本数据单元的一个较小的字节子集可以用来对大部分单元进行排序从而允许所述解决方案。因此，这样的解决方案在对于其操作所需的存储的数量方面是高效的。It is anticipated that a relatively small subset of the total bytes from the cell name will often be sufficient to identify each basic data unit. Studies implemented on various real-world datasets using the embodiments described herein confirm that a small subset of bytes from the basic data units can be used to sort most of the units, thus allowing the solution described. Therefore, such a solution is efficient in terms of the amount of storage required for its operation.

在对于来自图4的实例所需的存取方面，对于每一个传入的4KB输入组块(或候选单元)，所述方法将需要实施一次以下存取以便对树结构进行查询并且到达叶节点：三个高速缓存引用、两个存储器引用(或者可能有多个存储器引用)加上来自闪存存储装置的单一IO，以便对叶节点数据结构进行存取。来自存储装置的该单一IO将获取一个4KB页面，其将保有对应于一组近似64个单元的叶节点数据结构的信息，从而将包括专用于所讨论的基本数据单元的48个字节。这48个字节将包括关于所讨论的基本数据单元的元数据。这将结束树查找处理。随后所需要的IO的次数将取决于候选单元结果是重复、导出项还是将被安装在滤筛中的新鲜基本数据单元。Regarding the access required for the example from Figure 4, for each incoming 4KB input block (or candidate cell), the method will require one access to query the tree structure and reach the leaf node: three cache references, two memory references (or possibly more memory references), plus a single IO from the flash memory device to access the leaf node data structure. This single IO from the memory device will retrieve a 4KB page, which will hold information corresponding to a set of approximately 64 leaf node data structures, thus including 48 bytes dedicated to the basic data unit in question. These 48 bytes will include metadata about the basic data unit in question. This will end the tree lookup process. The number of subsequent IOs required will depend on whether the candidate cell result is a duplicate, a derivation, or a fresh basic data unit to be placed in the filter.

作为某一基本数据单元的重复的候选单元将需要1次IO来获取该基本数据单元，以便验证重复。一旦验证了重复，将再需要一次IO以更新树中的元数据。因此，重复单元的摄取在树查找之后将需要两次IO，从而一共是3次IO。A candidate cell that is a duplicate of a given basic data unit requires one I/O operation to retrieve that basic data unit in order to verify the duplicate. Once the duplicate is verified, another I/O operation is required to update the metadata in the tree. Therefore, the ingestion of a duplicate cell will require two I/O operations after the tree lookup, for a total of three I/O operations.

没有通过树查找并且既不是重复也不是导出项的候选单元需要另外的1次IO以把该单元作为新的基本数据单元存储在滤筛中，并且需要另一次IO以更新树中的元数据。因此，没有通过树查找的候选单元的摄取在树查找之后将需要2次IO，从而导致一共3次IO。但是对于其中树查找处理在不需要存储IO的情况下终止的候选单元，对于摄取这样的候选单元一共只需要2次IO。Candidate cells that do not pass a tree search and are neither duplicates nor derived items require an additional IO to store them as new basic data units in the filter, and another IO to update the metadata in the tree. Therefore, ingesting candidate cells that do not pass a tree search will require 2 IOs after the tree search, resulting in a total of 3 IOs. However, for candidate cells where the tree search process terminates without requiring storage IO, ingesting such candidate cells only requires 2 IOs.

作为导出项(而非重复)的候选单元将首先需要1次IO以获取计算导出所需的基本数据单元。由于预期最经常的导出将是源自单一基本数据单元(而不是多个基本数据单元)，因此对于获取该基本数据单元将只需要单一IO。在成功完成导出之后，将再需要1次IO以把重建程序和导出细节存储在为该单元在存储装置中创建的条目中，并且需要另一次IO以更新树中的元数据(比如计数等等)以便反映出新的导出项。因此，变成导出项的候选单元的摄取在第一树查找之后需要3次附加的IO，从而一共是4次IO。A candidate cell that becomes a derived item (rather than a duplicate) will first require one I/O to retrieve the basic data unit needed to compute the derived item. Since the most frequent derived items are expected to originate from a single basic data unit (rather than multiple basic data units), retrieving that basic data unit will only require a single I/O. After successfully completing the derived item, one more I/O will be required to store the reconstruction procedure and derived item details in an entry created for that cell in the storage device, and another I/O will be required to update the metadata in the tree (such as counts, etc.) to reflect the new derived item. Therefore, the ingestion of a candidate cell that becomes a derived item requires three additional I/Os after the first tree lookup, for a total of four I/Os.

总而言之，为了摄取候选单元并且对其应用Data Distillation^TM方法(同时在非常大的数据集上全局利用冗余)需要大致3到4次IO。与传统的数据去重复技术所需要的情况相比，对于每个候选单元通常只有另外的一次IO，其回报是能够以比所述单元本身更细的粒度在数据集上全局地利用冗余。In summary, approximately 3 to 4 I/O operations are required to ingest candidate cells and apply the Data Distillation ^™ method to them (while globally utilizing redundancy on very large datasets). Compared to traditional data deduplication techniques, this typically requires only one additional I/O operation per candidate cell, with the added benefit of being able to globally utilize redundancy on the dataset at a finer granularity than the cell itself.

给出250000次随机IO存取/秒(这意味着对于4KB页面的1GB/秒的随机存取带宽)的存储系统每秒可以摄取大约62500(250000除以平均大小分别为4KB的每个输入组块4次IO)个输入组块并且对其实施Data Distillation^TM方法。这在用尽存储系统的所有带宽的情况下允许250MB/秒的摄取速率。如果仅使用存储系统的一半带宽(从而使得另一半可用于对所存储的数据进行存取)，这样的Data Distillation^TM系统仍然可以给出125MB/秒的摄取速率。因此，在给定足够处理能力的情况下，Data Distillation^TM系统能够以经济的IO(在比所述单元本身更细的粒度上)在数据集上全局地利用冗余，并且在当代存储系统上以每秒数百兆字节的摄取速率给出数据简化。A storage system with 250,000 random I/O accesses per second (meaning 1 GB/s of random access bandwidth for 4KB pages) can ingest approximately 62,500 (250,000 divided by 4 I/Os per input block of average size 4KB) input blocks per second and apply the Data Distillation ^™ method to them. This allows an ingestion rate of 250 MB/s even when the storage system's full bandwidth is utilized. Even if only half of the storage system's bandwidth is used (so that the other half is available for accessing the stored data), such a Data Distillation ^™ system can still deliver an ingestion rate of 125 MB/s. Therefore, given sufficient processing power, a Data Distillation ^™ system can globally utilize redundancy on a dataset with economical I/O (at a finer granularity than the individual cells) and deliver data simplification at ingestion rates of hundreds of megabytes per second on modern storage systems.

因此，正如测试结果所证实的那样，这里所描述的实施例实现了以经济的IO存取并且利用对于装置所需的最小增量存储从大容量数据存储库搜索单元(从中可以导出输入单元并且只利用规定所述导出所需的最小存储)的复杂任务。如此构造的这一框架使得使用全部单元字节的更小百分比找到适合于导出的单元成为可行，从而留下大部分字节可用于微扰和导出。解释这一方案为何对于大部分数据能够有效工作的一项重要洞察在于，树提供了易于使用的细粒度结构，从而允许定位在滤筛中标识单元的区分和区别字节，并且尽管这些字节分别处于数据中的不同深度和位置，在树结构中可以高效地对其进行隔离和存储。Therefore, as the test results confirm, the embodiments described herein achieve the complex task of searching for cells from a large data repository (from which input cells can be derived and only the minimum storage required for the specified derive) with economical I/O access and using the minimum incremental storage required for the device. This framework, thus constructed, makes it feasible to find suitable cells for derive using a smaller percentage of all cell bytes, leaving the majority of bytes available for perturbation and derive. A key insight explaining why this approach works effectively for most data is that the tree provides an easy-to-use, fine-grained structure that allows for the location of distinguishing and differentiating bytes that identify cells within a filter, and these bytes can be efficiently isolated and stored within the tree structure, even though they reside at different depths and positions within the data.

图5A-5C示出了关于如何可以使用这里所描述的实施例组织数据的一个实际的实例。图5A示出了512字节的输入数据以及因式分解的结果(例如实施图2中的操作202的结果)。在该例中应用指纹处理以确定数据中的中断，从而使得接连的中断标识候选单元。使用粗体和常规字体示出了交替的候选单元。举例来说，第一候选单元是“b8ac83d9dc7caf18f2f2e3f783a0ec69774bb50bbe1d3ef1ef8a82436ec43283bc1c0f6a82e19c224b22f9b2”，下一个候选单元是“ac83d9619ae5571ad2bbcc15d3e493eef62054b05b2dbccce933483a6d3daab3cb19567dedbe33e952a966c49f3297191cf22aa31b98b9dcd0fb54a7f761415e”，后面以此类推。如图所示，图5A中的输入被因式分解成12个可变大小候选单元。每一个组块的开头字节被用来在滤筛中对单元进行排序和组织。图5B示出了如何可以使用其名称并且使用图3B中所描述的树结构按照树形式把图5A中示出的12个候选单元组织成滤筛中的基本数据单元。每一个单元具有从该单元的整个内容构造的独特名称。在该例中，由于应用了指纹处理以确定12个候选单元之间的中断，因此每一个候选单元的开头字节将已经被对准到锚指纹：因此，每一个名称的开头字节将已经是从锚定在该指纹处的内容的第一维度构造的。名称的开头字节组织各个单元。举例来说，如果单元名称中的第一字节等于“0x22”，则取得顶部链接以选择基本数据单元#1。应当提到的是，使用不同数目的字节对图5B中的各个链接进行区分，正如参照图3B中示出的树数据结构所解释的那样。Figures 5A-5C illustrate a practical example of how data can be organized using the embodiments described herein. Figure 5A shows 512 bytes of input data and the result of factorization (e.g., the result of implementing operation 202 in Figure 2). In this example, fingerprinting is applied to identify breaks in the data, thus making successive breaks identify candidate units. Alternating candidate units are shown using bold and regular font. For example, the first candidate unit is "b8ac83d9dc7caf18f2f2e3f783a0ec69774bb50bbe1d3ef1ef8a82436ec43283bc1c0f6a82e19c224b22f9b2", the next candidate unit is "ac83d9619ae5571ad2bbcc15d3e493eef62054b05b2dbccce933483a6d3daab3cb19567dedbe33e952a966c49f3297191cf22aa31b98b9dcd0fb54a7f761415e", and so on. As shown in the figure, the input in Figure 5A is factored into 12 variable-size candidate units. The first byte of each block is used to sort and organize the cells in the filter. Figure 5B shows how the 12 candidate cells shown in Figure 5A can be organized into basic data cells in the filter using their names and the tree structure described in Figure 3B in a tree-like manner. Each cell has a unique name constructed from the entire content of that cell. In this example, because fingerprinting is applied to determine the breaks between the 12 candidate cells, the first byte of each candidate cell will have been aligned to the anchor fingerprint: therefore, the first byte of each name will have been constructed from the first dimension of the content anchored at that fingerprint. The first bytes of the names organize the individual cells. For example, if the first byte in the cell name is equal to "0x22", the top link is taken to select basic data cell #1. It should be mentioned that different numbers of bytes are used to distinguish the individual links in Figure 5B, as explained with reference to the tree data structure shown in Figure 3B.

图5C示出了如何可以使用参照图3D描述的树数据结构对图5A中示出的12个候选单元进行组织。进一步对每一个单元的内容应用指纹处理，以便识别单元内容内的第二指纹。把从第一指纹(其已经存在于每一个单元的边界处)的位置处提取出的内容字节与第二指纹串联，从而形成被用来对单元进行组织的名称的开头字节。换句话说，单元名称被如下构造：来自(分别通过锚指纹和第二指纹定位的)两个维度或字段的数据字节被串联形成名称的开头字节，随后是其余的字节。作为针对名称构造的这一选择的结果，(相比于图5B)不同的字节序列导向图5C中的各个基本数据单元。例如为了到达基本数据单元#4，树导航处理首先取得对应于作为第一维度(即第一指纹)处的字段的开头字节的“46093f9d”的链接，并且随后取得对应于作为位于第二维度(即第二指纹)处的字段的开头字节的“c4”的链接。Figure 5C illustrates how the 12 candidate cells shown in Figure 5A can be organized using the tree data structure described with reference to Figure 3D. Fingerprinting is further applied to the content of each cell to identify a second fingerprint within the cell content. The content bytes extracted from the location of the first fingerprint (which already exists at the boundary of each cell) are concatenated with the second fingerprint to form the first byte of the name used to organize the cell. In other words, the cell name is constructed as follows: data bytes from two dimensions or fields (located respectively by the anchor fingerprint and the second fingerprint) are concatenated to form the first byte of the name, followed by the remaining bytes. As a result of this choice in name construction, (compared to Figure 5B) different byte sequences guide the various basic data cells in Figure 5C. For example, to reach basic data cell #4, the tree navigation process first obtains the link corresponding to "46093f9d" which is the first byte of the field located in the first dimension (i.e., the first fingerprint), and then obtains the link corresponding to "c4" which is the first byte of the field located in the second dimension (i.e., the second fingerprint).

图6A-6C分别示出了根据这里所描述的一些实施例如何可以把树数据结构用于参照图1A-1C描述的内容关联映射器121和122。Figures 6A-6C respectively illustrate how a tree data structure can be used for content association mappers 121 and 122 as described with reference to some embodiments herein.

一旦解决了找到(从中尝试导出候选单元的)适当的基本数据单元的困难问题，所述问题就被收窄到检查一个基本数据单元或者基本数据单元的较小子集并且从中最优地导出候选单元，其中只利用规定所述导出所需的最小存储。其他目标包括把对于存储系统的存取次数保持到最低程度，并且把导出时间和重建时间保持到可以接受。Once the difficult problem of finding (and attempting to derive candidate cells from) suitable basic data units is solved, the problem is narrowed down to examining a basic data unit or a smaller subset of basic data units and optimally deriving candidate cells from it, utilizing only the minimum storage required for the specified deriving. Other objectives include minimizing the number of accesses to the storage system and keeping the deriving and reconstruction times acceptable.

导出器必须把候选单元表达成在一个或多个基本数据单元上实施的变换的结果，并且必须把这些变换规定成将被用来在取回数据时重新生成导出项的重建程序。每一项导出可能需要构造其自身所独有的程序。导出器的功能是识别这些变换，并且创建具有最小足迹的重建程序。可以采用多种变换，其中包括在一个或多个基本数据单元上或者在每一个单元的特定字段上实施的算术、代数或逻辑运算。此外还可以使用字节操纵变换，比如串联、插入、替换和删除一个或多个基本数据单元中的字节。The exporter must express candidate cells as the result of transformations performed on one or more basic data cells, and must specify these transformations as the reconstruction procedure to be used to regenerate the exported items when retrieving the data. Each export may require constructing its own unique procedure. The exporter's function is to identify these transformations and create a reconstruction procedure with a minimal footprint. Various transformations can be employed, including arithmetic, algebraic, or logical operations performed on one or more basic data cells or on specific fields of each cell. Furthermore, byte manipulation transformations such as concatenation, insertion, substitution, and deletion of bytes in one or more basic data cells can be used.

图7A提供了根据这里所描述的一些实施例的可以在重建程序中规定的变换的一个实例。在该例中规程的变换的词汇表包括在单元中的规定长度的字段上实施的算术运算，以及在基本数据单元中的规定偏移量处插入、删除、附加和替换所声明的长度的字节。导出器可以采用多种技术和操作来检测候选单元与一个或多个基本数据单元之间的相似性和差异并且用来构造重建程序。导出器可以利用在底层硬件中可用的词汇表来实施其功能。所述工作的最终结果是在对于重建程序所规定的词汇表中规定变换，并且在这样做时使用最小数量的增量存储并且采取还允许快速数据取回的方式。Figure 7A provides an example of transformations that can be specified in a reconstruction procedure according to some embodiments described herein. In this example, the vocabulary of transformations for the procedure includes arithmetic operations performed on fields of a specified length in the cells, and the insertion, deletion, appending, and replacement of bytes of a declared length at specified offsets in the basic data cells. The derivative can employ various techniques and operations to detect similarities and differences between candidate cells and one or more basic data cells and to construct the reconstruction procedure. The derivative can utilize a vocabulary available in the underlying hardware to implement its functionality. The end result of this work is to specify transformations in a vocabulary defined for the reconstruction procedure, and in doing so, using a minimal amount of incremental storage and in a manner that also allows for fast data retrieval.

导出器可以利用底层机器的处理能力并且在为之分配的处理预算内工作，以便在系统的成本-性能约束内提供尽可能最佳的性能。鉴于微处理器核心更容易获得，并且鉴于针对存储装置的IO存取较为昂贵，因此Data Distillation^TM解决方案被设计成利用当代微处理器的处理能力，以便高效地实施本地分析以及从少数几个基本数据单元导出候选单元的内容。预期Data Distillation^TM解决方案(在非常大的数据上)的性能将不受计算处理的速率限制(rate-limited)，而是受到典型存储系统的IO带宽的速率限制。举例来说，预期几个微处理器核心就将足以实施所需的计算和分析，从而在支持250000次IO/秒的典型的基于闪存的存储系统上支持每秒几百兆字节的摄取速率。应当提到的是，来自当代微处理器，比如Intel Xeon处理器E5-2687W(10核，3.1GHz，25MB高速缓存)的两个这样的微处理器核心是可以从处理器获得的全部计算能力的一部分(十分之二)。The extractor can leverage the processing power of the underlying machine and operate within its allocated processing budget to deliver the best possible performance within the system's cost-performance constraints. Given the greater availability of microprocessor cores and the higher cost of I/O access to storage devices, the Data Distillation ^™ solution is designed to utilize the processing power of modern microprocessors to efficiently perform local analysis and derive the contents of candidate cells from a few basic data units. The performance of the Data Distillation ^™ solution (on very large datasets) is expected to be rate-limited, not by the computational processing speed, but by the I/O bandwidth of a typical storage system. For example, a few microprocessor cores are expected to be sufficient to perform the necessary computations and analysis, supporting ingestion rates of several hundred megabytes per second on a typical flash-based storage system supporting 250,000 I/O operations per second. It should be noted that two such microprocessor cores from a modern microprocessor, such as the Intel Xeon E5-2687W (10 cores, 3.1 GHz, 25 MB cache), represent only a fraction (two-tenths) of the total computational power available from the processor.

图7B示出了根据这里所描述的一些实施例的从基本数据单元导出候选单元的结果的实例。具体来说，数据模式“Elem”是存储在基本数据滤筛中的基本数据单元，并且数据模式“Cand”是将从基本数据单元导出的候选单元。已经突出显示出“Cand”与“Elem”之间的18个共同的字节。重建程序702规定如何可以从数据模式“Elem”导出数据模式“Cand”。如图7B中所示，重建程序702示出了如何从“Elem”导出“Cand”，这是通过使用1字节替换、6字节插入、3字节删除、7字节批量替换。用以规定导出项的成本是20字节+3字节引用＝23字节，从而是原始大小的65.71％。应当提到的是，所示出的重建程序702是所述程序的人类可读表示，并且可能不是所述程序被这里所描述的实施例实际存储的方式。同样地，在图7B中还示出了基于算术运算(比如乘法和加法)的其他重建程序。举例来说，如果“Elem”是bc1c0f6a790c82e19c224b22f900ac83d9619ae5571ad2bbec152054ffffff83并且“Cand”是bc1c0f6a790c82e19c224b22f91c4da1aa0369a0461ad2bbec152054ffffff83，则如图所示可以使用乘法(00ac83d9619ae557)*2a＝[00]1c4da1aa0369a046导出8字节差异。用以规定导出项的成本是4字节+3字节引用＝7字节，从而是原始大小的20.00％。或者如果“Elem”是bc1c0f6a790c82e19c224b22f9b2ac83ffffffffffffffffffffffffffffb283并且“Cand”是bc1c0f6a790c82e19c224b22f9b2ac8300000000000000000000000000002426，则如图所示可以使用加法导出16字节差异，例如通过把0x71a3加到开始于偏移量16的16字节区段并且丢弃进位。用以规定导出项的成本是5字节+3字节引用＝8字节，从而是原始大小的22.85％。应当提到的是，图7A中的范例编码仅仅是出于说明的目的而选择的。图7B中的实例具有32字节的数据大小，因此对于单元内的长度和偏移量字段有5比特就足够了。对于较大的单元(例如4KB单元)，这些字段的大小将需要被增加到12比特。同样地，所述范例编码容许3字节或24比特的引用大小。这应当允许引用1600万个基本数据单元。如果引用需要能够对例如256TB的数据中的任何位置进行寻址，则引用的大小将需要是6个字节。当这样的数据集被因式分解成4KB单元时，规定引用所需要的6个字节将是4KB单元的大小的一小部分。Figure 7B illustrates an example of the results of deriving candidate units from basic data units according to some embodiments described herein. Specifically, the data pattern "Elem" is a basic data unit stored in the basic data filter, and the data pattern "Cand" is a candidate unit to be derived from the basic data unit. The 18 common bytes between "Cand" and "Elem" have been highlighted. Reconstruction procedure 702 specifies how the data pattern "Cand" can be derived from the data pattern "Elem". As shown in Figure 7B, reconstruction procedure 702 shows how "Cand" is derived from "Elem" by using 1-byte replacement, 6-byte insertion, 3-byte deletion, and 7-byte batch replacement. The cost of specifying the derived item is 20 bytes + 3-byte reference = 23 bytes, which is 65.71% of the original size. It should be mentioned that the reconstruction procedure 702 shown is a human-readable representation of the procedure and may not be the way the procedure is actually stored in the embodiments described herein. Similarly, other reconstruction procedures based on arithmetic operations (such as multiplication and addition) are also shown in Figure 7B. For example, if "Elem" is bc1c0f6a790c82e19c224b22f900ac83d9619ae5571ad2bbec152054ffffff83 and "Cand" is bc1c0f6a790c82e19c224b22f91c4da1aa0369a0461ad2bbec152054ffffff83, then as shown in the figure, the multiplication (00ac83d9619ae557)*2a＝[00]1c4da1aa0369a046 can be used to derive an 8-byte difference. The cost of the derived item is specified as 4 bytes + 3 bytes of reference = 7 bytes, which is 20.00% of the original size. Alternatively, if "Elem" is bc1c0f6a790c82e19c224b22f9b2ac83ffffffffffffffffffffffffffffb283 and "Cand" is bc1c0f6a790c82e19c224b22f9b2ac830000000000000000000000000002426, then as shown in the figure, a 16-byte difference can be derived using addition, for example, by adding 0x71a3 to a 16-byte segment starting at offset 16 and discarding the carry. The cost of the derived item is specified as 5 bytes + 3 bytes reference = 8 bytes, which is 22.85% of the original size. It should be mentioned that the example encoding in Figure 7A is chosen for illustrative purposes only. The example in Figure 7B has a data size of 32 bytes, so 5 bits are sufficient for the length and offset fields within the cell. For larger units (e.g., 4KB units), the size of these fields would need to be increased to 12 bits. Similarly, the example encoding allows for reference sizes of 3 bytes or 24 bits. This should allow references to 16 million basic data units. If the reference needs to be able to address any location in, for example, 256TB of data, then the reference size would need to be 6 bytes. When such a dataset is factored into 4KB units, the 6 bytes required for the reference would be a fraction of the size of a 4KB unit.

规定(从一个或多个基本数据单元导出的)导出单元所需的信息的大小是重建程序的大小与规定所需的(一个或多个)基本数据单元所需的引用大小之和。把候选单元规定为导出单元所需的信息的大小被称作候选与基本数据单元的距离。当可以从多个基本数据单元集合当中的任一个集合可行地导出候选时，则把具有最短距离的基本数据单元集合选择成目标。The size of the information required to define a derived unit (derived from one or more basic data units) is the sum of the size of the reconstruction program and the reference size required for defining the one or more basic data units. The size of the information required to define a candidate unit as a derived unit is called the distance between the candidate and the basic data units. When a candidate can be feasiblely derived from any set of multiple sets of basic data units, the set of basic data units with the shortest distance is selected as the target.

当需要从多于一个基本数据单元导出候选单元时(通过组装从这些基本数据单元当中的每一个基本数据单元导出的提取项)，导出器需要考虑到针对存储系统的附加存取的成本，并且将该成本与更小的重建程序和更小的距离的益处进行权衡。一旦对于候选创建了最优的重建程序，将其距离与距离阈值进行比较；如果没有超出阈值，则接受导出。一旦接受导出，就把候选单元改订为导出单元，并且由基本数据单元于重建程序的组合替换。为候选单元创建的蒸馏数据中的条目被重建程序加上对相关基本数据单元的一项或多项引用替换。如果对应于最佳导出的距离超出距离阈值，则导出项将不被接受。When candidate cells need to be derived from more than one basic data unit (by assembling extractables derived from each of these basic data units), the exporter needs to consider the cost of additional access to the storage system and weigh that cost against the benefits of a smaller reconstruction procedure and a smaller distance. Once an optimal reconstruction procedure has been created for a candidate, its distance is compared to a distance threshold; if the threshold is not exceeded, the export is accepted. Once the export is accepted, the candidate cell is revised into an exported cell and replaced by a combination of the basic data unit and the reconstruction procedure. Entries in the distilled data created for the candidate cell are replaced by the reconstruction procedure with one or more references to the relevant basic data unit. If the distance corresponding to the optimal export exceeds the distance threshold, the exported item will not be accepted.

为了产生数据简化，所述距离阈值必须总是小于候选单元的大小。举例来说，距离阈值可以被设定到候选单元大小的50％，从而使得只有在导出项的足迹小于或等于候选单元足迹的一半时才接受导出项，从而对于为之存在适当导出的每一个候选单元确保2x或更大的简化。距离阈值可以是预定的百分比或比例，其或者是基于用户规定的输入或者是由系统选择。距离阈值可以由系统基于系统的静态或动态参数确定。To achieve data simplification, the distance threshold must always be less than the size of the candidate cell. For example, the distance threshold could be set to 50% of the candidate cell size, ensuring that an item is accepted only if its footprint is less than or equal to half the candidate cell footprint, thus guaranteeing a simplification of 2x or greater for each candidate cell for which an appropriate derivative exists. The distance threshold can be a predetermined percentage or proportion, either based on user-defined input or selected by the system. The distance threshold can also be determined by the system based on static or dynamic parameters of the system.

图8A-8E示出了根据这里所描述的一些实施例如何通过把输入数据因式分解成固定大小单元并且把所述单元组织在参照图3D和3E描述的树数据结构中而实施数据简化。图8A示出了如何可以把输入数据简单地因式分解成32字节组块。具体来说，图8A示出了前10个组块，并且随后示出了例如出现在4200万个组块之后的另外几个组块。图8B示出了使用名称把基本数据单元组织在滤筛中，所述名称被构造成使得名称的开头字节由来自单元内容中的3个维度(对应于锚指纹、第二指纹和第三指纹的位置)的内容构成。具体来说，在图8B中，每一个32字节组块成为具有32个字节的候选单元(固定大小块)。对单元的内容应用指纹处理方法。每一个单元具有如下构造的名称：来自单元的三个维度或字段(分别通过锚指纹、第二指纹和第三指纹定位)的数据的字节被串联形成名称的开头字节，随后是单元的其余字节。名称被用来在滤筛中组织单元。如图8B中所示，前10个组块不包含重复或导出项，并且作为单元相继被安装在滤筛中。图8B示出了在第10个组块被消耗之后的滤筛。图8C示出了在消耗了数据输入的附加的几百万个单元之后(例如在给出了接下来的4200万个组块之后)的某一后续时间点处的滤筛内容。针对重复或导出项检查滤筛。无法从单元导出的组块被安装在滤筛中。图8C示出了在消耗了4200万个组块之后的滤筛，其中例如包含16000010个单元(可以利用3个字节的引用地址进行逻辑寻址)，其余的26000000个组块则成为导出项。图8D示出了随后被呈现到滤筛并且被识别为滤筛中的某一条目(被示出为单元编号24789)的重复的新鲜输入的一个实例。在该例中，滤筛把单元24789(组块9)识别成对应于组块42000011的最适当的单元。导出功能确定新的组块是确切的重复，并且将其替换成对单元24789的引用。用以表示导出项的成本是3字节引用相比于35B的原始大小，从而是原始大小的8.57％。图8D示出了被转换成滤筛中的某一条目(被示出为单元编号187126)的导出项的输入(组块42000012)的第二实例。在该例中，滤筛确定不存在确切的匹配。其把单元187125和187126(组块8和1)识别成最适当的单元。从所述最适当的单元导出新的单元。在图8D中示出了导出相比于单元187125和导出相比于单元187126。用以表示导出项相比于单元187125的成本是39字节+3字节引用＝42字节，从而是原始大小的120.00％。用以表示导出项相比于单元187126的成本是12字节+3字节引用＝15字节，从而是原始大小的42.85％。选择最佳导出(相比于单元187126)。将重建大小与阈值进行比较。举例来说，如果阈值是50％，则该导出项(42.85％)被接受。图8E提供了从基本数据单元导出的数据组块的两个附加的实例，其中包括通过从两个基本数据单元导出而实际创建导出项的一个实例。在第一实例中给出组块42000013。滤筛把单元9299998(组块10)识别成最适当的单元。在图8E中示出了导出相比于单元9299998。用以表示导出项的成本是4字节+3字节引用＝7字节，从而是原始大小的20.00％。将重建大小与阈值进行比较。举例来说，如果阈值是50％，则该导出项(20.00％)被接受。在第二实例中给出组块42000014。在该例中，组块42000014使得该组块的一般可以从单元9299997最佳地导出，该组块的另一半则可以从单元9299998最佳地导出。因此，创建多单元导出项以产生进一步的数据简化。在图8E中示出了多单元导出。用以表示该多单元导出项的成本是3字节引用+3字节+3字节引用＝9字节，从而是原始大小的25.71％。将重建大小与阈值进行比较，例如如果阈值是50％，则该导出项(25.71％)被接受。应当提到的是，来自单一单元导出项的最佳结果将是45.71％。Figures 8A-8E illustrate how data simplification is implemented according to some embodiments described herein by factoring input data into fixed-size units and organizing these units in a tree data structure described with reference to Figures 3D and 3E. Figure 8A shows how input data can be simply factored into 32-byte chunks. Specifically, Figure 8A shows the first 10 chunks, followed by several more chunks, for example, appearing after 42 million chunks. Figure 8B illustrates the organization of basic data units in a filter using names constructed such that the first byte of the name consists of content from three dimensions of the unit content (corresponding to the positions of the anchor fingerprint, second fingerprint, and third fingerprint). Specifically, in Figure 8B, each 32-byte chunk becomes a candidate unit (fixed-size block) with 32 bytes. A fingerprinting method is applied to the content of the unit. Each unit has a name constructed as follows: bytes of data from the three dimensions or fields of the unit (located by the anchor fingerprint, second fingerprint, and third fingerprint, respectively) are concatenated to form the first byte of the name, followed by the remaining bytes of the unit. The names are used to organize the units in the filter. As shown in Figure 8B, the first 10 blocks do not contain duplicates or derived items and are successively installed as units in the filter. Figure 8B shows the filter after the 10th block has been consumed. Figure 8C shows the filter contents at a subsequent point in time after consuming an additional few million units of data input (e.g., after the next 42 million blocks are presented). The filter is checked for duplicates or derived items. Blocks that cannot be derived from units are installed in the filter. Figure 8C shows the filter after consuming 42 million blocks, which, for example, contains 1,600,001 units (logically addressable using a 3-byte reference address), and the remaining 26,000,000 blocks become derived items. Figure 8D shows an example of a fresh input that is subsequently presented to the filter and identified as a duplicate entry in the filter (shown as unit number 24789). In this example, the filter identifies unit 24789 (block 9) as the most appropriate unit corresponding to block 42000011. The export function determines that the new block is an exact duplicate and replaces it with a reference to unit 24789. The cost of the exported item is 3 bytes of reference compared to the original size of 35 bytes, which is 8.57% of the original size. Figure 8D shows a second example of the input (block 42000012) of an exported item that has been converted into an entry in the filter (shown as unit number 187126). In this example, the filter determines that there is no exact match. It identifies units 187125 and 187126 (blocks 8 and 1) as the most appropriate units. A new unit is exported from the most appropriate unit. The export compared to unit 187125 and the export compared to unit 187126 are shown in Figure 8D. The cost of the derived item compared to unit 187125 is 39 bytes + 3 bytes reference = 42 bytes, which is 120.00% of the original size. The cost of the derived item compared to unit 187126 is 12 bytes + 3 bytes reference = 15 bytes, which is 42.85% of the original size. The best derived item (compared to unit 187126) is selected. The reconstructed size is compared to a threshold. For example, if the threshold is 50%, the derived item (42.85%) is accepted. Figure 8E provides two additional examples of data blocks derived from basic data units, including one example of actually creating derived items by deriving from two basic data units. Block 42000013 is given in the first example. The filter identifies unit 9299998 (block 10) as the most appropriate unit. The derived item compared to unit 9299998 is shown in Figure 8E. The cost of the derived item is 4 bytes + 3 bytes reference = 7 bytes, which is 20.00% of the original size. The reconstructed size is compared to a threshold. For example, if the threshold is 50%, the derived item (20.00%) is accepted. Block 42000014 is given in the second example. In this example, block 42000014 makes it possible for the general half of the block to be optimally derived from cell 9299997, and the other half of the block to be optimally derived from cell 9299998. Therefore, a multi-cell derived item is created to produce further data simplification. The multi-cell derived item is shown in Figure 8E. The cost of this multi-cell derived item is 3 bytes reference + 3 bytes + 3 bytes reference = 9 bytes, which is 25.71% of the original size. The reconstructed size is compared to a threshold; for example, if the threshold is 50%, the derived item (25.71%) is accepted. It should be mentioned that the best result from a single-cell derived item would be 45.71%.

图8A-8E示出了Data Distillation^TM系统的一个重要优点：可以在消耗和产生固定大小块的同时有效地实施数据简化。应当提到的是，固定大小块在高性能存储系统中是高度期望的。通过使用Data Distillation^TM装置，由许多固定大小的块构成的较大的传入输入文件可以被因式分解成许多固定大小的单元，从而使得所有基本数据单元都具有固定的大小。对应于每一个导出单元的潜在的可变大小重建程序被包装在一起并且被内联保持在蒸馏数据文件中，其随后可以被组块成固定大小块。因此，出于所有实际的目的，可以在存储系统中消耗和产生固定大小块的同时实施强力的数据简化。Figures 8A-8E illustrate a key advantage of the Data Distillation ^™ system: the ability to efficiently implement data simplification while consuming and generating fixed-size blocks. It should be noted that fixed-size blocks are highly desirable in high-performance storage systems. By using the Data Distillation ^™ apparatus, a large incoming input file consisting of many fixed-size blocks can be factored into many fixed-size units, ensuring that all basic data units have a fixed size. Potential variable-size reconstruction procedures corresponding to each derived unit are packaged together and maintained inline within the distilled data file, which can then be chunked into fixed-size blocks. Therefore, for all practical purposes, powerful data simplification can be implemented simultaneously in the storage system while consuming and generating fixed-size blocks.

图9A-9C示出了首先在图1C中示出的系统的Data Distillation^TM方案的一个实例：这种方案采用可以通过内容关联方式被存取的单独的基本重建程序滤筛。这样的结构允许检测到构成已经存在于基本重建程序滤筛中的重建程序。这样的导出项可以被改订成引用现有的重建程序。这允许检测重建程序当中的冗余。在图9A中摄取输入数据。对所述数据应用指纹处理方法，并且在指纹位置处设定组块边界。如图所示，输入被因式分解成8个候选单元(在图9A中通过粗体和常规字体示出了交替的组块)。在图9B中，所述8个候选单元被示出为组织在滤筛中。每一个单元具有从该单元的整个内容构造的独特名称。在该例中，单元名称被如下构造：来自两个维度或字段(分别通过锚指纹和第二指纹定位)的数据的字节被串联形成名称的开头字节，随后是其余字节。名称被用来在滤筛中对单元进行排序，并且还通过树结构提供对滤筛的内容关联存取。图9B还示出了包含基本重建程序的第二内容关联结构。图9C示出了重复重建。假设所到来的55字节候选单元(图9C中示出)并非任何基本数据单元的重复。单元3被选择成最适当的单元——前2个维度对于PDE 2和3是相同的，但是开始于88a7的其余字节与单元3匹配。利用一个12字节重建程序(RP)从单元3导出所述新的输入。编码如图7A中所示。对于该例应当注意到的是，最大单元大小是64比特，并且所有偏移量和长度都被编码成6比特值，而不是图7A中所示的5比特长度和偏移量。对基本重建程序滤筛进行搜索并且没有找到这一新RP。该RP被插入到基本重建程序滤筛中并且基于其值被排序。所述新单元被改订成对基本数据单元3的引用以及对基本重建程序滤筛中的引用4处的新创建的基本重建程序的引用。对应于该导出单元的总存储大小是：3字节PDE引用、3字节RP引用、12字节RP＝18字节，从而是相比于将其存储为PDE的大小的31.0％。后面假设所述55字节候选单元的一份拷贝到达。与前面一样，基于单元3创建一个12比特RP。对基本重建程序滤筛进行搜索，并且找到具有基本RP ID＝3、RP引用＝4的RP。该候选单元在系统中被表示成针对基本数据单元3的引用和针对重建程序4的引用。对于该导出单元所增加的总存储大小现在是：3字节PDE引用、3字节RP引用＝6字节，从而是相比于将其存储为PDE的大小的10.3％。Figures 9A-9C illustrate an example of the Data Distillation ^™ scheme of the system first shown in Figure 1C: this scheme employs a separate basic reconstruction procedure filter that can be accessed via content association. This structure allows for the detection of reconstruction procedures already existing in the basic reconstruction procedure filter. Such derivatives can be modified to reference existing reconstruction procedures. This allows for the detection of redundancy within the reconstruction procedures. Input data is ingested in Figure 9A. A fingerprinting method is applied to the data, and chunk boundaries are set at fingerprint locations. As shown, the input is factored into 8 candidate units (alternating chunks are shown in bold and regular font in Figure 9A). In Figure 9B, these 8 candidate units are shown organized within the filter. Each unit has a unique name constructed from the entire content of that unit. In this example, the unit name is constructed as follows: bytes from data from two dimensions or fields (located via anchor fingerprint and second fingerprint, respectively) are concatenated to form the first byte of the name, followed by the remaining bytes. The name is used to sort the units within the filter and also provides content association access to the filter via a tree structure. Figure 9B also shows a second content association structure containing the basic reconstruction procedure. Figure 9C shows a duplicate reconstruction. Assume the incoming 55-byte candidate cell (shown in Figure 9C) is not a duplicate of any basic data cell. Cell 3 is selected as the most appropriate cell—the first two dimensions are the same for PDEs 2 and 3, but the remaining bytes starting at 88a7 match cell 3. The new input is derived from cell 3 using a 12-byte reconstruction procedure (RP). The encoding is shown in Figure 7A. It should be noted for this example that the maximum cell size is 64 bits, and all offsets and lengths are encoded as 6-bit values, instead of the 5-bit lengths and offsets shown in Figure 7A. The basic reconstruction procedure filter is searched and this new RP is not found. The RP is inserted into the basic reconstruction procedure filter and sorted based on its values. The new cell is revised to a reference to basic data cell 3 and a reference to the newly created basic reconstruction procedure at reference 4 in the basic reconstruction procedure filter. The total storage size corresponding to this derived unit is: 3 bytes PDE reference, 3 bytes RP reference, 12 bytes RP = 18 bytes, which is 31.0% of the size compared to storing it as a PDE. It is assumed that a copy of the 55-byte candidate unit arrives. As before, a 12-bit RP is created based on unit 3. The basic reconstruction procedure filter is searched, and an RP with basic RP ID = 3 and RP reference = 4 is found. This candidate unit is represented in the system as a reference to basic data unit 3 and a reference to reconstruction procedure 4. The total increased storage size for this derived unit is now: 3 bytes PDE reference, 3 bytes RP reference = 6 bytes, which is 10.3% of the size compared to storing it as a PDE.

图10A提供了根据这里所描述的一些实施例的关于如何对基本数据单元应用在重建程序中规定的变换以产生导出单元的一个实例。该例示出了被规定从编号为187126的基本数据单元(该基本数据单元也被示出在图8C的滤筛中)导出的导出单元，这是通过对该基本数据单元应用由所示出的重建程序规程的四种变换(插入、替换、删除和附加)。如图10A中所示，从滤筛加载单元187126，并且执行重建程序以便从单元187126导出组块42000012。图10B-10C示出了根据这里所描述的一些实施例的数据取回处理。每一项数据取回请求实质上采取蒸馏数据中的一个单元的形式，并且在无损简化格式中被呈现到取回引擎。对应于每一个单元的无损简化格式包含对相关联的(多个)基本数据单元的引用以及重建程序。Data Distillation^TM装置的取回器获取基本数据单元和重建程序，并且将其提供到重建器以供重建。在获取了对应于蒸馏数据的某一单元的相关基本数据单元和重建程序之后，重建器执行重建程序以便生成处于其原始未简化形式的所述单元。数据取回处理执行重建所需的工作量关于重建程序的大小和基本数据单元的大小成线性。因此，通过所述系统可以实现高数据取回速率。Figure 10A provides an example of how, according to some embodiments described herein, transformations specified in the reconstruction procedure are applied to a basic data unit to produce a derived unit. This example illustrates a derived unit specified for extraction from a basic data unit numbered 187126 (also shown in the filter of Figure 8C) by applying four transformations (insertion, replacement, deletion, and appending) to that basic data unit according to the reconstruction procedure specification shown. As shown in Figure 10A, unit 187126 is loaded from the filter, and the reconstruction procedure is executed to extract block 42000012 from unit 187126. Figures 10B-10C illustrate data retrieval processing according to some embodiments described herein. Each data retrieval request essentially takes the form of a unit of distilled data and is presented to the retrieval engine in a lossless simplified format. The lossless simplified format corresponding to each unit contains references to the associated basic data unit(s) and the reconstruction procedure. The retrieval unit of the Data Distillation ^™ device acquires the basic data unit and the reconstruction procedure and provides them to the reconstructor for reconstruction. After acquiring the relevant basic data unit and reconstruction procedure corresponding to a specific unit of distillation data, the reconstructor executes the reconstruction procedure to generate the unit in its original, unsimplified form. The amount of work required for the data retrieval process to perform the reconstruction is linear with respect to the size of the reconstruction procedure and the size of the basic data unit. Therefore, a high data retrieval rate can be achieved with this system.

显而易见的是，为了把一个单元从蒸馏数据中的无损简化形式重建到其原始未简化形式，只需要获取对于该单元所规定的(多个)基本数据单元和重建程序。因此，为了重建给定的单元，不需要对其他单元进行存取或重建。这就使得Data Distillation^TM装置即使在为针对重建和取回的随机请求序列服务时仍然是高效的。应当提到的是，例如LempelZiv方法之类的传统压缩方法需要获取并且解压缩包含所期望的块的整个数据窗口。举例来说，如果存储系统采用Lempel-Ziv方法使用32KB的窗口压缩4KB数据库，则为了获取和解压缩给定的4KB块，需要获取和解压缩整个32KB窗口。由于为了给出所期望的数据需要消耗更多带宽并且需要解压缩更多数据，这样就构成了性能惩罚。Data Distillation^TM装置则不会招致这样的惩罚。It is evident that to reconstruct a cell from its lossless simplified form in distilled data back to its original unsimplified form, only the basic data units(s) specified for that cell and the reconstruction procedure are required. Therefore, to reconstruct a given cell, no other cells need to be accessed or reconstructed. This makes the Data Distillation ^™ device efficient even when serving a random sequence of requests for reconstruction and retrieval. It should be noted that traditional compression methods, such as the Lempel-Ziv method, require fetching and decompressing the entire data window containing the desired block. For example, if a storage system uses the Lempel-Ziv method to compress a 4KB database with a 32KB window, then to fetch and decompress a given 4KB block, the entire 32KB window needs to be fetched and decompressed. This incurs a performance penalty because more bandwidth is required to deliver the desired data and more data needs to be decompressed. The Data Distillation ^™ device does not incur such a penalty.

Data Distillation^TM装置可以通过多种方式被集成到计算机系统中，以便通过高效地在系统中的整个数据上全局发现并利用冗余的方式对数据进行组织和存储。图11A-11G示出了根据这里所描述的一些实施例的包括Data Distillation^TM机制(可以利用软件、硬件或者其组合来实施)的系统。图11A给出了通用计算平台，其中软件应用运行在系统软件上，系统软件执行在硬件平台上，硬件平台由处理器、存储器和数据存储组件构成。图11B示出了被集成到平台的应用层中的Data Distillation^TM装置，每一个特定应用使用所述装置利用对应于该应用的数据集内的冗余。图11C示出了被采用来为在其上方应用的所有应用提供数据虚拟化层或服务的Data Distillation^TM装置。图11D和11E示出了DataDistillation^TM装置与范例计算平台的操作系统、文件系统和数据管理服务的两种不同形式的集成。其他集成方法包括(而不限于)与硬件平台中的嵌入式计算堆栈集成，比如采用在如图11F中所示的基于闪存的数据存储子系统中的嵌入式计算堆栈。Data Distillation ^™ devices can be integrated into computer systems in various ways to organize and store data efficiently by globally discovering and utilizing redundancy across the entire system. Figures 11A-11G illustrate systems incorporating Data Distillation ^™ mechanisms (which can be implemented using software, hardware, or a combination thereof) according to some embodiments described herein. Figure 11A shows a general-purpose computing platform where software applications run on system software, which executes on a hardware platform consisting of a processor, memory, and data storage components. Figure 11B shows a Data Distillation ^™ device integrated into the application layer of the platform, where each specific application uses the device to utilize redundancy within its corresponding dataset. Figure 11C shows a Data Distillation ^™ device employed to provide a data virtualization layer or services for all applications above it. Figures 11D and 11E illustrate two different forms of integration of Data Distillation ^™ devices with the operating system, file system, and data management services of an example computing platform. Other integration methods include (but are not limited to) integration with embedded computing stacks in hardware platforms, such as the embedded computing stack in a flash-based data storage subsystem as shown in Figure 11F.

图11G给出了Data Distillation^TM装置与图11D中示出的范例计算平台的集成的附加细节。图11G示出了Data Distillation^TM装置的组件，其中解析器和因式分解器、导出器、取回器和重建器作为软件在通用处理器上执行，并且内容关联映射结构驻留在存储分级结构的几个等级上。基本数据滤筛可以驻留在存储介质(比如基于闪存的存储驱动器)中。Figure 11G provides additional details on the integration of the Data Distillation ^™ device with the exemplary computing platform shown in Figure 11D. Figure 11G illustrates the components of the Data Distillation ^™ device, where the parser and factorizer, derivator, retrieval unit, and reconstructor execute as software on a general-purpose processor, and the content-associative mapping structure resides at several levels of the storage hierarchy. The basic data filter can reside in a storage medium such as a flash-based storage drive.

图11H示出了Data Distillation^TM装置如何可以与范例通用计算平台进行接口。Figure 11H illustrates how the Data Distillation ^™ device can interface with the Exemplary General Computing Platform.

文件系统(或档案系统(filesystem))把文件(例如文本文档、电子数据表、可执行文件、多媒体文件等等)与标识符(例如文件名、文件句柄等等)相关联，并且允许通过使用与文件相关联的标识符在文件上实施操作(例如读取、写入、插入、附加、删除等等)。由文件系统实施的命名空间可以是平坦的或分级的。此外，命名空间可以被分层，例如顶层标识符可以被解析成相继的更低层处的一个或多个标识符，直到顶层标识符被完全解析为止。通过这种方式，文件系统提供对于物理地存储文件内容的(多个)物理数据存储设备和/或存储介质(例如计算机存储器、闪存驱动器、盘驱动器、网络存储设备、CD-ROM、DVD等等)的抽象。A file system (or archive system) associates files (such as text documents, spreadsheets, executable files, multimedia files, etc.) with identifiers (such as filenames, file handles, etc.) and allows operations (such as reading, writing, inserting, appending, deleting, etc.) to be performed on files using the identifiers associated with them. The namespaces implemented by a file system can be flat or hierarchical. Furthermore, namespaces can be layered; for example, a top-level identifier can be resolved into one or more identifiers at successive lower levels until the top-level identifier is fully resolved. In this way, a file system provides an abstraction of the physical data storage devices and/or storage media (such as computer memory, flash drives, disk drives, network storage devices, CD-ROMs, DVDs, etc.) that physically store the file contents.

被用于在文件系统中存储信息的物理存储设备和/或存储介质可以使用一种或多种存储技术，并且可以位于相同的网络位置处或者可以分布在不同的网络位置处。在给定与文件相关联的标识符以及被请求在文件上实施的一项或多项操作的情况下，文件系统可以(1)识别一个或多个物理存储设备和/或存储介质，并且(2)使得由文件系统识别出的物理存储设备和/或存储介质实施被请求在与所述标识符相关联的文件上实施的操作。The physical storage devices and/or storage media used to store information in the file system may use one or more storage technologies and may be located in the same network location or may be distributed in different network locations. Given an identifier associated with a file and one or more operations requested to be performed on the file, the file system may (1) identify one or more physical storage devices and/or storage media, and (2) cause the physical storage devices and/or storage media identified by the file system to perform the operations requested to be performed on the file associated with the identifier.

每当在系统中实施读取或写入操作时，可能涉及不同的软件和/或硬件组件。术语“读取器”可以指代在系统中实施给定的读取操作时在所述系统中涉及的软件和/或硬件组件的选集，并且术语“写入器”可以指代在系统中实施给定的写入操作时在所述系统中涉及的软件和/或硬件组件的选集。这里所描述的数据简化方法和装置的一些实施例可以由在实施给定的读取或写入操作时所涉及的系统的一个或多个软件和/或硬件组件利用，或者可以被合并到其中。不同的读取器和写入器可以利用或合并不同的数据简化实现方式。但是，利用或合并特定数据简化实现方式的每一个写入器将对应于同样利用或合并相同的数据简化实现方式的读取器。应当提到的是，在系统中实施的一些读取和写入操作可能不会利用或合并数据简化装置。举例来说，当Data Distillation^TM装置或数据简化装置103取回基本数据单元或者把新的基本数据单元添加到基本数据滤筛时，其可以在没有数据简化的情况下直接实施读取和写入操作。Each time a read or write operation is performed in the system, different software and/or hardware components may be involved. The term "reader" may refer to a selection of software and/or hardware components involved in the system when a given read operation is performed, and the term "writer" may refer to a selection of software and/or hardware components involved in the system when a given write operation is performed. Some embodiments of the data simplification methods and apparatus described herein may be utilized by one or more software and/or hardware components of the system involved in performing a given read or write operation, or may be incorporated therein. Different readers and writers may utilize or incorporate different data simplification implementations. However, each writer utilizing or incorporating a particular data simplification implementation will correspond to a reader that also utilizes or incorporates the same data simplification implementation. It should be noted that some read and write operations performed in the system may not utilize or incorporate data simplification apparatus. For example, when the Data Distillation ^™ apparatus or data simplification apparatus 103 retrieves basic data units or adds new basic data units to the basic data filter, it may perform read and write operations directly without data simplification.

具体来说，在图11H中，写入器150W可以总体上涉及在实施给定的写入操作时所涉及的系统的软件和/或硬件组件，并且读取器150R可以总体上涉及在实施给定的读取操作时所涉及的系统的软件和/或硬件组件。如图11H中所示，写入器150W向DataDistillation^TM装置或数据简化装置103提供输入数据，并且从Data Distillation^TM装置或数据简化装置103接收蒸馏数据108。读取器150R向Data Distillation^TM装置或数据简化装置103提供取回请求109，并且从Data Distillation^TM装置或数据简化装置103接收所取回的数据输出113。Specifically, in Figure 11H, the writer 150W can generally relate to the software and/or hardware components of the system involved in performing a given write operation, and the reader 150R can generally relate to the software and/or hardware components of the system involved in performing a given read operation. As shown in Figure 11H, the writer 150W provides input data to the Data Distillation ^™ device or data simplification device 103 and receives distilled data 108 from the Data Distillation ^™ device or data simplification device 103. The reader 150R provides a retrieval request 109 to the Data Distillation ^™ device or data simplification device 103 and receives the retrieved data output 113 from the Data Distillation ^™ device or data simplification device 103.

对应于图11H的实现方式实例包括而不限于在应用、操作系统内核、文件系统、数据管理模块、设备驱动程序或者闪存或盘驱动器的固件中合并或利用Data Distillation^TM装置或数据简化装置103。这跨越了在图11B-11F描述的多种配置和使用。Implementation examples corresponding to Figure 11H include, but are not limited to, incorporating or utilizing the Data Distillation ^™ device or data simplification device 103 in the firmware of applications, operating system kernels, file systems, data management modules, device drivers, or flash or disk drives. This spans the various configurations and uses described in Figures 11B-11F.

图11I图解说明在块处理存储系统中，如何把Data Distillation^TM装置用于数据简化。在这样的块处理系统中，数据被保存在块中，每个块由逻辑块地址或者说LBA识别。块不断被更改和重写，以致新的数据可能被重写到由特定LBA识别的块中。系统中的每个块被视为候选单元，Data Distillation^TM装置可用于把所述候选单元简化成由对(保存在特定基本数据单元块中的)基本数据单元的引用，以及在导出单元的情况下，对(保存在特定重建程序块中的)重建程序的引用构成的无损简化形式。图11I介绍了把利用LBA识别的块的内容映射到无损简化形式的对应单元的数据结构1151。而关于每个LBA将驻留关联单元的规范。对于采用固定大小的块的系统，便利的是使传入的块，基本数据单元块1152，以及重建程序块1153都是固定大小的。在该系统中，每个基本数据单元可被保存为单独的块。多个重建程序可被包装到也具有相同的固定大小的重建程序块中。对于各个基本数据单元和重建程序，数据结构还包含对驻留在叶节点数据结构的计数字段和关联元数据的引用，以致当用新的数据重写块时，可有效地管理驻留在LBA处的以前数据-(正在被重写的)现有基本数据单元和重建程序的计数字段必须被递减，并且同样地，进入到LBA中的由传入数据引用的基本数据单元的计数必须被递增。通过把对计数字段的引用保持在数据结构1151中，可快捷地管理重写，从而能够实现充分利用Data Distillation^TM装置提供的数据简化的高性能块处理存储系统。Figure 11I illustrates how the Data Distillation ^™ device is used for data simplification in a block-processing storage system. In such a block-processing system, data is stored in blocks, each identified by a Logical Block Address (LBA). Blocks are constantly modified and rewritten, such that new data may be rewritten into blocks identified by a specific LBA. Each block in the system is considered a candidate unit, and the Data Distillation ^™ device can be used to simplify these candidate units into a lossless simplified form consisting of references to basic data units (stored in specific basic data unit blocks) and, in the case of derived units, references to reconstruction procedures (stored in specific reconstruction procedure blocks). Figure 11I illustrates data structure 1151, which maps the contents of blocks identified using LBAs to corresponding units in the lossless simplified form. Regarding the specification of each LBA residing in the associated unit, for systems using fixed-size blocks, it is convenient to make the incoming blocks, basic data unit blocks 1152, and reconstruction procedure blocks 1153 all fixed-size. In this system, each basic data unit can be stored as a separate block. Multiple rebuild procedures can be packaged into rebuild procedure blocks of the same fixed size. For each basic data unit and rebuild procedure, the data structure also includes a count field residing in the leaf node data structure and a reference to associated metadata, so that when a block is rewritten with new data, the previous data residing in the LBA can be effectively managed—the count field of the existing basic data unit (being rewritten) and rebuild procedure must be decremented, and similarly, the count of the basic data unit referenced by the incoming data entering the LBA must be incremented. By maintaining a reference to the count field in data structure 1151, rewriting can be managed quickly, enabling a high-performance block processing storage system that fully utilizes the data simplification provided by the Data Distillation ^™ device.

图12A示出了根据这里所描述的一些实施例使用Data Distillation^TM装置在受到带宽约束的通信介质上传送数据。在所示出的设置中，通信节点A创建将被发送到通信节点B的文件集合。节点A采用Data Distillation^TM装置把输入文件转换成蒸馏数据或蒸馏文件，其中包含对安装在基本数据滤筛中的基本数据单元的引用以及用于导出单元的重建程序。节点A随后把蒸馏文件连同基本数据滤筛发送到节点B(可以在发送蒸馏文件之前、同时或之后发送基本数据滤筛；此外，可以通过相同的通信信道或者通过与被用于发送蒸馏文件的通信信道不同的通信信道发送基本数据滤筛)。节点B把基本数据滤筛安装在相应结构的末端，并且随后通过驻留在Data Distillation^TM装置中的取回器和重建器馈送蒸馏文件，以便产生由节点A创建的原始文件集合。因此，通过在受到带宽约束的通信介质的全部两端采用Data Distillation^TM装置仅发送简化数据，使得对于所述介质的使用更加高效。应当提到的是，使用Data Distillation^TM允许利用更大范围内的冗余(超出使用例如Lempel-Ziv之类的传统技术的可行范围)，从而可以高效地传送非常大的文件或文件群组。Figure 12A illustrates the use of a Data Distillation ^™ device to transmit data over a bandwidth-constrained communication medium according to some embodiments described herein. In the illustrated setup, communication node A creates a set of files to be sent to communication node B. Node A uses the Data Distillation ^™ device to convert the input files into distillation data or distillation files, which contain references to basic data units mounted in a basic data filter and a reconstruction procedure for deriving units. Node A then sends the distillation files along with the basic data filter to node B (the basic data filter can be sent before, simultaneously with, or after sending the distillation files; additionally, it can be sent through the same communication channel or through a different communication channel than the one used to send the distillation files). Node B mounts the basic data filter at the end of the corresponding structure and then feeds the distillation files through a retrieval unit and a reconstructor residing in the Data Distillation ^™ device to produce the original set of files created by node A. Thus, by using the Data Distillation ^™ device at both ends of the bandwidth-constrained communication medium to send only simplified data, the use of the medium becomes more efficient. It should be mentioned that using Data Distillation ^™ allows for the utilization of a greater range of redundancy (beyond the capabilities of traditional techniques such as Lempel-Ziv), enabling the efficient transfer of very large files or groups of files.

我们现在将讨论在广域网设置中使用Data Distillation^TM装置，其中各个工作组协作共享分散在多个节点上的数据。当数据被初次创建时，可以如图12A中所示对其进行简化和传送。广域网在每一个站点处保持数据的拷贝，以便允许对于数据的快速本地存取。使用Data Distillation^TM装置可以简化每一个站点处的足迹。此外，在任何站点处进行新鲜数据的后续摄取时，可以利用新鲜数据与已有的基本数据滤筛的内容之间的任何冗余以便对新鲜数据进行简化。We will now discuss the use of Data Distillation ^™ devices in wide area network (WAN) setups where various workgroups collaborate to share data distributed across multiple nodes. When data is initially created, it can be simplified and transmitted as shown in Figure 12A. The WAN maintains a copy of the data at each site to allow for fast local access to the data. Using Data Distillation ^™ devices simplifies the footprint at each site. Furthermore, when subsequent ingestion of fresh data at any site, any redundancy between the fresh data and the existing basic data filters can be utilized to simplify the fresh data.

在这样的设置中，需要把针对任何给定站点处的数据的任何修改传送到所有其他站点，从而把每一个站点处的基本数据滤筛保持一致。因此，如图12B中所示，根据这里所描述的一些实施例，可以把例如基本数据单元的安装和删除之类的更新以及元数据更新传送到每一个站点处的基本数据滤筛。举例来说，在把新鲜的基本数据单元安装到给定站点处的滤筛中时，所述基本数据单元需要被传送到所有其他站点。每一个站点可以使用该基本数据单元的值按照内容关联方式对滤筛进行存取，并且确定需要把新的条目添加在滤筛中的何处。同样地，在从给定站点处的滤筛中删除某一基本数据单元时，需要更新所有其他站点以反映出所述删除。可以实现这一点的一种方式是通过把所述基本数据单元传送到所有站点，从而使得每一个站点可以使用所述基本数据单元对滤筛进行内容关联存取，以便确定需要删除叶节点中的哪一个条目，连同针对树中的相关链接的必要更新以及从滤筛中删除该基本数据单元。另一种方法是向所有站点传送对所述基本数据单元所驻留的叶节点中的用于所述基本数据单元的条目的引用。In such a setup, any modifications to the data at any given site need to be transmitted to all other sites to keep the basic data filters consistent across all sites. Therefore, as shown in Figure 12B, according to some embodiments described herein, updates such as the installation and deletion of basic data units, as well as metadata updates, can be transmitted to the basic data filters at each site. For example, when a fresh basic data unit is installed into a filter at a given site, that basic data unit needs to be transmitted to all other sites. Each site can access the filter in a content-associative manner using the value of that basic data unit and determine where to add the new entry to the filter. Similarly, when a basic data unit is deleted from a filter at a given site, all other sites need to be updated to reflect the deletion. One way to achieve this is by transmitting the basic data unit to all sites, allowing each site to access the filter in a content-associative manner using the basic data unit to determine which entry in the leaf node needs to be deleted, along with necessary updates to the relevant links in the tree and the deletion of the basic data unit from the filter. Another approach is to transmit to all stations a reference to the entry for the basic data unit in the leaf node where the basic data unit resides.

因此，Data Distillation^TM装置可以被用来简化存储在广域网的各个站点处的数据的足迹，以及对网络的通信链接进行高效的使用。Therefore, Data Distillation ^™ devices can be used to simplify the footprint of data stored at various sites on a wide area network, and to make efficient use of network communication links.

图12C示出了Data Distillation^TM装置1203如何摄取输入文件1201，并且在完成蒸馏处理之后生成蒸馏文件集合1205和基本数据滤筛或基本数据存储库1206。图12C的基本数据滤筛或基本数据存储库1206本身由两个组成部分构成，即如图12D中所示的映射器1207和基本数据单元(或PDE)1208。Figure 12C illustrates how the Data Distillation ^™ device 1203 ingests an input file 1201 and generates a collection of distilled files 1205 and a basic data filter or basic data repository 1206 after the distillation process is completed. The basic data filter or basic data repository 1206 of Figure 12C itself consists of two components: a mapper 1207 as shown in Figure 12D and a basic data unit (or PDE) 1208.

映射器1207本身之内具有两个组成部分，即定义总体的树的树节点数据结构集合以及叶节点数据结构集合。树节点数据结构集合可以被放置到一个或多个文件中。同样地，叶节点数据结构集合可以被放置到一个或多个文件中。在一些实施例中，被称作树节点文件的单一文件保持对应于为给定数据集(输入文件1201)的基本数据单元创建的树的整个树节点数据结构集合，并且被称作叶节点文件的另一个单一文件保持对应于为该数据集的基本数据单元创建的树的整个叶节点数据结构集合。The mapper 1207 itself comprises two components: a set of tree node data structures defining the overall tree and a set of leaf node data structures. The set of tree node data structures can be placed in one or more files. Similarly, the set of leaf node data structures can be placed in one or more files. In some embodiments, a single file, referred to as the tree node file, holds the entire set of tree node data structures corresponding to the tree created for the basic data units of a given dataset (input file 1201), and another single file, referred to as the leaf node file, holds the entire set of leaf node data structures corresponding to the tree created for the basic data units of that dataset.

在图12D中，基本数据单元1208包含为给定的数据集(输入文件1201)创建的基本数据单元集合。所述基本数据单元集合可以被放置到一个或多个文件中。在一些实施例中，被称作PDE文件的单一文件保持为所述给定数据集创建的整个基本数据单元集合。In Figure 12D, basic data unit 1208 contains a set of basic data units created for a given dataset (input file 1201). This set of basic data units can be placed in one or more files. In some embodiments, a single file, referred to as a PDE file, holds the entire set of basic data units created for the given dataset.

树节点文件中的树节点将包含对树节点文件内的其他树节点的引用。树节点文件中的最深等级(或最低等级)的树节点将包含对叶节点文件中的叶节点数据结构中的条目的引用。叶节点文件中的叶节点数据结构中的条目将包含对PDE文件中的基本数据单元的引用。Tree nodes in a tree node file will contain references to other tree nodes within the file. The deepest (or lowest) level tree node in a tree node file will contain references to entries in the leaf node data structure of the leaf node file. Entries in the leaf node data structure of the leaf node file will contain references to the basic data units in the PDE file.

在图12E中示出了树节点文件、叶节点文件和PDE文件，该图示出了由所述装置创建的所有分量的细节。图12E示出了包括名称为file1、file2、file3…fileN的N个文件的输入文件集合1201，所述文件被Data Distillation^TM装置简化从而产生蒸馏文件集合1205和基本数据滤筛的各个分量，也就是树节点文件1209、叶节点文件1210和PDE文件1211。蒸馏文件1205包括名称为file1.dist、file2.dist、file3.dist…fileN.dist的N个文件。DataDistillation^TM装置把输入数据因式分解成其构成单元，并且创建两个类别的数据单元——基本数据单元和导出单元。蒸馏文件包含处于无损简化格式中的数据单元的描述，并且包含对PDE文件中的基本数据单元的引用。输入文件1201中的每一个文件在蒸馏文件1205中具有相应的蒸馏文件。举例来说，输入文件1201中的file1 1212对应于蒸馏文件1205中的名称为file1.dist 1213的蒸馏文件。图12R示出了被指定为一组输入文件和目录或文件夹的输入数据集的替代表示。Figure 12E illustrates the tree node file, leaf node file, and PDE file, showing details of all components created by the device. Figure 12E shows an input file set 1201 comprising N files named file1, file2, file3…fileN, which are simplified by the Data Distillation ^™ device to produce a distillation file set 1205 and the various components of the basic data filter, namely the tree node file 1209, leaf node file 1210, and PDE file 1211. Distillation file 1205 comprises N files named file1.dist, file2.dist, file3.dist…fileN.dist. The Data Distillation ^™ device factorizes the input data into its constituent units and creates two categories of data units—basic data units and derived units. The distillation file contains a description of the data units in a lossless simplified format and includes references to the basic data units in the PDE file. Each file in input file 1201 has a corresponding distillation file in distillation file 1205. For example, file1 1212 in input file 1201 corresponds to the distillation file named file1.dist 1213 in distillation file 1205. Figure 12R shows an alternative representation of an input dataset that is specified as a set of input files and directories or folders.

应当提到的是，图12E示出了由数据蒸馏装置基于根据图1A的蒸馏数据和基本数据滤筛的组织而创建的各个组成部分，其中重建程序被放置在蒸馏文件中的单元的无损简化表示中。应当提到的是，(根据图1B的)一些实施例可以把重建程序放置在基本数据滤筛中，并且将其像基本数据单元一样对待。蒸馏文件中的单元的无损简化表示将包含对基本数据滤筛中的重建程序的引用(而不是包含重建程序本身)。在这些实施例中，重建程序将像基本数据单元一样被对待，并且在PDE文件1211中产生。在另一个实施例中，根据图1C，重建程序与基本数据单元分开存储并且被存储在称作重建程序存储库的结构中。在这样的实施例中，蒸馏文件中的单元的无损简化表示将包含对重建程序存储库中的重建程序的引用。在这样的实施例中，如图12F中所示，除了产生对应于基本数据单元的树组织的树节点文件1209、叶节点文件1210和PDE文件1211之外，所述装置还将产生被称作重建树节点文件1219和重建叶节点文件1220的树和叶节点文件的第二集合，连同被称作RP文件1221的包含所有重建程序的文件。It should be noted that Figure 12E illustrates the various components created by the data distillation apparatus based on the organization of distillation data and a basic data filter according to Figure 1A, wherein the reconstruction procedure is placed in a non-destructive simplified representation of a cell in the distillation file. It should be noted that in some embodiments (according to Figure 1B), the reconstruction procedure can be placed in the basic data filter and treated as a basic data cell. The non-destructive simplified representation of a cell in the distillation file will contain references to the reconstruction procedure in the basic data filter (rather than containing the reconstruction procedure itself). In these embodiments, the reconstruction procedure will be treated as a basic data cell and generated in PDE file 1211. In another embodiment, according to Figure 1C, the reconstruction procedure is stored separately from the basic data cell and is stored in a structure called a reconstruction procedure repository. In such an embodiment, the non-destructive simplified representation of a cell in the distillation file will contain references to the reconstruction procedure in the reconstruction procedure repository. In such an embodiment, as shown in FIG12F, in addition to generating tree node files 1209, leaf node files 1210 and PDE files 1211 corresponding to the tree organization of basic data units, the apparatus will also generate a second set of tree and leaf node files called reconstructed tree node files 1219 and reconstructed leaf node files 1220, together with a file called RP file 1221 containing all reconstruction procedures.

图12E中所示的Data Distillation^TM装置还把管理其操作的配置和控制信息存储在树节点文件1209、叶节点文件1210、PDE文件1211和蒸馏文件1205当中的一项或多项中。或者可以生成包含该信息的第五分量。类似地对于图12F中示出的装置，所述配置和控制信息可以被存储在图12F所示的各个分量当中的一个或多个分量中，或者可以被存储在为此目的生成的另一个分量中。The Data Distillation ^™ device shown in Figure 12E also stores configuration and control information that manages its operation in one or more of the tree node file 1209, leaf node file 1210, PDE file 1211, and distillation file 1205. Alternatively, a fifth component containing this information can be generated. Similarly, for the device shown in Figure 12F, the configuration and control information can be stored in one or more of the components shown in Figure 12F, or it can be stored in another component generated for this purpose.

图12G示出了Data Distillation^TM装置的使用的总览，其中给定的数据集(输入文件1221)被馈送到Data Distillation^TM装置1203，并且被处理以产生无损简化数据集(无损简化数据集1224)。输入数据集1221可以由文件、对象、块、组块或来自数据流的提取项的选集构成。应当提到的是，图12E示出了其中所述数据集由文件构成的实例。图12G的输入数据集1221对应于图12E的输入文件1201，图12G的无损简化数据集1224则包括图12E中示出的四个组成部分，即图12E的蒸馏文件1205、树节点文件1209、叶节点文件1210和PDE文件1211。在图12G中，Data Distillation^TM装置利用为之给出的输入数据集的整个范围内的数据单元当中的冗余。Figure 12G shows an overview of the use of the Data Distillation ^™ device, where a given dataset (input file 1221) is fed to the Data Distillation ^™ device 1203 and processed to produce a lossless simplified dataset (lossless simplified dataset 1224). The input dataset 1221 can consist of files, objects, chunks, groups, or a selection of extracts from a data stream. It should be noted that Figure 12E shows an example where the dataset consists of files. The input dataset 1221 of Figure 12G corresponds to the input file 1201 of Figure 12E, and the lossless simplified dataset 1224 of Figure 12G includes the four components shown in Figure 12E: the distillation file 1205, the tree node file 1209, the leaf node file 1210, and the PDE file 1211 of Figure 12E. In Figure 12G, the Data Distillation ^™ device utilizes redundancy across the entire range of data units in the given input dataset.

Data Distillation^TM装置可以被配置成利用输入数据集的一个子集中的冗余，并且提供对应于为之给出的每一个数据子集的无损简化。举例来说，如图12H中所示，输入数据集1221可以被分割成许多更小的数据选集，每一个选集在本公开内容中被称作“批次”或“数据的批次”或“数据批次”。图12H示出了被配置成摄取输入数据批次1224并且产生无损简化数据批次1225的Data Distillation^TM装置。图12H示出输入数据集1221由若干数据选集构成，也就是数据批次1…数据批次i…数据批次n。数据被每次一个数据批次呈现到DataDistillation^TM装置，并且利用每一个数据批次的范围内的冗余以生成无损简化数据批次。举例来说，来自输入数据集1221的数据批次i1226被馈送到所述装置，并且无损简化数据批次i 1228被递送到无损简化数据集1227。来自输入数据集1221的每一个数据批次被馈送到所述装置，并且相应的无损简化数据批次被递送到无损简化数据集1227。在消耗并且简化了所有数据批次1…数据批次i…数据批次n之后，输入数据集1221被简化到无损简化数据集1227。The Data Distillation ^™ apparatus can be configured to utilize redundancy within a subset of the input dataset and provide lossless simplification corresponding to each given subset of data. For example, as shown in Figure 12H, the input dataset 1221 can be divided into a number of smaller data selections, each selection referred to herein as a “batch” or “data batch”. Figure 12H illustrates a Data Distillation ^™ apparatus configured to ingest input data batches 1224 and generate lossless simplified data batches 1225. Figure 12H shows that the input dataset 1221 consists of several data selections, namely data batch 1…data batch i…data batch n. Data is presented to the Data Distillation ^™ apparatus one data batch at a time, and redundancy within the range of each data batch is utilized to generate lossless simplified data batches. For example, data batch i1226 from the input dataset 1221 is fed to the apparatus, and lossless simplified data batch i1228 is delivered to lossless simplified dataset 1227. Each data batch from input dataset 1221 is fed into the device, and the corresponding lossless simplified data batch is delivered to lossless simplified dataset 1227. After consuming and simplifying all data batches 1...data batch i...data batch n, input dataset 1221 is simplified to lossless simplified dataset 1227.

虽然Data Distillation^TM装置的设计在利用全局数据范围内的冗余方面已经是高效的，但是前面的技术可以被用来进一步加快数据简化处理并且进一步改进其效率。通过把数据批次的大小限制到能够容纳到系统的可用存储器中，可以提高数据简化处理的吞吐量。举例来说，其大小是许多太字节或甚至拍字节的输入数据集可以被分解成分别具有例如256GB大小的许多数据批次，并且每一个数据批次可以被快速地简化。通过使用具有256GB的存储器的单一处理器核心(Intel Xeon E5-1650 V3，Haswell 3.5Ghz处理器)，在我们的实验室中已经实施了利用256GB的范围内的冗余的此类解决方案，从而给出每秒几百兆字节的数据摄取速率，同时在各种数据集上给出2-3x的简化水平。应当提到的是，256GB的范围比32KB大几百万倍，而32KB是Lempel Ziv方法在现今的处理器上给出10MB/秒到200MB/秒之间的摄取性能的窗口大小。因此，通过适当地限制冗余的范围，通过潜在地牺牲一些简化可以实现数据蒸馏处理的速度方面的改进。While the Data Distillation ^™ device is already highly efficient in utilizing redundancy across the entire data range, the aforementioned techniques can be used to further accelerate data simplification and improve its efficiency. Throughput of data simplification can be increased by limiting the size of data batches to fit within the system's available memory. For example, an input dataset of many terabytes or even petabytes in size can be broken down into multiple data batches, each of which can be rapidly simplified. In our lab, we have implemented such a solution utilizing redundancy within the 256GB range using a single processor core (Intel Xeon E5-1650 V3, Haswell 3.5GHz processor) with 256GB of memory, achieving data ingestion rates of several hundred megabytes per second while delivering 2-3x simplification levels across a variety of datasets. It should be noted that 256GB is millions of times larger than 32KB, and 32KB is the window size within which the Lempel Ziv method delivers ingestion performance between 10MB/s and 200MB/s on current processors. Therefore, improvements in the speed of data distillation processing can be achieved by appropriately limiting the scope of redundancy and potentially sacrificing some simplification.

图12I示出了图12H中的设置的一种变型，并且示出了运行在多个处理器上以便显著提升输入数据集的数据简化(并且还有数据重建/取回)吞吐量的多个数据蒸馏处理。图12I示出了被分割成x个数据批次的输入数据集1201，所述x个独立数据批次被馈送到运行在独立处理器核心上的j个独立处理中(每一个处理被分配足够的存储器以容纳将被馈送到该处的任何数据批次)以得到并行执行，并且对于数据简化以及重建/取回都产生近似j倍的加速。Figure 12I illustrates a variation of the setup in Figure 12H and shows multiple data distillation processes running on multiple processors to significantly improve the throughput of data simplification (and data reconstruction/retrieval) of the input dataset. Figure 12I shows an input dataset 1201 divided into x data batches, which are fed into j independent processes running on independent processor cores (each process is allocated sufficient memory to accommodate any data batches to be fed there) for parallel execution, resulting in an approximately j-fold speedup for both data simplification and reconstruction/retrieval.

图12H示出了被配置成摄取输入数据批次1224并且产生无损简化数据批次1225的Data Distillation^TM(数据蒸馏)装置。图12H示出了输入数据集1221由多个数据选集构成，也就是数据批次1…数据批次i…数据批次n。在一些实施例中，可以采用将输入数据集划分为多个数据批次的替代性分区方案，其中动态地确定数据批次边界以便最好地利用可用存储器。可用存储器可以用于首先保存所有树节点，或者可以用于保存数据批次的所有树节点和所有叶节点，或者最后可以用于保存所有树节点、叶节点和所有基本数据单元。这三种不同的选择使装置能够具有替代的操作点。例如，使可用存储器专用于树节点可以使更大范围的数据容纳在数据批次中，但这要求装置必须在需要时从存储装置中获取叶节点以及相关的基本数据单元，从而导致附加的延迟。替代地，使可用存储器专用于容纳树节点和叶节点两者加快了蒸馏速度，但会减少树的有效大小，从而减少可容纳在数据批次中的数据的范围。最后，使用可用存储器来保持所有树节点、叶节点和基本数据单元将实现最快的蒸馏，但是可以作为单个范围支持的数据批次的大小将最小。在所有这些实施例中，将在达到存储器限制时动态关闭数据批次，并且来自输入数据集的后续文件将成为新数据批次的一部分。Figure 12H illustrates a Data Distillation ^™ apparatus configured to ingest input data batches 1224 and produce lossless simplified data batches 1225. Figure 12H shows that the input dataset 1221 consists of multiple data selections, namely data batch 1…data batch i…data batch n. In some embodiments, an alternative partitioning scheme can be employed to divide the input dataset into multiple data batches, wherein the data batch boundaries are dynamically determined to best utilize available memory. Available memory can be used to store all tree nodes first, or it can be used to store all tree nodes and all leaf nodes of the data batch, or it can be used last to store all tree nodes, leaf nodes, and all basic data units. These three different options allow the apparatus to have alternative operating points. For example, dedicating available memory to tree nodes allows a larger range of data to be accommodated in the data batch, but this requires the apparatus to retrieve leaf nodes and associated basic data units from the storage device when needed, resulting in additional latency. Alternatively, dedicating available memory to both tree nodes and leaf nodes speeds up the distillation process but reduces the effective size of the tree, thereby reducing the range of data that can be accommodated in the data batch. Finally, using available memory to hold all tree nodes, leaf nodes, and basic data units will achieve the fastest distillation, but will also minimize the size of the data batches that can be supported as a single range. In all these embodiments, data batches will be dynamically shut down when memory limits are reached, and subsequent files from the input dataset will become part of new data batches.

存在可以提高装置的效率并加快重建过程的进一步的改进。在一些实施例中，单个统一映射器用于蒸馏，但是不是将基本数据单元保持在单个PDE文件中，而是基本数据单元跨N个PDE文件保持。因此，以前的单个PDE文件被划分为n个PDE文件，每个PDE文件小于某个阈值大小，并且当PDE文件超过该阈值大小时(由于基本数据单元的安装而增长时)，每个分区在蒸馏过程中被创建。通过咨询映射器以内容关联地选择适合于导出的适当基本数据单元，然后导出从其所在的适当PDE文件中获取的适当基本数据单元，来对每个输入文件进行蒸馏。每个蒸馏文件被进一步增强，以列出(n个PDE文件中)包含特定蒸馏文件引用的基本数据单元的所有PDE文件。为了重建特定的蒸馏文件，只有那些列出的PDE文件将需要被加载或打开以被访问以用于重建。这样做的好处在于，对于单个或几个蒸馏文件的重建，仅包含该特定蒸馏文件所需的基本数据单元的那些PDE文件需要被访问或保持活动，而其它PDE文件无需被保留或加载到快速的存储器或存储装置层中。因此，重建可以加快并且更高效。Further improvements exist that can increase the efficiency of the device and accelerate the reconstruction process. In some embodiments, a single unified mapper is used for distillation, but instead of keeping the basic data units in a single PDE file, the basic data units are kept across N PDE files. Thus, the previously single PDE file is divided into n PDE files, each smaller than a certain threshold size, and each partition is created during the distillation process as the PDE file exceeds that threshold size (due to the growth caused by the mounting of the basic data units). Each input file is distilled by consulting the mapper to content-relatedally select the appropriate basic data units suitable for export, and then exporting the appropriate basic data units obtained from the appropriate PDE file in which they reside. Each distillation file is further enhanced to list all PDE files (out of the n PDE files) that contain the basic data units referenced by the specific distillation file. Only those listed PDE files will need to be loaded or opened to be accessed for reconstruction in order to reconstruct a specific distillation file. The advantage of this approach is that, for the reconstruction of a single or a few distillation files, only those PDE files containing the essential data units required for that specific distillation file need to be accessed or kept active, while other PDE files do not need to be retained or loaded into a fast memory or storage device layer. Therefore, reconstruction can be faster and more efficient.

将PDE文件分区为n个PDE文件还可以通过标准进行指导，该标准将在数据集中的任何给定文件的简化过程中对基本数据进行的引用模式本地化。装置可以通过计数器进行增强，该计数器对当前PDE文件中对单元的引用的密度进行计数和估计。如果该密度高，则PDE文件不会被分区或拆分，并且将在后续单元的安装时不断增长。一旦给定蒸馏文件中的引用的密度逐渐降低，就可以允许PDE文件在后续增长超过某个阈值时被拆分和分区。一旦被分区，就将打开新的PDE文件，并将使得来自后续蒸馏的后续安装进行到该新的PDE文件中。如果只需要重建数据批次中的文件的子集，那么这种布置将进一步加快重建。Partitioning a PDE file into n PDE files can also be guided by a standard that localizes the reference patterns of the underlying data during the simplification of any given file in the dataset. The device can be enhanced by a counter that counts and estimates the density of references to cells in the current PDE file. If this density is high, the PDE file will not be partitioned or split and will continue to grow with subsequent cell installations. Once the reference density in a given distillation file gradually decreases, the PDE file can be allowed to be split and partitioned if subsequent growth exceeds a certain threshold. Once partitioned, a new PDE file will be opened, and subsequent installations from subsequent distillations will be performed in that new PDE file. This arrangement will further accelerate reconstruction if only a subset of files in the data batch needs to be reconstructed.

图12J示出了由Data Distillation^TM装置对于一个使用模型产生的简化数据的各个分量，其中在输入数据集的简化之后不再需要保留映射器。这样的使用模型的实例是特定种类的数据备份和数据归档应用。在这样的使用模型中，对于简化数据的后续使用是从简化数据集重建和取回输入数据集。在这样的情形中，通过在数据简化完成之后不再存储映射器可以把简化数据的足迹进一步简化。图12J示出了被馈送到所述装置的输入文件1201，从而产生蒸馏文件1205和PDE文件1211——这些组成部分(或分量)构成这种情形中的简化数据。应当提到的是，仅使用蒸馏文件1205和PDE文件1211可以完全重新生成并恢复输入文件1201。回想到对应于蒸馏文件中的每一个单元的无损简化表示包含重建程序(在需要时)以及对PDE文件中的基本数据单元的引用。与PDE文件相耦合，这就是执行重建所需的全部信息。还需要指出的是，这种布置对输入数据集的重建和取回的性能效率具有重要的益处。在该实施例中，装置将输入数据集分解为包含在单独的PDE文件中的蒸馏文件和基本数据单元。在重建期间，可以先将PDE文件从存储装置加载到可用存储器中，然后可以从存储装置中连续读取蒸馏文件进行重建。在重建每个蒸馏文件期间，将快速从存储器中取回重建蒸馏文件所需的任何基本数据单元，而不会因读取基本数据单元而引起任何附加的存储装置访问延迟。重建的蒸馏文件可以在其完成时被写出到存储装置中。这种布置排除了执行随机存储访问的需要，否则这种随机存储访问将对性能产生有害的影响。在这种解决方案中，来自存储装置的PDE文件的负载是对顺序的连续字节块的一组访问，每个蒸馏文件的读取也是对顺序的连续字节块的一组访问，最后每个重建的输入文件作为对一组顺序的连续字节块的访问被写出到存储装置中。这种布置的存储性能更紧密地跟踪顺序地读取和写入连续字节块的性能，而不是引起多个随机存储访问的解决方案的性能。Figure 12J illustrates the components of simplified data generated by the Data Distillation ^™ device for a usage model where the mapper is no longer needed after simplification of the input dataset. An example of such a usage model is a particular type of data backup and data archiving application. In such a usage model, subsequent use of the simplified data involves reconstructing and retrieving the input dataset from the simplified dataset. In this case, the footprint of the simplified data can be further simplified by not storing the mapper after data simplification is complete. Figure 12J shows the input file 1201 fed to the device, resulting in a distillation file 1205 and a PDE file 1211—these components constitute the simplified data in this scenario. It should be noted that the input file 1201 can be completely regenerated and restored using only the distillation file 1205 and the PDE file 1211. Recall that the lossless simplified representation corresponding to each unit in the distillation file contains the reconstruction procedure (if needed) and references to the basic data units in the PDE file. Coupled with the PDE file, this is all the information needed to perform the reconstruction. It should also be noted that this arrangement has significant benefits for the performance efficiency of reconstructing and retrieving the input dataset. In this embodiment, the apparatus decomposes the input dataset into distillation files and basic data units contained in separate PDE files. During reconstruction, the PDE files can be loaded from storage into available memory first, and then the distillation files can be read sequentially from storage for reconstruction. During the reconstruction of each distillation file, any basic data units required to reconstruct the distillation file are quickly retrieved from memory without any additional storage access latency caused by reading basic data units. The reconstructed distillation files can be written to storage when they are complete. This arrangement eliminates the need for random memory accesses, which would otherwise have a detrimental impact on performance. In this solution, the load from the PDE files of storage is a set of accesses to sequential, consecutive blocks of bytes, the read of each distillation file is also a set of accesses to sequential, consecutive blocks of bytes, and finally each reconstructed input file is written to storage as an access to a set of sequential, consecutive blocks of bytes. The storage performance of this arrangement more closely follows the performance of sequentially reading and writing consecutive blocks of bytes, rather than the performance of solutions that cause multiple random memory accesses.

应当提到的是，图12J示出了由数据蒸馏装置基于根据图1A的蒸馏数据和基本数据滤筛的组织而创建的各个分量，其中重建程序被放置在蒸馏文件中的单元的无损简化表示中。应当提到的是，(根据图1B的)一些实施例可以把重建程序放置在基本数据滤筛中，并且将其像基本数据单元一样对待。蒸馏文件中的单元的无损简化表示将包含对基本数据滤筛中的重建程序的引用(而不是包含重建程序本身)。在这些实施例中，重建程序将像基本数据单元一样被对待，并且在PDE文件1211中产生。在另一个实施例中，根据图1C，重建程序与基本数据单元分开存储并且被存储在称作重建程序存储库的结构中。在这样的实施例中，蒸馏文件中的单元的无损简化表示将包含对重建程序存储库中的重建程序的引用。在这样的实施例中，除了产生对应于基本数据单元的PDE文件之外，所述装置还将产生被称作RP文件的包含所有重建程序的文件。这在图12K中示出，该图对于其中不再需要保留映射器的使用模型示出了简化数据的分量。图12K示出了包括蒸馏文件1205、PDE文件1211和RP文件1221的简化数据分量。It should be mentioned that Figure 12J illustrates the various components created by the data distillation apparatus based on the organization of distillation data and a basic data filter according to Figure 1A, wherein the reconstruction procedure is placed in a non-destructive simplified representation of the cell in the distillation file. It should be mentioned that in some embodiments (according to Figure 1B), the reconstruction procedure can be placed in the basic data filter and treated as a basic data cell. The non-destructive simplified representation of the cell in the distillation file will contain references to the reconstruction procedure in the basic data filter (rather than containing the reconstruction procedure itself). In these embodiments, the reconstruction procedure will be treated as a basic data cell and generated in PDE file 1211. In another embodiment, according to Figure 1C, the reconstruction procedure is stored separately from the basic data cell and is stored in a structure called a reconstruction procedure repository. In such an embodiment, the non-destructive simplified representation of the cell in the distillation file will contain references to the reconstruction procedure in the reconstruction procedure repository. In such an embodiment, in addition to generating a PDE file corresponding to the basic data cell, the apparatus will also generate a file called an RP file containing all the reconstruction procedures. This is illustrated in Figure 12K, which shows the simplified data components for a usage model where the mapper is no longer needed. Figure 12K shows the simplified data components including distillation file 1205, PDE file 1211, and RP file 1221.

图12L-12P示出了根据本文描述的一些实施例如何在分布式系统上部署和执行蒸馏过程以能够以非常高的摄取速率适应非常大的数据集。Figures 12L-12P illustrate how, according to some embodiments described herein, a distillation process can be deployed and executed on a distributed system to accommodate very large datasets at very high ingestion rates.

分布式计算范例需要通过在多个计算机上运行的程序对大数据集进行分布式处理。图12L示出了被称为分布式计算集群的组织中联网在一起的多个计算机。图12L示出了计算机之间的点对点链路，但是将会理解的是，可以使用任何通信拓扑结构(例如，轴辐式(hub-and-spoke)拓扑结构或者网状拓扑结构)来代替图12L所示的拓扑结构。在给定的集群中，一个节点被指定为将任务分配给从节点、并控制和协调整个操作的主节点。从节点按照主节点的指示执行任务。Distributed computing paradigms require the distributed processing of large datasets through programs running on multiple computers. Figure 12L illustrates multiple computers networked together in an organization known as a distributed computing cluster. Figure 12L shows point-to-point links between computers; however, it will be understood that any communication topology (e.g., hub-and-spoke or mesh topology) can be used instead of the topology shown in Figure 12L. In a given cluster, a node is designated as the master node, which assigns tasks to slave nodes and controls and coordinates the entire operation. Slave nodes execute tasks according to the master node's instructions.

数据蒸馏过程可以跨分布式计算集群的多个节点以分布式方式执行，以利用集群中的多个计算机的总计算、存储器和存储容量。在此设置中，主节点上的主蒸馏模块与从节点上运行的从蒸馏模块相互作用，以分布式方式实现数据蒸馏。为了便于这种分配，可以将装置的基本数据滤筛分割成多个独立的子集或子树，这些子集或子树可以分布在从节点上运行的多个从模块上。回想一下，在数据蒸馏装置中，基本数据单元基于其名称以树形式进行组织，并且它们的名称来源于其内容。基本数据滤筛可以根据基本数据滤筛中单元名称的前导字节划分为多个独立的子集或子筛。可以有多种方法在多个子树上划分名称空间。例如，可以将单元名称的前导字节的值分割成多个子范围，并将每个子范围分配给子筛。可以创建多少个子集或分区，因为集群中有从模块，因此每个独立分区都部署在特定的从模块上。使用部署的子筛，每个从模块被设计为对其接收的候选单元执行数据蒸馏处理。Data distillation can be performed in a distributed manner across multiple nodes in a distributed computing cluster to utilize the combined computing, memory, and storage capacity of the multiple computers in the cluster. In this setup, the master distillation module on the master node interacts with slave distillation modules running on slave nodes to perform data distillation in a distributed manner. To facilitate this distribution, the basic data filters of the apparatus can be divided into multiple independent subsets or subtrees, which can be distributed across multiple slave modules running on slave nodes. Recall that in a data distillation apparatus, basic data units are organized in a tree structure based on their names, and their names are derived from their contents. The basic data filters can be divided into multiple independent subsets or sub-filters based on the leading byte of the unit name in the basic data filters. There are various methods to partition the namespace across multiple subtrees. For example, the value of the leading byte of the unit name can be divided into multiple subranges, and each subrange can be assigned to a sub-filter. The number of subsets or partitions can be varied, and since there are slave modules in the cluster, each independent partition is deployed on a specific slave module. Using the deployed sub-filters, each slave module is designed to perform data distillation processing on the candidate units it receives.

图12M示出了将基本数据滤筛分成将被部署在运行于4个节点上的4个从模块上的4个基本数据滤筛或子筛，标记为PDS_1、PDS_2、PDS_3和PDS_4。分区基于基本数据单元的名称的前导字节。在所示的示例中，PDS_1中所有单元的名称的前导字节将位于范围A到I中，并且滤筛PDS_1将具有名称A_I，该名称由引导到它的值的范围标记。同样，PDS_2中的所有单元的名称的前导字节将在J到O的范围内，并且子筛PDS_2将具有由引导到其的值的范围标记的名称J_0。同样，PDS_3中的所有单元的名称的前导字节将在P到S的范围内，并且子筛PDS_3将具有由引导到其的值的范围标记的名称P_S。最后，PDS_4中所有单元的名称的前导字节将在T到Z的范围内，并且子筛PDS_4将具有由引导到它的值的范围标记的名称T_Z。Figure 12M illustrates the division of basic data filters into four basic data filters or sub-filters, labeled PDS_1, PDS_2, PDS_3, and PDS_4, which will be deployed on four slave modules running on four nodes. The partitioning is based on the leading byte of the basic data unit's name. In the example shown, the leading byte of the name of all units in PDS_1 will be in the range A to I, and filter PDS_1 will have the name A_I, marked by the range of values leading to it. Similarly, the leading byte of the name of all units in PDS_2 will be in the range J to O, and sub-filter PDS_2 will have the name J_0, marked by the range of values leading to it. Likewise, the leading byte of the name of all units in PDS_3 will be in the range P to S, and sub-filter PDS_3 will have the name P_S, marked by the range of values leading to it. Finally, the leading byte of the name of all units in PDS_4 will be in the range T to Z, and sub-filter PDS_4 will have the name T_Z, marked by the range of values leading to it.

在此设置中，在主节点上运行的主模块接收输入文件并执行输入文件的轻量级解析和因式分解以将输入文件分解为候选单元序列，并且随后将每个候选单元引导向合适的从模块进行进一步处理。轻量级解析可以包括针对图式解析每个候选单元，或者可以包括在候选单元上应用指纹，以确定构成候选单元的名称的前导字节的维度。主设备处的解析仅限于识别足以确定哪个从模块应该接收候选单元的一样多的字节。基于候选单元的名称的前导字节中的值，候选者被转发到保持与该特定值相对应的子筛的从节点处的从模块。In this setup, the master module running on the master node receives the input file and performs lightweight parsing and factorization of the input file to decompose it into a sequence of candidate units. Each candidate unit is then directed to the appropriate slave module for further processing. Lightweight parsing may include parsing each candidate unit graphically, or it may include applying a fingerprint to the candidate unit to determine the dimension of the leading bytes that make up the candidate unit's name. Parsing at the master node is limited to identifying enough bytes to determine which slave module should receive the candidate unit. Based on the value in the leading bytes of the candidate unit's name, the candidate is forwarded to the slave module at the slave node that maintains a sub-sieve corresponding to that specific value.

当数据累积到滤筛中时，可以间歇地重新访问分区并重新平衡分区。分区和重新平衡功能可以由主模块执行。As data accumulates in the filter, partitions can be intermittently revisited and rebalanced. Partitioning and rebalancing can be performed by the main module.

在接收到候选单元之后，每个从模块执行数据蒸馏处理，从对候选单元的完整解析和检查开始创建其名称。使用该名称，从模块执行子筛的内容关联查询，并且执行蒸馏处理以将候选单元转换为关于该子筛的无损简化表示中的单元。通过用以识别从模块的被称为从编号(SlaveNumber)的字段和关于其简化了该单元的对应的子筛增强蒸馏文件中的单元的无损简化表示。单元的无损简化表示被发送回主模块。如果候选单元在子筛中未找到，或者不能从子筛中的基本数据单元中导出，则将新的基本数据单元识别为分配给子筛。Upon receiving a candidate cell, each slave module performs data distillation, starting with a complete parsing and inspection of the candidate cell to create its name. Using this name, the slave module performs a sub-sieve content association query and performs distillation to convert the candidate cell into a cell in a lossless simplified representation of that sub-sieve. This is done using a field called SlaveNumber, which identifies the slave module, and a lossless simplified representation of the cell in the corresponding sub-sieve enhanced distillation file. The lossless simplified representation of the cell is sent back to the master module. If a candidate cell is not found in the sub-sieve, or cannot be derived from the basic data cells in the sub-sieve, a new basic data cell is identified and assigned to the sub-sieve.

主模块继续将来自输入文件的所有候选单元引导至适当的从模块并且累积进入的单元描述(以无损简化表示)，直到其已经接收到输入文件的所有单元为止。此时，可以向所有的从模块发出全局提交通信，以便根据其各自的蒸馏处理的结果来更新其各自的子筛。针对输入的蒸馏文件存储在主模块中。The master module continues to guide all candidate cells from the input file to the appropriate slave modules and accumulates the incoming cell descriptions (in a lossless simplified representation) until it has received all cells from the input file. At this point, a global commit communication can be sent to all slave modules to update their respective subsieves based on the results of their respective distillation processes. The distillation file for the input is stored in the master module.

在一些实施例中，不是在任何从设备可以用新的基本数据单元或元数据来更新其子筛之前等待整个蒸馏文件被准备，而是可以在从模块处处理候选单元时完成对子筛的更新。In some embodiments, instead of waiting for the entire distillation file to be prepared before any slave device can update its subsieve with new basic data units or metadata, the update of the subsieve can be completed while the candidate units are being processed at the slave module.

在一些实施例中，根据图1B和1C的描述，每个子筛包含基本数据单元以及重建程序。在这样的实施例中，重建程序被存储在子筛中，并且无损简化表示包含对子筛中的基本数据单元以及重建程序(如果需要)的引用。这进一步减小了单元的大小，并因此减小了需要存储在主模块中的蒸馏文件的大小。在一些实施例中，每个子筛中的基本重建程序滤筛包含用于从驻留在该子筛中的基本数据单元创建导出项的那些重建程序。在这种情况下，基本重建程序在从节点本地可用，并且使得能够进行快速导出和重建，而不会产生以其它方式会从远程节点获取基本重建程序带来的任何延迟。在其它实施例中，基本重建程序滤筛跨所有节点全局分布，以利用分布式系统的总容量。通过第二字段增强无损简化表示，该第二字段识别包含基本重建程序的从节点或子筛。在这样的实施例中，该解决方案导致从远程节点获取基本重建程序的附加延迟，以便通过导出生成最终重建程序或者重建单元。整个方法利用所有从节点的组合存储容量来基于每个文件中的每个块或候选单元的内容在所有节点上分发文件。In some embodiments, as described in Figures 1B and 1C, each sub-sieve contains basic data units and reconstruction procedures. In such embodiments, reconstruction procedures are stored in the sub-sieve, and the lossless simplified representation contains references to the basic data units and reconstruction procedures (if needed) in the sub-sieve. This further reduces the unit size and thus the size of the distillation file that needs to be stored in the main module. In some embodiments, the basic reconstruction procedure filter in each sub-sieve contains those reconstruction procedures for creating derived items from the basic data units residing in that sub-sieve. In this case, the basic reconstruction procedures are locally available on the slave node, enabling fast export and reconstruction without any latency that would otherwise be incurred from obtaining the basic reconstruction procedures from remote nodes. In other embodiments, the basic reconstruction procedure filters are globally distributed across all nodes to utilize the total capacity of the distributed system. The lossless simplified representation is enhanced by a second field that identifies the slave node or sub-sieve containing the basic reconstruction procedures. In such embodiments, this solution results in additional latency in obtaining the basic reconstruction procedures from remote nodes to generate the final reconstruction procedure or reconstruction unit through export. The entire method utilizes the combined storage capacity of all slave nodes to distribute files across all nodes based on the content of each block or candidate unit in each file.

数据取回由主模块类似地协调。主模块接收蒸馏文件并检查蒸馏文件中每个单元的无损简化规格。其提取字段“SlaveNumber”，指示哪个从模块将重建该单元。该单元然后被发送到适当的从模块以进行重建。该重建的单元然后被发送回主模块。主模块汇编从所有从模块重建的单元，并将重建的文件转发给要求该文件的消费者。Data retrieval is similarly coordinated by the master module. The master module receives the distillation file and examines the lossless simplified specification of each cell in the file. Its extracted field, "SlaveNumber," indicates which slave module will reconstruct the cell. The cell is then sent to the appropriate slave module for reconstruction. The reconstructed cell is then sent back to the master module. The master module assembles the cells reconstructed from all slave modules and forwards the reconstructed file to the consumer that requested it.

图12N图示了可以如何在分布式系统中部署和执行数据蒸馏装置。输入文件1251被馈送到主模块，该模块解析并识别文件中每个候选单元的名称的前导字节。主模块将候选单元引导到4个从模块之一。持有具有名称A_I的包含具有在范围A至I中的名称承载值(Name bearing value)的前导字节的基本数据单元的PDS_1或子筛的从节点1处的从模块1接收带有名称BCD...的候选单元1252，其被确定为已经存在于名为A_I的子筛中的单元的副本。从模块1返回无损简化表示1253，该表示包含该单元为基本的指示符并且驻留在地址为refPDE1的Slave1处。如图12N所示，主模块将所有候选单元发送到相关的从模块并汇编和收集并最后存储蒸馏文件。Figure 12N illustrates how a data distillation apparatus can be deployed and executed in a distributed system. Input file 1251 is fed to the master module, which parses and identifies the leading byte of the name of each candidate cell in the file. The master module directs the candidate cells to one of four slave modules. Slave module 1, at slave node 1 of slave node PDS_1 or a sub-sieve holding a basic data cell with the name A_I and a leading byte containing a name-bearing value in the range A to I, receives candidate cell 1252 with the name BCD..., which is determined to be a copy of the cell already existing in the sub-sieve named A_I. Slave module 1 returns a lossless simplified representation 1253, which contains the cell as an indicator and resides at address refPDE1 on Slave1. As shown in Figure 12N, the master module sends all candidate cells to the relevant slave modules, assembles and collects them, and finally stores the distillation file.

图12O示出了图12N所示的图示的变体。在该变体中，在蒸馏文件中每个单元的无损简化表示中，标识该单元已经相对其简化的特定Child_Sieve的字段包含该Child_Sieve的名称，而不是Child_Sieve驻留在其上的模块或节点的编号。因此，字段SlaveNumber被字段Child_Sieve_Name取代。这具有通过其虚拟地址而不是Child_Sieve所驻留的模块或物理节点的编号引用相关的Child_Sieve的益处。因此，如图12O所示，持有具有名称A_I的包含具有在范围A至I中的名称承载值的前导字节的基本数据单元的PDS_1或子筛的从节点1处的从模块1接收带有名称BCD...的候选单元1252，其被确定为已经存在于名为A_I的子筛中的单元的副本。从模块1返回无损简化表示1254，该表示包含该单元为基本的指示符并且驻留在地址为refPDE1的Slave1处。Figure 120 shows a variation of the illustration shown in Figure 12N. In this variation, in the lossless simplified representation of each cell in the distillation file, the field identifying the specific Child_Sieve to which the cell has been simplified contains the name of the Child_Sieve, rather than the number of the module or node to which the Child_Sieve resides. Therefore, the SlaveNumber field is replaced by the Child_Sieve_Name field. This has the advantage of referencing the associated Child_Sieve by its virtual address rather than the number of the module or physical node to which the Child_Sieve resides. Thus, as shown in Figure 120, slave module 1 at slave node 1 of PDS_1 or subsieve, which holds a basic data cell with the name A_I and a leading byte containing a name-bearing value in the range A to I, receives a candidate cell 1252 with the name BCD..., which is determined to be a copy of the cell already existing in the subsieve named A_I. Return lossless simplified representation 1254 from module 1, which contains the basic pointer of the cell and resides at address refPDE1 in Slave1.

注意，通过采用图12L至图12O中描述的布置，可以提高数据蒸馏处理的整体吞吐率。主模块的吞吐量将受到轻量级解析和来自主模块的候选单元的调度的限制。对许多候选单元的蒸馏将并行执行，只要它们的内容将它们引导到不同的从模块。Note that by employing the arrangement described in Figures 12L to 12O, the overall throughput of data distillation processing can be improved. The throughput of the main module will be limited by lightweight parsing and the scheduling of candidate units from the main module. Distillation of many candidate units will be performed in parallel, as long as their contents direct them to different slave modules.

为了进一步提高整体吞吐量，可以并行化输入流的轻量级解析和因式分解以识别哪个Child_Sieve应当接收候选单元的任务。这个任务可以被主模块划分成多个并行任务，由多个从节点上运行的从模块并行执行。这可以通过预测数据流并将数据流分成多个部分重叠的段来完成。这些段由主模块发送到并行执行轻量级解析和因式分解的每个从模块，并将因式分解的结果发送给主模块。主模块分解跨越每个段的边界的因式分解，然后将候选单元路由到适当的从模块。To further improve overall throughput, the lightweight parsing and factorization of the input stream can be parallelized to identify which Child_Sieve should receive candidate units. This task can be divided into multiple parallel tasks by the master module, executed in parallel by slave modules running on multiple slave nodes. This can be accomplished by predicting the data stream and dividing it into multiple partially overlapping segments. These segments are sent by the master module to each slave module that performs the lightweight parsing and factorization in parallel, and the results of the factorization are sent back to the master module. The master module factorizes the data across the boundaries of each segment and then routes the candidate units to the appropriate slave module.

图12L至图12O描述了其中数据蒸馏装置以分布式方式运行的布置，其中主蒸馏模块在主节点上运行，并且多个从蒸馏模块在从节点上运行。主模块负责执行基本数据单元跨各种子筛的分区。在所示的安排中，所有要摄取的输入文件都由主模块摄取，并且无损简化的蒸馏文件保留在主模块处，而所有基本数据单元(以及任何主要重建程序)驻留在各个从模块处的子筛中。对文件的数据取回请求也由主文件处理，并且对相应的蒸馏文件的重建由主文件协调。图12P示出了输入文件可以被任何从蒸馏模块(以及保留在那些模块中的相应蒸馏文件)摄取的变型，并且数据取回请求可以由任何从蒸馏模块处理。主模块继续以相同的方式在子筛上执行基本数据单元的分区，以便基本数据单元在各个子筛上的分布将与图12L到12O中所示的布置相同。但是，在图12P所示的新配置中，每个从模块都知道该分区，因为每个从模块都可以摄取和取回数据。另外，所有的模块都知道在由这些模块摄取数据时在每个模块上创建和存储的蒸馏文件的存在和位置。这允许任何从模块满足针对存储在整个系统中的任何文件的数据取回请求。Figures 12L to 12O depict an arrangement in which the data distillation apparatus operates in a distributed manner, with a master distillation module running on a master node and multiple slave distillation modules running on slave nodes. The master module is responsible for performing partitioning of basic data units across various subsieves. In the arrangement shown, all input files to be ingested are ingested by the master module, and lossless simplified distillation files remain at the master module, while all basic data units (and any major reconstruction procedures) reside in the subsieves at the respective slave modules. Data retrieval requests for files are also handled by the master module, and the reconstruction of the corresponding distillation files is coordinated by the master module. Figure 12P shows a variation where input files can be ingested by any slave distillation module (and the corresponding distillation files retained in those modules), and data retrieval requests can be handled by any slave distillation module. The master module continues to perform partitioning of basic data units on the subsieves in the same manner, so that the distribution of basic data units across the subsieves will be the same as the arrangement shown in Figures 12L to 12O. However, in the new configuration shown in Figure 12P, each slave module is aware of the partition because each slave module can ingest and retrieve data. Furthermore, all modules are aware of the existence and location of distillation files created and stored on each module when data is ingested by these modules. This allows any module to fulfill data retrieval requests for any file stored throughout the system.

如图12P所示，每个从模块可以从分布式存储系统摄取和取回数据。例如，从蒸馏模块1 1270摄入输入文件I 1271并且执行轻量级解析以将输入文件I因式分解并且将候选单元路由到包含与输入文件I中的每个候选单元的名称对应的子筛的模块。例如，来自输入文件I的候选单元1275被发送到从蒸馏模块2 1279。同样，从蒸馏模块21279摄取输入文件II并执行轻量级解析以将输入文件II因式分解并将候选单元路由到包含与输入文件II中的每个候选单元的名称对应的子筛的模块。例如，来自输入文件II的候选单元1277被发送到从蒸馏模块1 1270。每个从蒸馏模块处理它们接收的候选单元，完成关于其子筛的蒸馏过程，并将候选单元的无损简化表示返回到摄入数据的发起模块。例如，响应于从从蒸馏模块1 1270接收来自输入文件I的候选单元1275，从蒸馏模块2 1279将无损简化单元1276返回至从蒸馏模块1 1270。同样，响应于从从蒸馏模块2 1279接收来自输入文件II的候选单元1277，从蒸馏模块1 1270将无损简化单元1278返回到从蒸馏模块2 1279。As shown in Figure 12P, each slave module can ingest and retrieve data from the distributed storage system. For example, distillation module 1 1270 ingests input file I 1271 and performs lightweight parsing to factorize input file I and route candidate units to modules containing sub-sieves corresponding to the names of each candidate unit in input file I. For example, candidate unit 1275 from input file I is sent to distillation module 2 1279. Similarly, distillation module 2 1279 ingests input file II and performs lightweight parsing to factorize input file II and route candidate units to modules containing sub-sieves corresponding to the names of each candidate unit in input file II. For example, candidate unit 1277 from input file II is sent to distillation module 1 1270. Each slave distillation module processes the candidate units it receives, completes the distillation process with respect to its sub-sieves, and returns a lossless simplified representation of the candidate units to the initiating module that ingested the data. For example, in response to receiving candidate unit 1275 from input file I from distillation module 1 1270, distillation module 2 1279 returns lossless simplification unit 1276 to distillation module 1 1270. Similarly, in response to receiving candidate unit 1277 from input file II from distillation module 2 1279, distillation module 1 1270 returns lossless simplification unit 1278 to distillation module 2 1279.

在该布置中，可以在任何从模块处满足数据的取回。接收取回请求的模块需要首先确定该请求的文件的蒸馏文件所驻留的位置，并从相应的从模块获取蒸馏文件。随后，发起的从模块需要协调该蒸馏文件中各个单元的分布式重建，以产生原始文件并将其传送给请求的应用。In this setup, data retrieval can be satisfied at any slave module. The module receiving the retrieval request first needs to determine the location where the distillation file of the requested file resides and obtain the distillation file from the corresponding slave module. Subsequently, the initiating slave module needs to coordinate the distributed reconstruction of the individual units within the distillation file to generate the original file and deliver it to the requesting application.

以这种方式，可以在分布式系统的多个节点上以分布式方式执行数据蒸馏处理，以更有效地利用集群中的多个计算机的总计算、存储器和存储容量。系统中的所有节点都可以用来摄取和取回数据。这应该能够在充分利用系统中的节点的总的组合存储容量的同时，实现非常高的数据摄取和取回速率。这还允许在系统中的任何节点上运行的应用在本地节点上查询存储在系统中任何位置的任何数据，并且使该查询高效无缝地满足。In this way, data distillation can be performed in a distributed manner across multiple nodes in a distributed system to more effectively utilize the total computing, memory, and storage capacity of multiple computers in the cluster. All nodes in the system can be used to ingest and retrieve data. This should enable very high data ingestion and retrieval rates while fully utilizing the combined total storage capacity of the nodes in the system. It also allows applications running on any node in the system to query any data stored anywhere in the system on their local node, and to satisfy that query efficiently and seamlessly.

在图12M到12P所描述的布置中，跨驻留在系统的各个节点中的子筛的数据划分基于全局可见名称空间中的单元名称，其中各单元通过因式分解输入文件来提取。在另一种安排中，共享某些元数据的数据批次或整个文件组可以被分配并存储在特定的节点上。因此，整体数据的主分区基于数据批次，并且由主数据执行和管理。所有的从模块都保持知晓数据批次到模块的分配。数据批次将完全驻留在给定的从节点上。在该从节点上运行的蒸馏从模块上的子筛将包含属于该数据批次的所有基本数据单元。换句话说，给定数据批次的所有基本数据单元的整个树将完全驻留在单个从蒸馏模块内的单个子筛上。给定数据批次的所有蒸馏文件也将驻留在同一从蒸馏模块中。使用这种布置，输入文件仍然可以被任何从蒸馏模块摄取，并且数据取回请求仍然可以被任何从蒸馏模块处理。但是，给定数据批次的整个数据蒸馏处理在包含该数据批次的模块上完全执行。数据摄取和数据取回的请求从发起模块路由到指定用于容纳特定数据批次的特定从模块。该方案具有在因式分解和蒸馏数据批次时在分布式环境中具有降低的通信开销的优点。在整个全局数据足迹内不再采用冗余，而是在数据批次内本地非常有效地利用冗余。该方案仍然使用分布式系统的组合存储容量，并提供无缝的能力来从系统的任何节点查询、摄取和取回任何数据。In the arrangement described in Figures 12M to 12P, the data partitioning across sub-sieves residing in various nodes of the system is based on cell names in a globally visible namespace, where each cell is extracted by factoring the input file. In another arrangement, data batches or entire groups of files sharing certain metadata can be allocated and stored on specific nodes. Thus, the master partitioning of the overall data is based on data batches and is performed and managed by the master data. All slave modules remain aware of the data batch allocation to the module. The data batch will reside entirely on the given slave node. The sub-sieves on the distillation slave module running on that slave node will contain all the basic data units belonging to that data batch. In other words, the entire tree of all basic data units of a given data batch will reside entirely on a single sub-sieve within a single slave distillation module. All distillation files of a given data batch will also reside in the same slave distillation module. Using this arrangement, the input file can still be ingested by any slave distillation module, and data retrieval requests can still be processed by any slave distillation module. However, the entire data distillation process for a given data batch is performed entirely on the module containing that data batch. Data ingestion and retrieval requests are routed from the initiating module to a specific slave module designated to hold a particular batch of data. This approach offers the advantage of reduced communication overhead in a distributed environment when factoring and distilling data batches. Redundancy is no longer employed across the entire global data footprint; instead, it is utilized very efficiently locally within the data batch. The approach still utilizes the combined storage capacity of the distributed system and provides the seamless capability to query, ingest, and retrieve any data from any node in the system.

因此，采用上述众多技术，高效利用分布式系统中的资源来以非常高的速度在非常大的数据集上执行数据蒸馏。Therefore, by employing the aforementioned techniques, resources in a distributed system can be efficiently utilized to perform data distillation on very large datasets at very high speeds.

可以进一步增强Data Distillation^TM方法和装置，以促进数据的高效移动和迁移。在一些实施例中，可以以多个容器或包裹的形式递送无损简化的数据集，以促进数据移动。在一些实施例中，一个或多个简化的数据批次可以装配到单个容器或包裹中，并且替代地，可以将单个简化的数据批次转换成多个包裹。在一些实施例中，单个简化的数据批次作为单个自描述包裹被递送。图12Q图示了这种包裹的示例结构。可以将图12Q中的包裹1280视为单个文件或字节的连续集合，其包含依次彼此串联的以下组成部分：(1)头部1281，它是包裹头部，包裹头部首先包含指定包裹的长度的包裹长度1282，其次包含识别蒸馏文件、PDE文件和各种清单在包裹中的位置的偏移量的偏移量标识符；(2)蒸馏文件1283，它是一个接一个地串联的数据批次的蒸馏文件，其中首先指定每个蒸馏文件的长度，然后是组成该蒸馏文件的所有字节；(3)PDE文件1284，它是PDE文件，从PDE文件的长度标识符开始，然后是包含所有基本数据单元的PDE文件的主体；(4)源清单1285，它是源清单，描述输入数据集的结构并识别包裹中每个文件的唯一目录结构、路径名和文件名。源清单还包含输入数据批次中的每个节点的列表(其已被简化并变成包裹)以及与每个节点相关联的元数据；(5)目的地清单和映射器1286，它是目的地清单和映射器。目的地映射器提供了每个输入节点和文件到目的地目录和文件结构或云中的目标存储桶/容器和对象/二进制大对象(bolb)结构的预期映射。该清单有助于在数据移动后将包裹中的各个组成部分移动、重建和重新放置到最终目的地。注意的是，可以单独更改该目的地映射器部分，以重新定位包裹中的数据要传输到和被重建的目的地。The Data Distillation ^™ method and apparatus can be further enhanced to facilitate efficient data movement and migration. In some embodiments, lossless simplified datasets can be delivered in the form of multiple containers or packages to facilitate data movement. In some embodiments, one or more simplified data batches can be assembled into a single container or package, and alternatively, a single simplified data batch can be converted into multiple packages. In some embodiments, a single simplified data batch is delivered as a single self-describing package. Figure 12Q illustrates an example structure of such a package. Package 1280 in Figure 12Q can be viewed as a continuous collection of single files or bytes, containing the following components concatenated sequentially: (1) Header 1281, which is the package header, first containing a package length 1282 specifying the length of the package, and then containing offset identifiers identifying the positions of distillation files, PDE files, and various lists within the package; (2) Distillation files 1283, which are distillation files of data batches concatenated one after another, where the length of each distillation file is first specified, followed by all the bytes that make up that distillation file; (3) PDE files 1284, which are PDE files, starting with the length identifier of the PDE file, followed by the body of the PDE file containing all the basic data units; (4) Source list 1285, which is the source list describing the structure of the input dataset and identifying the unique directory structure, pathname, and filename of each file in the package. The source list also contains a list of each node in the input data batch (which has been simplified and transformed into a package) and metadata associated with each node; (5) Destination list and mapper 1286, which is the destination list and mapper. The destination mapper provides the expected mapping from each input node and file to the destination directory and file structure or the target bucket/container and object/binary large object (bolb) structure in the cloud. This manifest helps to move, rebuild, and reposition the individual components of the package to the final destination after data movement. Note that this destination mapper section can be modified individually to reposition the data in the package to the destination to be transferred and rebuilt.

以这种方式，数据批次的无损简化表示作为包裹以自描述的并且适合于数据的移动和重定位的格式递送。In this way, the lossless simplified representation of data batches is delivered as packages in a self-describing format suitable for data movement and relocation.

使用本文描述的实施例在各种真实世界数据集上执行数据简化以确定这些实施例的有效性。所研究的真实世界数据集包括公司电子邮件的Enron语料库、美国政府的各种记录和文件、进入MongoDB NOSQL数据库的美国运输部记录以及向公众公开的公司PowerPoint演示文稿。使用这里描述的实施例并且将输入数据因式分解成平均4KB的可变大小的单元(边界由指纹确定)，在这些数据集上实现了3.23x的平均数据简化。3.23x的简化意味着简化的数据的大小等于原始数据的大小除以3.23x，导致压缩比为31％的简化足迹。传统的重复数据删除技术被发现使用等效的参数在这些数据集上传递1.487x的数据简化。使用本文描述的实施例并将输入数据因式分解成4KB的固定大小的单元，在这些数据集上实现了1.86x的平均数据简化。传统的重复数据删除技术被发现使用等效的参数在这些数据集上传递1.08x的数据简化。因此，Data Distillation^TM方案被发现比传统的重复数据删除方案提供了更好的数据简化。Data simplification was performed on various real-world datasets using the embodiments described herein to determine the effectiveness of these embodiments. The real-world datasets studied included the Enron corpus of corporate emails, various records and documents from the U.S. government, U.S. Department of Transportation records accessed through a MongoDB NoSQL database, and publicly available corporate PowerPoint presentations. Using the embodiments described herein and factoring the input data into variable-size units averaging 4KB (boundaries determined by fingerprints), an average data simplification of 3.23x was achieved on these datasets. A 3.23x simplification means that the size of the simplified data is equal to the size of the original data divided by 3.23x, resulting in a simplification footprint of 31% compression. Conventional deduplication techniques were found to deliver 1.487x data simplification on these datasets using equivalent parameters. Using the embodiments described herein and factoring the input data into fixed-size units of 4KB, an average data simplification of 1.86x was achieved on these datasets. Conventional deduplication techniques were found to deliver 1.08x data simplification on these datasets using equivalent parameters. Therefore, the Data Distillation ^™ scheme was found to provide better data simplification than conventional deduplication schemes.

测试运行还确认了基本数据单元的字节的小子集用于对滤筛中的大部分单元进行排序，从而实现对于其操作需要最小增量存储的方案。The test run also confirmed that a small subset of bytes of the basic data unit was used to sort most of the units in the filter, thus enabling a scheme that requires minimal incremental storage for its operations.

结果证实，Data Distillation^TM装置有效地实现了以比该单元自身更精细的粒度在整个数据集上的全局数据单元之间利用冗余。这种方法实现的无损数据简化是通过数据访问和IO的经济实现的，采用的数据结构本身只需要最小的增量存储，并且使用现代多核微处理器上可用的总计算处理能力的一小部分。前面部分中描述的实施例的特征在于在大的和极大的数据集上执行无损数据简化，同时提供高速率的数据摄取和数据取回的系统和技术，并且不具有传统技术的缺点和局限性。The results confirm that the Data Distillation ^™ device effectively leverages redundancy across global data units throughout the dataset at a finer granularity than the unit itself. This lossless data simplification is achieved through economical data access and I/O, requiring minimal incremental storage for the data structures themselves and utilizing only a fraction of the total computational power available on modern multi-core microprocessors. The embodiments described in the preceding sections are characterized by systems and techniques for performing lossless data simplification on large and extremely large datasets, while providing high-rate data ingestion and retrieval, without the drawbacks and limitations of conventional techniques.

通过从驻留在基本数据滤筛中的基本数据单元导出数据，对已经无损简化的数据Data is derived from the basic data units residing in the basic data filter, and the data has been losslessly simplified. 执行内容关联搜索和取回Perform content-related search and retrieval

可以利用某些特征来增强在前面的文本中描述并且在图1A至图12P中示出的数据蒸馏装置，以便有效地对来自以无损简化格式存储的数据的信息执行多维搜索和内容关联取回。这种多维搜索和数据取回是分析或数据仓储应用的关键构件。现在将描述这些增强。Certain features can be leveraged to enhance the data distillation apparatus described in the preceding text and shown in Figures 1A to 12P, enabling efficient multidimensional search and content-related retrieval of information from data stored in a lossless, simplified format. This multidimensional search and data retrieval is a key component of analytics or data warehousing applications. These enhancements will now be described.

图13示出了类似于图3H所示的结构的叶节点数据结构。但是，如图13所示，每个基本数据单元的叶节点数据结构中的条目被增强，以包含对蒸馏数据中的所有单元的引用(其也被称为反向引用或反向链接)，包括对该特定基本数据单元的引用。回想一下，数据蒸馏图式将来自输入文件的数据因式分解成使用诸如图1H中描述的规范以简化格式置于蒸馏文件中的单元序列。蒸馏文件中有两种单元-基本数据单元和导出单元。蒸馏文件中每个单元的规范都将包含对驻留在基本数据滤筛中的基本数据单元的引用。对于这些引用中的每一个(从蒸馏文件中的单元到基本数据滤筛中的基本数据单元)，将会有相应的反向链接或反向引用(从叶节点数据结构中的基本数据单元的条目到蒸馏文件中的单元)安置在叶节点数据结构中。反向引用确定蒸馏文件中的偏移量，该偏移量标记单元的无损简化表示的开始。在一些实施例中，反向引用包括蒸馏文件的名称和定位该单元的开始的该文件内的偏移量。如图13中所示，除了对蒸馏文件中的每个单元的反向引用，叶节点数据结构还保持标识在蒸馏文件中被引用的单元是基本数据单元(基本)还是导出单元(导出项deriv)的指示符。在蒸馏过程中，在将单元放入蒸馏文件时，将反向链接安置到叶节点数据结构中。Figure 13 illustrates a leaf node data structure similar to that shown in Figure 3H. However, as shown in Figure 13, the entries in the leaf node data structure for each basic data unit are enhanced to include references to all units in the distillation data (also referred to as backreferences or backlinks), including references to that particular basic data unit. Recall that the data distillation schema factorizes data from the input file into a sequence of units placed in the distillation file in a simplified format using specifications such as those described in Figure 1H. There are two types of units in the distillation file—basic data units and derived units. The specification for each unit in the distillation file will contain references to the basic data units residing in the basic data filter. For each of these references (from a unit in the distillation file to a basic data unit in the basic data filter), a corresponding backlink or backreference (from an entry of the basic data unit in the leaf node data structure to a unit in the distillation file) will be placed in the leaf node data structure. The backreference determines an offset in the distillation file that marks the start of the lossless simplified representation of the unit. In some embodiments, the backreference includes the name of the distillation file and the offset within that file that positions the start of the unit. As shown in Figure 13, in addition to backreferences to each cell in the distillation file, the leaf node data structure also maintains an indicator that identifies whether the cell referenced in the distillation file is a basic data cell (basic) or a derived cell (deriv). During distillation, backlinks are placed into the leaf node data structure when cells are added to the distillation file.

反向参考或反向链接被设计为通用句柄，其可以到达共享基本数据滤筛的所有蒸馏文件中的所有单元。Reverse references or reverse links are designed as universal handles that can reach all cells in all distillation files that share a basic data filter.

预期反向引用的添加不会显著影响所达到的数据简化，因为预期数据单元的大小被选择为使得每个引用是数据单元的大小的一部分。例如，考虑一个系统，其中导出单元被约束为每个导出项不超过1个基本数据单元(因此不允许多单元导出项)。跨所有叶节点数据结构的反向引用的总数将等于所有蒸馏文件中单元的总数。假设32GB大小的样本输入数据集简化到8GB的无损简化数据，采用1KB的平均单元大小，并产生4倍的简化率。输入数据中有32M个单元。如果每个反向引用大小为8B，则反向引用占用的总空间为256MB或0.25GB。这对8GB足迹的简化数据是小的增加。新的足迹将达到8.25GB，有效降幅将达到3.88倍，相当于3％的简化损失。对于简化数据的强大的内容关联数据取回的好处，这是一个小的代价。The addition of expected backreferences will not significantly affect the data simplification achieved because the expected data unit size is chosen such that each reference is a portion of the data unit size. For example, consider a system where derived units are constrained to no more than one basic data unit per derived item (thus disallowing multi-unit derived items). The total number of backreferences across all leaf node data structures will equal the total number of units in all distilled files. Suppose a 32GB sample input dataset is simplified to 8GB of lossless simplified data with an average unit size of 1KB, resulting in a 4x simplification rate. There are 32M units in the input data. If each backreference is 8B in size, the total space occupied by backreferences is 256MB, or 0.25GB. This is a small increase for simplified data with an 8GB footprint. The new footprint will reach 8.25GB, resulting in an effective reduction of 3.88x, equivalent to a 3% simplification loss. This is a small cost for the powerful content-related data retrieval benefits of simplified data.

如本文前面所述，蒸馏装置可采用多种方法来确定候选单元的内容内的骨骼数据结构的各个分量的位置。单元的骨架数据结构的各个分量可以被认为是维度，所以后面紧跟着每个单元的其余内容的这些维度的级联被用来创建每个单元的名称。该名称用于排序和组织树中的基本数据单元。As described earlier in this article, the distillation apparatus can employ various methods to determine the location of each component of the skeletal data structure within the content of a candidate cell. Each component of the cell's skeletal data structure can be considered a dimension, so the concatenation of these dimensions, immediately following the rest of the cell's content, is used to create the name of each cell. This name is used to sort and organize the basic data unit in the tree.

在已知输入数据的结构的使用模型中，图式定义了各个字段或维度。这种图式由正在使用该内容关联数据取回装置的分析应用提供，并通过该应用的接口提供给装置。基于图式中的声明，蒸馏装置的解析器能够解析候选单元的内容以检测和定位各个维度并创建候选单元的名称。如前所述，与维度对应的字段中具有相同内容的单元将沿着树的同一分支(leg)分组在一起。对于安置到滤筛中的每个基本数据单元，可以将维度上的信息作为元数据存储在叶节点数据结构中的基本数据单元的条目中。该信息可以包括在每个所声明的维度上的内容的位置、大小和值，并且可以存储在图13中被称为“基本数据单元的其它元数据”的字段中。In the usage model of the known input data structure, a schema defines the various fields or dimensions. This schema is provided by the analytics application that is using the content-related data retrieval device and is provided to the device through the application's interface. Based on the declarations in the schema, the parser of the distillation device is able to parse the content of candidate cells to detect and locate each dimension and create names for the candidate cells. As previously mentioned, cells with the same content in the fields corresponding to dimensions will be grouped together along the same branch (leg) of the tree. For each basic data cell placed in the filter, information on the dimension can be stored as metadata in the entry of the basic data cell in the leaf node data structure. This information may include the position, size, and value of the content on each declared dimension and can be stored in a field referred to as "Other Metadata of Basic Data Cells" in Figure 13.

图14A示出根据本文描述的一些实施例的提供输入数据集的结构的描述以及输入数据集的结构与维度之间的对应关系的描述的示例图式。结构描述1402是描述输入数据的完整结构的更完整图式的摘录或部分。结构描述1402包括键(key)列表(例如，“PROD_ID”、“MFG”、“MONTH”、“CUS_LOC”、“CATEGORY”和“PRICE”)，后跟对应于该键的值的类型。冒号字符“：”被用作分隔符来分隔键和值的类型，分号字符“；”被用作分隔符来分隔不同的键对和值的相应类型。请注意，完整图式(结构1402是其中的一部分)可以指定附加字段来标识每个输入的开始和结束，也可能指定维度之外的其他字段。维度映射描述1404描述了用于组织基本数据单元的维度如何映射到结构化输入数据集中的键值。例如，维度映射描述1404中的第一行指定对应于输入数据集中的键“MFG”的值的前四个字节(因为第一行以文本“prefix＝4”结束)用于创建维度1。维度映射描述1404中的其余行描述了如何基于结构化输入数据创建其他三个维度。在键到维度的该映射中，键在输入中出现的顺序不一定匹配维度的顺序。使用提供的图式描述，解析器可以在输入数据中识别这些维度以创建候选单元的名称。对于图14A中的例子，并且使用维度映射描述1404，将如下创建候选单元的名称-(1)名称的前4个字节将是与被声明为维度1的键“MFG”对应的值的前4个字节，(2)名称的接下来的4个字节将是与被声明为维度2的键“CATEGORY”对应的值的前4个字节，(3)名称的接下来的3个字节将是与被声明为维度3的键“CUS_LOC”对应的值的前3个字节，(4)名称的接下来3个字节将是与被声明为维度4的键“MONTH”对应的值的前3个字节，(5)名称的下一组字节将由维度的剩余字节的级联组成，(6)并且最后，在维度的所有字节用完之后，名称的剩余字节将从候选单元的其余字节的级联创建。Figure 14A illustrates an example diagram describing the structure of an input dataset according to some embodiments described herein, and a description of the correspondence between the structure of the input dataset and its dimensions. Structure description 1402 is an excerpt or part of a more complete diagram describing the full structure of the input data. Structure description 1402 includes a list of keys (e.g., “PROD_ID”, “MFG”, “MONTH”, “CUS_LOC”, “CATEGORY”, and “PRICE”), followed by the type of the value corresponding to that key. A colon character “:” is used as a separator to separate the types of keys and values, and a semicolon character “;” is used as a separator to separate different key pairs and their respective types. Note that the complete diagram (of which structure 1402 is a part) may specify additional fields to identify the beginning and end of each input, and may also specify fields other than dimensions. Dimension mapping description 1404 describes how the dimensions used to organize basic data units are mapped to the key values in the structured input dataset. For example, the first line in dimension mapping description 1404 specifies that the first four bytes corresponding to the value of the key "MFG" in the input dataset (because the first line ends with the text "prefix=4") are used to create dimension 1. The remaining lines in dimension mapping description 1404 describe how to create the other three dimensions based on the structured input data. In this key-to-dimension mapping, the order in which the keys appear in the input does not necessarily match the order of the dimensions. Using the provided graphical description, the parser can identify these dimensions in the input data to create names for candidate cells. For the example in Figure 14A, and using dimension mapping description 1404, the names of candidate cells will be created as follows: (1) The first 4 bytes of the name will be the first 4 bytes of the value corresponding to the key “MFG” declared as dimension 1; (2) The next 4 bytes of the name will be the first 4 bytes of the value corresponding to the key “CATEGORY” declared as dimension 2; (3) The next 3 bytes of the name will be the first 3 bytes of the value corresponding to the key “CUS_LOC” declared as dimension 3; (4) The next 3 bytes of the name will be the first 3 bytes of the value corresponding to the key “MONTH” declared as dimension 4; (5) The next set of bytes of the name will be a concatenation of the remaining bytes of the dimension; (6) And finally, after all the bytes of the dimension have been used up, the remaining bytes of the name will be created from the concatenation of the remaining bytes of the candidate cells.

由驱动该装置的应用提供的图式可以指定多个主维度以及多个第二维度。所有这些主维度和第二维度的信息都可以保留在叶节点数据结构的元数据中。主维度用于形成主轴，沿该主轴分类和组织滤筛中的单元。如果主维度耗尽且具有大量成员的子树仍然存在，则可以更深入树中使用第二维度，以将单元进一步细分为更小的组。有关第二维度的信息可以保留为元数据，也可以用作区分叶节点内单元的第二标准。在提供内容关联多维搜索和取回的一些实施例中，可以要求所有传入数据必须包含该图式所声明的每个维度的键和有效值。这允许系统确保只有有效数据进入滤筛中所需的子树。不包含被指定为维度的所有字段或者在与维度的字段对应的值中包含无效值的候选单元将被发送到如前面在图3E中图示的不同的子树中。The schema provided by the application driving the device can specify multiple primary dimensions and multiple secondary dimensions. Information about all these primary and secondary dimensions can be preserved in the metadata of the leaf node data structure. Primary dimensions are used to form the main axis along which cells in the filter are categorized and organized. If the primary dimensions are exhausted and a subtree with a large number of members still exists, secondary dimensions can be used deeper into the tree to further subdivide the cells into smaller groups. Information about the secondary dimensions can be preserved as metadata or used as a second criterion to distinguish cells within a leaf node. In some embodiments that provide content-related multidimensional search and retrieval, all incoming data can be required to contain the key and valid value for each dimension declared by the schema. This allows the system to ensure that only valid data enters the required subtree in the filter. Candidate cells that do not contain all fields specified as dimensions or contain invalid values in the values corresponding to the fields of a dimension will be sent to different subtrees as illustrated earlier in Figure 3E.

数据蒸馏装置以一种附加的方式被约束，以基于维度中的内容来全面地支持数据的内容关联搜索和取回。当从基本数据单元创建导出单元时，导出器被约束为确保基本数据单元和导出项在每个相应维度的值字段中具有完全相同的内容。因此，在创建导出项时，不允许重建程序干扰或修改与基本数据单元的任何维度相对应的值字段中的内容，以构造导出项。给定候选单元，在查找滤筛期间，如果候选单元与目标基本数据单元的相应维度相比在任何维度中具有不同的内容，则需要安置新的基本数据单元，而不接受导出项。例如，如果主维度的子集将单元充分分类到树中的不同组中，使得候选单元到达叶节点以找到在主维度的这个子集中具有相同内容但在剩余主维度或第二维度中具有不同内容的基本数据单元，那么，需要安置新的基本数据单元而不是创建导出项。该功能确保通过简单地查询基本数据滤筛可以使用维度搜索所有数据。The data distillation apparatus is constrained in an additional way to comprehensively support content-related searches and retrievals of data based on the content within dimensions. When creating derived units from basic data units, the exporter is constrained to ensure that the basic data unit and the exported item have exactly the same content in the value fields of each corresponding dimension. Therefore, when creating exported items, the refactoring process is not allowed to interfere with or modify the content in the value fields corresponding to any dimension of the basic data unit to construct the exported item. Given a candidate unit, during the search filtering process, if the candidate unit has different content in any dimension compared to the corresponding dimension of the target basic data unit, a new basic data unit needs to be placed, and the exported item is not accepted. For example, if a subset of the primary dimension sufficiently categorizes the units into different groups in the tree, such that a candidate unit reaches a leaf node to find a basic data unit that has the same content in this subset of the primary dimension but has different content in the remaining primary or second dimension, then a new basic data unit needs to be placed instead of creating an exported item. This feature ensures that all data can be searched using dimensions by simply querying the basic data filter.

导出器可以采用各种实现技术来实行如下约束：候选单元和基本数据单元必须在每个对应维度的值字段中具有完全相同的内容。导出器可以从基本数据单元的骨架数据结构中提取包括与维度相对应的字段的位置、长度和内容的信息。类似地，该信息是从解析器/分解器接收的，或针对候选单元计算出的。接下来，可以比较候选单元和基本数据单元的维度的对应字段是否相等。一旦确认是相等的，导出器就可以继续进行其余导出。如果不相等，则候选单元将作为新的基本数据单元被安装在滤筛中。The exporter can employ various implementation techniques to enforce the constraint that candidate cells and basic data cells must have identical content in the value fields of each corresponding dimension. The exporter can extract information from the skeleton data structure of the basic data cell, including the position, length, and content of the fields corresponding to the dimensions. This information is similarly received from the parser/decomposer or computed for the candidate cell. Next, the corresponding fields of the dimensions of the candidate cell and the basic data cell can be compared for equality. Once equality is confirmed, the exporter can continue with the remaining export. If they are not equal, the candidate cell will be added to the filter as a new basic data cell.

上述限制预计不会显著妨碍大多数使用模型的数据简化程度。例如，如果输入数据由一组单元组成，这组单元是数据仓储事务，每个数据仓储事务的大小为1000个字节，并且如果由图式指定了一组6个主维度和14个第二维度，每个维度8个字节的数据，内容在维度上占用的总字节数为160个字节。在创建导出项时，这些160个字节不允许有扰动。这仍然将留出候选单元数据的剩余的840个字节可用于扰动以创建导出项，从而为利用冗余留下充足的机会，同时使得能够使用维度以内容关联的方式来搜索和取回来自数据仓储的数据。The aforementioned limitations are not expected to significantly hinder the data simplification of most models. For example, if the input data consists of a set of units, which are data warehouse transactions, each 1000 bytes in size, and if the diagram specifies a set of 6 primary dimensions and 14 secondary dimensions, each with 8 bytes of data, the total number of bytes occupied by the content across the dimensions is 160 bytes. These 160 bytes are not allowed to be perturbed when creating derived items. This still leaves 840 bytes of candidate unit data available for perturbing to create derived items, thus providing ample opportunity to utilize redundancy while enabling the searching and retrieval of data from the data warehouse using dimensions in a content-related manner.

为了执行对包含维度中的字段的特定值的数据的搜索查询，装置可以遍历树并到达与指定的维度匹配的树中的节点，并且可以返回该节点下面的所有叶节点数据结构作为查找的结果。对存在叶节点处的基本数据单元的引用可用于在需要时获取所需的基本数据单元。如果需要，反向链接可以从蒸馏文件中取回输入单元(以无损简化格式)。该单元可以随后被重建以产生原始输入数据。因此，增强的装置允许对基本数据存储库(其是整个数据的较小子集)中的数据完成所有搜索，同时能够根据需要到达并取回所有的导出单元。To perform a search query on data containing specific values of fields in a dimension, the apparatus can traverse the tree and reach the node in the tree that matches the specified dimension, returning all leaf node data structures below that node as the search result. References to the basic data units present at leaf nodes can be used to retrieve the required basic data units when needed. If necessary, backlinks can retrieve input units (in a lossless simplified format) from a distillation file. This unit can then be reconstructed to produce the original input data. Therefore, the enhanced apparatus allows for complete searches of data in a basic data repository (which is a smaller subset of the entire data) while being able to reach and retrieve all derived units as needed.

增强的装置可以用于执行搜索和查找查询以基于查询所指定的维度中的内容来强大的搜索和取回数据的相关子集。内容关联数据取回查询将具有“获取(维度1，维度1的值；维度2，维度2的值；...)”的形式。查询将指定搜索中涉及的维度以及用于内容关联搜索和查找的每个指定维度的值。查询可以指定所有维度，也可以只指定维度的子集。查询可以指定基于多个维度的复合条件作为搜索和取回的标准。具有指定维度的指定值的滤筛中的所有数据都将被取回。The enhanced mechanism allows for the execution of search and lookup queries to powerfully search and retrieve relevant subsets of data based on content within the dimensions specified in the query. Content-related data retrieval queries will have the form "get (dimension 1, value of dimension 1; dimension 2, value of dimension 2; ...)". The query will specify the dimensions involved in the search, along with the value for each specified dimension used for content-related searches and looks. Queries can specify all dimensions or only a subset of dimensions. Queries can specify composite conditions based on multiple dimensions as criteria for search and retrieval. All data filtered with specified values for specified dimensions will be retrieved.

可以支持多种获取查询并使其可用于正在使用该内容关联数据取回装置的分析应用。这些查询将通过应用的接口提供给装置。接口向装置提供来自应用的查询，并将查询结果从装置返回给应用。首先，可以使用查询FetchRefs来为与查询相匹配的每个基本数据单元获取对图13中的叶节点数据结构(连同子ID或条目的索引)的引用或处理。查询FetchMetaData的第二种形式可用于为匹配查询的每个基本数据单元从图13中的叶节点数据结构中的条目获取元数据(包括骨架数据结构、关于维度的信息和对基本数据单元的引用)。查询FetchPDEs的第三种形式将获取匹配搜索条件的所有基本数据单元。另一种形式的查询FetchDistilledElements将获取与搜索条件匹配的蒸馏文件中的所有单元。另一种形式的查询FetchElements将获取输入数据中与搜索条件匹配的所有单元。请注意，对于FetchElements查询，装置将首先获取蒸馏单元，然后将相关的蒸馏单元重建到来自输入数据的单元中，并将其作为查询的结果返回。It supports various retrieval queries and makes them usable by analytics applications that are using the content-related data retrieval device. These queries are provided to the device through the application's interface. The interface provides the device with queries from the application and returns the query results from the device to the application. First, the query `FetchRefs` can be used to retrieve a reference to or process the leaf node data structure (along with the sub-ID or index of the entry) in Figure 13 for each basic data unit that matches the query. A second form of the query `FetchMetaData` can be used to retrieve metadata (including the skeleton data structure, information about dimensions, and references to the basic data unit) from the entries in the leaf node data structure in Figure 13 for each basic data unit that matches the query. A third form of the query `FetchPDEs` will retrieve all basic data units that match the search criteria. Another form of the query `FetchDistilledElements` will retrieve all units in the distilled files that match the search criteria. Another form of the query `FetchElements` will retrieve all units in the input data that match the search criteria. Note that for the `FetchElements` query, the device will first retrieve the distilled units, then reconstruct the relevant distilled units into units from the input data and return them as the result of the query.

除了这样的多维内容关联获取原语之外，接口还可以向应用提供直接访问基本数据单元(使用对基本数据单元的引用)和蒸馏文件中的单元(使用对单元的反向引用)的能力。另外，接口可以向应用提供重建蒸馏文件中的蒸馏单元的能力(给定对蒸馏单元的引用)，并且因其存在于输入数据中而传递该单元。In addition to such multidimensional content association retrieval primitives, the interface can also provide applications with the ability to directly access basic data units (using references to basic data units) and units in distillation files (using backreferences to units). Furthermore, the interface can provide applications with the ability to reconstruct distillation units in distillation files (given a reference to the distillation unit) and pass that unit because it exists in the input data.

这些查询的明智组合可以被分析应用用来执行搜索，确定相关的联合和交叉点，并且收集重要的见解。The intelligent combination of these queries can be used by analytics applications to perform searches, identify relevant unions and intersections, and gather important insights.

下面解释的图14B示出了具有在结构描述1402中描述的结构的输入数据集的示例。在该示例中，包含在文件1405中的输入数据包含电子商务交易。该输入数据由数据蒸馏装置中的分析器使用图14A中的图式和维度声明转换成候选单元序列1406。请注意，每个候选单元的名称的前导字节是如何由维度的内容组成的。例如，候选单元1的名称1407的前导字节是PRINRACQNYCFEB。这些名称用于以树形式组织候选单元。数据简化完成后，蒸馏数据被放置在蒸馏文件1408中。Figure 14B, explained below, illustrates an example of an input dataset with the structure described in structure description 1402. In this example, the input data contained in file 1405 comprises e-commerce transactions. This input data is transformed into a sequence of candidate cells 1406 by an analyzer in the data distillation apparatus using the schema and dimension claims in Figure 14A. Note how the leading byte of each candidate cell's name is composed of the dimensions' contents. For example, the leading byte of the name 1407 for candidate cell 1 is PRINRACQNYCFEB. These names are used to organize the candidate cells in a tree structure. After data simplification, the distilled data is placed in distillation file 1408.

以下解释的图14C示出了如何使用维度映射描述1404来根据结构描述1402解析图14A中所示的输入数据集，根据维度映射描述1404来确定维度，并且基于所确定的维度在树中组织基本数据单元。在图14C中，使用来自4个维度的总共14个字符将基本数据单元组织在主树中。主树中示出的是各种基本数据单元的叶节点数据结构的一部分。注意，为了容易查看的目的，未示出图13的完整的叶节点数据结构。但是，图14C示出了叶节点数据结构中的每个条目的路径信息或名称，子ID，从基本数据单元到蒸馏文件中的单元的所有反向引用或反向链接，以及指示蒸馏文件中的单元是否是“基本(或素数：prime)“(用P表示)或“导出项”(用D表示)的指示符以及对基本数据单元的引用。图14C示出了映射到主树中的5个基本数据单元的蒸馏文件中的7个单元。在图14C中，具有名称PRINRACQNYCFEB的用于基本数据单元的反向链接A返回到蒸馏文件中的单元1。同时，名为NIKESHOELAHJUN的基本数据单元具有分别到单元2、单元3和单元58的3个反向链接B、C和E。请注意，单元3和单元58是单元2的导出项。Figure 14C, explained below, illustrates how the dimension mapping description 1404 is used to parse the input dataset shown in Figure 14A according to the structure description 1402, determine the dimensions according to the dimension mapping description 1404, and organize the basic data units in the tree based on the determined dimensions. In Figure 14C, the basic data units are organized in the main tree using a total of 14 characters from the 4 dimensions. The main tree shows a portion of the leaf node data structure for the various basic data units. Note that the complete leaf node data structure of Figure 13 is not shown for ease of viewing. However, Figure 14C shows the path information or name of each entry in the leaf node data structure, the sub-ID, all backreferences or backlinks from the basic data unit to the unit in the distillation file, and an indicator indicating whether the unit in the distillation file is a "prime" (denoted by P) or a "derived item" (denoted by D), as well as references to the basic data unit. Figure 14C shows 7 units in the distillation file mapped to 5 basic data units in the main tree. In Figure 14C, the backlink A for the basic data cell, named PRINRACQNYCFEB, returns to cell 1 in the distillation file. Meanwhile, the basic data cell named NIKESHOELAHJUN has three backlinks B, C, and E to cells 2, 3, and 58, respectively. Note that cells 3 and 58 are derivations of cell 2.

图14D示出了从维度创建的辅助索引或辅助树，以提高搜索的效率。在此示例中，创建的辅助映射树基于维度2(即CATEGORY)。通过直接遍历该辅助树，可以找到输入数据中给定CATEGORY的所有单元，而不需要否则可能会发生的更昂贵的主树遍历。例如，向下穿过用“SHOE”表示的腿直接导向鞋的两个基本数据单元ADDIKHOESJCSEP和NIKESHOELAHJUN。Figure 14D illustrates an auxiliary index or auxiliary tree created from a dimension to improve search efficiency. In this example, the auxiliary mapping tree is created based on dimension 2 (i.e., CATEGORY). By directly traversing this auxiliary tree, all cells in the input data for a given CATEGORY can be found without the need for a more expensive main tree traversal that might otherwise occur. For example, traversing down the leg, denoted by "SHOE," directly leads to the two basic data cells of shoe: ADDIKHOESJCSEP and NIKESHOELAHJUN.

或者，这样的辅助树可以基于第二维度并且用于利用维度来帮助搜索的快速收敛。Alternatively, such an auxiliary tree can be based on a second dimension and used to leverage the dimension to help the search converge quickly.

现在将提供在图14D中所示的装置上执行的查询的示例。查询FetchPDEs(维度1，NIKE；)将返回两个名为NIKESHOELAHJUN和NIKEJERSLAHOCT的基本数据单元。查询FetchDistilledElements(维度1，NIKE；)将返回单元2、单元3、单元58和单元59，其为无损简化格式的蒸馏单元。查询FetchElements(维度1，NIKE；维度2，SHOE)将从输入数据文件1405返回交易2、交易3和交易58。查询FetchMetadata(维度2，SHOES)将为名为ADIDSHOESJCSEP和NIKESHOELAHJUN的两个基本数据单元中的每一个返回存储在叶节点数据结构条目中的元数据。An example of a query executed on the apparatus shown in Figure 14D will now be provided. The query `FetchPDEs(Dimension 1, NIKE;)` will return two basic data units named `NIKESHOELAHJUN` and `NIKEJERSLAHOCT`. The query `FetchDistilledElements(Dimension 1, NIKE;)` will return units 2, 3, 58, and 59, which are distillation units in a lossless simplified format. The query `FetchElements(Dimension 1, NIKE; Dimension 2, SHOE)` will return transactions 2, 3, and 58 from input data file 1405. The query `FetchMetadata(Dimension 2, SHOES)` will return metadata stored in the leaf node data structure entry for each of the two basic data units named `ADIDSHOESJCSEP` and `NIKESHOELAHJUN`.

到目前为止所描述的装置可以用于支持基于在称为维度的字段中指定的内容的搜索。另外，该装置可以用于支持基于未包括在维度列表中的关键字列表的搜索。这样的关键字可以由诸如驱动该装置的搜索引擎之类的应用提供给该装置。关键字可以通过图式声明指定给装置，或者通过包含所有关键字的关键字列表传递，每个关键字由已声明的分隔符(如空格、逗号或换行)分隔。或者，可以使用图式以及关键字列表来共同指定所有关键字。可以指定非常多的关键字–装置对关键字的数量没有限制。这些搜索关键字将被称为Keywords(关键字)。该装置可以使用这些关键字来保持搜索的倒置索引。对于每个关键字，倒置索引包含对包含此关键字的蒸馏文件中的单元的反向引用列表。The apparatus described so far can be used to support searches based on content specified in fields called dimensions. Additionally, the apparatus can be used to support searches based on a list of keywords not included in the dimension list. Such keywords can be provided to the apparatus by an application such as the search engine that powers it. Keywords can be specified to the apparatus via a schema declaration or passed via a keyword list containing all keywords, each keyword separated by a declared delimiter (such as a space, comma, or newline). Alternatively, all keywords can be specified using both a schema and a keyword list. A large number of keywords can be specified – the apparatus has no limit on the number of keywords. These search keywords will be referred to as Keywords. The apparatus can use these keywords to maintain an inverted index for the search. For each keyword, the inverted index contains a list of backreferences to cells in distilled files containing that keyword.

基于图式或关键字列表中的关键字声明，蒸馏装置的解析器可以解析候选单元的内容，以检测和定位传入候选单元中的各个关键字(是否找到以及在哪儿找到)。随后，通过数据蒸馏装置将候选单元转换为基本数据单元或导出单元，并将其作为单元置于蒸馏文件中。在此单元中找到的关键字的倒置索引可以通过对蒸馏文件中的单元的反向引用来更新。对于在单元中找到的每个关键字，倒置索引被更新以包括对蒸馏文件中的该单元的反向引用。回想一下，蒸馏文件中的单元为无损简化表示。Based on keyword declarations in a schema or keyword list, the parser of the distillation apparatus can parse the contents of candidate cells to detect and locate individual keywords within the incoming candidate cells (whether they are found and where they are found). The candidate cells are then converted into basic data cells or derived cells by the data distillation apparatus and placed as cells in the distillation file. The inverted index of the keywords found in this cell can be updated using backreferences to the cells in the distillation file. For each keyword found in a cell, the inverted index is updated to include a backreference to that cell in the distillation file. Recall that the cells in the distillation file are represented using a lossless simplified representation.

在使用关键字对数据进行搜索查询之后，查询倒置索引以找到并提取对包含该关键字的蒸馏文件中的单元的反向引用。使用对这样的单元的反向引用，可以取回对该单元的无损简化表示，并且可以重建该单元。然后可以提供重建的单元作为搜索查询的结果。After performing a search query on the data using a keyword, the inverted index is queried to find and extract back references to cells in distillation files containing that keyword. Using these back references, a lossless simplified representation of the cell can be retrieved, and the cell can be reconstructed. The reconstructed cell can then be provided as a result of the search query.

可以增强倒置索引以包含定位关键字在重建单元中的偏移量的信息。请注意，候选单元中检测到的每个关键字的偏移量或位置可由解析器确定，因此，当将对蒸馏文件中的单元的反向引用置入倒置索引中时，也可将此信息记录在倒置索引中。在搜索查询时，在查询倒置索引以取回对包含相关关键字的蒸馏文件中的单元的反向引用之后，并且在该单元被重建之后，记录的关键字在重建单元中的偏移量或位置(与原始输入候选单元相同)可用于确定输入数据或输入文件中关键字的存在位置。The inverted index can be enhanced to include information about the offset of the locating keyword within the reconstructed cell. Note that the offset or position of each keyword detected in the candidate cell can be determined by the parser; therefore, this information can also be recorded in the inverted index when a backreference to a cell in the distillation file is placed there. During a search query, after querying the inverted index to retrieve a backreference to a cell in the distillation file containing the relevant keyword, and after that cell is reconstructed, the recorded offset or position of the keyword within the reconstructed cell (the same as the original input candidate cell) can be used to determine the location of the keyword in the input data or input file.

图15示出了便于根据关键字进行搜索的倒置索引。对于每个关键字，倒置索引包含成对的值-第一个是对包含关键字的蒸馏文件中的无损简化单元的反向引用，第二个值是关键字在重建单元中的偏移量。Figure 15 illustrates an inverted index that facilitates searching by keyword. For each keyword, the inverted index contains a pair of values—the first is a backreference to the lossless simplification cell in the distillation file containing the keyword, and the second value is the offset of the keyword in the reconstructed cell.

维度和关键字对数据蒸馏装置中的基本数据滤筛具有不同的含义。请注意，维度用作在滤筛中沿其组织基本数据单元的主轴。维度形成数据中每个单元的骨架数据结构。维度是根据传入数据结构的知识来声明的。导出器受到限制，使得创建的任何导出单元必须与每个相应维度的字段的值中的基本数据单元具有完全相同的内容。Dimensions and keywords have different meanings for the basic data filters in a data distillation apparatus. Note that dimensions serve as the main axis along which the basic data units are organized in the filter. Dimensions form the skeleton data structure of each unit in the data. Dimensions are declared based on knowledge of the incoming data structure. The derivator is restricted such that any derived unit created must have exactly the same content as the basic data unit in the value of each corresponding dimension's field.

不需要为关键字保持这些属性。在一些实施例中，既不存在关键字甚至存在于数据中的先验要求，也不是要求基于关键字组织基本数据滤筛，也不关于涉及包含关键字的导出项约束导出器。如果需要，导出器可以通过修改关键字的值来从基本数据单元自由创建导出项。关键字的位置仅记录在扫描输入数据时发现的地方，并且倒置索引被更新。在基于关键词的内容关联搜索时，查询倒置索引并获得关键词的所有位置。These properties do not need to be maintained for keywords. In some embodiments, there are no prior requirements for keywords to even exist in the data, nor is there a requirement to organize basic data filtering based on keywords, nor is there any constraint on the exporter involving derivations containing keywords. If needed, the exporter can freely create derivations from the basic data unit by modifying the value of the keyword. The position of the keyword is only recorded where it is found when scanning the input data, and the inverted index is updated. During keyword-based content association searches, the inverted index is queried and all positions of the keyword are obtained.

在其它实施例中，不需要关键字存在于数据中(数据中不存在关键字不会使数据无效)，但是要求基本数据滤筛包含所有包含关键字的单元，并且导出器在涉及包含关键字的内容的导出方面受到约束–除简化重复项外，不允许任何导出。这些实施例的目的在于，包含任何关键字的所有不同单元必须存在于基本数据滤筛中。这是其中控制基本数据的选择的规则以关键字为条件的示例。在这些实施例中，可以创建修改的倒置索引，该倒置索引针对每个关键字包含对每个包含该关键字的基本数据单元的反向引用。在这些实施例中，实现了强大的基于关键字的搜索能力，其中仅搜索基本数据滤筛与搜索整个数据一样有效。In other embodiments, the keyword is not required to exist in the data (the absence of the keyword in the data does not invalidate the data), but the basic data filter is required to include all units containing the keyword, and the exporter is constrained in its export of content containing the keyword – no export is allowed except to reduce duplicates. The aim of these embodiments is that all distinct units containing any keyword must exist in the basic data filter. This is an example where the rules controlling the selection of basic data are conditional on keywords. In these embodiments, a modified inverted index can be created that contains a backreference to each basic data unit containing that keyword. In these embodiments, powerful keyword-based search capabilities are implemented, where searching only the basic data filter is as effective as searching the entire data.

可能存在其它实施例，其中导出器受到约束，使得不允许重建程序干扰或修改基本数据单元中找到的任何关键字的内容，以便将候选单元制定为该基本数据单元的导出单元。关键字需要从基本数据单元不变地传播到导出项。如果导出器需要修改在基本数据单元中找到的任何关键字的字节，以便成功地将候选项制定为该基本数据单元的导出项，那么导出项可以不被接受，并且该候选项必须作为新的基本数据单元被安装在滤筛中。Other embodiments may exist in which the exporter is constrained such that the reconstruction process is not allowed to interfere with or modify the content of any keywords found in the basic data unit in order to designate a candidate unit as a derived unit of that basic data unit. Keywords need to be propagated unchanged from the basic data unit to the derived item. If the exporter needs to modify the bytes of any keywords found in the basic data unit in order to successfully designate a candidate as a derived item of that basic data unit, then the derived item may be rejected, and the candidate must be installed in the filter as a new basic data unit.

关于涉及关键字的导出项，可以以各种方式来约束导出器，使得管理基本数据的选择的规则以关键字为条件。Regarding exported items involving keywords, the exporter can be constrained in various ways so that the rules for managing the selection of basic data are conditional on the keywords.

使用关键字搜索数据的装置可以接受对关键字的列表的更新。装置使得能够对关键字列表更新。可以添加关键字而不会对以无损简化形式存储的数据进行任何更改。当添加新的关键字时，可以针对更新的关键字列表对新的输入数据进行解析，并且随着输入数据更新的倒置索引随后以无损简化形式被存储。如果现有数据(已经以无损简化形式存储)需要针对新的关键字进行索引，则装置可逐渐地读取蒸馏文件(一次一个或多个蒸馏文件，或者一次一个无损简化数据批次)、重建原始文件(但不会干扰无损简化存储的数据)，并解析重建的文件以更新倒置索引。在这期间(all this while)，整个数据存储库可以继续保持以无损简化形式存储。The device for searching data using keywords can accept updates to the list of keywords. The device enables updates to the keyword list. Keywords can be added without altering the data stored in a lossless simplified form. When a new keyword is added, the new input data can be parsed against the updated keyword list, and the inverted index, updated with the input data, is subsequently stored in a lossless simplified form. If existing data (already stored in a lossless simplified form) needs to be indexed against new keywords, the device can progressively read distillation files (one or more distillation files at a time, or one batch of lossless simplified data at a time), reconstruct the original files (without interfering with the lossless simplified data storage), and parse the reconstructed files to update the inverted index. During this entire process (all this while), the entire data repository can continue to be stored in a lossless simplified form.

图16A图示了图14A中所示的图式的变型的图式声明。图16A中的图式包括第二维度1609的声明和关键字列表1610。图16B示出了具有在结构描述1602中描述的结构的输入数据集1611的示例，其被解析并且被转换成具有基于所声明的主维度的名称的一组候选单元。候选单元被转换成蒸馏文件1613中的单元。第二维度“PROD_ID”的声明在导出器上设置约束，使得可能不从基本数据单元“具有PROD_ID＝348的NIKESHOELAHJUN”导出候选单元58，并且因此在基本数据滤筛中创建另外一个基本数据单元“具有PROD_ID＝349的NIKESHOELAHJUN”。尽管输入数据集与图14B中所示的输入数据集相同，但是蒸馏的结果是产生7个蒸馏单元，但产生6个基本数据单元。图16C示出了作为蒸馏过程的结果创建的蒸馏文件、主树和基本数据单元。Figure 16A illustrates a variant of the schema declaration shown in Figure 14A. The schema in Figure 16A includes the declaration of the second dimension 1609 and the keyword list 1610. Figure 16B shows an example of an input dataset 1611 with the structure described in structure description 1602, which is parsed and transformed into a set of candidate cells with names based on the declared primary dimension. The candidate cells are transformed into cells in the distillation file 1613. The declaration of the second dimension “PROD_ID” sets a constraint on the exporter such that it is possible not to derive candidate cell 58 from the basic data cell “NIKESHOELAHJUN with PROD_ID=348”, and thus create another basic data cell “NIKESHOELAHJUN with PROD_ID=349” in the basic data filter. Although the input dataset is the same as the input dataset shown in Figure 14B, the result of distillation is the production of 7 distillation cells but 6 basic data cells. Figure 16C shows the distillation file, main tree, and basic data cells created as a result of the distillation process.

图16D示出了为第二维度“PROD_ID”创建的辅助树。使用特定的PROD_ID值遍历此树会导致具有该特定PROD_ID的基本数据单元。例如，查询FetchPDEs(维度5,251)或者请求PROD_ID＝251的基本数据单元的查询FetchPDEs(PROD_ID，251)产生基本数据单元WILSBALLLAHNOV。Figure 16D illustrates the auxiliary tree created for the second dimension, “PROD_ID”. Traversing this tree using a specific PROD_ID value results in a basic data unit with that specific PROD_ID. For example, the query FetchPDEs(dimension 5, 251) or the query FetchPDEs(PROD_ID, 251) requesting a basic data unit with PROD_ID = 251 produces the basic data unit WILSBALLLAHNOV.

图16E示出了针对在图16A结构1610中声明的3个关键字创建的倒置索引(标记为关键字的倒置索引1631)。这些关键字是FEDERER、LAVER和SHARAPOVA。在解析和消费输入数据集1611之后更新倒置索引。查询FetchDistilledElements(关键字，Federer)将利用倒置索引(而不是主树或辅助树)来返回单元2、单元3和单元58。Figure 16E shows the inverted index (labeled as inverted index 1631) created for the three keywords declared in structure 1610 of Figure 16A. These keywords are FEDERER, LAVER, and SHARAPOVA. The inverted index is updated after parsing and consuming the input dataset 1611. The query FetchDistilledElements(keyword, Federer) will utilize the inverted index (instead of the main tree or auxiliary tree) to return cells 2, 3, and 58.

图17示出了针对内容关联数据取回增强的整个装置的框图。内容关联数据取回引擎1701向数据蒸馏装置提供图式1704或包括数据的维度的结构定义。其还向装置提供关键字列表1705。其发出查询1702用于从蒸馏装置搜索和取回数据，并接收查询的结果作为结果1703。导出器110被增强以知晓维度的声明从而在创建导出项时禁止在维度的位置修改内容。注意，从叶节点数据结构中的条目到蒸馏文件中的单元的反向引用被存储在基本数据滤筛106中的叶节点数据结构中。同样，辅助索引也被存储在基本数据滤筛106中。还示出了当单元被写入蒸馏数据时，由导出器110通过反向引用1709更新的倒置索引1707。该内容关联数据取回引擎与其他应用(例如分析，数据仓储和数据分析应用)进行交互，向他们提供执行的查询的结果。Figure 17 shows a block diagram of the entire apparatus for enhanced content-related data retrieval. The content-related data retrieval engine 1701 provides the data distillation apparatus with a schema 1704 or a structural definition including dimensions of the data. It also provides the apparatus with a keyword list 1705. It issues queries 1702 to search for and retrieve data from the distillation apparatus and receives the query results as results 1703. The exporter 110 is enhanced to be aware of dimension declarations, thus preventing modification of content at the dimension's location when creating exported items. Note that backreferences from entries in the leaf node data structure to cells in the distillation file are stored in the leaf node data structure in the basic data filter 106. Similarly, secondary indexes are also stored in the basic data filter 106. The inverted index 1707, updated by the exporter 110 via backreferences 1709, is also shown when cells are written to distilled data. This content-related data retrieval engine interacts with other applications, such as analytics, data warehousing, and data analysis applications, providing them with the results of executed queries.

总而言之，增强的数据蒸馏装置能够对以无损简化形式存储的数据进行强大的多维内容关联搜索和取回。In summary, the enhanced data distillation device enables powerful multidimensional content-related searches and retrievals of data stored in a lossless, simplified form.

Data Distillation^TM装置可用于音频和视频数据的无损简化的用途。利用该方法完成的数据简化是通过从驻留在内容关联滤筛中的基本数据单元中导出音频和视频数据的分量实现的。下面说明所述方法对于这类用途的应用。The Data Distillation ^™ device can be used for lossless simplification of audio and video data. Data simplification achieved using this method is accomplished by deriving components of audio and video data from basic data units residing in a content-related filter. The application of this method to such applications is described below.

图18A-18B表示用于按照MPEG 1，层3标准(也被称为MP3)的音频数据的压缩和解压缩的编码器和解码器的方框图。MP3是利用有损和无损数据简化技术的组合来压缩传入音频的数字音频的音频编码格式。它设法把紧致盘(CD)音频从1.4Mbps压缩到128Kbps。MP3利用人耳的局限性来抑制大多数人的人耳无法感知的音频的分量。为了实现这一点，采用统称为感知编码技术的一组技术，感知编码技术有损但察觉不到地减小一段音频数据的大小。感知编码技术是有损的，并且在这些步骤期间丢失的信息无法恢复。这些感知编码技术由霍夫曼编码补充，霍夫曼编码是本文中前面描述的无损数据简化技术。Figures 18A-18B show block diagrams of encoders and decoders used for compressing and decompressing audio data according to the MPEG 1, Layer 3 standard (also known as MP3). MP3 is an audio encoding format for digital audio that compresses incoming audio using a combination of lossy and lossless data reduction techniques. It manages to compress CD audio from 1.4 Mbps to 128 Kbps. MP3 takes advantage of the limitations of the human ear to suppress audio components that are imperceptible to most people. To achieve this, a set of techniques collectively known as perceptual coding techniques are used, which lossily but imperceptibly reduce the size of a segment of audio data. Perceptual coding techniques are lossy, and information lost during these steps cannot be recovered. These perceptual coding techniques are complemented by Huffman coding, a lossless data reduction technique described earlier in this paper.

在MP3中，传入的音频流被压缩成一系列的几个较小的数据帧，每个数据帧包含帧头和压缩音频数据。原始音频流被定期采样，以产生一系列的音频片段，随后采用感知编码和霍夫曼编码压缩所述一系列的音频片段，从而产生一系列的MP3数据帧。感知编码和霍夫曼编码技术都是在音频数据的每个片段中局部应用的。霍夫曼编码技术在音频片段内局部地，但不横跨音频流全局地利用冗余。从而，MP3技术既不横跨单个音频流，又不在多个音频流之间全局地利用冗余。这代表超越MP3所能达到的进一步数据简化的机会。In MP3, the incoming audio stream is compressed into a series of smaller data frames, each containing a frame header and compressed audio data. The original audio stream is periodically sampled to produce a series of audio segments, which are then compressed using perceptual coding and Huffman coding to produce a series of MP3 data frames. Both perceptual coding and Huffman coding techniques are applied locally within each segment of the audio data. Huffman coding utilizes redundancy locally within an audio segment, but not globally across the audio stream. Thus, MP3 technology does not utilize redundancy globally across a single audio stream, nor across multiple audio streams. This represents an opportunity for further data simplification beyond what MP3 can achieve.

每个MP3数据帧表示26ms的音频片段。每个帧保存1152个样本，并被细分成均包含576个样本的2个颗粒(granule)。在图18A中的编码器方框图中，可以看出在数字音频信号的编码期间，通过滤波处理，并通过应用改进的离散余弦变换(MDCT)，获得时域样本，并变换成576个频域样本。应用感知编码技术，以减少包含在样本中的信息的量。感知编码的输出是非均匀量化颗粒1810，非均匀量化颗粒1810每个频率线包含减少的信息。随后利用霍夫曼编码进一步减小颗粒的大小。每个颗粒的576个频率线可把多个霍夫曼表用于它们的编码。霍夫曼编码的输出是包含缩放因子、霍夫曼编码比特和辅助数据的帧的主要数据分量。(用于表征和定位各个字段的)边信息被放入MP3头部中。编码的输出是MP3编码音频信号。在128Kbps的比特率下，MP3帧的大小为417或418字节。Each MP3 data frame represents a 26ms audio segment. Each frame stores 1152 samples, which are subdivided into two granules, each containing 576 samples. In the encoder block diagram in Figure 18A, it can be seen that during the encoding of the digital audio signal, filtering is performed, and a modified Discrete Cosine Transform (MDCT) is applied to obtain time-domain samples, which are then transformed into 576 frequency-domain samples. Perceptual coding is applied to reduce the amount of information contained in the samples. The output of perceptual coding is a non-uniform quantization granule 1810, where each frequency line contains the reduced information. Huffman coding is then used to further reduce the size of the granules. Multiple Huffman tables can be used for encoding the 576 frequency lines of each granule. The output of Huffman coding is the main data component of the frame, containing the scaling factor, Huffman-coded bits, and auxiliary data. Side information (used to characterize and locate individual fields) is placed in the MP3 header. The encoded output is the MP3 encoded audio signal. At a bitrate of 128Kbps, the size of an MP3 frame is 417 or 418 bytes.

图18C表示如何增强首次示于图1A中的数据蒸馏装置，以对MP3数据进行数据简化。图18C中图解所示的方法把MP3数据因式分解成候选单元，并按比单元本身更细的粒度利用单元之间的冗余。对于MP3数据，选择颗粒作为单元。在一个实施例中，非均匀量化颗粒1810(如图18A中所示)可被视为单元。在另一个实施例中，单元可由量化频率线1854和缩放因子1855的串联构成。Figure 18C illustrates how to enhance the data distillation apparatus first shown in Figure 1A to simplify MP3 data. The method illustrated in Figure 18C factorizes the MP3 data into candidate cells and utilizes redundancy between cells at a finer granularity than the cells themselves. For MP3 data, particles are selected as cells. In one embodiment, a non-uniform quantization particle 1810 (shown in Figure 18A) can be considered a cell. In another embodiment, a cell can be composed of a cascade of quantization frequency lines 1854 and scaling factors 1855.

在图18C中，MP3编码数据流1862由数据蒸馏装置1863接收，并被简化成以无损简化形式保存的蒸馏MP3数据流1868。传入的MP3编码数据流1862由一系列的多对MP3头部和MP3数据构成。MP3数据包括CRC、边信息、主要数据和辅助数据。传出的由该装置创建的蒸馏MP3数据由相似的一系列的多对构成(每一对是DistMP3头部，和后面的无损简化格式的单元规范)。DistMP3头部包含原始帧的除主要数据以外的所有分量，即，它包含MP3头部、CRC、边信息和辅助数据。该蒸馏MP3数据中的单元字段包含以无损简化形式指定的颗粒。解析器/因式分解器1864进行传入的MP3编码流的第一次解码(包括进行霍夫曼解码)，以提取量化频率线1851和缩放因子1852(示于图18B中)，和生成作为候选单元的音频颗粒1865。解析器/因式分解器进行的第一次解码步骤与图18B的同步和错误检验1851、霍夫曼解码1852、及缩放因子解码1853的步骤相同-这些步骤是在任何标准MP3解码器中进行的，并且在现有技术中是公知的。基本数据滤筛1866包含被组织，以便按照内容关联方式存取的作为基本数据单元的颗粒。在把颗粒安装在基本数据滤筛中期间，颗粒的内容被用于确定该颗粒应被安装在滤筛中的什么地方，和更新滤筛的适当叶节点中的骨架数据结构和元数据。随后，颗粒被霍夫曼编码并被压缩，以致颗粒可以不比当驻留在MP3数据中时，它所占据的足迹更大的足迹，被保存在滤筛中。每当导出器需要滤筛中的颗粒作为基本数据单元时，在被提供给导出器之前，该颗粒被解压缩。利用数据蒸馏装置，传入的音频颗粒由导出器1870从驻留在滤筛中的基本数据单元(所述基本数据单元也是音频颗粒)导出，从而创建颗粒的无损简化表示或者蒸馏表示，并将其放入蒸馏的MP3数据1868中。颗粒的这种蒸馏表示被放入替换最初存在于MP3帧的主要数据字段中的霍夫曼编码信息的单元字段中。利用图1H中所示的格式，编码每个单元或颗粒的蒸馏表示-蒸馏数据中的每个单元或者是基本数据单元(伴随有对滤筛中的基本数据单元或基本颗粒的引用)，或者是导出单元(伴随有对滤筛中的基本数据单元或基本颗粒的引用，加上从所引用的基本数据单元，生成导出单元的重建程序)。在导出步骤期间，用于接受导出项的阈值可被设定为驻留在正被简化的帧的主要数据字段中的原始霍夫曼编码信息的大小的一部分。从而，除非重建程序与对基本数据单元的引用之和小于(包含霍夫曼编码数据的)MP3编码帧的对应主要数据字段的大小的所述一部分，否则，导出项不会被接受。如果重建程序与对基本数据单元的引用之和小于(包含霍夫曼编码数据)的编码MP3帧的现有主要数据字段的大小的所述一部分，那么可以决定接受导出项。In Figure 18C, the MP3 encoded data stream 1862 is received by the data distillation device 1863 and simplified into a distilled MP3 data stream 1868 stored in a lossless simplified form. The incoming MP3 encoded data stream 1862 consists of a series of multiple pairs of MP3 headers and MP3 data. The MP3 data includes CRC, side information, main data, and auxiliary data. The outgoing distilled MP3 data created by the device consists of a similar series of multiple pairs (each pair is a DistMP3 header, followed by a unit specification in a lossless simplified format). The DistMP3 header contains all components of the original frame except for the main data; that is, it contains the MP3 header, CRC, side information, and auxiliary data. The unit field in the distilled MP3 data contains particles specified in a lossless simplified form. The parser/factorizer 1864 performs the first decoding of the incoming MP3 encoded stream (including Huffman decoding) to extract the quantization frequency line 1851 and scaling factor 1852 (shown in Figure 18B), and generates audio particles 1865 as candidate units. The first decoding step performed by the parser/factorizer is the same as the synchronization and error checking 1851, Huffman decoding 1852, and scaling factor decoding 1853 steps in Figure 18B—these steps are performed in any standard MP3 decoder and are well known in the art. The basic data filter 1866 contains particles organized as basic data units for access in a content-associative manner. During the installation of particles in the basic data filter, the content of the particles is used to determine where the particle should be installed in the filter and to update the skeleton data structure and metadata in the appropriate leaf nodes of the filter. Subsequently, the particles are Huffman encoded and compressed so that the particle can be stored in the filter with a footprint no larger than it would occupy when residing in MP3 data. Whenever the exporter needs a particle in the filter as a basic data unit, the particle is decompressed before being provided to the exporter. Using a data distillation device, incoming audio particles are extracted by the extractor 1870 from the basic data units (which are also audio particles) residing in the filter, thereby creating a lossless simplified representation or distillation representation of the particles, which is then placed into the distilled MP3 data 1868. This distilled representation of the particles is placed in a unit field that replaces the Huffman-coded information originally present in the main data field of the MP3 frame. Using the format shown in Figure 1H, the distillation representation of each unit or particle is encoded—each unit in the distilled data is either a basic data unit (accompanied by a reference to the basic data unit or basic particle in the filter) or an extracted unit (accompanied by a reference to the basic data unit or basic particle in the filter, plus a reconstruction procedure from the referenced basic data unit, generating the extracted unit). During the extraction step, a threshold for accepting extracted items can be set to a portion of the size of the original Huffman-coded information residing in the main data field of the frame being simplified. Therefore, unless the sum of the reconstruction procedure and the references to the basic data units is less than the stated portion of the size of the corresponding main data field of the MP3 encoded frame (including Huffman-coded data), the derived item will not be accepted. If the sum of the reconstruction procedure and the references to the basic data units is less than the stated portion of the size of the existing main data field of the encoded MP3 frame (including Huffman-coded data), then it can be decided to accept the derived item.

上述方法使跨越保存在装置中的多个音频颗粒，在全局范围利用冗余成为可能。MP3编码数据文件可被变换成蒸馏MP3数据，并以无损简化形式保存。当需要被取回时，可调用(采用取回器1871和重建器1872的)数据取回处理，以重建MP3编码数据1873。在图18C中所示的装置中，重建器负责执行重建程序，以生成期望的颗粒。此外，它被增强，以进行为生成MP3编码数据所需的霍夫曼编码步骤(被表示成图18A中的霍夫曼编码1811)。该数据随后可被馈送给标准MP3解码器，以播放音频。The method described above enables the global utilization of redundancy across multiple audio particles stored in the device. MP3 encoded data files can be transformed into distilled MP3 data and stored in a lossless, simplified form. When retrieval is required, a data retrieval process (using retrieval unit 1871 and reconstructor 1872) can be invoked to reconstruct the MP3 encoded data 1873. In the device shown in Figure 18C, the reconstructor is responsible for performing the reconstruction procedure to generate the desired particles. Furthermore, it is enhanced to perform the Huffman coding steps required to generate the MP3 encoded data (represented as Huffman coding 1811 in Figure 18A). This data can then be fed to a standard MP3 decoder for audio playback.

按照这种方式，可以修改和采用数据蒸馏装置，以进一步减小MP3音频文件的大小。In this way, data distillation devices can be modified and employed to further reduce the size of MP3 audio files.

在所述方案的另一种变型中，当收到MP3编码流时，解析器/因式分解器把整个主要数据字段视为导出用候选单元，或者视为用于安装到基本数据滤筛中的基本数据单元。在这种变型中，所有单元将继续保持霍夫曼编码状态，并且重建程序将作用于已被霍夫曼编码的单元。也可采用数据蒸馏装置的这种变型，以进一步减小MP3音频文件的大小。In another variation of the scheme, upon receiving the MP3 encoded stream, the parser/factor treats the entire main data field as candidate units for derivation, or as basic data units to be loaded into the basic data filter. In this variation, all units remain in Huffman-coded state, and the reconstruction process operates on the units that have already been Huffman-coded. This variation of the data distillation device can also be used to further reduce the size of the MP3 audio file.

以与前述各部分中描述的和在图18A-18C中图示的方式相似的方式，可以采用Data Distillation^TM装置用于无损简化视频数据的目的。通过从驻留在内容关联滤筛中的基本数据单元导出视频数据的分量来实现通过该方法完成的数据简化。视频数据流包含音频和移动图片分量。已经描述了用于蒸馏音频分量的方法。现在将解决移动图片分量。移动图片分量通常被组织为一系列图片组。图片组以I帧开始，并且通常后面跟着多个预测帧(称为P帧和B帧)。I帧通常较大并且包含图片的完整快照，而预测帧是在采用诸如相对于I帧或相对于其它导出帧的运动估计之类的技术之后导出的。Data Distillation^TM装置的一些实施例从视频数据中提取I帧作为单元，并对它们执行数据蒸馏过程，从而保留某些I帧作为驻留在内容关联滤筛中的基本数据单元，而其余的I帧是从基本数据单元导出的。所描述的方法使得能够在跨视频文件内的多个I帧和跨多个视频文件的全局范围内利用冗余。由于I帧通常是移动图片数据的庞大分量，因此该方法将使得减少移动图片分量的足迹。将蒸馏技术应用于音频分量以及移动图片分量将有助于无损简化视频数据的整体大小。In a manner similar to that described in the preceding sections and illustrated in Figures 18A-18C, the Data Distillation ^™ apparatus can be used for the purpose of lossless simplification of video data. This data simplification is achieved by deriving components of video data from basic data units residing in a content-associative filter. The video data stream contains audio and moving picture components. Methods for distilling the audio components have already been described. The moving picture components will now be addressed. Moving picture components are typically organized into a series of picture groups. A picture group begins with an I-frame and is typically followed by multiple prediction frames (called P-frames and B-frames). I-frames are typically large and contain a complete snapshot of the picture, while the prediction frames are derived after employing techniques such as motion estimation relative to the I-frames or other derived frames. Some embodiments of the Data Distillation ^™ apparatus extract I-frames as units from the video data and perform a data distillation process on them, thereby retaining certain I-frames as basic data units residing in the content-associative filter, while the remaining I-frames are derived from these basic data units. The described method enables the utilization of redundancy across multiple I-frames within a video file and globally across multiple video files. Since I-frames are typically large components of moving image data, this method will reduce the footprint of the moving image components. Applying distillation techniques to both audio and moving image components will help to losslessly simplify the overall size of the video data.

图19示出了首次示于图1A中的数据蒸馏装置如何可以被增强以对视频数据进行数据简化。视频数据流1902由数据蒸馏装置1903接收，并被简化为以无损简化形式存储的蒸馏视频数据流1908。传入的视频数据流1902包括两个分量-压缩的移动图片数据和压缩的音频数据。由该装置创建的传出的蒸馏视频数据也包括两个分量，即，压缩的移动图片数据和压缩的音频数据；但是，这些分量的尺寸通过数据蒸馏装置1903被进一步减小。解析器/分解器1904从视频数据流1902中提取压缩的移动图片数据和压缩的音频数据，并从压缩的移动图片数据中提取(包括执行任何所需的霍夫曼解码)内帧(I帧)和预测帧。I帧被用作候选单元1905以在基本数据滤筛1906中执行内容关联的查找。导出器1910使用基本数据滤筛1906返回的基本数据单元集合(也就是I帧)来生成I帧的无损简化表示或蒸馏表示，并且无损简化的I帧被放置在蒸馏视频数据1908中。使用图1H中所示的格式对蒸馏表示进行编码-蒸馏数据中的每个单元要么是基本数据单元(伴随着对滤筛中的基本数据单元的引用)，要么是导出单元(伴随着对滤筛中的基本数据单元的引用以及从引用的基本数据单元生成导出单元的重建程序)。在导出步骤期间，可以将用于接受导出的阈值设置为原始I帧的大小的比例(fraction)。因此，除非重建程序和对基本数据单元的引用的总和小于对应I帧的大小的这一比例，否则将不接受导出。如果重建程序和对基本数据单元的引用的总和小于原始I帧的大小的这一比例，则可以做出接受导出的决定。Figure 19 illustrates how the data distillation apparatus, first shown in Figure 1A, can be enhanced to simplify video data. Video data stream 1902 is received by data distillation apparatus 1903 and simplified into a distilled video data stream 1908 stored in a lossless simplified form. The incoming video data stream 1902 comprises two components—compressed moving picture data and compressed audio data. The outgoing distilled video data created by this apparatus also comprises two components, namely, compressed moving picture data and compressed audio data; however, the size of these components is further reduced by data distillation apparatus 1903. A parser/decomposer 1904 extracts the compressed moving picture data and compressed audio data from the video data stream 1902 and extracts (including performing any necessary Huffman decoding) in-frames (I-frames) and prediction frames from the compressed moving picture data. The I-frames are used as candidate units 1905 to perform content association lookups in a basic data filter 1906. The exporter 1910 uses the set of basic data units (i.e., I-frames) returned by the basic data filter 1906 to generate a lossless simplified representation or distillation representation of the I-frame, and the lossless simplified I-frame is placed in the distilled video data 1908. The distillation representation is encoded using the format shown in Figure 1H—each unit in the distilled data is either a basic data unit (accompanied by a reference to the basic data units in the filter) or an exported unit (accompanied by a reference to the basic data units in the filter and a reconstruction procedure that generates the exported unit from the referenced basic data units). During the export step, a threshold for accepting the export can be set as a fraction of the size of the original I-frame. Therefore, an export will not be accepted unless the sum of the reconstruction procedure and the references to the basic data units is less than this fraction of the size of the corresponding I-frame. If the sum of the reconstruction procedure and the references to the basic data units is less than this fraction of the size of the original I-frame, a decision can be made to accept the export.

上述方法使得能够跨装置中存储的多个视频数据集的多个I帧在全局范围内利用冗余。当需要取回时，可以调用数据取回过程(采用取回器1911和重建器1912)来重建视频数据1913。在图19中所示的装置中，重建器负责执行重建程序以生成期望的I帧。它还被增强以将压缩的音频数据与压缩的移动图片数据组合在一起(本质上是由解析器&分解器1904执行的提取操作的逆操作)来生成视频数据1913。然后，该数据可以被馈送到标准视频解码器中以播放视频。The method described above enables the global utilization of redundancy across multiple I-frames from multiple video datasets stored in the device. When retrieval is needed, the data retrieval process (using retriever 1911 and reconstructor 1912) can be invoked to reconstruct the video data 1913. In the device shown in Figure 19, the reconstructor is responsible for performing the reconstruction procedure to generate the desired I-frame. It is also enhanced to combine compressed audio data with compressed moving picture data (essentially the inverse operation of the extraction operation performed by parser & decomposer 1904) to generate the video data 1913. This data can then be fed into a standard video decoder for video playback.

以这种方式，数据蒸馏装置可以适于和用于进一步减小视频文件的大小。In this way, the data distillation device can be adapted and used to further reduce the size of video files.

给出前面的描述是为了使得本领域技术人员能够制作和使用所述实施例。在不背离本公开内容的精神和范围的情况下，本领域技术人员将很容易认识到针对所公开的实施例的各种修改，并且这里所定义的一般原理适用于其他实施例和应用。因此，本发明不限于所示出的实施例，而是应被给予与这里所公开的原理和特征一致的最广范围。The foregoing description is provided to enable those skilled in the art to make and use the described embodiments. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art without departing from the spirit and scope of this disclosure, and the general principles defined herein are applicable to other embodiments and applications. Therefore, the invention is not limited to the illustrated embodiments, but should be given the broadest scope consistent with the principles and features disclosed herein.

在本公开内容中描述的数据结构和代码可以部分地或完全地被存储在计算机可读存储介质和/或硬件模块和/或硬件装置上。计算机可读存储介质包括而不限于易失性存储器、非易失性存储器、磁性和光学存储设备(比如盘驱动器、磁带、CD(紧致盘)、DVD(数字通用盘或数字视频盘))或者能够存储代码和/或数据的现在已知或后来开发的其他介质。在本公开内容中描述的硬件模块或装置包括而不限于专用集成电路(ASIC)、现场可编程门阵列(FPGA)、专用或共享处理器以及/或者现在已知或后来开发的其他硬件模块或装置。The data structures and code described in this disclosure may be stored, in part or in whole, on computer-readable storage media and/or hardware modules and/or hardware devices. Computer-readable storage media include, but are not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices (such as disk drives, magnetic tapes, CDs (compact discs), DVDs (Digital Universal Discs or Digital Video Discs)), or other media now known or hereafter developed capable of storing code and/or data. Hardware modules or devices described in this disclosure include, but are not limited to, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), dedicated or shared processors, and/or other hardware modules or devices now known or hereafter developed.

在本公开内容中描述的方法和处理可以部分地或完全地被具体实现为存储在计算机可读存储介质或设备中的代码和/或数据，从而当计算机系统读取并执行所述代码和/或数据时，所述计算机系统实施相关联的方法和处理。所述方法和处理还可以部分地或完全地被具体实现在硬件模块或装置中，从而当所述硬件模块或装置被激活时，其实施相关联的方法和处理。应当提到的是，所述方法或处理可以使用代码、数据以及硬件模块或装置的组合来具体实现。The methods and processes described in this disclosure can be implemented, in part or in whole, as code and/or data stored in a computer-readable storage medium or device, such that when a computer system reads and executes the code and/or data, the computer system performs the associated methods and processes. The methods and processes can also be implemented, in part or in whole, in a hardware module or device, such that when the hardware module or device is activated, it performs the associated methods and processes. It should be noted that the methods or processes can be implemented using a combination of code, data, and hardware modules or devices.

前面关于本发明的实施例的描述仅仅是出于说明和描述的目的而给出的。所述描述不意图进行穷举或者把本发明限制到所公开的形式。因此，本领域技术人员将会想到许多修改和变型。此外，前面的公开内容不意图限制本发明。The foregoing description of embodiments of the present invention is provided for illustrative purposes only. The description is not intended to be exhaustive or to limit the invention to the disclosed forms. Therefore, many modifications and variations will occur to those skilled in the art. Furthermore, the foregoing disclosure is not intended to limit the invention.

Claims

1. A method for simplifying video data to obtain simplified video data, the method comprising:

Compressed moving image data and compressed audio data are extracted from the video data;

Extract inner frames (I-frames) from the compressed moving image data;

Extract the predicted frames from the compressed moving image data;

Maintain the data structure of the organization's basic data units, each of which consists of a sequence of bytes;

Losslessly simplify I-frames to obtain losslessly simplified I-frames, wherein the losslessly simplified I-frames consist of, for each I-frame:

The first set of basic data units is identified by using I-frames to perform a first content association lookup on the data structure of basic data units organized based on basic data units, and...

The first set of basic data units is used to losslessly simplify an I-frame so that (i) if the I-frame is a copy of a basic data unit in the first set of basic data units, a reference to that basic data unit is obtained, or (ii) if the I-frame is not a copy of any basic data unit in the first set of basic data units, a reference to one or more basic data units in the first set of basic data units and a transform sequence from said one or more basic data units is obtained; and

The simplified video data mentioned therein includes lossless simplified I-frames, basic data units referenced by the lossless simplified I-frames, predicted frames, and compressed audio data.

2. The method of claim 1, wherein using the first set of basic data units to losslessly simplify the I-frame comprises:

In response to determining that the sum of (i) the size of the references to the first set of basic data units and (ii) the size of the description of the reconstruction procedure is less than a threshold proportion of the size of the I-frame, a first lossless simplified representation of the I-frame is generated, wherein the first lossless simplified representation includes a reference to each basic data unit in the first set of basic data units and a description of the reconstruction procedure; and

In response to determining that the sum of (i) the size of the reference to the first set of basic data units and (ii) the size of the description of the reconstruction procedure is greater than or equal to a threshold proportion of the size of the I-frame,

In the data structure, I-frames are added as new basic data units, and

A second lossless simplified representation of an I-frame is generated, wherein the second lossless simplified representation includes a reference to the new basic data unit.

3. The method of claim 2, wherein the description of the reconstruction procedure specifies a transformation sequence that generates an I-frame when applied to the first set of basic data units.

4. The method of claim 1, wherein the method further comprises:

The compressed audio data is decompressed to obtain a set of audio components; and

For each audio component in the set of audio components

The second set of basic data units is identified by performing a second content association lookup on the data structure of the basic data units based on the content organization using this audio component, and

The second set of basic data units is used to losslessly simplify the audio component; and

The simplified video data includes lossless simplified I-frames, basic data units referenced by the lossless simplified I-frames, prediction frames, lossless simplified audio components, and basic data units referenced by the lossless simplified audio components.

5. The method of claim 4, wherein using the second set of basic data units to losslessly simplify the audio component comprises:

In response to determining that the sum of (i) the size of the references to the second set of basic data units and (ii) the size of the description of the reconstruction procedure is less than a threshold proportion of the size of the audio component, a first lossless simplified representation of the audio component is generated, wherein the first lossless simplified representation includes a reference to each basic data unit in the second set of basic data units and a description of the reconstruction procedure; and

In response to determining that the sum of (i) the size of the reference to the second set of basic data units and (ii) the size of the description of the reconstruction procedure is greater than or equal to a threshold proportion of the size of the audio component,

The audio components are added as new basic data units to the data structure, and

A second lossless simplified representation of the audio component is generated, wherein the second lossless simplified representation includes a reference to the new basic data unit.

6. The method of claim 5, wherein the description of the reconstruction procedure specifies a transformation sequence that, when applied to the second set of basic data units, generates the audio components.

7. The method of claim 1, further comprising:

In response to receiving a request to retrieve video data,

Recreating compressed moving image data includes reconstructing the I-frame from the lossless simplified I-frame and the basic data units referenced by the lossless simplified I-frame, and combining the I-frame with the predicted frame.

The compressed moving image data is combined with the compressed audio data to obtain the video data.

8. The method of claim 1, wherein extracting an I-frame from the compressed moving image data includes performing Huffman decoding.

9. The method of claim 8, further comprising performing Huffman coding on the lossless simplified I-frame and the basic data unit referenced by the lossless simplified I-frame to obtain a Huffman-coded lossless simplified I-frame and a Huffman-coded basic data unit, respectively, wherein the simplified video data includes the Huffman-coded lossless simplified I-frame, the Huffman-coded basic data unit, the predicted frame, and the compressed audio data.

10. The method of claim 9, further comprising:

In response to receiving a request to retrieve video data,

The compressed motion picture data is recreated by (1) performing Huffman decoding on a lossless simplified I-frame encoded with Huffman to obtain a lossless simplified I-frame, and performing Huffman decoding on a Huffman-coded basic data unit to obtain a basic data unit; (2) reconstructing an I-frame from the lossless simplified I-frame and the basic data unit; and (3) performing Huffman coding on the I-frame and combining it with the predicted frame.

11. A computer-readable storage medium storing instructions, said instructions, when executed by a computer, causing the computer to perform a method for simplifying video data to obtain simplified video data, said method comprising:

Extract inner frames (I-frames) from the compressed moving image data;

Extract the predicted frames from the compressed moving image data;

12. The computer-readable storage medium of claim 11, wherein using the first set of basic data units to losslessly simplify an I-frame comprises:

In the data structure, I-frames are added as new basic data units, and

13. The computer-readable storage medium of claim 12, wherein the description of the reconstruction procedure specifies a transformation sequence that generates an I-frame when applied to the first set of basic data units.

14. The computer-readable storage medium of claim 11, wherein the method further comprises:

For each audio component in the set of audio components

15. The computer-readable storage medium of claim 14, wherein using the second set of basic data units to losslessly simplify the audio component comprises:

16. The computer-readable storage medium of claim 15, wherein the description of the reconstruction procedure specifies a transformation sequence that, when applied to the second set of basic data units, generates the audio components.

17. The computer-readable storage medium of claim 11, further comprising:

In response to receiving a request to retrieve video data,

18. The computer-readable storage medium of claim 11, wherein extracting an I-frame from the compressed moving picture data includes performing Huffman decoding.

19. The computer-readable storage medium of claim 18, further comprising performing Huffman coding on the lossless simplified I-frame and the basic data unit referenced by the lossless simplified I-frame to obtain a Huffman-coded lossless simplified I-frame and a Huffman-coded basic data unit, respectively, wherein the simplified video data includes the Huffman-coded lossless simplified I-frame, the Huffman-coded basic data unit, the predicted frame, and compressed audio data.

20. The computer-readable storage medium of claim 19, further comprising:

In response to receiving a request to retrieve video data,

21. An electronic device comprising:

processor; and

A memory storing instructions, which, when executed by the processor, cause the processor to perform a method for simplifying video data to obtain simplified video data, the method comprising:

Extract inner frames (I-frames) from the compressed moving image data;

Extract the predicted frames from the compressed moving image data;

22. The electronic device of claim 21, wherein using the first set of basic data units to losslessly simplify an I-frame comprises:

In the data structure, I-frames are added as new basic data units, and

23. The electronic device of claim 22, wherein the description of the reconstruction procedure specifies a transformation sequence that generates an I-frame when applied to the first set of basic data units.

24. The electronic device of claim 21, wherein the method further comprises:

For each audio component in the set of audio components, perform the following operations:

25. The electronic device of claim 24, wherein using the second set of basic data units to losslessly simplify the audio component comprises:

26. The electronic device of claim 25, wherein the description of the reconstruction procedure specifies a transformation sequence that, when applied to the second set of basic data units, generates the audio component.

27. The electronic device of claim 21, further comprising:

In response to receiving a request to retrieve video data,

28. The electronic device of claim 21, wherein extracting an I-frame from the compressed moving image data includes performing Huffman decoding.

29. The electronic device of claim 28, further comprising performing Huffman coding on the lossless simplified I-frame and the basic data unit referenced by the lossless simplified I-frame to obtain a Huffman-coded lossless simplified I-frame and a Huffman-coded basic data unit, respectively, wherein the simplified video data includes the Huffman-coded lossless simplified I-frame, the Huffman-coded basic data unit, the predicted frame, and compressed audio data.

30. The electronic device of claim 29, further comprising:

In response to receiving a request to retrieve video data,