CN106095971B - A kind of method and server for searching data flow cut-point based on server - Google Patents
A kind of method and server for searching data flow cut-point based on server Download PDFInfo
- Publication number
- CN106095971B CN106095971B CN201610439783.2A CN201610439783A CN106095971B CN 106095971 B CN106095971 B CN 106095971B CN 201610439783 A CN201610439783 A CN 201610439783A CN 106095971 B CN106095971 B CN 106095971B
- Authority
- CN
- China
- Prior art keywords
- data
- point
- predetermined condition
- window
- bytes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明实施例提供了一种基于服务器查找数据流分割点的方法。本发明实施例中通过判断M个窗口中某一个窗口中至少部分数据是否满足预定条件,来查找数据流分割点,当某一个窗口中至少部分数据不满足预定条件,则跳过N*U个长度,获得下一个潜在分割点,提高了数据流分割点查找效率。
The embodiment of the present invention provides a server-based method for searching data stream segmentation points. In the embodiment of the present invention, the data flow segmentation point is searched by judging whether at least part of the data in one of the M windows satisfies the predetermined condition, and when at least part of the data in a certain window does not meet the predetermined condition, skip N*U length, to obtain the next potential split point, and improve the efficiency of searching for the split point of the data stream.
Description
技术领域technical field
本发明涉及信息技术领域,尤其涉及一种基于服务器查找数据流分割点的方法及服务器。The invention relates to the field of information technology, in particular to a server-based method for searching data stream segmentation points and a server.
背景技术Background technique
数据量的不断增长,使得提供充足的数据存储成为当前存储领域面临的严峻挑战。目前应对这一挑战的一种方式为利用需要存储的数据的冗余特性,使用重复数据删除技术,从而减少存储的数据量。The continuous growth of data volume makes providing sufficient data storage a serious challenge in the current storage field. One way to address this challenge is to take advantage of the redundant nature of the data that needs to be stored and use de-duplication technology to reduce the amount of stored data.
现有技术中,基于内容分块(Content Defined Chunk,CDC)的重复数据删除算法,首先要将待存储的数据流分成很多数据块,而将数据流分成数据块就需要在数据流中查找合适的分割点,两个相邻数据流分割点之间的数据构成一个数据块。计算数据块的特征值,从而查找是否存在相同特征值的数据块,如果查找到相同特征指的数据块,则认为存在重复数据。具体的,基于内容分块的重复数据删除技术是应用滑动窗口技术(Sliding WindowTechnique)基于文件的内容来查找分块的分割点,即通过计算窗口内数据的Rabin指纹来确定数据流分割点。假设从数据流的左边向右边查找分割点,每次计算滑动窗口内数据的指纹,并且将指纹值基于给定的整数K取模后,与给定的余数R进行比对;若相等则窗口的右端为数据流分割点,否则将窗口继续往右滑动一个字节,依次循环地进行计算和比对,直到到达数据流末尾。在基于内容分块的重复数据删除过程中,查找数据流分割点,需要消耗大量的计算资源,从而成为提升重复数据删除性能的瓶颈。In the prior art, the data deduplication algorithm based on Content Defined Chunk (CDC) first needs to divide the data stream to be stored into many data blocks, and to divide the data stream into data blocks needs to find the appropriate data stream in the data stream. The data between the split points of two adjacent data streams constitutes a data block. Calculate the characteristic value of the data block to find out whether there is a data block with the same characteristic value. If the data block with the same characteristic value is found, it is considered that there is duplicate data. Specifically, the data deduplication technology based on content partitioning is to apply the sliding window technology (Sliding Window Technique) to find the split point of the block based on the content of the file, that is, to determine the data stream split point by calculating the Rabin fingerprint of the data in the window. Assuming that the segmentation point is found from the left to the right of the data stream, the fingerprint of the data in the sliding window is calculated each time, and the fingerprint value is moduloed based on the given integer K, and compared with the given remainder R; if they are equal, the window The right end of is the split point of the data stream, otherwise, the window will continue to slide one byte to the right, and the calculation and comparison will be performed cyclically until the end of the data stream is reached. In the process of data deduplication based on content partitioning, finding data stream segmentation points consumes a large amount of computing resources, which becomes a bottleneck in improving deduplication performance.
发明内容Contents of the invention
第一方面,本发明实施例提供了一种基于服务器查找数据流分割点的方法,在所述服务器上预设有规则,所述规则为:为潜在分割点In the first aspect, the embodiment of the present invention provides a server-based method for finding data stream segmentation points, where rules are preset on the server, and the rules are: potential segmentation points
k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;所述方法包括:k Determine M points p x , the window W x [p x -A x ,p x +B x ] corresponding to the point p x and the predetermined condition corresponding to the window W x [p x -A x ,p x +B x ] C x , wherein, x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers; the method includes:
a)依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;a) Determine the point p iz and the window W iz [p iz -A z ,p iz +B z ] corresponding to the point p iz for the current potential segmentation point ki according to the rules, i and z are integers, and 1 ≤z≤M;
b)判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data stream segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix-Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each of the M windows W ix [p ix -A x , p ix +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第一方面,第一种可能的实现方式中,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf。With reference to the first aspect, in a first possible implementation manner, the rule further includes: at least two points pe and p f satisfying the conditions of A e =A f , Be =B f , and C e =C f .
结合第一方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。In combination with the first possible implementation of the first aspect, in the second possible implementation, the rule further includes: the at least two points p e and p f , relative to the potential segmentation point k, at The data stream split point looks in the reverse direction.
结合第一方面的第一种可能的实现方式或第二种可能的实现方式,第三种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf之间的距离为1个U。In combination with the first possible implementation manner or the second possible implementation manner of the first aspect, in a third possible implementation manner, the rule further includes: the distance between the at least two points p e and p f The distance is 1 U.
结合第一方面,或第一方面第一至第三种任一可能的实现方式,第四种可能的实现方式中,判断所述窗口Wiz[piz-Az,piz+Bz]中至少部 分数据是否满足所述预定条件Cz,具体包括:In combination with the first aspect, or any one of the first to third possible implementations of the first aspect, in the fourth possible implementation, the window W iz [p iz -A z ,p iz +B z ] is judged Whether at least part of the data in satisfies the predetermined condition C z , specifically including:
使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
结合第一方面的第四种可能的实现方式,第五种可能的实现方式中,所述使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体为使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。In combination with the fourth possible implementation of the first aspect, in the fifth possible implementation, the use of a random function to determine at least part of the data in the window W iz [p iz -A z ,p iz +B z ] Whether the predetermined condition C z is satisfied is specifically using a hash function to determine whether at least part of the data in the window W iz [p iz -A z , p iz +B z ] satisfies the predetermined condition C z .
结合第一方面,或第一方面第一至第五种任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述点pic对应的所述窗口Wic[pic-Ac,pic+Bc]的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述点pic是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。In combination with the first aspect, or any of the first to fifth possible implementations of the first aspect, in the sixth possible implementation, when the window W iz [p iz -A z ,p iz +B z ] At least part of the data does not meet the predetermined condition C z , from the point p iz along the search direction of the data stream segmentation point to jump N data stream segmentation point minimum search unit U, to obtain the new potential segmentation point, according to According to the rule, the left boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is the same as the window W iz [p iz -A z ,p iz +B z ] coincides with the right boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is located within the range of the window W iz [p iz -A z ,p iz +B z ]; wherein, the point p ic determined for the new potential segmentation point is according to the rule, for the new The M points determined by the potential segmentation points are the first points in the sequence obtained according to the search direction of the data flow.
结合第一方面的第四种可能的实现方式,第七种可能的实现方式中,使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the first aspect, in the seventh possible implementation, a random function is used to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies The predetermined condition C z specifically includes:
在所述窗口Wiz[piz-Az,piz+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字 节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [p iz -A z ,p iz +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
第二方面,本发明实施例提供了一种基于服务器查找数据流分割点的方法,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;In the second aspect, the embodiment of the present invention provides a server-based method for searching for data flow segmentation points. Rules are preset on the server, and the rules are: determine M windows W x [kA for a potential segmentation point k x ,k+B x ] and the predetermined condition C x corresponding to the window W x [kA x ,k+B x ], wherein x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers;
所述方法包括:The methods include:
a)依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki+Bz],i和z为整数,并且1≤z≤M;a) Determine the corresponding window W iz [k i -A z , ki +B z ] for the current potential segmentation point ki according to the rules, i and z are integers, and 1≤z≤M;
b)判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip the minimum search unit U of N data stream segmentation points, N*U is not greater than ‖B z ‖+max x (‖A x ‖), obtain a new potential segmentation point, and perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each window W ix [k i -A x , ki +B x ] of the M windows of the current potential segmentation point ki satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第二方面,第一种可能的实现方式中,所述规则还包括:至少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf。With reference to the second aspect, in the first possible implementation manner, the rule further includes: at least two windows W ie [k i -A e ,k i +B e ] and W if [k i -A f ,k i +B f ], satisfying the conditions: |A e +B e |=|A f +B f |, C e =C f .
结合第二方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:Ae和Af为正整数。With reference to the first possible implementation manner of the second aspect, in the second possible implementation manner, the rule further includes: A e and A f are positive integers.
结合第二方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:Ae-1=Af,Be+1=Bf。In combination with the first possible implementation or the second possible implementation of the second aspect, in a third possible implementation, the rule further includes: A e -1=A f , Be +1= B f .
结合第二方面,或第二方面第一至第三任一可能的实现方式,第四种可能的实现方式中,判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否所述满足预定条件Cz,具体包括:In combination with the second aspect, or any of the first to third possible implementations of the second aspect, in the fourth possible implementation, it is determined that in the window W iz [k i -A z , ki +B z ] Whether at least part of the data meets the predetermined condition C z , specifically including:
使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第二方面的第四种可能的实现方式,第五种可能的实现方式中,所述使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体为使用hash函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。In combination with the fourth possible implementation of the second aspect, in the fifth possible implementation, the use of a random function to determine at least part of the data in the window W iz [k i -A z , k i +B z ] Whether the predetermined condition C z is satisfied is specifically using a hash function to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第二方面,或第二方面第一至第五任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第一的窗口。In combination with the second aspect, or any of the first to fifth possible implementations of the second aspect, in the sixth possible implementation, when the window W iz [k i -A z , ki +B z ] At least part of the data does not satisfy the predetermined condition C z , jumping from the current potential segmentation point ki along the data stream segmentation point search direction for N minimum search units U of the data stream segmentation point to obtain the new potential segmentation point , according to the rule, the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is the same as the window W iz [k i -A z ,k i +B z ] coincides with the right boundary or the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is located in the window W iz [k i - A z , ki + B z ] range; wherein, the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is according to the rule, as The M windows determined by the new potential segmentation point are the first windows in the sequence obtained according to the search direction of the data flow.
结合第二方面的第四种可能的实现方式,第七种可能的实现方式中,使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the second aspect, in the seventh possible implementation, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies The predetermined condition C z specifically includes:
在所述窗口Wiz[ki-Az,ki+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选 择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [k i -A z , ki +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
第三方面,本发明实施例提供了一种用于查找数据流分割点的服务器,所述服务器包括中央处理单元和主存储器,所述中央处理单元与所述主存储器通信,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;In a third aspect, an embodiment of the present invention provides a server for finding a data flow split point, the server includes a central processing unit and a main memory, the central processing unit communicates with the main memory, and on the server There are preset rules, which are: determine M points p x for potential segmentation point k, the window W x [p x -A x ,p x + B x ] and the window W x [p x -A x ,p x +B x ] corresponds to the predetermined condition C x , wherein x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers;
所述主存储器用于存储可执行指令,所述中央处理单元执行所述可执行指令,以执行如下步骤:The main memory is used to store executable instructions, and the central processing unit executes the executable instructions to perform the following steps:
a)依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;a) Determine the point p iz and the window W iz [p iz -A z ,p iz +B z ] corresponding to the point p iz for the current potential segmentation point ki according to the rules, i and z are integers, and 1 ≤z≤M;
b)判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data stream segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix- Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each of the M windows W ix [p ix - A x , p ix + B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第三方面,第一种可能的实现方式中,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf。With reference to the third aspect, in the first possible implementation manner, the rule further includes: at least two points pe and p f satisfy the conditions of A e =A f , Be =B f , and C e =C f .
结合第三方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。With reference to the first possible implementation of the third aspect, in the second possible implementation, the rule further includes: the at least two points p e and p f , relative to the potential segmentation point k, at The data stream split point looks in the reverse direction.
结合第三方面的第一种可能的实现方式或第二种可能的实现方式,第三种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf之间的距离为1个U。With reference to the first possible implementation manner or the second possible implementation manner of the third aspect, in the third possible implementation manner, the rule further includes: the distance between the at least two points p e and p f The distance is 1 U.
结合第三方面,或第一至第三任一可能的实现方式,第四种可能的实现方式中,所述中央处理单元具体用于With reference to the third aspect, or any of the first to third possible implementation manners, in a fourth possible implementation manner, the central processing unit is specifically configured to
使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
结合第三方面的第四种可能的实现方式,第五种可能的实现方式中,所述中央处理单元具体用于使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。In combination with the fourth possible implementation of the third aspect, in the fifth possible implementation, the central processing unit is specifically configured to use a hash function to determine the window W iz [p iz -A z ,p iz +B z ] whether at least part of the data satisfies the predetermined condition C z .
结合第三方面,或第一至第五任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述点pic对应的所述窗口Wic[pic-Ac,pic+Bc] 的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述点pic是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。In combination with the third aspect, or any one of the first to fifth possible implementations, in the sixth possible implementation, when at least part of the data in the window W iz [p iz -A z ,p iz +B z ] If the predetermined condition C z is not satisfied, the minimum search unit U of N data stream segmentation points is jumped from the point p iz along the search direction of the data stream segmentation point to obtain the new potential segmentation point. According to the rule, The left boundary of the window W ic [p ic -A c , p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is the same as the window W iz [p iz -A z ,p iz + The right boundary of B z ] coincides or the left boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is located in the window Within the range of W iz [p iz -A z ,p iz +B z ]; wherein, the point p ic determined for the new potential segmentation point is the new potential segmentation point according to the rule The determined M points are the first points in the sequence obtained according to the search direction of the data flow.
结合第三方面的第四种可能的实现方式,第七种可能的实现方式中,所述中央处理单元使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the third aspect, in the seventh possible implementation, the central processing unit uses a random function to determine whether in the window W iz [p iz -A z ,p iz +B z ] Whether at least part of the data satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[piz-Az,piz+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据满足所述预定条件Cz。第四方面,本发明实 施例提供了一种用于查找数据流分割点的服务器,所述服务器包括中央处理单元和主存储器,所述中央处理单元与所述主存储器通信,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;Select F bytes in the window W iz [p iz -A z ,p iz +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z . In a fourth aspect, an embodiment of the present invention provides a server for finding a data stream split point, the server includes a central processing unit and a main memory, the central processing unit communicates with the main memory, and on the server A rule is preset, and the rule is: determine M windows W x [kA x , k+B x ] and a predetermined condition C x corresponding to the window W x [kA x , k+B x ] for the potential segmentation point k, Wherein, x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers;
所述主存储器用于存储可执行指令,所述中央处理单元执行所述可执行指令,以执行以下步骤:The main memory is used to store executable instructions, and the central processing unit executes the executable instructions to perform the following steps:
a)依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki+Bz],i和z为整数,并且1≤z≤M;a) Determine the corresponding window W iz [k i -A z , ki +B z ] for the current potential segmentation point ki according to the rules, i and z are integers, and 1≤z≤M;
b)判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip the minimum search unit U of N data stream segmentation points, N*U is not greater than ‖B z ‖+max x (‖A x ‖), obtain a new potential segmentation point, and perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each window W ix [k i -A x , ki +B x ] of the M windows of the current potential segmentation point ki satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第四方面,第一种可能的实现方式中,所述规则还包括:至少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf。With reference to the fourth aspect, in the first possible implementation manner, the rule further includes: at least two windows W ie [k i -A e ,k i +B e ] and W if [k i -A f ,k i +B f ], satisfying the conditions: |A e +B e |=|A f +B f |, C e =C f .
结合第四方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:Ae和Af为正整数。With reference to the first possible implementation manner of the fourth aspect, in the second possible implementation manner, the rule further includes: A e and A f are positive integers.
结合第四方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:Ae-1=Af,Be+1= Bf。In combination with the first possible implementation or the second possible implementation of the fourth aspect, in the third possible implementation, the rule further includes: A e -1=A f , Be +1= B f .
结合第四方面,或第一至第三任一可能的实现方式,第四种可能的实现方式中,所述中央处理单元具体用于With reference to the fourth aspect, or any of the first to third possible implementation manners, in a fourth possible implementation manner, the central processing unit is specifically used to
使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第四方面的第四种可能的实现方式,第五种可能的实现方式中,所述中央处理单元具体用于使用hash函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。With reference to the fourth possible implementation of the fourth aspect, in the fifth possible implementation, the central processing unit is specifically configured to use a hash function to determine the window W iz [k i -A z , k i +B z ] whether at least part of the data satisfies the predetermined condition C z .
结合第四方面,或第一至第五任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第一的窗口。With reference to the fourth aspect, or any of the first to fifth possible implementations, in the sixth possible implementation, when at least part of the data in the window W iz [k i -A z , ki +B z ] If the predetermined condition C z is not satisfied, jump N minimum search units U of the data stream segmentation point from the current potential segmentation point ki along the data stream segmentation point search direction to obtain the new potential segmentation point, according to the According to the above rules, the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is the same as the window W iz [k i -A z , ki +B z ] or the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is located at the window W iz [k i -A z , k i +B z ] range; wherein, the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is based on the rule, for the new The M windows determined by the potential splitting points are the first windows in the sequence obtained according to the search direction of the data flow.
结合第四方面的第四种可能的实现方式,第七种可能的实现方式中,所述中央处理单元使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the fourth aspect, in the seventh possible implementation, the central processing unit uses a random function to judge the Whether at least part of the data satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[ki-Az,ki+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字 节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [k i -A z , ki +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
第五方面,本发明实施例提供了一种用于查找数据流分割点的服务器,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;In the fifth aspect, the embodiment of the present invention provides a server for searching data stream segmentation points, and rules are preset on the server, and the rules are: determine M points p x , point The window W x [p x -A x ,p x +B x ] corresponding to p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], where x is 1 to M continuous natural numbers, M≥2, A x and B x are integers;
所述服务器包括:处理单元,用于执行步骤a):The server includes: a processing unit, configured to perform step a):
a)依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;a) Determine the point p iz and the window W iz [p iz -A z ,p iz +B z ] corresponding to the point p iz for the current potential segmentation point ki according to the rules, i and z are integers, and 1 ≤z≤M;
判断处理单元,用于判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分 数据是否满足预定条件Cz;A judging processing unit, configured to judge whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies a predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,则所述确定单元为所述新的潜在分割点执行步骤a);When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data flow segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, then the determination unit performing step a) for said new potential segmentation point;
当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix-Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。When at least part of the data in each of the M windows W ix [p ix -A x , p ix +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential segmentation point k i is the data stream split point.
结合第五方面,第一种可能的实现方式中,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf。With reference to the fifth aspect, in the first possible implementation manner, the rule further includes: at least two points pe and p f satisfy the conditions of A e =A f , Be =B f , and C e =C f .
结合第五方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。With reference to the first possible implementation of the fifth aspect, in the second possible implementation, the rule further includes: the at least two points p e and p f , relative to the potential segmentation point k, at The data stream split point looks in the reverse direction.
结合第五方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf之间距离为1个U。With reference to the first possible implementation manner or the second possible implementation manner of the fifth aspect, in a third possible implementation manner, the rule further includes: between the at least two points p e and p f The distance is 1 U.
结合第五方面,或第一至第三任一可能的实现方式,第四种可能的实现方式中,所述判断处理单元具体使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。With reference to the fifth aspect, or any of the first to third possible implementation manners, in a fourth possible implementation manner, the determination processing unit specifically uses a random function to determine the window W iz [p iz -A z ,p iz +B z ] whether at least part of the data satisfies the predetermined condition C z .
结合第五方面的第四种可能的实现方式,第五种可能的实现方式中,所述判决处理单元具体用于使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。With reference to the fourth possible implementation of the fifth aspect, in the fifth possible implementation, the judgment processing unit is specifically configured to use a hash function to judge the window W iz [p iz -A z ,p iz +B z ] whether at least part of the data satisfies the predetermined condition C z .
结合第五方面,或第一至第五任一可能的实现方式,第六种可能的实现方式中,所述判断处理单元用于当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流 分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,所述确定单元为所述新的潜在分割点执行步骤a),根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[pic-Ac,pic+Bc]的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[pic-Ac,pic+Bc]是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。With reference to the fifth aspect, or any of the first to fifth possible implementation manners, in a sixth possible implementation manner, the judgment processing unit is configured to when the window W iz [p iz -A z ,p iz + At least part of the data in B z ] does not satisfy the predetermined condition C z , jump N minimum search units U of data flow segmentation points from the point p iz along the data flow segmentation point search direction, and obtain the new potential segmentation point, the determination unit performs step a) for the new potential segmentation point, according to the rule, the window W ic corresponding to the point p ic determined for the new potential segmentation point [p ic -A c ,p ic +B c ] coincides with the right boundary of the window W iz [p iz -A z ,p iz +B z ] or the window W ic [p ic -A c ,p ic +B c ] is located within the window W iz [p iz -A z ,p iz +B z ]; The window W ic [p ic -A c ,p ic +B c ] is the first point in the sequence obtained according to the search direction of the data flow among the M points determined for the new potential segmentation point according to the rule.
结合第五方面的第四种可能的实现方式,第七种可能的实现方式中,所述判断处理单元具体用于使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:With reference to the fourth possible implementation of the fifth aspect, in the seventh possible implementation, the determination processing unit is specifically configured to use a random function to determine the window W iz [p iz -A z ,p iz +B z ] whether at least part of the data satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[piz-Az,piz+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [p iz -A z ,p iz +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
第六方面,本发明实施例提供了一种用于查找数据流分割点的服务器,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;In a sixth aspect, an embodiment of the present invention provides a server for searching data stream segmentation points, and a rule is preset on the server, and the rule is: determine M windows W x [kA for a potential segmentation point k x ,k+B x ] and the predetermined condition C x corresponding to the window W x [kA x ,k+B x ], wherein x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers;
所述服务器包括:确定单元,用于执行步骤a:The server includes: a determining unit, configured to perform step a:
a)依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki a) Determine the corresponding window W iz [k i -A z ,k i for the current potential segmentation point ki according to the rules
+Bz],i和z为整数,并且1≤z≤M;+B z ], i and z are integers, and 1≤z≤M;
判断处理单元,用于判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;A judging processing unit, configured to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies a predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip the minimum search unit U of N data stream segmentation points, N*U is not greater than ‖B z ‖+max x (‖A x ‖), obtain a new potential segmentation point, and perform step a);
c当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c When at least part of the data in each of the M windows W ix [k i -A x , ki +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential segmentation Point ki is the data flow splitting point.
结合第六方面,第一种可能的实现方式中,所述规则还包括:至 少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf。With reference to the sixth aspect, in the first possible implementation manner, the rule further includes: at least two windows W ie [k i -A e ,k i +B e ] and W if [k i -A f ,k i +B f ], satisfying the conditions: |A e +B e |=|A f +B f |, C e =C f .
结合第六方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:Ae和Af为正整数。With reference to the first possible implementation manner of the sixth aspect, in the second possible implementation manner, the rule further includes: A e and A f are positive integers.
结合第六方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:Ae-1=Af,Be+1=Bf。In combination with the first possible implementation manner or the second possible implementation manner of the sixth aspect, in the third possible implementation manner, the rule further includes: A e -1=A f , Be +1= B f .
结合第六方面,或第一至第三任一可能的实现方式,第四种可能的实现方式中,所述判断处理单元具体用于With reference to the sixth aspect, or any of the first to third possible implementation manners, in a fourth possible implementation manner, the judgment processing unit is specifically configured to
使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第六方面的第四种可能的实现方式,第五种可能的实现方式中,所述判断处理单元具体用于使用hash函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。With reference to the fourth possible implementation of the sixth aspect, in the fifth possible implementation, the judgment processing unit is specifically configured to use a hash function to judge the window W iz [k i -A z , k i +B z ] whether at least part of the data satisfies the predetermined condition C z .
结合第六方面,或第一至第五任一可能的实现方式,第六种可能的实现方式中,所述判断处理单元用于当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,所述确定单元为所述新的潜在分割点执行步骤a),根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第一 的窗口。With reference to the sixth aspect, or any of the first to fifth possible implementation manners, in the sixth possible implementation manner, the judgment processing unit is configured to: when the window W iz [k i -A z ,k i + At least part of the data in B z ] does not satisfy the predetermined condition C z , jump N minimum search units U of the data stream segmentation point from the current potential segmentation point ki along the data stream segmentation point search direction, and obtain the new The potential segmentation point of , the determination unit performs step a) for the new potential segmentation point, according to the rule, the window W ic [k i -A c ,k i + The left boundary of B c ] coincides with the right boundary of the window W iz [k i -A z , ki +B z ] or the window W ic [k i -A c , ki +B c ] is located within the range of the window W iz [ ki -A z , ki +B z ]; wherein, the window W determined for the new potential segmentation point ic [k i -A c , ki +B c ] is the first window in the sequence obtained according to the search direction of the data flow among the M windows determined for the new potential segmentation point according to the rule.
结合第六方面的第四种可能的实现方式,第七种可能的实现方式中,所述判断处理单元使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation manner of the sixth aspect, in the seventh possible implementation manner , the judgment processing unit uses a random function to judge the Whether at least part of the data satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[ki-Az,ki+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [k i -A z , ki +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
第七方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储可执行指令,服务器执行所述可执行指 令以查找数据流分割点,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;In the seventh aspect, the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium is used to store executable instructions, and the server executes the executable instructions to find data flow segmentation points, and the server There are rules preset above, and the rules are: determine M points p x for the potential segmentation point k, the window W x [p x -A x ,p x +B x ] corresponding to the point p x , and the window W x [p x -A x ,p x +B x ] corresponds to the predetermined condition C x , wherein x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers;
当所述服务器执行所述可执行指令,以执行以下步骤:When the server executes the executable instruction to perform the following steps:
a)依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;a) Determine the point p iz and the window W iz [p iz -A z ,p iz +B z ] corresponding to the point p iz for the current potential segmentation point ki according to the rules, i and z are integers, and 1 ≤z≤M;
b)判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data stream segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix-Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each of the M windows W ix [p ix -A x , p ix +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第七方面,第一种可能的实现方式中,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf。With reference to the seventh aspect, in the first possible implementation manner, the rule further includes: at least two points pe and p f satisfy the conditions of A e =A f , Be =B f , and C e =C f .
结合第七方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。With reference to the first possible implementation manner of the seventh aspect, in the second possible implementation manner, the rule further includes: the at least two points pe and p f , relative to the potential segmentation point k, at The data stream split point looks in the reverse direction.
结合第七方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:所述至少两个点pe和pf之间的距离为1个U。With reference to the first possible implementation manner or the second possible implementation manner of the seventh aspect, in a third possible implementation manner, the rule further includes: between the at least two points p e and p f The distance is 1 U.
结合第七方面,或第七方面第一至第三任一可能的实现方式,第四种可能的实现方式中,所述服务器判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:With reference to the seventh aspect, or any of the first to third possible implementation manners of the seventh aspect, in the fourth possible implementation manner, the server determines that the window W iz [p iz -A z , p iz +B z ] whether at least part of the data satisfies the predetermined condition C z , specifically including:
所述服务器使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。The server uses a random function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
结合第七方面的第四种可能的实现方式,第五种可能的实现方式中,所述服务器使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:With reference to the fourth possible implementation of the seventh aspect, in the fifth possible implementation, the server uses a random function to judge that at least part of the window W iz [p iz -A z ,p iz +B z ] Whether the data satisfies the predetermined condition C z , specifically including:
所述服务器使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。The server uses a hash function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
结合第七方面,或第七方面第一至第五任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述点pic对应的所述窗口Wic[pic-Ac,pic+Bc]的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述点pic是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。In combination with the seventh aspect, or any of the first to fifth possible implementations of the seventh aspect, in the sixth possible implementation, when the window W iz [p iz -A z ,p iz +B z ] At least part of the data does not satisfy the predetermined condition C z , jumping from the point p iz along the direction of searching for the data stream split point by N minimum search units U of the data stream split point to obtain the new potential split point, according to the According to the above rules, the left boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is the same as the window W iz [p iz -A z , The right boundary of p iz +B z ] coincides or the left boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is located at Within the range of the window W iz [p iz -A z ,p iz +B z ]; wherein, the point p ic determined for the new potential segmentation point is according to the rule, for the new The M points determined by the potential split point are the points ranked first in the sequence obtained according to the search direction of the data flow.
结合第七方面的第四种可能的实现方式,第七种可能的实现方式中,使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the seventh aspect, in the seventh possible implementation, a random function is used to judge whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies The predetermined condition C z specifically includes:
在所述窗口Wiz[piz-Az,piz+Bz]中选择F个字节,将所述F个字节 反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [p iz -A z ,p iz +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
第八方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储可执行指令,服务器执行所述可执行指令以查找数据流分割点,在所述服务器上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;当所述服务器执行所述可执行指令,以执行以下步骤:In an eighth aspect, an embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium is used to store executable instructions, and the server executes the executable instructions to find data flow segmentation points. There are rules preset above, and the rules are: determine M windows W x [kA x , k+B x ] and predetermined conditions C x corresponding to windows W x [kA x , k+B x ] for potential segmentation point k , wherein, x is a continuous natural number from 1 to M, M≥2, A x and B x are integers; when the server executes the executable instruction, the following steps are performed:
a)依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki a) Determine the corresponding window W iz [k i -A z ,k i for the current potential segmentation point ki according to the rules
+Bz],i和z为整数,并且1≤z≤M;+B z ], i and z are integers, and 1≤z≤M;
b)判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;b) judging whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,执行步骤a);When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip the minimum search unit U of N data stream segmentation points, N*U is not greater than ‖B z ‖+max x (‖A x ‖), obtain a new potential segmentation point, and perform step a);
c)当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。c) When at least part of the data in each window W ix [k i -A x , ki +B x ] of the M windows of the current potential segmentation point ki satisfies the predetermined condition C x , then the current potential The split point ki is the data stream split point.
结合第八方面,第一种可能的实现方式中,所述规则还包括:至少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf。With reference to the eighth aspect, in the first possible implementation manner, the rule further includes: at least two windows W ie [k i -A e ,k i +B e ] and W if [k i -A f ,k i +B f ], satisfying the conditions: |A e +B e |=|A f +B f |, C e =C f .
结合第八方面的第一种可能的实现方式,第二种可能的实现方式中,所述规则还包括:Ae和Af为正整数。With reference to the first possible implementation manner of the eighth aspect, in the second possible implementation manner, the rule further includes: A e and A f are positive integers.
结合第八方面的第一种可能的实现方式或第二种可能的实现方式,在第三种可能的实现方式中,所述规则还包括:Ae-1=Af,Be+1=Bf。In combination with the first possible implementation manner or the second possible implementation manner of the eighth aspect, in the third possible implementation manner, the rule further includes: A e -1=A f , Be +1= B f .
结合第八方面,或第八方面第一到第三任一可能的实现方式,第四种可能的实现方式中,所述服务器判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:With reference to the eighth aspect, or any of the first to third possible implementation manners of the eighth aspect, in the fourth possible implementation manner, the server determines that the window W iz [k i -A z ,k i +B z ] whether at least part of the data satisfies the predetermined condition C z , specifically including:
使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第八方面的第四种可能的实现方式,第五种可能的实现方式中,所述服务器使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部 分数据是否满足所述预定条件Cz,具体为所述服务器使用hash函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。With reference to the fourth possible implementation of the eighth aspect, in the fifth possible implementation, the server uses a random function to judge that at least part of the window W iz [k i -A z , k i +B z ] Whether the data satisfies the predetermined condition C z , specifically, the server uses a hash function to determine whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
结合第八方面,或第八方面第一到第五任一可能的实现方式,第六种可能的实现方式中,当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第一的窗口。In combination with the eighth aspect, or any of the first to fifth possible implementations of the eighth aspect, in the sixth possible implementation, when the window W iz [k i -A z , ki +B z ] At least part of the data does not satisfy the predetermined condition C z , jumping from the current potential segmentation point ki along the data stream segmentation point search direction for N minimum search units U of the data stream segmentation point to obtain the new potential segmentation point , according to the rule, the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is the same as the window W iz [k i -A z ,k i +B z ] coincides with the right boundary or the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is located in the window W iz [k i - A z , ki + B z ] range; wherein, the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is according to the rule, as The M windows determined by the new potential segmentation point are the first windows in the sequence obtained according to the search direction of the data flow.
结合第八方面的第四种可能的实现方式,第七种可能的实现方式中,使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:In combination with the fourth possible implementation of the eighth aspect, in the seventh possible implementation, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies The predetermined condition C z specifically includes:
在所述窗口Wiz[ki-Az,ki+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述 F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [k i -A z , ki +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
本发明实施例中通过判断M个窗口中某一个窗口中至少部分数据是否满足预定条件,来查找数据流分割点,当某一个窗口中至少部分数据不满足预定条件,则跳过N*U个长度,获得下一个潜在分割点,提高了数据流分割点查找效率。In the embodiment of the present invention, the data flow segmentation point is searched by judging whether at least part of the data in one of the M windows satisfies the predetermined condition, and when at least part of the data in a certain window does not meet the predetermined condition, skip N*U length, to obtain the next potential split point, and improve the efficiency of searching for the split point of the data stream.
附图说明Description of drawings
图1为本发明实施例一种应用场景示意图;FIG. 1 is a schematic diagram of an application scenario according to an embodiment of the present invention;
图2为数据流分割点示意图;Fig. 2 is a schematic diagram of a data flow splitting point;
图3为查找数据流分割点示意图;Fig. 3 is a schematic diagram of finding data stream segmentation points;
图4为本发明实施例方法示意图;Fig. 4 is a schematic diagram of the method of the embodiment of the present invention;
图5和图6为查找数据流分割点实施方式示意图;Fig. 5 and Fig. 6 are the schematic diagrams of the implementation manner of finding the data stream segmentation point;
图7和图8为查找数据流分割点实施方式示意图;FIG. 7 and FIG. 8 are schematic diagrams of implementations of searching for data stream segmentation points;
图9和图10为找数据流分割点实施方式示意图;Fig. 9 and Fig. 10 are the schematic diagrams of the embodiment of finding data stream segmentation point;
图11和图12和图13为找数据流分割点实施方式示意图;Fig. 11 and Fig. 12 and Fig. 13 are the schematic diagrams of the embodiment of finding the data stream segmentation point;
图14和图15为找数据流分割点实施方式示意图;Fig. 14 and Fig. 15 are the schematic diagrams of the embodiment of finding the split point of the data flow;
图16和图17为判断窗口中至少部分数据是否满足预定条件示意图;Figure 16 and Figure 17 are schematic diagrams for judging whether at least part of the data in the window satisfies a predetermined condition;
图18为去重服务器结构图;Figure 18 is a structural diagram of the deduplication server;
图19为去重服务器结构图;Figure 19 is a structural diagram of the deduplication server;
图20为本发明实施例方法示意图;Figure 20 is a schematic diagram of the method of the embodiment of the present invention;
图21和图22为查找数据流分割点实施方式示意图;FIG. 21 and FIG. 22 are schematic diagrams of implementations of searching for data stream segmentation points;
图23和图24为查找数据流分割点实施方式示意图;FIG. 23 and FIG. 24 are schematic diagrams of implementations of searching for data stream segmentation points;
图25和图26为找数据流分割点实施方式示意图;FIG. 25 and FIG. 26 are schematic diagrams of implementations of finding a data stream segmentation point;
图27和图28和图29为找数据流分割点实施方式示意图;Fig. 27, Fig. 28 and Fig. 29 are schematic diagrams of implementations of finding data stream segmentation points;
图30和图31为找数据流分割点实施方式示意图;FIG. 30 and FIG. 31 are schematic diagrams of implementations of finding a data stream segmentation point;
图32和图33为判断窗口中至少部分数据是否满足预定条件示意图。FIG. 32 and FIG. 33 are schematic diagrams for judging whether at least part of the data in the window satisfies a predetermined condition.
图34和35为判断是否满足预定条件的概率示意图。34 and 35 are probability diagrams for judging whether a predetermined condition is satisfied.
具体实施例specific embodiment
随着存储技术的不断进步,数据产生量也在不断增加,大量的数据对存储容量提出了最高的要求。存储容量增加的同时,也增加了IT设备采购成本,为了缓解数据量与存储容量之间的需求矛盾,节约IT设备采购成本,在数据存储领域引入了重复数据删除技术。With the continuous advancement of storage technology, the amount of data generated is also increasing, and a large amount of data puts forward the highest requirements for storage capacity. The increase in storage capacity also increases the procurement cost of IT equipment. In order to alleviate the demand contradiction between data volume and storage capacity and save IT equipment procurement costs, data deduplication technology is introduced in the field of data storage.
本发明实施例一种使用场景为数据备份场景。数据备份是为防止各种原因导致的数据丢失,通过备份服务器将数据备份到其他存储介质的过程。如图1所示的数据备份系统架构。数据备份系统包括客户端(101a、101b…101n)、备份服务器102、重复数据删除服务器(简称去重服务器或重删服务器)103和存储设备(104a、104b…104n)。其中客户端(101a、101b…101n)可以为应用服务器、工作站等;备份服务器102用于备份客户端生成的数据;去重服务器103用于执行备份数据的重复数据删除任务;存储设备(104a、104b…104n)作为存储重复数据删除后的数据的存储介质,可以为磁盘阵列、磁带库等存 储介质。客户端(101a、101b…101n)、备份服务器102、重复数据删除服务器103和存储设备(104a、104b…104n)可以通过交换机、局域网、互联网、光纤等方式连接,上述设备可以位于同一地点,也可以位于不同地点。备份服务器102、重删服务器103、存储设备(104a、104b…104n)可以为独立的物理设备,或者在具体实现中物理上集成为一体,或者备份服务器102与重删服务器103集成为一体,或者重删服务器103与存储设备(104a、104b…104n)集成为一体等。A usage scenario in the embodiment of the present invention is a data backup scenario. Data backup is the process of backing up data to other storage media through the backup server to prevent data loss caused by various reasons. The data backup system architecture shown in Figure 1. The data backup system includes clients (101a, 101b...101n), backup server 102, data deduplication server (referred to as deduplication server or deduplication server) 103 and storage devices (104a, 104b...104n). Wherein the clients (101a, 101b...101n) can be application servers, workstations, etc.; the backup server 102 is used to back up the data generated by the client; the deduplication server 103 is used to perform the deduplication task of the backup data; the storage device (104a, 104b...104n) As a storage medium for storing deduplicated data, it may be a storage medium such as a disk array or a tape library. Clients (101a, 101b...101n), backup server 102, deduplication server 103, and storage devices (104a, 104b...104n) can be connected through switches, local area networks, the Internet, optical fibers, etc., and the above devices can be located at the same place or Can be located in different locations. The backup server 102, the deduplication server 103, and the storage devices (104a, 104b...104n) can be independent physical devices, or they can be physically integrated in a specific implementation, or the backup server 102 and the deduplication server 103 can be integrated into one, or The deduplication server 103 is integrated with the storage devices (104a, 104b...104n) and the like.
去重服务器103对备份数据的数据流执行重复数据删除操作,一般包括以下步骤:The deduplication server 103 performs deduplication operations on the data stream of the backup data, which generally includes the following steps:
1)数据流分割点查找:根据特定算法在数据流中查找数据流分割点;1) Data stream split point search: find the data stream split point in the data stream according to a specific algorithm;
2)根据查找到的数据流分割点划分数据块;2) Divide the data block according to the found data flow segmentation point;
3)计算数据块的特征值:计算数据块的特征值作为标识该数据块的特征;将计算得到的特征值添加到该数据流对应的文件的数据块的特征列表中;一般利用SHA-1或MD5算法计算数据块的特征值;3) Calculate the characteristic value of the data block: calculate the characteristic value of the data block as the characteristic identifying the data block; add the calculated characteristic value to the characteristic list of the data block of the file corresponding to the data stream; generally use SHA-1 or MD5 algorithm to calculate the characteristic value of the data block;
4)相同数据块检测:将计算得到的数据块的特征值与数据块特征列表中已存在的特征值进行比对以确定是否存在相同数据块;4) Same data block detection: compare the calculated feature value of the data block with the existing feature value in the data block feature list to determine whether there is the same data block;
5)删除重复数据块:通过相同数据块检测,如果发现数据块特征列表中存在与该数据块相同的特征值,则不需要再存储该数据块或者根据备份策略确定的重复数据块存储数量决定是否存储该数据块。5) Delete duplicate data blocks: through the detection of the same data block, if it is found that there is a feature value identical to the data block in the feature list of the data block, it is not necessary to store the data block or determine the number of duplicate data blocks stored according to the backup strategy Whether to store the data block.
通过去重服务器103对备份数据的数据流执行重复数据删除操作的步骤可知,数据流分割点查找作为重复数据删除操作的关键步骤,直接决定了重复数据删除的性能。From the steps of the deduplication server 103 performing the deduplication operation on the data stream of the backup data, it can be seen that the data stream segmentation point search is a key step of the deduplication operation, which directly determines the performance of the deduplication.
本发明实施例中,去重服务器103接收备份服务器102发送的备份文件,对该文件执行重复数据删除处理。通常待处理备份文件在去重服务器103中以数据流形式呈现,去重服务器103查找数据流中的分割 点时,通常要确定数据流分割点最小查找单位,具体如图2所示,如潜在分割点k1位于序号分别为1和2的连续两个数据流分割点最小查找单位之间,潜在分割点是指需要进行判断是否可以作为数据流分割点的点;当点k1为一个数据流分割点,数据流分割点查找方向如图2中箭头所示,查找下一个潜在分割点为k7,即位于序号分别为7和8的连续两个数据流分割点最小查找单位之间,当潜在分割点k7为数据流分割点,则相邻的两个数据流分割点k1、k7之间的数据为1个数据块。数据流分割点最小查找单位具体可以根据实际需要确定,这里以1个字节(Byte)为例,即序号为1、2、7和8的数据流分割点最小查找单位大小均为1个字节。如图2所示的数据流分割点查找方向通常表示由文件头向文件尾方向查找,或者由文件尾向文件头方向,本实施例中以从文件头向文件尾方向查找为例。In the embodiment of the present invention, the deduplication server 103 receives the backup file sent by the backup server 102, and performs deduplication processing on the file. Usually, the backup file to be processed is presented in the form of a data stream in the deduplication server 103. When the deduplication server 103 searches for a split point in the data stream, it usually needs to determine the minimum search unit of the data stream split point, as shown in FIG. The split point k 1 is located between the minimum search units of two consecutive data stream split points whose serial numbers are 1 and 2 respectively. A potential split point refers to a point that needs to be judged whether it can be used as a data stream split point; Stream split point, the search direction of the data stream split point is shown by the arrow in Figure 2, and the next potential split point to be searched is k 7 , which is located between two consecutive data stream split point minimum search units whose sequence numbers are 7 and 8 respectively, When the potential splitting point k 7 is a data stream splitting point, the data between two adjacent data stream splitting points k 1 and k 7 is 1 data block. The minimum search unit of the data stream segmentation point can be determined according to actual needs. Here, 1 byte (Byte) is taken as an example, that is, the minimum search unit size of the data stream segmentation points with serial numbers 1, 2, 7, and 8 is 1 word. Festival. The search direction of the data stream split point as shown in FIG. 2 usually means searching from the file head to the file tail, or from the file tail to the file head. In this embodiment, the search from the file head to the file tail is taken as an example.
在重复数据删除场景,通常数据块越小,重复数据删除率越高,越容易查找到重复数据块,但是由此生成的元数据数量越大,而且数据块小到一定程度之后,重复数据删除率就不会增加了,但是元数据数量却会急剧增加。因此,必须控制数据块大小,实际应用中,通常会设定数据块的最小值,例如4KB(4096个字节),同时考虑到重复数据删除率,也会设定数据块的最大值,即数据块大小不能超过最大值,例如12KB(12288个字节)。一种具体实现方式如图3所示,去重服务器103在沿着箭头所示方向查找数据流分割点,ka为当前查找到的数据流分割点,从ka向数据流分割点查找方向查找下一个潜在分割点,为满足最小数据块要求,通常会从数据流分割点开始沿着数据流分割点查找方向跳过最小数据块大小,从最小数据块结束位置开始查找,也就是将最小数据块结束位置作为下一个潜在分割点ki。在本发明实施例中,可以先从ka点沿数据流分割点查找方向跳跃最小数据 块4KB,即4*1024=4096字节。从ka点沿数据流分割点查找方向跳跃4096个字节,在第4096个字节的结束位置获得点ki,作为潜在分割点,例如ki位于序号分别为4096和4097的连续两个数据流分割点最小查找单位之间。仍然以图3为例,ka为当前查找到的数据流分割点,沿如图3所示方向查找下一个数据流分割点,如果超过数据块最大值仍然没有找到下一个数据流分割点,则在从ka点开始向数据流分割点查找方向达到数据块最大值的点kz作为下一个数据流分割点,进行强制分割。In the data deduplication scenario, usually the smaller the data block, the higher the deduplication rate, and the easier it is to find the duplicate data block, but the greater the amount of metadata generated, and after the data block is small enough, the data deduplication The rate will not increase, but the amount of metadata will increase dramatically. Therefore, the size of the data block must be controlled. In practical applications, the minimum value of the data block is usually set, such as 4KB (4096 bytes). At the same time, considering the deduplication rate, the maximum value of the data block is also set, namely The data block size cannot exceed a maximum value, such as 12KB (12288 bytes). A specific implementation is shown in Figure 3, the deduplication server 103 searches for the data stream splitting point along the direction indicated by the arrow, k a is the data stream splitting point currently found, and the search direction from k a to the data stream splitting point Find the next potential split point. In order to meet the minimum data block requirements, the minimum data block size is usually skipped from the data stream split point along the data stream split point search direction, and the search starts from the end position of the smallest data block, that is, the smallest The end position of the data block is used as the next potential split point k i . In the embodiment of the present invention, the minimum data block of 4KB can be skipped from point k a along the search direction of the data stream segmentation point, that is, 4*1024=4096 bytes. Jump 4096 bytes from point k a along the data stream split point search direction, and obtain point k i at the end position of the 4096th byte as a potential split point, for example, k i is located in two consecutive numbers of 4096 and 4097 The data stream split point is between the smallest search units. Still taking Figure 3 as an example, k a is the currently found data stream split point, and the next data stream split point is searched along the direction shown in Figure 3. If the next data stream split point is still not found if the maximum value of the data block is exceeded, Then, the point k z which reaches the maximum value of the data block in the search direction from k a to the data stream split point is taken as the next data stream split point, and the forced split is performed.
本发明实施例提供一种基于去重服务器查找数据流分割点的方法,如图4所示,包括:An embodiment of the present invention provides a method for finding a data stream segmentation point based on a deduplication server, as shown in FIG. 4 , including:
在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;其中,px与潜在分割点k之间距离dx个数据流分割点最小查找单位,数据流分割点最小查找单位以U表示,本实施例中U=1个字节,。在图3所示的实现方式中,关于M的取值,其中一种实现方式,M*U取值不大于预设的两个相邻的数据流分割点之间的最大距离,即预设的数据块最大长度。判断点pz对应的窗口Wz[pz-Az,pz+Bz]中至少部分数据是否满足预定条件Cz,其中,z为整数,1≤z≤M,(pz-Az)与(pz+Bz)分别表示窗口Wz的两个边界。当判断任意一个点pz的窗口Wz[pz-Az,pz+Bz]中至少部分数据不满足预定条件Cz,则从不满足预定条件的窗口Wz[pz-Az,pz+Bz]对应的点pz沿数据流分割点查找方向跳跃N个字节,N≤‖Bz‖+maxx(‖Ax‖+‖(k-px)‖)。其中,‖(k-px)‖表示M个点px中任一个点与潜在分割点k之间的距离,maxx(‖Ax‖+‖(k-px)‖)表示M个点px中任一个点与潜在分割点k之间的距离及该点对应的Ax的绝对值之和的最大值;‖Bz‖表示Wz[pz-Az,pz+Bz]中Bz的绝对值,将在下面实施例中具体介绍N取值的原理。当判断M个窗口中的每一个窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx,则潜在分割点k为数据流分割点。Rules are preset on the deduplication server 103, and the rules are: determine M points p x , the window W x [p x −A x , p x +B x ] corresponding to the point p x for potential segmentation point k and The predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], where x is a continuous natural number from 1 to M, M≥2, A x and B x are integers; where p x The distance from the potential segmentation point k is d x the minimum search unit of the data stream segmentation point. The minimum search unit of the data stream segmentation point is represented by U, and U=1 byte in this embodiment. In the implementation shown in FIG. 3 , regarding the value of M, in one implementation, the value of M*U is not greater than the preset maximum distance between two adjacent data stream segmentation points, that is, the preset The maximum length of the data block. Judging whether at least part of the data in the window W z [p z -A z , p z +B z ] corresponding to the point p z meets the predetermined condition C z , where z is an integer, 1≤z≤M, (p z -A z ) and (p z +B z ) denote the two boundaries of the window W z respectively. When it is judged that at least part of the data in the window W z [p z -A z , p z +B z ] of any point p z does not meet the predetermined condition C z , then the window W z [p z -A z , p z +B z ] corresponding point p z jumps N bytes along the search direction of data stream segmentation point, N≤‖B z ‖+max x (‖A x ‖+‖(kp x )‖). Among them, ‖(kp x )‖ represents the distance between any point in M points p x and the potential segmentation point k, and max x (‖A x ‖+‖(kp x )‖) represents the distance between M points p x The maximum value of the sum of the distance between any point and the potential segmentation point k and the absolute value of A x corresponding to the point; ‖B z ‖ means B in W z [p z -A z , p z +B z ] For the absolute value of z , the principle of selecting the value of N will be specifically introduced in the following embodiments. When it is determined that at least part of the data in each of the M windows W x [p x -A x , p x +B x ] satisfies the predetermined condition C x , then the potential split point k is a data stream split point.
具体的,对当前潜在分割点ki,依据所述规则,执行以下步骤:Specifically, for the current potential segmentation point k i , according to the rules, the following steps are performed:
步骤401:依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;Step 401: Determine the point p iz and the window W iz [p iz -A z ,p iz +B z ] corresponding to the point p iz for the current potential segmentation point ki according to the rule, i and z are integers, and 1≤z≤M;
步骤402:判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz;Step 402: judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,执行步骤401;When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data flow segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, go to step 401;
当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix-Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。When at least part of the data in each of the M windows W ix [p ix -A x , p ix +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential segmentation point k i is the data stream split point.
进一步地,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf;Further, the rule further includes: at least two points p e and p f satisfy the conditions of A e =A f , B e =B f , and C e =C f ;
所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。The rule further includes: the at least two points pe and p f are in the reverse direction of the data flow split point search relative to the potential split point k.
所述规则还包括:所述至少两个点pe和pf之间的距离为1个U。The rule further includes: the distance between the at least two points pe and p f is 1 U.
判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:Judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z , specifically includes:
使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是 否满足所述预定条件Cz。A random function is used to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
所述使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体为使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。The use of a random function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is specifically using a hash function to determine whether the window W iz [p iz -A z , p iz +B z ] whether at least part of the data satisfies the predetermined condition C z .
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述点pic对应的所述窗口Wic[pic-Ac,pic+Bc]的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述点pic是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data stream segmentation point is used to obtain the new potential segmentation point. According to the rules, the window W ic corresponding to the point p ic determined for the new potential segmentation point [p ic -A c ,p ic +B c ] is coincident with the right boundary of the window W iz [p iz -A z ,p iz +B z ] or the point p ic corresponding to the new potential segmentation point The left boundary of the window W ic [p ic -A c ,p ic +B c ] is within the range of the window W iz [p iz -A z ,p iz +B z ]; wherein, is the new potential The point p ic determined by the split point is the first point in the sequence obtained according to the search direction of the data flow among the M points determined for the new potential split point according to the rule.
本发明实施例中通过判断M个窗口中某一个窗口中至少部分数据是否满足预定条件,来查找数据流分割点,当某一个窗口中至少部分数据不满足预定条件,则跳过N*U个长度,其中,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得下一个潜在分割点,提高了数据流分割点查找效率。In the embodiment of the present invention, the data flow segmentation point is searched by judging whether at least part of the data in one of the M windows satisfies the predetermined condition, and when at least part of the data in a certain window does not meet the predetermined condition, skip N*U length, where N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain the next potential segmentation point, which improves the efficiency of searching for data flow segmentation points.
在重复数据删除过程中,为保证数据块大小均匀,会考虑平均数据块(也称为平均分块)大小,即在满足最小数据块大小和最大数据块大小限定的同时,会确定平均数据块大小,以保证获得的数据块大小均匀。点px个数M与点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率,这两个因素决定了找到数据流分割点的概率(以P(n)表示)。前者影响跳跃的长度,后者影响跳跃的概率,二者共同影响平均分块大小。一般而言,在平均分块大小固定时,点 px个数M增加,则单个点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率也增加,例如在去重服务器103上预设的规则为:为潜在分割点k确定11个点px,x分别为1到11连续的自然数,11个点中任一个点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率为1/2。而在去重服务器103上预设的另一组规则为:为潜在分割点k选择的24个点px,x分别为1到24连续的自然数,24个点中任一个点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率3/4。具体窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率设定可参见判断窗口Wx[px-Ax,px+Bx]中至少部分数据是否满足预定条件Cx部分的描述。点px个数M与点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率这两个因素决定P(n),P(n)表示:从数据流起始位置/上一数据流分割点查找n个数据流分割点最小查找单位后没找到数据流分割点的概率。关于这两个因素决定P(n)的计算过程,实际上是一个多步长Fibonacci数列,后面将具体描述。得到P(n)后,1-P(n)即为数据流分割点的分布函数,(1-P(n))-(1-P(n-1))=P(n-1)-P(n),即为在第n个点找到数据流分割点的概率,也就是数据流分割点的密度函数,根据数据流分割点的密度函数就可以积分从而求得数据流分割点的期望长度,即平均分块大小,其中,4*1024(字节)表示最小数据块长度,12*1024(字节)表示最大数据块长度。In the process of data deduplication, in order to ensure uniform data block size, the average data block (also known as average block) size will be considered, that is, the average data block will be determined while meeting the minimum data block size and maximum data block size restrictions size to ensure that the obtained data blocks are uniform in size. The number M of points p x and the probability that at least part of the data in the window W x [p x -A x , p x +B x ] corresponding to point p x satisfies the predetermined condition C x , these two factors determine the finding of data stream segmentation The probability of the point (in P(n)). The former affects the length of the jump, the latter affects the probability of the jump, and the two together affect the average block size. Generally speaking, when the average block size is fixed and the number M of points p x increases, at least part of the data in the window W x [p x -A x , p x +B x ] corresponding to a single point p x meets the predetermined conditions The probability of C x also increases. For example, the preset rule on the deduplication server 103 is: determine 11 points p x for the potential segmentation point k, x is a continuous natural number from 1 to 11, and any point p in the 11 points The probability that at least part of the data in the window W x [p x -A x , p x +B x ] corresponding to x satisfies the predetermined condition C x is 1/2. Another set of rules preset on the deduplication server 103 is: 24 points p x selected for the potential segmentation point k, x is a continuous natural number from 1 to 24, and any point p x in the 24 points corresponds to The probability that at least part of the data in the window W x [p x -A x , p x +B x ] satisfies the predetermined condition C x is 3/4. The probability setting of at least part of the data in the specific window W x [p x -A x , p x +B x ] meeting the predetermined condition C x can be found in the judgment window W x [p x -A x , p x +B x ] Whether at least part of the data satisfies the description of the predetermined condition C x part. The number M of points p x and the probability that at least part of the data in the window W x [p x -A x , p x +B x ] corresponding to point p x meet the predetermined condition C x These two factors determine P(n), P (n) indicates: the probability that no data stream split point is found after searching n minimum search units of data stream split points from the data stream start position/previous data stream split point. The calculation process of determining P(n) by these two factors is actually a multi-step Fibonacci sequence, which will be described in detail later. After obtaining P(n), 1-P(n) is the distribution function of the data stream segmentation point, (1-P(n))-(1-P(n-1))=P(n-1)- P(n), that is, the probability of finding the data stream split point at the nth point, that is, the density function of the data stream split point, can be integrated according to the density function of the data stream split point Thus, the expected length of the split point of the data stream, that is, the average block size, is obtained, wherein 4*1024 (bytes) represents the minimum data block length, and 12*1024 (bytes) represents the maximum data block length.
如图3所示的数据流分割点查找的基础上,在图5所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个点px、点px对应的窗口Wx[px-Ax,px+Bx](简称窗口Wx)和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,A1=A2=A3=A4=A5=A6=A7 =A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。其中,点px与潜在分割点k之间距离dx个字节,具体的,点p1与潜在分割点k之间距离0个字节,点p2与潜在分割点k之间距离1个字节,点p3与潜在分割点k之间距离2个字节,点p4与潜在分割点k之间距离3个字节,点p5与潜在分割点k之间距离4个字节,点p6与潜在分割点k之间距离5个字节,点p7与潜在分割点k之间距离6个字节,点p8与潜在分割点k之间距离7个字节,点p9与潜在分割点k之间距离8个字节,点p10与潜在分割点k之间距离9个字节,点p11与潜在分割点k之间距离10个字节,并且点p2、p3、p4、p5、p6、p7、p8、p9、p10和p11相对于潜在分割点k均位于数据流分割点查找反方向。ka为数据流分割点,图5中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图5所示的实施方式中,为潜在分割点ki确定的点为11个,分别为pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11,点pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11对应的窗口分别为Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]。上述窗口分别简称为Wi1、Wi2、Wi3、Wi4、Wi5、Wi6、Wi7、Wi8、Wi9、Wi10和Wi11。其中,点pix与潜在分割点ki之间距离dx个字节,具体的,pi1与ki间距0个字节、pi2与ki间距1个字节、pi3与ki间距2个字节、pi4与ki间距3个字节、pi5与ki间距4个字节、pi6与ki间距5个字节、pi7与 ki间距6个字节、pi8与ki间距7个字节、pi9与ki间距8个字节、pi10与ki间距9个字节,pi11与ki间距10个字节,并且pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11相对于潜在分割点ki均位于数据流分割点查找反方向。判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图6所示,Wi5[pi5-169,pi5]中至少部分数据不满足对应的预定条件C5,则从点pi5沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B5‖+maxx(‖Ax‖+‖(ki-pix)‖),在图6所示的实 施方式中,跳跃N个字节不大于179字节,在本实施例中,N=11,得到下一个潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj。根据图5所示的实施方式中在去重服务器103上预设的规则,为潜在分割点kj确定的点为11个,分别为pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11,确定点pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11对应的窗口分别为Wj1[pj1-169,pj1]、Wj2[pj2-169,pj2]、Wj3[pj3-169,pj3]、Wj4[pj4-169,pj4]、Wj5[pj5-169,pj5]、Wj6[pj6-169,pj6]、Wj7[pj7-169,pj7]、Wj8[pj8-169,pj8]、Wj9[pj9-169,pj9]、Wj10[pj10-169,pj10]和Wj11[pj11-169,pj11]。其中,pjx与潜在分割点kj之间距离dx个字节,具体的,pj1与kj间距0个字节、pj2与kj间距1个字节、pj3与kj间距2个字节、pj4与kj间距3个字节、pj5与kj间距4个字节、pj6与kj间距5个字节、pj7与kj间距6个字节、pj8与kj间距7个字节、pj9与kj间距8个字节、pj10与kj间距9个字节,pj11与kj间距10个字节,并且pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11相对于潜在分割点kj均位于数据流分割点查找反方向。如图6所示实施方式中,当为潜在分割点kj确定的第11个窗口Wj11[pj11-169,pj11],在保证潜在分割点ki与潜在分割点kj之间的范围都在判断范围之内,则在本实施方式中,必须保证窗口Wj11[pj11-169,pj11]的左边界与Wi5[pi5-169,pi5]的右边界pi5重合或者位于Wi5[pi5-169,pi5]范围之内,其中,所述潜在分割点kj确定的点pj11是根据所述规则,为所述潜在分割点kj确定的M个点按照数据流查找方向获得的序列中排序第一的点。因此,在这一限定内,当Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5,从pi5沿着数据流分割点查找方向跳跃的距离为不大于‖B5‖+maxx(‖Ax‖+‖(ki-pix)‖),其中,M=11,11*U不大于maxx(‖Ax‖+‖(ki-pix)‖),因此,从pi5沿着数据 流分割点查找方向跳跃的距离为不大于179。判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1、判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2、判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3、判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该规则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜 在分割点kj不是数据流分割点时,按照与ki相同的方式跳跃11个字节获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search as shown in Figure 3, in the embodiment shown in Figure 5, a rule is preset on the deduplication server 103, and the rule is: determine 11 points for the potential segmentation point k p x , window W x [p x -A x ,p x +B x ] (referred to as window W x ) corresponding to point p x and reservation corresponding to window W x [p x -A x ,p x +B x ] Condition C x , where A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 = B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 . Among them, the distance between point p x and potential segmentation point k is d x bytes, specifically, the distance between point p 1 and potential segmentation point k is 0 bytes, and the distance between point p 2 and potential segmentation point k is 1 bytes, the distance between point p 3 and potential segmentation point k is 2 bytes, the distance between point p 4 and potential segmentation point k is 3 bytes, and the distance between point p 5 and potential segmentation point k is 4 bytes node, the distance between point p 6 and potential split point k is 5 bytes, the distance between point p 7 and potential split point k is 6 bytes, the distance between point p 8 and potential split point k is 7 bytes, The distance between point p 9 and potential split point k is 8 bytes, the distance between point p 10 and potential split point k is 9 bytes, the distance between point p 11 and potential split point k is 10 bytes, and point p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , p 9 , p 10 , and p 11 are all located in the opposite direction of the data flow split point search relative to the potential split point k. k a is the data stream splitting point. The search direction of the data stream splitting point shown in Figure 5 is from left to right. After skipping the smallest data block 4KB from the data stream splitting point k a , the smallest data block 4KB end position is taken as the next potential The split point ki is a point p ix determined for the potential split point ki . In this embodiment, according to the preset rule on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 5 , 11 points are determined for the potential segmentation point ki , which are p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 , the windows corresponding to points p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 - 169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169, p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -169,p i11 ]. The above windows are referred to as W i1 , W i2 , W i3 , W i4 , W i5 , W i6 , W i7 , W i8 , W i9 , W i10 and W i11 for short respectively. Among them, the distance between point p ix and potential segmentation point ki is d x bytes, specifically, the distance between p i1 and ki is 0 bytes, the distance between p i2 and ki is 1 byte, and the distance between p i3 and ki The distance between p i4 and ki is 3 bytes, the distance between p i5 and ki is 4 bytes, the distance between p i6 and ki is 5 bytes, the distance between p i7 and ki is 6 bytes, The distance between p i8 and ki is 7 bytes, the distance between p i9 and ki is 8 bytes, the distance between p i10 and ki is 9 bytes, the distance between p i11 and ki is 10 bytes, and p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are all located in the reverse direction of the data flow segmentation point search relative to the potential segmentation point ki . Judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , judging W i3 [ Whether at least part of the data in p i3 -169,p i3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and judge whether W i5 [p i5 -169 ,p i5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169,p i6 ] meets the predetermined condition C 6 , judging W i7 [p i7 -169,p i7 ] Whether at least part of the data in W i8 [p i8 -169, p i8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W i8 [p i8 -169, p i8 ] meets the predetermined condition C 8 , or whether at least part of the data in W i9 [p i9 -169, p i9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W i10 [p i10 -169, p i10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W i11 [p i11 -169, p i11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined condition, as shown in Figure 6, at least part of the data in W i5 [p i5 -169, p i5 ] does not meet the corresponding predetermined condition C 5 , Then jump N bytes from the point p i5 along the search direction of the data stream segmentation point, where N bytes are not greater than ‖B 5 ‖+max x (‖A x ‖+‖(k i -p ix )‖), In the embodiment shown in Fig. 6, jumping N bytes is no more than 179 bytes, and in this embodiment, N=11, obtains next potential segmentation point, for being different from potential segmentation point ki , new here The potential split points of are denoted as k j . According to the preset rules on the deduplication server 103 in the embodiment shown in FIG. 5 , there are 11 points determined for the potential segmentation point k j , namely p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 , determine the points p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and The windows corresponding to p j11 are respectively W j1 [p j1 -169,p j1 ], W j2 [p j2 -169,p j2 ], W j3 [p j3 -169,p j3 ], W j4 [p j4 -169 ,p j4 ], W j5 [p j5 -169,p j5 ], W j6 [p j6 -169,p j6 ], W j7 [p j7 -169 ,p j7 ], W j8 [p j8 -169 ,p j8 ], W j9 [p j9 -169,p j9 ], W j10 [p j10 -169,p j10 ], and W j11 [p j11 -169,p j11 ]. Among them, the distance between p jx and potential segmentation point k j is d x bytes, specifically, the distance between p j1 and k j is 0 bytes, the distance between p j2 and k j is 1 byte, and the distance between p j3 and k j 2 bytes, 3 bytes between p j4 and k j , 4 bytes between p j5 and k j , 5 bytes between p j6 and k j , 6 bytes between p j7 and k j , p The distance between j8 and k j is 7 bytes, the distance between p j9 and k j is 8 bytes, the distance between p j10 and k j is 9 bytes, the distance between p j11 and k j is 10 bytes, and p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 , and p j11 are all located in the opposite direction of the data flow split point search relative to the potential split point k j . In the embodiment shown in Figure 6, when the 11th window W j11 [p j11 -169,p j11 ] determined for the potential segmentation point k j , between the guaranteed potential segmentation point k i and the potential segmentation point k j range is within the judgment range, then in this embodiment, it must be ensured that the left boundary of window W j11 [p j11 -169,p j11 ] coincides with the right boundary p i5 of W i5 [p i5 -169,p i5 ] Or within the range of W i5 [p i5 -169,p i5 ], wherein the point p j11 determined by the potential segmentation point k j is the M points determined for the potential segmentation point k j according to the rule The first point in the sequence obtained according to the search direction of the data flow. Therefore, within this limit, when at least part of the data in W i5 [p i5 -169, p i5 ] does not satisfy the predetermined condition C 5 , the jumping distance from p i5 along the direction of data stream segmentation point search is not greater than ∥B 5 ‖+max x (‖A x ‖+‖(k i -p ix )‖), where M=11, 11*U is not greater than max x (‖A x ‖+‖(k i -p ix )‖ ), therefore, the jumping distance from p i5 along the search direction of the data stream segmentation point is not greater than 179. Judging whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , judging W j3 [ Whether at least part of the data in p j3 -169,p j3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , and judge whether W j5 [p j5 -169 ,p j5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging W j7 [p j7 -169 ,p j7 ] Whether at least part of the data in W j8 [p j8 -169,p j8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , and whether at least part of the data in W j9 [p j9 -169,p j9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W j11 [p j11 -169, p j11 ] meets the predetermined condition C11 . Of course, in the embodiment of the present invention, this rule is also followed when judging whether the potential split point k a is a data stream split point, and the specific implementation will not be described again, and the description of judging the potential split point k i can be referred to. When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data stream split point. When it is judged that the potential split point k j is not a data stream split point, jump 11 bytes in the same way as k i to obtain the next potential split point, and judge according to the preset rules on the deduplication server 103 and the above method Whether a potential split point is a stream split point. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the forced split point.
在图5所示的实施方式中,根据在去重服务器103上预设的规则,从判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1开始,当判断Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据满足预定条件C3和判断Wi4[pi4-169,pi4]中至少部分数据满足预定条件C4,判断Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5时,从点pi5沿着数据流分割点查找方向跳跃10个字节,在第10个字节的结束位置获得新的潜在分割点,为与其他潜在分割点区别,这里表示为kg,按照在去重服务器103上预设的规则,为潜在分割点kg确定11个点pgx,x分别为1到11连续的自然数,分别为pg1、pg2、pg3、pg4、pg5、pg6、pg7、pg8、pg9、pg10和pg11,确定点pg1、pg2、pg3、pg4、pg5、pg6、pg7、pg8、pg9、pg10和pg11对应的窗口分别为Wg1[pg1-169,pg1]、Wg2[pg2-169,pg2]、Wg3[pg3-169,pg3]、Wg4[pg4-169,pg4]、Wg5[pg5-169,pg5]、Wg6[pg6-169,pg6]、Wg7[pg7-169,pg7]、Wg8[pg8-169,pg8]、Wg9[pg9-169,pg9]、Wg10[pg10-169,pg10]和Wg11[pg11-169,pg11]。其中,pgx与潜在分割点kg之间距离dx个字节,具体的,pg1与kg间距0个字节、pg2与kg间距1个字节、pg3与kg间距2个字节、pg4与kg间距3个字节、pg5与kg间距4个字节、pg6与kg间距5个字节、pg7与kg间距6个字节、pg8与kg间距7个字节、pg9与kg间距8个字节、pg10与kg间距9个字节,pg11与 kg间距10个字节,并且pg2、pg3、pg4、pg5、pg6、pg7、pg8、pg9、pg10和pg11相对于潜在分割点kg均位于数据流分割点查找反方向。判断Wg1[pg1-169,pg1]中至少部分数据是否满足预定条件C1、判断Wg2[pg2-169,pg2]中至少部分数据是否满足预定条件C2、判断Wg3[pg3-169,pg3]中至少部分数据是否满足预定条件C3、判断Wg4[pg4-169,pg4]中至少部分数据是否满足预定条件C4、判断Wg5[pg5-169,pg5]中至少部分数据是否满足预定条件C5、判断Wg6[pg6-169,pg6]中至少部分数据是否满足预定条件C6、判断Wg7[pg7-169,pg7]中至少部分数据是否满足预定条件C7、判断Wg8[pg8-169,pg8]中至少部分数据是否满足预定条件C8、判断Wg9[pg9-169,pg9]中至少部分数据是否满足预定条件C9、判断Wg10[pg10-169,pg10]中至少部分数据是否满足预定条件C10和判断Wg11[pg11-169,pg11]中至少部分数据是否满足预定条件C11。因此,潜在分割点kg对应的点pg11与潜在分割点ki对应的点pi5重合,并且点pg11对应的窗口Wg11[pg11-169,pg11]与点pi5对应的窗口Wi5[pi5-169,pi5]重合,并且C5=C11,因此,对当潜在分割点ki,当判断Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5时,从点pi5沿着数据流分割点查找方向跳跃10个字节,获得的潜在分割点kg仍然不符合作为数据流分割点的条件。因此,如果从点pi5沿着数据流分割点查找方向跳跃10个字节会存在重复计算,从点pi5沿着数据流分割点查找方向跳跃11个字节可以减少重复计算,效率更高。因此提高了查找数据流分割点的速度。当预设规定中点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率为1/2时,即是说以1/2的概率执行跳跃,每次最多可以跳跃179个字节。In the embodiment shown in FIG. 5 , according to the preset rules on the deduplication server 103, it starts from judging whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , when it is judged that W At least part of the data in i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , at least part of the data in the judgment W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , and the judgment W i3 [p i3 -169 ,p i3 ] at least part of the data satisfies the predetermined condition C 3 and at least part of the data in the judgment W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 4 , and in the judgment W i5 [p i5 -169,p i5 ] at least When part of the data does not meet the predetermined condition C5 , jump 10 bytes from the point p i5 along the direction of data flow segmentation point search, and obtain a new potential segmentation point at the end position of the 10th byte, which is the same as other potential segmentation points The difference, expressed here as k g , according to the preset rules on the deduplication server 103, determines 11 points p gx for the potential segmentation point k g , and x is a continuous natural number from 1 to 11, which are respectively p g1 and p g2 , p g3 , p g4 , p g5 , p g6 , p g7 , p g8 , p g9 , p g10 and p g11 , determine the points p g1 , p g2 , p g3 , p g4 , p g5 , p g6 , p g7 , p g8 , p g9 , p g10 and p g11 correspond to W g1 [p g1 -169,p g1 ], W g2 [p g2 -169,p g2 ], W g3 [p g3 -169,p g3 ], W g4 [p g4 -169,p g4 ], W g5 [p g5 -169,p g5 ], W g6 [p g6 -169,p g6 ], W g7 [p g7 -169,p g7 ] , W g8 [p g8 -169, p g8 ], W g9 [p g9 -169, p g9 ], W g10 [p g10 -169, p g10 ], and W g11 [p g11 -169, p g11 ]. Among them, the distance between p gx and potential segmentation point k g is d x bytes, specifically, the distance between p g1 and k g is 0 bytes, the distance between p g2 and k g is 1 byte, and the distance between p g3 and k g 2 bytes, 3 bytes between p g4 and k g , 4 bytes between p g5 and k g , 5 bytes between p g6 and k g , 6 bytes between p g7 and k g , p The distance between g8 and k g is 7 bytes, the distance between p g9 and k g is 8 bytes, the distance between p g10 and k g is 9 bytes, the distance between p g11 and k g is 10 bytes, and p g2 , p g3 , p g4 , p g5 , p g6 , p g7 , p g8 , p g9 , p g10 and p g11 are all located in the opposite direction of the data flow split point search relative to the potential split point k g . Judging whether at least part of the data in W g1 [p g1 -169, p g1 ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W g2 [p g2 -169, p g2 ] meets the predetermined condition C 2 , judging W g3 [ p g3 -169,p g3 ] whether at least part of the data satisfies the predetermined condition C 3 , judge whether at least part of the data in W g4 [p g4 -169,p g4 ] meets the predetermined condition C 4 , and judge W g5 [p g5 -169 ,p g5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W g6 [p g6 -169,p g6 ] meets the predetermined condition C 6 , judging W g7 [p g7 -169,p g7 ] Whether at least part of the data in W g8 [p g8 -169,p g8 ] meets the predetermined condition C 7 , whether at least part of the data in W g8 [p g8 -169,p g8 ] meets the predetermined condition C 8 , or whether at least part of the data in W g9 [p g9 -169,p g9 ] Whether to meet the predetermined condition C 9 , judging whether at least part of the data in W g10 [p g10 -169, p g10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W g11 [p g11 -169, p g11 ] meets the predetermined condition C11 . Therefore, the point p g11 corresponding to the potential segmentation point k g coincides with the point p i5 corresponding to the potential segmentation point k i , and the window W g11 [p g11 -169,p g11 ] corresponding to the point p g11 is the window corresponding to the point p i5 W i5 [p i5 -169,p i5 ] coincides, and C 5 =C 11 , therefore, for the potential segmentation point ki , when it is judged that at least part of the data in W i5 [p i5 -169,p i5 ] does not meet the predetermined When the condition C is 5 , jumping 10 bytes from the point p i5 along the direction of data stream split point search, the obtained potential split point k g still does not meet the condition of being a data stream split point. Therefore, if jumping 10 bytes from point p i5 along the search direction of the data flow split point, there will be double calculation, and jumping 11 bytes from point p i5 along the search direction of the data stream split point can reduce double count and be more efficient . Therefore, the speed of finding data stream split points is increased. When the probability that at least part of the data in the window W x [p x -A x ,p x +B x ] corresponding to the midpoint p x satisfies the predetermined condition C x is 1/2, that is to say, 1/2 Probability to perform a jump, and each time a maximum of 179 bytes can be jumped.
在本实施方式中,预定规则为:为潜在分割点k确定11个点px、点 px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到11连续的自然数,其中,点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件的概率为1/2,通过这两个因素可以计算P(n)。并且A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离0个字节,p2与k之间距离1个字节,p3与k之间距离2个字节,p4与k之间距离3个字节,p5与k之间距离4个字节,p6与k之间距离5个字节,p7与k之间距离6个字节,p8与k之间距离7个字节,p9与k之间距离8个字节,p10与k之间距离9个字节,p11与k之间距离10个字节,并且p2、p3、p4、p5、p6、p7、p8、p9、p10和p11相对于潜在分割点k均位于数据流分割点查找反方向。因此是否存在连续11个点对应窗口中的每一个窗口中至少部分数据均满足预定条件Cx就决定潜在分割点k是否为数据流分割点。从数据流起始位置/上一数据流分割点跳跃最小分块长度4096个字节后,向数据流分割点查找反方向回退10个字节,找到第4086个点,在该点处不存在数据流分割点,所以P(4086)=1,依次类推,P(4087)=1,……P(4095)=1。在第4096个点处,即在最小分块大小处,以(1/2)^11的概率这11个点对应的窗口中每一个窗口中至少部分数据满足预定条件Cx,因此以(1/2)^11的概率存在数据流分割点,以1-(1/2)^11的概率不存在数据流分割点,所以P(11)=1-(1/2)^11。In this embodiment, the predetermined rule is: determine 11 points p x for the potential segmentation point k, the window W x [p x -A x , p x +B x ] and the window W x [p x corresponding to the point p x -A x ,p x +B x ] corresponding to the predetermined condition C x , x is a continuous natural number from 1 to 11, wherein, the window W x corresponding to point p x [p x -A x ,p x +B x ] The probability that at least some of the data satisfy the predetermined condition is 1/2, and P(n) can be calculated by these two factors. And A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 , wherein, the distance between p x and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 0 bytes, and the distance between p 2 and k is 1 byte Bytes, the distance between p 3 and k is 2 bytes, the distance between p 4 and k is 3 bytes, the distance between p 5 and k is 4 bytes, and the distance between p 6 and k is 5 bytes , the distance between p 7 and k is 6 bytes, the distance between p 8 and k is 7 bytes, the distance between p 9 and k is 8 bytes, the distance between p 10 and k is 9 bytes, p The distance between 11 and k is 10 bytes, and p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , p 9 , p 10 and p 11 are all located in the data Stream split point lookup in reverse direction. Therefore, whether there are at least part of the data in each window corresponding to 11 consecutive points satisfying the predetermined condition Cx determines whether the potential segmentation point k is a data stream segmentation point. After skipping the minimum block length of 4096 bytes from the start position of the data stream/the last data stream split point, go back 10 bytes in the opposite direction to the data stream split point and find the 4086th point. There is a data stream split point, so P(4086)=1, and so on, P(4087)=1, ... P(4095)=1. At the 4096th point, that is, at the minimum block size, with a probability of (1/2)^11, at least part of the data in each window corresponding to these 11 points satisfies the predetermined condition C x , so (1 There is a data stream split point with a probability of /2)^11, and there is no data stream split point with a probability of 1-(1/2)^11, so P(11)=1-(1/2)^11.
如图34所示,在第n个点处,可以分为12种情况来递推P(n)。As shown in Figure 34, at the nth point, P(n) can be deduced recursively in 12 cases.
情况1:第n个点对应的窗口中至少部分数据以1/2的概率不满足预定条件,此时第n个点前面的n-1个点以P(n-1)的概率不存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条 件,因此P(n)包含1/2*P(n-1)。第n个点对应的窗口中至少部分数据不满足预定条件,并且第n个点前面的n-1个点存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条件的情况与P(n)无关。Case 1: At least part of the data in the window corresponding to the nth point does not meet the predetermined condition with a probability of 1/2. At this time, the n-1 points in front of the nth point do not exist continuously with a probability of P(n-1) At least part of the data in each of the windows corresponding to the 11 points of , respectively satisfy the predetermined conditions, so P(n) includes 1/2*P(n-1). At least part of the data in the window corresponding to the nth point does not meet the predetermined conditions, and there are n-1 points in front of the nth point, and at least part of the data in each of the windows corresponding to 11 consecutive points meet the predetermined conditions respectively The case is independent of P(n).
情况2:第n个点对应的窗口中至少部分数据以1/2的概率满足预定条件,第n-1个点对应的窗口中至少部分数据以1/2的概率不满足预定条件,此时第n-1个点前面的n-2个点以P(n-2)的概率不存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条件,因此P(n)包含1/2*1/2*P(n-2)。第n个点对应的窗口中至少部分数据满足预定条件,第n-1个点对应的窗口中至少部分数据不满足预定条件,并且第n-1个点前面的n-2个点存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条件的情况与P(n)无关。Case 2: At least part of the data in the window corresponding to the nth point meets the predetermined condition with a probability of 1/2, and at least part of the data in the window corresponding to the n-1th point does not meet the predetermined condition with a probability of 1/2. At this time The n-2 points in front of the n-1th point do not exist with the probability of P(n-2), at least part of the data in each window corresponding to the continuous 11 points in the window respectively meet the predetermined conditions, so P(n) Contains 1/2*1/2*P(n-2). At least part of the data in the window corresponding to the nth point meets the predetermined condition, at least part of the data in the window corresponding to the n-1th point does not meet the predetermined condition, and there are continuous n-2 points in front of the n-1th point The fact that at least part of the data in each of the windows corresponding to the 11 points respectively satisfies the predetermined condition has nothing to do with P(n).
依照上述描述,情况11:第n至n-9个点对应的窗口中至少部分数据以(1/2)^10的概率满足预定条件,第n-10个点对应的窗口中至少部分数据以1/2的概率不满足预定条件,此时第n-10个点前面的n-11个点以P(n-11)的概率不存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条件,因此P(n)包含(1/2)^10*1/2*P(n-11)。第n至n-9个点对应的窗口中至少部分数据均满足预定条件,第n-10个点对应的窗口中至少部分数据不满足预定条件,并且第n-10个点前面的n-11个点存在连续的11个点对应的窗口中每一个窗口中至少部分数据分别满足预定条件的情况与P(n)无关。According to the above description, case 11: at least part of the data in the window corresponding to the nth to n-9 points meets the predetermined condition with a probability of (1/2)^10, and at least part of the data in the window corresponding to the n-10th point is The probability of 1/2 does not meet the predetermined conditions. At this time, the n-11 points in front of the n-10th point do not exist with the probability of P(n-11). In each window corresponding to 11 consecutive points, at least Part of the data respectively satisfies the predetermined conditions, so P(n) includes (1/2)^10*1/2*P(n-11). At least part of the data in the window corresponding to the n-9th point meets the predetermined condition, at least part of the data in the window corresponding to the n-10th point does not meet the predetermined condition, and n-11 points in front of the n-10th point It is independent of P(n) that at least part of the data in each of the windows corresponding to 11 consecutive points in each window satisfy the predetermined condition respectively.
情况12:第n至n-10个点对应的窗口中至少部分数据以(1/2)^11的概率满足预定条件,该情况与P(n)无关。Case 12: At least part of the data in the window corresponding to the nth to n-10th points meets the predetermined condition with a probability of (1/2)^11, and this case has nothing to do with P(n).
因此,P(n)=1/2*P(n-1)+(1/2)^2*P(n-2)+……+(1/2)^11*P(n-11)。另一种预设规则:为潜在分割点k确定24个点px、点 px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到24连续的自然数,其中,点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率为3/4,通过这两个因素可以计算P(n)。并且A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=…=C22=C23=C24,其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离0个字节,p2与k之间距离1个字节,p3与k之间距离2个字节,p4与k之间距离3个字节,p5与k之间距离4个字节,p6与k之间距离5个字节,p7与k之间距离6个字节,p8与k之间距离7个字节,p9与k之间距离8个字节,…p22与k之间距离21个字节,p23与k之间距离22个字节,p24与k之间距离23个字节,并且p2、p3、p4、p5、p6、p7、p8、p9…p22、p23和p24相对于潜在分割点k均位于数据流分割点查找反方向。因此是否存在连续24个点对应窗口中的每一个窗口中至少部分数据均满足预定条件Cx就决定潜在分割点k是否为数据流分割点,可以通过下面的公式计算:Therefore, P(n)=1/2*P(n-1)+(1/2)^2*P(n-2)+...+(1/2)^11*P(n-11) . Another preset rule: determine 24 points p x for potential segmentation point k, the window W x [p x -A x ,p x +B x ] and window W x [p x -A x corresponding to point p x ,p x +B x ] corresponding to the predetermined condition C x , x is a continuous natural number from 1 to 24, wherein, at least part of the window W x [p x -A x ,p x +B x ] corresponding to point p x The probability that the data satisfies the predetermined condition C x is 3/4, and P(n) can be calculated by these two factors. And A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =... =C 22 =C 23 =C 24 , wherein, the distance between p x and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 0 bytes, p 2 and k The distance between p 3 and k is 2 bytes, the distance between p 4 and k is 3 bytes, the distance between p 5 and k is 4 bytes, and the distance between p 6 and k The distance between p 7 and k is 5 bytes, the distance between p 7 and k is 6 bytes, the distance between p 8 and k is 7 bytes, the distance between p 9 and k is 8 bytes, ... the distance between p 22 and k 21 bytes, the distance between p 23 and k is 22 bytes, the distance between p 24 and k is 23 bytes, and p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , p 9 . . . p 22 , p 23 and p 24 are all located in the opposite direction of the data flow split point search relative to the potential split point k. Therefore, whether there are at least part of the data in each window corresponding to 24 consecutive points satisfying the predetermined condition C x determines whether the potential segmentation point k is a data stream segmentation point, which can be calculated by the following formula:
P(4073)=1,P(4074)=1,……P(,4095)=1,P(4096)=1-(3/4)^24,P(4073)=1, P(4074)=1, ... P(,4095)=1, P(4096)=1-(3/4)^24,
P(n)=1/4*P(n-1)+1/4*(3/4)*P(n-2)+……+1/4*(3/4)^23*P(n-24)。P(n)=1/4*P(n-1)+1/4*(3/4)*P(n-2)+......+1/4*(3/4)^23*P( n-24).
经过计算,P(5*1024)=0.78,P(11*1024)=0.17,P(12*1024)=0.13,即从数据流起始位置/上一数据流分割点查找到12KB后以13%的概率仍未找到数据流分割点,强制进行分割。通过这个概率,求得数据流分割点的密度函数,经过积分求得大约平均在从数据流起始位置/上一数据流分割点查找7.6KB时找到数据流分割点,即平均分块长度大约为7.6KB。与连续的11个点对应的窗口中至少部分数据以1/2的概率 满足预定条件不同,传统CDC算法采用一个窗口以1/2^12的概率满足条件时,方可达到平均分块长度7.6KB的效果。After calculation, P(5*1024)=0.78, P(11*1024)=0.17, P(12*1024)=0.13, that is, after finding 12KB from the data stream start position/previous data stream split point, use 13 % probability that the data stream split point is still not found, forcing a split. Through this probability, the density function of the split point of the data stream is obtained, and the approximate average of the data stream split point is found when searching for 7.6 KB from the starting position of the data stream/the previous split point of the data stream through integration, that is, the average block length is about It is 7.6KB. Unlike at least part of the data in the window corresponding to consecutive 11 points that meets the predetermined condition with a probability of 1/2, the traditional CDC algorithm uses a window that meets the condition with a probability of 1/2^12 to achieve an average block length of 7.6 The effect of KB.
在图3所示的数据流分割点查找的基础上,在图7所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到11连续的自然数,其中,点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件Cx的概率为1/2,并且A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离2个字节,p2与k之间距离3个字节,p3与k之间距离4个字节,p4与k之间距离5个字节,p5与k之间距离6个字节,p6与k之间距离7个字节,p7与k之间距离8个字节,p8与k之间距离9个字节,p9与k之间距离10个字节,p10与k之间距离1个字节,p11与k之间距离0个字节,并且p1、p2、p3、p4、p5、p6、p7、p8、p9和p10相对于潜在分割点k均位于数据流分割点查找反方向。ka为数据流分割点,图7中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,在最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图7所示的实施方式中,依据预定规则,为潜在分割点ki确定的点为11个,分别为pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11,点pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11对应的窗口分别为Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169, pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]。其中,点pix与潜在分割点ki之间距离dix个字节,具体的,pi1与ki间距2个字节、pi2与ki间距3个字节、pi3与ki间距4个字节、pi4与ki间距5个字节、pi5与ki间距6个字节、pi6与ki间距7个字节、pi7与ki间距8个字节、pi8与ki间距9个字节、pi9与ki间距10个字节、pi10与ki间距1个字节,pi11与ki间距0个字节,并且pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9和pi10相对于潜在分割点ki均位于数据流分割点查找反方向。判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足 预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图8所示,Wi3[pi3-169,pi3]中至少部分数据不满足预定条件C3,点pi3沿着数据流分割点查找方向跳跃11个字节为例进行描述。如图8所示,当判断W3不满足预定条件时,以p3为起始点,沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B3‖+maxx(‖Ax‖+‖(ki-pix)‖),在图6所示的实施方式中,跳跃N个字节,具体为不大于179字节,在本实施例中,N=11,在第11个字节的结束位置,获得下一个潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据在去重服务器103上预设的规则,为潜在分割点kj确定的点为11个,分别为pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11,确定点pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11对应的窗口分别为Wj1[pj1-169,pj1]、Wj2[pj2-169,pj2]、Wj3[pj3-169,pj3]、Wj4[pj4-169,pj4]、Wj5[pj5-169,pj5]、Wj6[pj6-169,pj6]、Wj7[pj7-169,pj7]、Wj8[pj8-169,pj8]、Wj9[pj9-169,pj9]、Wj10[pj10-169,pj10]和Wj11[pj11-169,pj11]。其中,pjx与潜在分割点kj之间距离dx个字节,具体的,pj1与kj间距2个字节、pj2与kj间距3个字节、pj3与kj间距4个字节、pj4与kj间距5个字节、pj5与kj间距6个字节、pj6与kj间距7个字节、pj7与kj间距8个字节、pj8与kj间距9个字节、pj9与kj间距10个字节、pj10与kj间距1个字节,pj11与kj间距0个字节,并且pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9和pj10相对于潜在分割点kj均位于数据流分割点查找反方向。判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1、判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2、判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3、判断Wj4[pj4-169,pj4]中至少部分数据是否满 足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式跳跃11个字节获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。当然该方法的实施受最大数据块长度和构 成该数据流的文件的大小约束,在此不再赘述。On the basis of the data stream segmentation point search shown in FIG. 3 , in the embodiment shown in FIG. 7 , a rule is preset on the deduplication server 103, and the rule is: determine 11 points for the potential segmentation point k p x , the window W x [p x -A x ,p x +B x ] corresponding to the point p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], x respectively is a continuous natural number from 1 to 11, where the probability that at least part of the data in the window W x [p x -A x ,p x +B x ] corresponding to point p x meets the predetermined condition C x is 1/2, and A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 , wherein, the distance between p x and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 2 bytes, and the distance between p 2 and k is 3 bytes, The distance between p 3 and k is 4 bytes, the distance between p 4 and k is 5 bytes, the distance between p 5 and k is 6 bytes, the distance between p 6 and k is 7 bytes, p 7 The distance between p 8 and k is 8 bytes, the distance between p 8 and k is 9 bytes, the distance between p 9 and k is 10 bytes, the distance between p 10 and k is 1 byte, and the distance between p 11 and k The distance between them is 0 bytes, and p 1 , p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , p 9 and p 10 are all located at the data stream split point with respect to the potential split point k Find the opposite direction. k a is the data stream split point. The search direction of the data stream split point shown in Figure 7 is from left to right. After skipping the minimum data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is used as the next The potential segmentation point ki is to determine the point p ix for the potential segmentation point ki . In this embodiment, according to the preset rules on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 7 , according to predetermined rules, 11 points are determined for the potential segmentation point ki, namely p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 , the windows corresponding to points p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are respectively W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -169,p i11 ]. Among them, the distance between point p ix and potential segmentation point ki is d ix bytes. Specifically, the distance between p i1 and ki is 2 bytes, the distance between p i2 and ki is 3 bytes, and the distance between p i3 and ki The distance between p i4 and ki is 5 bytes, the distance between p i5 and ki is 6 bytes, the distance between p i6 and ki is 7 bytes, the distance between p i7 and ki is 8 bytes, The distance between p i8 and ki is 9 bytes, the distance between p i9 and ki is 10 bytes, the distance between p i10 and ki is 1 byte, the distance between p i11 and ki is 0 bytes, and p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , and p i10 are all located in the opposite direction of the data stream segmentation point search relative to the potential segmentation point ki . Judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , judging W i3 [ Whether at least part of the data in p i3 -169,p i3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and judge whether W i5 [p i5 -169 ,p i5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169,p i6 ] meets the predetermined condition C 6 , judging W i7 [p i7 -169,p i7 ] Whether at least part of the data in W i8 [p i8 -169, p i8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W i8 [p i8 -169, p i8 ] meets the predetermined condition C 8 , or whether at least part of the data in W i9 [p i9 -169, p i9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W i10 [p i10 -169, p i10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W i11 [p i11 -169, p i11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined condition, as shown in Figure 8, at least part of the data in W i3 [p i3 -169, p i3 ] does not meet the predetermined condition C 3 , point p i3 jumps 11 bytes along the search direction of the data stream split point as an example for description. As shown in Figure 8, when it is judged that W 3 does not meet the predetermined conditions, start from p 3 and jump N bytes along the direction of data stream segmentation point search, where N bytes are not greater than ‖B 3 ‖+max x (‖A x ‖+‖(k i -p ix )‖), in the embodiment shown in Figure 6, skip N bytes, specifically not more than 179 bytes, in this embodiment, N= 11. At the end position of the 11th byte, obtain the next potential segmentation point. In order to distinguish it from the potential segmentation point ki, the new potential segmentation point is represented as k j here , according to the preset value on the deduplication server 103 Rule, there are 11 points determined for the potential segmentation point k j , which are p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 , Determine the windows corresponding to points p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 respectively as W j1 [p j1 -169,p j1 ] , W j2 [p j2 -169,p j2 ], W j3 [p j3 -169,p j3 ], W j4 [p j4 -169,p j4 ], W j5 [p j5 -169,p j5 ], W j6 [p j6 -169,p j6 ], W j7 [p j7 -169 ,p j7 ], W j8 [p j8 -169 ,p j8 ], W j9 [p j9 -169,p j9 ], W j10 [ p j10 -169,p j10 ] and W j11 [p j11 -169,p j11 ]. Among them, the distance between p jx and potential segmentation point k j is d x bytes, specifically, the distance between p j1 and k j is 2 bytes, the distance between p j2 and k j is 3 bytes, and the distance between p j3 and k j 4 bytes, 5 bytes between p j4 and k j , 6 bytes between p j5 and k j , 7 bytes between p j6 and k j , 8 bytes between p j7 and k j , p The distance between j8 and k j is 9 bytes, the distance between p j9 and k j is 10 bytes, the distance between p j10 and k j is 1 byte, the distance between p j11 and k j is 0 bytes, and p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , and p j10 are all located in the opposite direction of the data flow split point search relative to the potential split point k j . Judging whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , judging W j3 [ Whether at least part of the data in p j3 -169,p j3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , and judge whether W j5 [p j5 -169 ,p j5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging W j7 [p j7 -169 ,p j7 ] Whether at least part of the data in W j8 [p j8 -169,p j8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , and whether at least part of the data in W j9 [p j9 -169,p j9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W j11 [p j11 -169, p j11 ] meets the predetermined condition C11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data flow split point. When it is judged that the potential segmentation point k j is not a data stream segmentation point, jump 11 bytes in the same way as k i to obtain the next potential segmentation point, and judge according to the preset rules on the deduplication server 103 and the above method Whether a potential split point is a stream split point. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the forced split point. Of course, the implementation of this method is limited by the maximum data block length and the size of the files constituting the data stream, so details will not be repeated here.
在图3所示的数据流分割点查找的基础上,在图9所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离3个字节,p2与k之间距离2个字节,p3与k之间距离1个字节,p4与k之间距离0个字节,p5与k之间距离1个字节,p6与k之间距离2个字节,p7与k之间距离3个字节,p8与k之间距离4个字节,p9与k之间距离5个字节,p10与k之间距离6个字节,p11与k之间距离7个字节,并且p5、p6、p7、p8、p9、p10和p11相对于潜在分割点k均位于数据流分割点查找反方向,p1、p2和p3相对于潜在分割点k均位于数据流分割点查找方向。ka为数据流分割点,图9中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图9所示的实施方式中,为潜在分割点ki确定的点为11个,分别为pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11,点pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11对应的窗口分别为Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]。其中,pix与潜在分割点ki之间距离dx个字节,具 体的,pi1与ki间距3个字节、pi2与ki间距2个字节、pi3与ki间距1个字节、pi4与ki间距0个字节、pi5与ki间距1个字节、pi6与ki间距2个字节、pi7与ki间距3个字节、pi8与ki间距4个字节、pi9与ki间距5个字节、pi10与ki间距6个字节,pi11与ki间距7个字节,并且pi5、pi6、pi7、pi8、pi9、pi10和pi11相对于潜在分割点ki均位于数据流分割点查找反方向,pi1、pi2和pi3相对于潜在分割点ki均位于数据流分割点查找方向。判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如 图10所示,Wi7[pi7-169,pi7]中至少部分数据不满足对应的预定条件,则从点pi7沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B4‖+maxx(‖Ax‖+‖(ki-pix)‖),在图10所示的实施方式中,跳跃N个字节,具体为不大于179个字节,在本实施例中,具体取N=8,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据图9所示的实施方式中在去重服务器103上预设的规则,为潜在分割点kj确定的点为11个,分别为pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11,确定点pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11对应的窗口分别为Wj1[pj1-169,pj1]、Wj2[pj2-169,pj2]、Wj3[pj3-169,pj3]、Wj4[pj4-169,pj4]、Wj5[pj5-169,pj5]、Wj6[pj6-169,pj6]、Wj7[pj7-169,pj7]、Wj8[pj8-169,pj8]、Wj9[pj9-169,pj9]、Wj10[pj10-169,pj10]和Wj11[pj11-169,pj11]。其中,pjx与潜在分割点kj之间距离dx个字节,具体的,pj1与kj间距3个字节、pj2与kj间距2个字节、pj3与kj间距1个字节、pj4与kj间距0个字节、pj5与kj间距1个字节、pj6与kj间距2个字节、pj7与kj间距3个字节、pj8与kj间距4个字节、pj9与kj间距5个字节、pj10与kj间距6个字节,pj11与kj间距7个字节,并且pj5、pj6、pj7、pj8、pj9、pj10和pj11相对于潜在分割点kj均位于数据流分割点查找反方向,pj1、pj2和pj3相对于潜在分割点kj均位于数据流分割点查找方向。判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1、判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2、判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3、判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分 数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式跳跃8个字节获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search shown in FIG. 3 , in the embodiment shown in FIG. 9 , a rule is preset on the deduplication server 103, and the rule is: determine 11 points for the potential segmentation point k p x , the window W x [p x -A x ,p x +B x ] corresponding to the point p x , and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], where A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 = C11 . Among them, the distance between p x and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 3 bytes, the distance between p 2 and k is 2 bytes, and p 3 The distance between p 4 and k is 1 byte, the distance between p 4 and k is 0 byte, the distance between p 5 and k is 1 byte, the distance between p 6 and k is 2 bytes, and the distance between p 7 and k The distance between p 8 and k is 4 bytes, the distance between p 9 and k is 5 bytes, the distance between p 10 and k is 6 bytes, and the distance between p 11 and k The distance is 7 bytes, and p 5 , p 6 , p 7 , p 8 , p 9 , p 10 and p 11 are all located in the opposite direction of the data flow split point search relative to the potential split point k, p 1 , p 2 and p 3 are located in the search direction of the data flow split point relative to the potential split point k. k a is the data stream splitting point. The search direction of the data stream splitting point shown in Figure 9 is from left to right. After skipping the smallest data block 4KB from the data stream splitting point k a , the smallest data block 4KB end position is taken as the next potential The split point ki is a point p ix determined for the potential split point ki . In this embodiment, according to the preset rule on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 9 , 11 points are determined for the potential segmentation point ki, namely p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 , the windows corresponding to points p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 - 169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169, p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -169,p i11 ]. Among them, the distance between p ix and potential segmentation point ki is d x bytes, specifically, the distance between p i1 and ki is 3 bytes, the distance between p i2 and ki is 2 bytes, and the distance between p i3 and ki 1 byte, the distance between p i4 and ki is 0 bytes, the distance between p i5 and ki is 1 byte, the distance between p i6 and ki is 2 bytes, the distance between p i7 and ki is 3 bytes, p The distance between i8 and ki is 4 bytes, the distance between p i9 and ki is 5 bytes, the distance between p i10 and ki is 6 bytes, the distance between p i11 and ki is 7 bytes, and p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are all located in the opposite direction of the data flow segmentation point search relative to the potential segmentation point ki, and p i1 , p i2 and p i3 are all located in the data stream relative to the potential segmentation point ki Split point lookup direction. Judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , judging W i3 [ Whether at least part of the data in p i3 -169,p i3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and judge whether W i5 [p i5 -169 ,p i5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169,p i6 ] meets the predetermined condition C 6 , judging W i7 [p i7 -169,p i7 ] Whether at least part of the data in W i8 [p i8 -169, p i8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W i8 [p i8 -169, p i8 ] meets the predetermined condition C 8 , or whether at least part of the data in W i9 [p i9 -169, p i9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W i10 [p i10 -169, p i10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W i11 [p i11 -169, p i11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined conditions, as shown in Figure 10, at least part of the data in W i7 [p i7 -169, p i7 ] does not meet the corresponding predetermined conditions, then from Point p i7 jumps N bytes along the search direction of the data flow splitting point, where N bytes are not greater than ‖B 4 ‖+max x (‖A x ‖+‖(k i -p ix )‖), in Fig. In the embodiment shown in 10, N bytes are skipped, specifically no more than 179 bytes. In this embodiment, N=8 is specifically taken to obtain a new potential segmentation point, which is different from the potential segmentation point ki , here the new potential segmentation point is denoted as k j , according to the rules preset on the deduplication server 103 in the embodiment shown in FIG . , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 , determine the points p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 correspond to W j1 [p j1 -169,p j1 ], W j2 [p j2 -169,p j2 ], W j3 [p j3 - 169,p j3 ], W j4 [p j4 -169,p j4 ], W j5 [p j5 -169,p j5 ], W j6 [p j6 -169,p j6 ], W j7 [p j7 -169 , p j7 ], W j8 [p j8 -169 ,p j8 ], W j9 [p j9 -169,p j9 ], W j10 [p j10 -169,p j10 ], and W j11 [p j11 -169,p j11 ]. Among them, the distance between p jx and potential segmentation point k j is d x bytes. Specifically, the distance between p j1 and k j is 3 bytes, the distance between p j2 and k j is 2 bytes, and the distance between p j3 and k j is 1 byte, the distance between p j4 and k j is 0 bytes, the distance between p j5 and k j is 1 byte, the distance between p j6 and k j is 2 bytes, the distance between p j7 and k j is 3 bytes, p The distance between j8 and k j is 4 bytes, the distance between p j9 and k j is 5 bytes, the distance between p j10 and k j is 6 bytes, the distance between p j11 and k j is 7 bytes, and p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 are all located in the opposite direction of the data flow segmentation point search relative to the potential segmentation point k j , and p j1 , p j2 and p j3 are all located in the data stream relative to the potential segmentation point k j Split point lookup direction. Judging whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , judging W j3 [ Whether at least part of the data in p j3 -169,p j3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , and judge whether W j5 [p j5 -169 ,p j5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging W j7 [p j7 -169 ,p j7 ] Whether at least part of the data in W j8 [p j8 -169,p j8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , and whether at least part of the data in W j9 [p j9 -169,p j9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W j11 [p j11 -169, p j11 ] meets the predetermined condition C11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data stream split point. When it is judged that the potential segmentation point k j is not a data flow segmentation point, jump 8 bytes in the same way as k i to obtain the next potential segmentation point, and judge according to the preset rules on the deduplication server 103 and the above method Whether a potential split point is a stream split point. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the mandatory split point.
在图3所示的数据流分割点查找的基础上,在图11所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k 确定11个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=169,A11=182,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10≠C11。其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离0个字节,p2与k之间距离1个字节,p3与k之间距离2个字节,p4与k之间距离3个字节,p5与k之间距离4个字节,p6与k之间距离5个字节,p7与k之间距离6个字节,p8与k之间距离7个字节,p9与k之间距离8个字节,p10与k之间距离1个字节,p11与k之间距离3个字节,并且、p2、p3、p4、p5、p6、p7、p8和p9相对于潜在分割点k均位于数据流分割点查找反方向,p10和p11相对于潜在分割点k均位于数据流分割点查找方向。ka为数据流分割点,图11中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图11所示的实施方式中,为潜在分割点ki确定的点为11个,分别为pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11,点pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11对应的窗口分别为Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-182,pi11]。其中,pix与潜在分割点ki之间距离dx个字节,具体的,pi1与ki间距0个字节、pi2与ki间距1个字节、pi3与ki间距2个字节、pi4与ki间距3个字节、pi5与ki间距4个字节、pi6与ki间距5个字节、pi7与ki间距6个字节、pi8与ki间距7个字节、pi9 与ki间距8个字节、pi10与ki间距1个字节,pi11与ki间距3个字节,并且pi2、pi3、pi4、pi5、pi6、pi7、pi8和pi9相对于潜在分割点ki均位于数据流分割点查找反方向,pi10和pi11相对于潜在分割点ki均位于数据流分割点查找方向。判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当判断窗口Wi11中至少部分数据不满足预定条件C11时,则从潜在分割点ki沿着数据流分割点查找方向跳跃1个字节,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj。当Wi1、Wi2、Wi3、Wi4、Wi5、Wi6、Wi7、Wi8、Wi9和 Wi1010个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图12所示,Wi4[pi4-169,pi4],则从点pi4沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B4‖+maxx(‖Ax‖+‖(ki-pix)‖),在图12所示的实施方式中,跳跃N个字节,具体为不大于179,在本实施例中,具体取N=9,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据图11所示的实施方式中在去重服务器103上预设的规则,为潜在分割点kj确定的点为11个,分别为pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11,确定点pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11对应的窗口分别为Wj1[pj1-169,pj1]、Wj2[pj2-169,pj2]、Wj3[pj3-169,pj3]、Wj4[pj4-169,pj4]、Wj5[pj5-169,pj5]、Wj6[pj6-169,pj6]、Wj7[pj7-169,pj7]、Wj8[pj8-169,pj8]、Wj9[pj9-169,pj9]、Wj10[pj10-169,pj10]和Wj11[pj8-182,pj8]。其中,pjx与潜在分割点kj之间距离dx个字节,具体的,pj1与kj间距0个字节、pj2与kj间距1个字节、pj3与kj间距2个字节、pj4与kj间距3个字节、pj5与kj间距4个字节、pj6与kj间距5个字节、pj7与kj间距6个字节、pj8与kj间距7个字节、pj9与kj间距8个字节、pj10与kj间距1个字节,pj11与kj间距3个字节,并且pj2、pj3、pj4、pj5、pj6、pj7、pj8和pj9相对于潜在分割点kj均位于数据流分割点查找反方向,pj10和pj11相对于潜在分割点kj均位于数据流分割点查找方向。判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1、判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2、判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3、判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6 -169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-182,pj11]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 11, a rule is preset on the deduplication server 103, and the rule is: determine 11 points for the potential segmentation point k p x , the window W x [p x -A x ,p x +B x ] corresponding to the point p x , and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], where A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =169, A 11 =182, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 ≠C 11 . Among them, the distance between p x and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 0 bytes, the distance between p 2 and k is 1 byte, and p 3 The distance between p 4 and k is 2 bytes, the distance between p 4 and k is 3 bytes, the distance between p 5 and k is 4 bytes, the distance between p 6 and k is 5 bytes, and the distance between p 7 and k The distance between p 8 and k is 7 bytes, the distance between p 9 and k is 8 bytes, the distance between p 10 and k is 1 byte, and the distance between p 11 and k The distance is 3 bytes, and, p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 and p 9 are all located in the opposite direction of the data flow split point search relative to the potential split point k, p 10 and p 11 is located in the search direction of the data stream split point relative to the potential split point k. k a is the data stream split point. The search direction of the data stream split point shown in Figure 11 is from left to right. After skipping the minimum data block 4KB from the data stream split point k a , the minimum data block 4KB end position is taken as the next potential The split point ki is a point p ix determined for the potential split point ki . In this embodiment, according to the preset rule on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 11 , 11 points are determined for the potential segmentation point k i , which are p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 , the windows corresponding to points p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 - 169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169, p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -182,p i11 ]. Among them, the distance between p ix and potential segmentation point ki is d x bytes, specifically, the distance between p i1 and ki is 0 bytes, the distance between p i2 and ki is 1 byte, and the distance between p i3 and ki 2 bytes, 3 bytes between p i4 and ki , 4 bytes between p i5 and ki , 5 bytes between p i6 and ki , 6 bytes between p i7 and ki , p The distance between i8 and ki is 7 bytes, the distance between p i9 and ki is 8 bytes, the distance between p i10 and ki is 1 byte, the distance between p i11 and ki is 3 bytes, and p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 and p i9 are all located in the opposite direction of the data stream segmentation point search relative to the potential segmentation point ki , and p i10 and p i11 are located in the data stream relative to the potential segmentation point ki Split point lookup direction. Judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , judging W i3 [ Whether at least part of the data in p i3 -169,p i3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and judge whether W i5 [p i5 -169 ,p i5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169,p i6 ] meets the predetermined condition C 6 , judging W i7 [p i7 -169,p i7 ] Whether at least part of the data in W i8 [p i8 -169, p i8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W i8 [p i8 -169, p i8 ] meets the predetermined condition C 8 , or whether at least part of the data in W i9 [p i9 -169, p i9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W i10 [p i10 -169, p i10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W i11 [p i11 -169, p i11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in the judging window W i11 does not meet the predetermined condition C 11 , then jump 1 byte from the potential segmentation point ki along the data stream segmentation point search direction to obtain a new potential segmentation point, which is the potential segmentation point ki difference, here denote the new potential segmentation point as k j . When at least part of the data in any of the 10 windows W i1 , W i2 , W i3 , W i4 , W i5 , W i6 , W i7 , W i8 , W i9 , and W i10 does not meet the corresponding predetermined conditions, such as As shown in Figure 12, W i4 [p i4 -169, p i4 ], jump N bytes from point p i4 along the direction of data stream segmentation point search, where N bytes are not greater than ‖B 4 ‖+max x (‖A x ‖+‖(k i −p ix )‖), in the embodiment shown in Figure 12, skip N bytes, specifically not more than 179, in this embodiment, specifically take N=9 , to obtain a new potential segmentation point, in order to distinguish it from the potential segmentation point ki, here the new potential segmentation point is expressed as k j , according to the rules preset on the deduplication server 103 in the embodiment shown in FIG. 11 , as The potential segmentation point k j determines 11 points, which are p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 . The windows corresponding to j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 are respectively W j1 [p j1 -169,p j1 ], W j2 [p j2 -169,p j2 ], W j3 [p j3 -169,p j3 ], W j4 [p j4 -169,p j4 ], W j5 [p j5 -169,p j5 ], W j6 [p j6 -169,p j6 ], W j7 [p j7 -169 ,p j7 ], W j8 [p j8 -169 ,p j8 ], W j9 [p j9 -169,p j9 ], W j10 [p j10 - 169,p j10 ] and W j11 [p j8 -182,p j8 ]. Among them, the distance between p jx and potential segmentation point k j is d x bytes, specifically, the distance between p j1 and k j is 0 bytes, the distance between p j2 and k j is 1 byte, and the distance between p j3 and k j 2 bytes, 3 bytes between p j4 and k j , 4 bytes between p j5 and k j , 5 bytes between p j6 and k j , 6 bytes between p j7 and k j , p The distance between j8 and k j is 7 bytes, the distance between p j9 and k j is 8 bytes, the distance between p j10 and k j is 1 byte, the distance between p j11 and k j is 3 bytes, and p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 and p j9 are all located in the opposite direction of the data stream segmentation point search relative to the potential segmentation point k j , and p j10 and p j11 are located in the data stream relative to the potential segmentation point k j Split point lookup direction. Judging whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , judging W j3 [ Whether at least part of the data in p j3 -169,p j3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , and judge whether W j5 [p j5 -169 ,p j5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging W j7 [p j7 -169 ,p j7 ] Whether at least part of the data in W j8 [p j8 -169,p j8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , and whether at least part of the data in W j9 [p j9 -169,p j9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W j11 [p j11 -182, p j11 ] meets the predetermined condition C11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data flow split point. When it is judged that the potential split point k j is not a data flow split point, the next potential split point is obtained in the same manner as k i , and the next potential split point is judged according to the preset rules on the deduplication server 103 and the above method. Split point for data flow. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the forced split point.
在图3所示的数据流分割点查找的基础上,在图13所示的实施方式中,在去重服务器103上预设有规则为:为潜在分割点k确定11个点 px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到11连续的自然数,其中,点px对应的窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件的概率为1/2,并且A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,其中,px与潜在分割点k之间距离dx个字节,具体的,p1与潜在分割点k之间距离0个字节,p2与k之间距离2个字节,p3与k之间距离4个字节,p4与k之间距离6个字节,p5与k之间距离8个字节,p6与k之间距离10个字节,p7与k之间距离12个字节,p8与k之间距离14个字节,p9与k之间距离16个字节,p10与k之间距离18个字节,p11与k之间距离20个字节,并且p2、p3、p4、p5、p6、p7、p8、p9、p10和p11相对于潜在分割点k均位于数据流分割点查找反方向。ka为数据流分割点,图13中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,在最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图13所示的实施方式中,依据预定规则,为潜在分割点ki确定的点为11个,分别为pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11,点pi1、pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11对应的窗口分别为Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]。其中,pix与潜在分割点ki之间距离dx个字节,具体的,pi1与ki间距0个字节、pi2与ki间距2个字节、pi3与ki间距4个字节、pi4与ki间距6个字节、pi5与ki间距8个字节、pi6与ki间 距10个字节、pi7与ki间距12个字节、pi8与ki间距14个字节、pi9与ki间距16个字节、pi10与ki间距18个字节,pi11与ki间距20个字节,并且pi2、pi3、pi4、pi5、pi6、pi7、pi8、pi9、pi10和pi11相对于潜在分割点ki均位于数据流分割点查找反方向。判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1、判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图14所示,Wi4[pi4-169,pi4]中至少部分数据不满足预定条件C4,则选择下一个潜在分割点,为与潜在分割点ki区别,这里表示为kj,kj位于ki右边,并且kj与ki间距1 个字节。如图14所示,依据在去重服务器103上预设的规则,为潜在分割点kj确定11个点,分别为pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11,确定点pj1、pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11对应的窗口分别为Wj1[pj1-169,pj1]、Wj2[pj2-169,pj2]、Wj3[pj3-169,pj3]、Wj4[pj4-169,pj4]、Wj5[pj5-169,pj5]、Wj6[pj6-169,pj6]、Wj7[pj7-169,pj7]、Wj8[pj8-169,pj8]、Wj9[pj9-169,pj9]、Wj10[pj10-169,pj10]和Wj11[pj11-169,pj11],其中,A1=A2=A3=A4=A5=A6=A7=A8=A9=A10=A11=169,B1=B2=B3=B4=B5=B6=B7=B8=B9=B10=B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。其中,pjx与潜在分割点kj之间距离dx个字节,具体的,pj1与kj间距0个字节、pj2与kj间距2个字节、pj3与kj间距4个字节、pj4与kj间距6个字节、pj5与kj间距8个字节、pj6与kj间距10个字节、pj7与kj间距12个字节、pj8与kj间距14个字节、pj9与kj间距16个字节、pj10与kj间距18个字节,pj11与kj间距20个字节,并且pj2、pj3、pj4、pj5、pj6、pj7、pj8、pj9、pj10和pj11相对于潜在分割点kj均位于数据流分割点查找反方向。判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1、判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2、判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3、判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足 预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点。当判断窗口Wj1、Wj2、Wj3、Wj4、Wj5、Wj6、Wj7、Wj8、Wj9、Wj10和Wj11中任一个窗口中至少部分数据不满足预定条件时,如图15所示,Wj3[pj3-169,pj3]中至少部分数据不满足预定条件C3时,点pi4相对于数据流分割点查找方向位于点pj3左边,从点pi4沿着数据流分割点查找方向跳跃21个字节,获得下一个潜在分割点,为与潜在分割点ki、kj相区别,表示为kl。根据图13所实施方式中在去重服务器103上预设的规则,为潜在分割点kl确定的点为11个,分别为pl1、pl2、pl3、pl4、pl5、pl6、pl7、pl8、pl9、pl10和pl11,点pl1、pl2、pl3、pl4、pl5、pl6、pl7、pl8、pl9、pl10和pl11对应的窗口分别为Wl1[pl1-169,pl1]、Wl2[pl2-169,pl2]、Wl3[pl3-169,pl3]、Wl4[pl4-169,pl4]、Wl5[pl5-169,pl5]、Wl6[pl6-169,pl6]、Wl7[pl7-169,pl7]、Wl8[pl8-169,pl8]、Wl9[pl9-169,pl9]、Wl10[pl10-169,pl10]和Wl11[pl11-169,pl11],其中,plx与潜在分割点kl之间距离dx个字节,具体的,pl1与潜在分割点kl之间距离0个字节,pl2与kl之间距离2个字节,pl3与kl之间距离4个字节,pl4与kl之间距离6个字节,pl5与kl之间距离8个字节,pl6与kl之间距离10个字节,pl7与kl之间距离12个字 节,pl8与kl之间距离14个字节,pl9与kl之间距离16个字节,pl10与kl之间距离18个字节,pl11与kl之间距离20个字节,并且pl2、pl3、pl4、pl5、pl6、pl7、pl8、pl9、pl10和pl11相对于潜在分割点kl均位于数据流分割点查找反方向。判断Wl1[pl1-169,pl1]中至少部分数据是否满足预定条件C1、判断Wl2[pl2-169,pl2]中至少部分数据是否满足预定条件C2、判断Wl3[pl3-169,pl3]中至少部分数据是否满足预定条件C3、判断Wl4[pl4-169,pl4]中至少部分数据是否满足预定条件C4、判断Wl5[pl5-169,pl5]中至少部分数据是否满足预定条件C5、判断Wl6[pl6-169,pl6]中至少部分数据是否满足预定条件C6、判断Wl7[pl7-169,pl7]中至少部分数据是否满足预定条件C7、判断Wl8[pl8-169,pl8]中至少部分数据是否满足预定条件C8、判断Wl9[pl9-169,pl9]中至少部分数据是否满足预定条件C9、判断Wl10[pl10-169,pl10]中至少部分数据是否满足预定条件C10和判断Wl11[pl11-169,pl11]中至少部分数据是否满足预定条件C11。当判断窗口Wl1中至少部分数据满足预定条件C1、窗口Wl2中至少部分数据满足预定条件C2、窗口Wl3中至少部分数据满足预定条件C3、窗口Wl4中至少部分数据满足预定条件C4、窗口Wl5中至少部分数据满足预定条件C5、窗口Wl6中至少部分数据满足预定条件C6、窗口Wl7中至少部分数据满足预定条件C7、窗口Wl8中至少部分数据满足预定条件C8、窗口Wl9中至少部分数据满足预定条件C9、窗口Wl10中至少部分数据满足预定条件C10和窗口Wl11中至少部分数据满足预定条件C11时,则当前潜在分割点kl为数据流分割点。当窗口Wl1、Wl2、Wl3、Wl4、Wl5、Wl6、Wl7、Wl8、Wl9、Wl10和Wl11中任一窗口中至少部分数据不满足预定条件时,选择下一个潜在分割点,为与潜在分割点ki、kj和kl区别,表示为km,km位于kl右边,并且km与kl间距1个字节。根据图13所示实施 例在去重服务器103上预设的规则,为潜在分割点km确定的点为11个,分别为pm1、pm2、pm3、pm4、pm5、pm6、pm7、pm8、pm9、pm10和pm11,点pm1、pm2、pm3、pm4、pm5、pm6、pm7、pm8、pm9、pm10和pm11对应的窗口分别为Wm1[pm1-169,pm1]、Wm2[pm2-169,pm2]、Wm3[pm3-169,pm3]、Wm4[pm4-169,pm4]、Wm5[pm5-169,pm5]、Wm6[pm6-169,pm6]、Wm7[pm7-169,pm7]、Wm8[pm8-169,pm8]、Wm9[pm9-169,pm9]、Wm10[pm10-169,pm10]和Wm11[pm11-169,pm11],其中,pmx与潜在分割点km之间距离dx个字节,具体的,pm1与潜在分割点km之间距离0个字节,pm2与km之间距离2个字节,pm3与km之间距离4个字节,pm4与km之间距离6个字节,pm5与km之间距离8个字节,pm6与km之间距离10个字节,pm7与km之间距离12个字节,pm8与km之间距离14个字节,pm9与km之间距离16个字节,pm10与km之间距离18个字节,pm11与km之间距离20个字节,并且pm2、pm3、pm4、pm5、pm6、pm7、pm8、pm9、pm10和pm11相对于潜在分割点km均位于数据流分割点查找反方向。判断Wm1[pm1-169,pm1]中至少部分数据是否满足预定条件C1、判断Wm2[pm2-169,pm2]中至少部分数据是否满足预定条件C2、判断Wm3[pm3-169,pm3]中至少部分数据是否满足预定条件C3、判断Wm4[pm4-169,pm4]中至少部分数据是否满足预定条件C4、判断Wm5[pm5-169,pm5]中至少部分数据是否满足预定条件C5、判断Wm6[pm6-169,pm6]中至少部分数据是否满足预定条件C6、判断Wm7[pm7-169,pm7]中至少部分数据是否满足预定条件C7、判断Wm8[pm8-169,pm8]中至少部分数据是否满足预定条件C8、判断Wm9[pm9-169,pm9]中至少部分数据是否满足预定条件C9、判断Wm10[pm10-169,pm10]中至少部分数据是否满足预定条件C10和判断Wm11[pm11-169,pm11]中至少部分数据是否满足预定条件C11。当判断窗口Wm1中至少部分数据满 足预定条件C1、窗口Wm2中至少部分数据满足预定条件C2、窗口Wm3中至少部分数据满足预定条件C3、窗口Wm4中至少部分数据满足预定条件C4、窗口Wm5中至少部分数据满足预定条件C5、窗口Wm6中至少部分数据满足预定条件C6、窗口Wm7中至少部分数据满足预定条件C7、窗口Wm8中至少部分数据满足预定条件C8、窗口Wm9中至少部分数据满足预定条件C9、窗口Wm10中至少部分数据满足预定条件C10和窗口Wm11中至少部分数据满足预定条件C11时,则当前潜在分割点km为数据流分割点。当任一个窗口中至少部分数据不满足预定条件时,则按照前面描述的方案执行跳跃,以获得下一个潜在分割点并判断是否为数据流分割点。On the basis of the data stream segmentation point search shown in FIG. 3 , in the embodiment shown in FIG. 13 , a rule is preset on the deduplication server 103: determine 11 points p x , point The window W x [p x -A x ,p x +B x ] corresponding to p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], x is 1 to 11 respectively Continuous natural numbers, where the probability that at least part of the data in the window W x [p x -A x ,p x +B x ] corresponding to point p x meets the predetermined condition is 1/2, and A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 , where p x The distance between p 1 and potential segmentation point k is d x bytes, specifically, the distance between p 1 and potential segmentation point k is 0 bytes, the distance between p 2 and k is 2 bytes, and the distance between p 3 and k The distance between p 4 and k is 6 bytes, the distance between p 5 and k is 8 bytes, the distance between p 6 and k is 10 bytes, and the distance between p 7 and k is 12 bytes bytes, the distance between p 8 and k is 14 bytes, the distance between p 9 and k is 16 bytes, the distance between p 10 and k is 18 bytes, and the distance between p 11 and k is 20 bytes , and p 2 , p 3 , p 4 , p 5 , p 6 , p 7 , p 8 , p 9 , p 10 and p 11 are all located in the opposite direction of the data flow split point search relative to the potential split point k. k a is the data stream split point. The search direction of the data stream split point shown in Figure 13 is from left to right. After skipping the smallest data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is used as the next The potential segmentation point ki is to determine the point p ix for the potential segmentation point ki . In this embodiment, according to the preset rules on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 13 , according to predetermined rules, 11 points are determined for the potential segmentation point ki, namely p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 , the windows corresponding to points p i1 , p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are respectively W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -169,p i11 ]. Among them, the distance between p ix and potential segmentation point ki is d x bytes, specifically, the distance between p i1 and ki is 0 bytes, the distance between p i2 and ki is 2 bytes, and the distance between p i3 and ki 4 bytes, 6 bytes between p i4 and ki , 8 bytes between p i5 and ki , 10 bytes between p i6 and ki , 12 bytes between p i7 and ki , p The distance between i8 and ki is 14 bytes, the distance between p i9 and ki is 16 bytes, the distance between p i10 and ki is 18 bytes, the distance between p i11 and ki is 20 bytes, and p i2 , p i3 , p i4 , p i5 , p i6 , p i7 , p i8 , p i9 , p i10 and p i11 are all located in the opposite direction of the data flow segmentation point search relative to the potential segmentation point ki . Judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , judging W i3 [ Whether at least part of the data in p i3 -169,p i3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and judge whether W i5 [p i5 -169 ,p i5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169,p i6 ] meets the predetermined condition C 6 , judging W i7 [p i7 -169,p i7 ] Whether at least part of the data in W i8 [p i8 -169, p i8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W i8 [p i8 -169, p i8 ] meets the predetermined condition C 8 , or whether at least part of the data in W i9 [p i9 -169, p i9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W i10 [p i10 -169, p i10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W i11 [p i11 -169, p i11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined condition, as shown in Figure 14, at least part of the data in W i4 [p i4 -169,p i4 ] does not meet the predetermined condition C 4 , then select The next potential segmentation point, to be distinguished from the potential segmentation point ki, is denoted as k j here, k j is located to the right of ki , and the distance between k j and ki is 1 byte. As shown in Figure 14, according to the preset rules on the deduplication server 103, 11 points are determined for the potential segmentation point k j , which are p j1 , p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 and p j11 , determine the corresponding The windows are respectively W j1 [p j1 -169,p j1 ], W j2 [p j2 -169,p j2 ], W j3 [p j3 -169,p j3 ], W j4 [p j4 -169,p j4 ] , W j5 [p j5 -169,p j5 ], W j6 [p j6 -169,p j6 ], W j7 [p j7 -169 ,p j7 ], W j8 [p j8 -169 ,p j8 ], W j9 [p j9 -169,p j9 ], W j10 [p j10 -169,p j10 ] and W j11 [p j11 -169,p j11 ], where A 1 =A 2 =A 3 =A 4 =A 5 =A 6 =A 7 =A 8 =A 9 =A 10 =A 11 =169, B 1 =B 2 =B 3 =B 4 =B 5 =B 6 =B 7 =B 8 =B 9 =B 10 =B 11 =0, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 . Among them, the distance between p jx and potential segmentation point k j is d x bytes, specifically, the distance between p j1 and k j is 0 bytes, the distance between p j2 and k j is 2 bytes, and the distance between p j3 and k j 4 bytes, 6 bytes between p j4 and k j , 8 bytes between p j5 and k j , 10 bytes between p j6 and k j , 12 bytes between p j7 and k j , p The distance between j8 and k j is 14 bytes, the distance between p j9 and k j is 16 bytes, the distance between p j10 and k j is 18 bytes, the distance between p j11 and k j is 20 bytes, and p j2 , p j3 , p j4 , p j5 , p j6 , p j7 , p j8 , p j9 , p j10 , and p j11 are all located in the opposite direction of the data flow split point search relative to the potential split point k j . Judging whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , judging W j3 [ Whether at least part of the data in p j3 -169,p j3 ] satisfies the predetermined condition C 3 , judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , and judge whether W j5 [p j5 -169 ,p j5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging W j7 [p j7 -169 ,p j7 ] Whether at least part of the data in W j8 [p j8 -169,p j8 ] satisfies the predetermined condition C 7 , whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , and whether at least part of the data in W j9 [p j9 -169,p j9 ] Whether the predetermined condition C 9 is met, judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W j11 [p j11 -169, p j11 ] meets the predetermined condition C11 . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in the window W i9 meets the predetermined condition C9 , at least part of the data in the window Wj10 meets the predetermined condition C10 , and at least part of the data in the window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data flow splitting point. When judging that at least part of the data in any of the windows W j1 , W j2 , W j3 , W j4 , W j5 , W j6 , W j7 , W j8 , W j9 , W j10 and W j11 does not meet the predetermined conditions, as As shown in Figure 15, when at least part of the data in W j3 [p j3 -169, p j3 ] does not meet the predetermined condition C 3 , point p i4 is located on the left side of point p j3 relative to the search direction of the data flow splitting point, from point p i4 along Jump 21 bytes in the search direction of the data stream split point to obtain the next potential split point, which is expressed as k l to distinguish it from potential split points ki and k j . According to the preset rules on the deduplication server 103 in the embodiment shown in FIG. 13 , there are 11 points determined for the potential segmentation point k l , which are p l1 , p l2 , p l3 , p l4 , p l5 , and p l6 , p l7 , p l8 , p l9 , p l10 and p l11 , the points p l1 , p l2 , p l3 , p l4 , p l5 , p l6 , p l7 , p l8 , p l9 , p l10 and p l11 correspond to The windows of W l1 [p l1 -169,p l1 ], W l2 [p l2 -169,p l2 ], W l3 [p l3 -169,p l3 ], W l4 [p l4 -169,p l4 ], W l5 [p l5 -169,p l5 ], W l6 [p l6 -169,p l6 ], W l7 [p l7 -169,p l7 ], W l8 [p l8 -169,p l8 ], W l9 [p l9 -169,p l9 ], W l10 [p l10 -169,p l10 ] and W l11 [p l11 -169 ,p l11 ], where the distance between p lx and potential segmentation point k l is d x bytes, specifically, the distance between p l1 and potential segmentation point k l is 0 bytes, the distance between p l2 and k l is 2 bytes, the distance between p l3 and k l is 4 bytes, The distance between p l4 and k l is 6 bytes, the distance between p l5 and k l is 8 bytes, the distance between p l6 and k l is 10 bytes, and the distance between p l7 and k l is 12 characters section, the distance between p l8 and k l is 14 bytes, the distance between p l9 and k l is 16 bytes, the distance between p l10 and k l is 18 bytes, and the distance between p l11 and k l is 20 bytes bytes, and p l2 , p l3 , p l4 , p l5 , p l6 , p l7 , p l8 , p l9 , p l10 and p l11 are all located in the opposite direction of the data stream split point search relative to the potential split point k l . Judging whether at least part of the data in W l1 [p l1 -169,p l1 ] meets the predetermined condition C 1 , judging whether at least part of the data in W l2 [p l2 -169,p l2 ] meets the predetermined condition C 2 , judging W l3 [ p l3 -169,p l3 ] whether at least part of the data meet the predetermined condition C 3 , judge whether at least part of the data in W l4 [p l4 -169,p l4 ] meet the predetermined condition C 4 , judge W l5 [p l5 -169 ,p l5 ] whether at least part of the data satisfies the predetermined condition C 5 , judge whether at least part of the data in W l6 [p l6 -169,p l6 ] meets the predetermined condition C 6 , judge whether W l7 [p l7 -169,p l7 ] Whether at least part of the data in W l8 [p l8 -169,p l8 ] meets the predetermined condition C 7 , whether at least part of the data in W l8 [p l8 -169,p l8 ] meets the predetermined condition C 8 , and whether at least part of the data in W l9 [p l9 -169,p l9 ] is judged Whether to meet the predetermined condition C 9 , judging whether at least part of the data in W l10 [p l10 -169,p l10 ] meets the predetermined condition C 10 and judging whether at least part of the data in W l11 [p l11 -169 ,p l11 ] meets the predetermined condition C11 . When it is judged that at least part of the data in window W l1 meets the predetermined condition C 1 , at least part of the data in window W l2 meets the predetermined condition C 2 , at least part of the data in window W l3 meets the predetermined condition C 3 , and at least part of the data in window W l4 meets the predetermined condition Condition C 4 , at least part of the data in window W l5 meet the predetermined condition C 5 , at least part of the data in window W l6 meet the predetermined condition C 6 , at least part of the data in window W l7 meet the predetermined condition C 7 , at least part of the data in window W l8 When the predetermined condition C8 is met, at least part of the data in window W19 meets the predetermined condition C9 , at least part of the data in window W110 meets the predetermined condition C10 , and at least part of the data in window W111 meets the predetermined condition C11 , then the current potential segmentation Point k l is the data flow splitting point. When at least some of the data in any of the windows W l1 , W l2 , W l3 , W l4 , W l5 , W l6 , W l7 , W l8 , W l9 , W l10 and W l11 do not meet the predetermined conditions, select the following A potential segmentation point, to be distinguished from potential segmentation points ki , kj and kl , is denoted as km, km is located to the right of kl , and the distance between km and kl is 1 byte. According to the preset rules on the deduplication server 103 in the embodiment shown in Fig. 13, there are 11 points determined for the potential segmentation point k m , which are respectively p m1 , p m2 , p m3 , p m4 , p m5 , p m6 , p m7 , p m8 , p m9 , p m10 and p m11 , the points p m1 , p m2 , p m3 , p m4 , p m5 , p m6 , p m7 , p m8 , p m9 , p m10 and p m11 correspond to The windows are respectively W m1 [p m1 -169,p m1 ], W m2 [p m2 -169,p m2 ], W m3 [p m3 -169,p m3 ], W m4 [p m4 -169,p m4 ], W m5 [p m5 -169, p m5 ], W m6 [p m6 -169, p m6 ], W m7 [p m7 -169, p m7 ], W m8 [p m8 -169, p m8 ], W m9 [p m9 -169,p m9 ], W m10 [p m10 -169,p m10 ] and W m11 [p m11 -169,p m11 ], where the distance d between p mx and the potential segmentation point k m x bytes, specifically, the distance between p m1 and potential segmentation point k m is 0 bytes, the distance between p m2 and k m is 2 bytes, the distance between p m3 and k m is 4 bytes, The distance between p m4 and k m is 6 bytes, the distance between p m5 and k m is 8 bytes, the distance between p m6 and k m is 10 bytes, and the distance between p m7 and k m is 12 characters section, the distance between p m8 and k m is 14 bytes, the distance between p m9 and k m is 16 bytes, the distance between p m10 and k m is 18 bytes, and the distance between p m11 and k m is 20 bytes bytes, and p m2 , p m3 , p m4 , p m5 , p m6 , p m7 , p m8 , p m9 , p m10 and p m11 are all located in the reverse direction of the data flow split point search relative to the potential split point k m . Judging whether at least part of the data in W m1 [p m1 -169, p m1 ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W m2 [p m2 -169, p m2 ] meets the predetermined condition C 2 , judging whether W m3 [ p m3 -169, p m3 ] whether at least part of the data satisfies the predetermined condition C 3 , judge whether at least part of the data in W m4 [p m4 -169, p m4 ] meet the predetermined condition C 4 , and judge whether W m5 [p m5 -169 ,p m5 ] whether at least part of the data satisfies the predetermined condition C 5 , judging whether at least part of the data in W m6 [ pm6 -169, pm6 ] meets the predetermined condition C 6 , judging whether W m7 [p m7 -169,p m7 ] Whether at least part of the data in W m8 [p m8 -169 ,p m8 ] meets the predetermined condition C 8 , whether at least part of the data in W m8 [ pm Whether to meet the predetermined condition C 9 , judge whether at least part of the data in W m10 [p m10 -169, p m10 ] meet the predetermined condition C 10 and judge whether at least part of the data in W m11 [p m11 -169, p m11 ] meet the predetermined condition C11 . When it is judged that at least part of the data in window W m1 meets the predetermined condition C 1 , at least part of the data in window W m2 meets the predetermined condition C 2 , at least part of the data in window W m3 meets the predetermined condition C 3 , and at least part of the data in window W m4 meets the predetermined condition Condition C 4 , at least part of the data in window W m5 meets the predetermined condition C 5 , at least part of the data in window W m6 meets the predetermined condition C 6 , at least part of the data in window W m7 meets the predetermined condition C 7 , and at least part of the data in window W m8 When the predetermined condition C8 is met, at least part of the data in window Wm9 meets the predetermined condition C9 , at least part of the data in window Wm10 meets the predetermined condition C10 , and at least part of the data in window Wm11 meets the predetermined condition C11 , then the current potential segmentation The point k m is the data flow splitting point. When at least part of the data in any window does not meet the predetermined condition, jumping is performed according to the scheme described above to obtain the next potential split point and determine whether it is a data flow split point.
本发明实施例提供了一种判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz,以图5所示的实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定点pi1及点pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定的条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中“■”表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为am,1…am,8,表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当 am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,255个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,可以表示为: 选取大量随机数,组成矩阵,由随机数据组成的矩阵一旦组成,保持不变,如从服从特定分布(这里以正态分布为例)的随机数中选择255*8个随机数组成矩阵R:将矩阵Va的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8。根据该方法,获得Sa1、Sa2…到Sa255,统计Sa1、Sa2…到Sa255中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sam与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sa1、Sa2…到Sa255中,每个值大于0的概率为1/2,所以K满足二项分布:根据统计结果,判断Sa1、Sa2…到Sa255的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1;当K为奇数时,表明Wi1[pi1-169,pi1]中至少部分数据不满足预定条件C1,这里C1即指根据上述方式获得的Sa1、Sa2…到Sa255的值大于0的个数K为偶数。在图5所示的实施方式中,在Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169, pi10]和Wi11[pi11-169,pi11]中,各窗口大小相同,即窗口大小均为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的描述。因此,如图16所示,表示判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为bm,1…bm,8,表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为:当bm,n=1时,Vbm,n=1,当bm,n=0时,Vbm,n=-1,其中bm,n表示bm,1…bm,8中的任一个,255个字节对应的位按照bm,n与Vbm,n的转换关系得到矩阵Vb,可以表示为:判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件的方式与判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件的方式相同,因此使用矩阵R:将矩阵Vb的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sbm=Vbm,1*hm,1+Vbm,2*hm,2+…+Vbm,8*hm,8。根据该方法,获得Sb1、Sb2…到Sb255,统计Sb1、Sb2…到Sb255中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sbm与矩阵R一样,仍然服从正态 分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sb1、Sb2…到Sb255中,每个值大于0的概率为1/2,所以K满足二项分布:根据统计结果,判断Sb1、Sb2…到Sb255的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2;当K为奇数时,表明Wi2[pi2-169,pi2]中至少部分数据不满足预定条件C2,这里C2即指根据上述方式获得的Sb1、Sb2…到Sb255的值大于0的个数K为偶数。图3所示的实施方式中,Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2。The embodiment of the present invention provides a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z . In this embodiment, a random function is used to judge the window W iz [ Whether at least part of the data in p iz -A z ,p iz +B z ] satisfies the predetermined condition C z , taking the implementation shown in Figure 5 as an example, according to the preset rules on the deduplication server 103, is a potential split k i determines the point p i1 and the window W i1 [p i1 -169, p i1 ] corresponding to the point p i1 , and judges whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , such as As shown in Figure 16, W i1 represents the window W i1 [p i1 -169, p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , select 5 bytes , "■" in Figure 16 represents one selected byte, and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as a m,1 ... a m,8 , which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, the bits corresponding to 255 bytes It can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 For any one, the bits corresponding to 255 bytes are obtained according to the conversion relationship between a m,n and V am,n to obtain the matrix V a , which can be expressed as: Select a large number of random numbers to form a matrix. Once the matrix composed of random data is formed, it remains unchanged. For example, select 255*8 random numbers from random numbers that obey a specific distribution (here, take the normal distribution as an example) to form a matrix R: Multiply the m-th row of the matrix V a with the random number in the m-th row of the matrix R, and then sum to get a value, specifically expressed as S am =V am,1 *h m,1 +V am,2 *h m,2 +...+V am,8 *h m,8 . According to this method, S a1 , S a2 . . . to S a255 are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S a1 , S a2 . . . to S a255 is counted. Since the matrix R obeys the normal distribution, S am , like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of the normal distribution random number being greater than 0 is 1/2, in S a1 , S a2 ... to S a255 In , the probability of each value greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S a1 , S a2 . To meet the conditions. When K is an even number, it indicates that at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 ; when K is an odd number, it indicates that at least part of the data in W i1 [p i1 -169,p i1 ] The predetermined condition C 1 is not satisfied, where C 1 means that the number K of S a1 , S a2 . [ _ _ _ _ _ _ _ _ _ p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169, p i10 ], and W i11 [p i11 -169,p i11 ], each window has the same size, namely The size of the window is 169 bytes. At the same time, the method of judging whether at least part of the data in the window meets the predetermined condition is the same. For details, see the above-mentioned judging whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 describe. Therefore, as shown in Figure 16, Indicates one byte selected when judging whether at least part of the data in the window W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as b m,1 ...b m,8 , which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, the corresponding bits of 255 bytes It can be expressed as: When b m,n =1, V bm,n =1, when b m,n =0, V bm,n =-1, where b m,n represents b m,1 ...b m,8 For any one, the bits corresponding to 255 bytes are obtained according to the conversion relationship between b m,n and V bm,n to obtain the matrix V b , which can be expressed as: The method of judging whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition is the same as the method of judging whether at least part of the data in the window W i2 [p i2 -169,p i2 ] meets the predetermined condition, so use Matrix R: Multiply the m-th row of the matrix V b with the random number in the m-th row of the matrix R, and then sum to get a value, specifically expressed as S bm =V bm,1 *h m,1 +V bm,2 *h m,2 +...+V bm,8 *h m,8 . According to this method, obtain S b1 , S b2 . . . to S b255 , and count the number K of values in S b1 , S b2 . Since the matrix R obeys the normal distribution, S bm , like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of the random number of the normal distribution being greater than 0 is 1/2, in S b1 , S b2 ... to S b255 In , the probability of each value greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S b1 , S b2 ... to S b255 is an even number, the probability that the random number of the binomial distribution is an even number is 1/2, so K is with the probability of 1/2 To meet the conditions. When K is an even number, it indicates that at least part of the data in W i2 [p i2 -169, p i2 ] meets the predetermined condition C 2 ; when K is an odd number, it indicates that at least part of the data in W i2 [p i2 -169, p i2 ] The predetermined condition C 2 is not met, where C 2 means that the number K of S b1 , S b2 . In the embodiment shown in FIG. 3 , at least part of the data in W i2 [p i2 −169, p i2 ] satisfies the predetermined condition C 2 .
因此,如图16所示,表示判断窗口Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[pi1-169,pi1]和Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件的方法,判断Wi3[pi3-169,pi3]中至少数据是否满足预定条件C3。图5所示的实施方式中,Wi3[pi3-169,pi3]中至少部分数据满足预定条件。如图16所示,表示判断窗口Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]和Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件的方法,判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4。图5所示的实施方式中,Wi4[pi4-169,pi4]中至少部分数据满足预定条件C4。如图16所示,表示判断窗口Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5时选择的1个字节,相邻两个选择的字节 之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]和Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件的方法,判断Wi5[pi5-169,pi5]中至少数据是否满足预定条件C5。图5所示的实施方式中,Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5。Therefore, as shown in Figure 16, Indicates one byte selected when judging whether at least part of the data in the window W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the method of judging whether at least part of the data in the windows W i1 [p i1 -169,p i1 ] and W i2 [p i2 -169,p i2 ] meet the predetermined conditions, and judge W i3 [p i3 -169,p i3 ] Whether at least the data satisfy the predetermined condition C 3 . In the embodiment shown in FIG. 5 , at least part of the data in W i3 [p i3 −169, p i3 ] satisfies a predetermined condition. As shown in Figure 16, Indicates one byte selected when judging whether at least part of the data in the window W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 4 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the method of judging whether at least part of the data in the windows W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ] and W i3 [p i3 -169,p i3 ] meet the predetermined conditions, It is judged whether at least part of the data in W i4 [p i4 -169, p i4 ] satisfies the predetermined condition C 4 . In the embodiment shown in FIG. 5 , at least part of the data in W i4 [p i4 -169, p i4 ] satisfies the predetermined condition C 4 . As shown in Figure 16, Indicates one byte selected when judging whether at least part of the data in the window W i5 [p i5 -169,p i5 ] satisfies the predetermined condition C 5 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the judgment windows W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ] and W i4 [p i4 -169,p i4 The method of whether at least part of the data in W i5 [p i5 −169, p i5 ] satisfies the predetermined condition C 5 is determined. In the embodiment shown in FIG. 5 , at least part of the data in W i5 [p i5 −169, p i5 ] does not satisfy the predetermined condition C 5 .
当Wi5[pi5-169,pi5]中至少部分数据不满足预定条件时C5,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得下一个潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,选择5个字节,图17中“■”表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为am,1'…am,8',表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为: 当am,n'=1时,Vam,n'=1,当am,n'=0时,Vam,n'=-1,其中am,n'表示am,1'…am,8'中的任一个,255个字节对应的位按照am,n'与 Vam,n'的转换关系得到矩阵Va',可以表示为:判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定的条件与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定的条件的方式相同,因此使用矩阵R:将矩阵Va'的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam'=Vam,1'*hm,1+Vam,2'*hm,2+…+Vam,8'*hm,8。根据该方法,获得Sa1'、Sa2'…到Sa255',统计Sa1'、Sa2'…到Sa255'中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sam'与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sa1'、Sa2'…到Sa255'中,每个值大于0的概率为1/2,所以K满足二项分布: 根据统计结果,判断Sa1'、Sa2'…到Sa255'的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1;当K为奇数时,表明Wj1[pj1-169,pj1]中至少部分数据不满足预定条件C1。When at least part of the data in W i5 [p i5 -169,p i5 ] does not meet the predetermined condition C 5 , jump 11 bytes from point p i5 along the direction of data flow splitting point search, at the end of the 11th byte The position obtains the next potential segmentation point k j , as shown in Figure 6, according to the preset rules on the deduplication server 103, determine the point p j1 for the potential segmentation point k j , and the window W j1 [p j1 corresponding to the point p j1 -169,p j1 ], the method of judging whether at least part of the data in the window W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 is the same as judging at least part of the data in the window W i1 [p i1 -169,p i1 ] Whether or not the predetermined condition C 1 is met is the same, so as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], to judge at least part of the data in W j1 [p j1 -169,p j1 ] Whether the predetermined condition C 1 is met or not, 5 bytes are selected. "■" in FIG. 17 represents 1 byte selected, and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as a m,1 '...a m,8 ', indicating the 1st to 8th bits of the mth byte in 255 bytes, therefore, 255 bytes correspond to bits can be expressed as: When a m,n '=1, V am,n '=1, when a m,n '=0, V am,n '=-1, where a m,n ' means a m,1 '... For any one of a m,8 ', the bits corresponding to 255 bytes are obtained according to the conversion relationship between a m,n ' and V am,n ' to obtain a matrix V a ', which can be expressed as: Judging whether at least part of the data in the window W j1 [p j1 -169,p j1 ] meets the predetermined condition is the same as judging whether at least part of the data in the window W i1 [p i1 -169,p i1 ] meets the predetermined condition, so Using matrix R: Multiply the mth row of the matrix V a ' with the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am '=V am,1 '*h m,1 +V am, 2 '*h m,2 +...+V am,8 '*h m,8 . According to this method, S a1 ′, S a2 ′… to S a255 ′ are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S a1 ′, S a2 ′… to S a255 ′ is counted. Since the matrix R obeys the normal distribution, S am ', like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of a normal distribution random number being greater than 0 is 1/2. In S a1 ', S a2 '... To S a255 ', the probability of each value being greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S a1 ', S a2 '... to S a255 ' is an even number, and the probability that the random number of the binomial distribution is an even number is 1/2, so K is 1/2 The probability satisfies the condition. When K is an even number, it indicates that at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 ; when K is an odd number, it indicates that at least part of the data in W j1 [p j1 -169,p j1 ] The predetermined condition C 1 is not satisfied.
判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,如图17所示,表示判断窗口Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字 节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为bm,1'…bm,8',表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为:当bm,n'=1时,Vbm,n'=1,当bm,n'=0时,Vbm,n'=-1,其中bm,n'表示bm,1'…bm,8'中的任一个,255个字节对应的位按照bm,n'与Vbm,n'的转换关系得到矩阵Vb',可以表示为: 窗口W2[p2-169,p2]和W2[q2-169,q2]中至少部分数据是否满足预定条件的方式相同,因此仍使用矩阵R: 将矩阵Vb'的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sbm'=Vbm,1'*hm,1+Vbm,2'*hm,2+…+Vbm,8'*hm,8。根据该方法,获得Sb1'、Sb2'…到Sb255',统计Sb1'、Sb2'…到Sb255'中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sbm'与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sb1'、Sb2'…到Sb255'中,每个值大于0的概率为1/2,所以K满足二项分布: 根据统计结果,判断Sb1'、Sb2'…到Sb255'的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wj2[pj2 -169,pj2]中至少部分数据满足预定条件C2;当K为奇数时,表明Wj2[pj2-169,pj2]中至少部分数据不满足预定条件C2。同理,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。The method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 is the same as the method of judging whether at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , so, as shown in Figure 17, Indicates one byte selected when judging whether at least part of the data in the window W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as b m,1 '...b m,8 ', which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, 255 bytes correspond to bits can be expressed as: When b m,n '=1, V bm,n '=1, when b m,n '=0, V bm,n '=-1, where b m,n ' means b m,1 '... For any one of b m,8 ', the bits corresponding to 255 bytes are obtained according to the conversion relationship between b m,n ' and V bm,n ' to obtain matrix V b ', which can be expressed as: Whether at least part of the data in the window W 2 [p 2 -169,p 2 ] and W 2 [q 2 -169,q 2 ] meet the predetermined condition is the same, so the matrix R is still used: Multiply the mth row of the matrix V b ' with the random number of the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S bm '=V bm,1 '*h m,1 +V bm, 2 '*h m,2 +...+V bm,8 '*h m,8 . According to this method, S b1 ′, S b2 ′… to S b255 ′ are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S b1 ′, S b2 ′… to S b255 ′ is counted. Since the matrix R obeys the normal distribution, S bm ', like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of a normal distribution random number being greater than 0 is 1/2. In S b1 ', S b2 '... To S b255 ', the probability of each value being greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K of S b1 ', S b2 '... to S b255 ' whose value is greater than 0 is an even number, and the probability that the random number of the binomial distribution is an even number is 1/2, so K is 1/ The probability of 2 satisfies the condition. When K is an even number, it indicates that at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 ; when K is an odd number, it indicates that at least part of the data in W j2 [p j2 -169,p j2 ] The predetermined condition C 2 is not satisfied. Similarly, the method of judging whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C 3 In the same way, judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , judge whether at least part of the data in W j5 [p j5 -169,p j5 ] meet the predetermined condition C 5. Judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , judging Whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , judge whether at least part of the data in W j9 [p j9 -169,p j9 ] meet the predetermined condition C 9 , and judge whether W j10 [p Whether at least part of the data in j10 -169,p j10 ] satisfies the predetermined condition C 10 and whether at least part of the data in W j11 [p j11 -169,p j11 ] meets the predetermined condition C 11 will not be repeated here.
仍然以图5所示实施方式为例,提供了一种判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz,根据在去重服务器103上预设的规则,为潜在分割点ki确定点pi1及pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定的条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中“■”表示选择的1个字节,相邻两个选择“■”的字节之间相差42个字节。其中一种实现方式为使用HASH函数计算选择的5个字节,使用HASH函数计算得到的数值是一个固定均匀分布,如果使用HASH函数计算得到的数值为偶数,则判断Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1,即C1表示根据上述方式使用HASH 函数计算得到的数值为偶数。因此,Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件的概率为1/2。在图5所示的实施方式中,使用Hash函数判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2、Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4和Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5,具体实现可参考描述图5所示实施方式使用Hash函数判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件的方式C1,在此不再赘述。Still taking the embodiment shown in FIG. 5 as an example, a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is provided, which is used in this embodiment The random function judges whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] meets the predetermined condition C z , and determines the point for the potential split point k i according to the preset rules on the deduplication server 103 p i1 and the window W i1 [p i1 -169, p i1 ] corresponding to p i1 , judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , as shown in Figure 16, W i1 represents the window W i1 [p i1 -169,p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 , select 5 bytes, in Figure 16 "■" indicates one byte selected, and the difference between two adjacent bytes selected "■" is 42 bytes. One of the implementation methods is to use the HASH function to calculate the selected 5 bytes. The value calculated by the HASH function is a fixed uniform distribution. If the value calculated by the HASH function is an even number, then judge W i1 [p i1 -169 ,p i1 ] at least part of the data satisfies the predetermined condition C 1 , that is, C 1 indicates that the value calculated by using the HASH function in the above manner is an even number. Therefore, the probability of whether at least part of the data in W i1 [p i1 −169, p i1 ] satisfies the predetermined condition is 1/2. In the embodiment shown in FIG. 5 , the Hash function is used to determine whether at least part of the data in W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2 , and at least part of the data in W i3 [p i3 -169, p i3 ] Whether the data meets the predetermined condition C 3 , whether at least part of the data in W i4 [p i4 -169,p i4 ] meet the predetermined condition C 4 and whether at least part of the data in W i5 [p i5 -169,p i5 ] meet the predetermined condition C 5. For specific implementation, refer to the manner C 1 that describes whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies a predetermined condition by using a Hash function in the embodiment shown in FIG. 5 , and details are not repeated here.
当Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5时,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得当前潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,选择5个字节,图17中“■”表示选择的1个字节,相邻两个选择的字节“■”之间相差42个字节。使用Hash函数计算从窗口Wj1[pj1-169,pj1]中选取的5个字节,如果得到的数值为偶数,则Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1。图17中,判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,如图17所示,表示判断窗口Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两 个选择的字节之间相差42个字节。使用Hash函数计算选择的5个字节,如果得到的数值为偶数,则Wj2[pj2-169,pj2]中至少部分数据满足预定条件C2。图17中,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,因此,如图17所示,表示判断窗口Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3时选择的1个字节,相邻两个选择的字节“”之间相差42个字节。使用Hash函数计算选择的5个字节,得到的数值为偶数,则Wj3[pj3-169,pj3]中至少部分数据满足预定条件C3。图17中,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4的方式和判断窗口Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4的方式,因此,如图17所示,表示判断窗口Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4时选择的1个字节,相邻两个选择的字节之间相差42个字节。使用Hash函数计算选择的5个字节,得到的数值为偶数,则Wj4[pj4-169,pj4]中至少部分数据满足预定条件C4。根据上述方法,判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。When at least part of the data in W i5 [p i5 -169,p i5 ] does not meet the predetermined condition C 5 , jump 11 bytes from point p i5 along the direction of data flow split point search, at the end of the 11th byte The position obtains the current potential segmentation point k j , as shown in Figure 6, according to the preset rules on the deduplication server 103, determine the point p j1 for the potential segmentation point k j , and the window W j1 [p j1 - 169,p j1 ], the method of judging whether at least part of the data in the window W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 is the same as judging whether at least part of the data in the window W i1 [p i1 -169,p i1 ] The way to satisfy the predetermined condition C 1 is the same, so as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], in order to judge whether at least part of the data in W j1 [p j1 -169,p j1 ] Satisfy the predetermined condition C 1 , select 5 bytes, "■" in Fig. 17 represents 1 selected byte, and the difference between two adjacent selected bytes "■" is 42 bytes. Use the Hash function to calculate the 5 bytes selected from the window W j1 [p j1 -169,p j1 ], if the obtained value is an even number, then at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined Condition C 1 . In Fig. 17, the method of judging whether at least part of the data in W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2 and judging whether at least part of the data in W j2 [p j2 -169, p j2 ] satisfies the predetermined condition C 2 in the same way, so, as shown in Figure 17, Indicates the 1 byte selected when judging whether at least part of the data in the window W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , and the adjacent two selected bytes There is a difference of 42 bytes between them. Use the Hash function to calculate the selected 5 bytes, and if the obtained value is an even number, then at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 . In Fig. 17, the method of judging whether at least part of the data in W i3 [p i3 -169, p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169, p j3 ] satisfies the predetermined condition C 3 in the same way, so, as shown in Figure 17, Indicates the 1 byte selected when judging whether at least part of the data in the window W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C 3 , and the adjacent two selected bytes " There is a difference of 42 bytes between ". Use the Hash function to calculate the selected 5 bytes, and the value obtained is an even number, then at least part of the data in W j3 [p j3 -169,p j3 ] meets the predetermined condition C 3 . Figure 17 In the method of judging whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 and whether at least part of the data in the window W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 way, therefore, as shown in Figure 17, Indicates one byte selected when judging whether at least part of the data in the window W j4 [p j4 -169,p j4 ] satisfies the predetermined condition C 4 , and two adjacent selected bytes There is a difference of 42 bytes between them. Use the Hash function to calculate the selected 5 bytes, and the value obtained is an even number, then at least part of the data in W j4 [p j4 -169,p j4 ] satisfies the predetermined condition C 4 . According to the above method, judge whether at least part of the data in W j5 [p j5 -169,p j5 ] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [p j6 -169,p j6 ] meet the predetermined condition C 6 , Judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , judging whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , judging W j9 [ p j9 -169, p j9 ] whether at least part of the data satisfies the predetermined condition C 9 , judging whether at least part of the data in W j10 [p j10 -169, p j10 ] meets the predetermined condition C 10 and judging W j11 [p j11 -169 ,p j11 ] whether at least part of the data satisfies the predetermined condition C 11 , which will not be repeated here.
以图5所示的实施方式为例,提供了一种判断窗口Wiz[piz-Az,piz +Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz,根据在去重服务器103上预设的规则,为潜在分割点ki确定点pi1及pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1、a2、a3、a4和a5。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1、a2、a3、a4和a5中的任一个ar均满足0≤ar≤255。a1、a2、a3、a4和a5组成1*5的矩阵。从服从二项分布的随机数中选择256*5个随机数,组成矩阵R,表示为: Taking the implementation shown in FIG. 5 as an example, a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is provided. In this embodiment, The random function judges whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] meets the predetermined condition C z , and determines the point for the potential split point k i according to the preset rules on the deduplication server 103 The window W i1 [p i1 -169, p i1 ] corresponding to p i1 and p i1 , judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , as shown in Figure 16, W i1 represents the window W i1 [p i1 -169,p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , select 5 bytes, and the serial number in Figure 16 is The byte "■" of 169, 127, 85, 43 and 1 represents one selected byte respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are respectively converted into a decimal value, expressed as a 1 , a 2 , a 3 , a 4 and a 5 respectively. Because a byte is composed of 8 bits, each byte "■" is used as a value, and any a r among a 1 , a 2 , a 3 , a 4 and a 5 satisfies 0≤a r ≤ 255. a 1 , a 2 , a 3 , a 4 and a 5 form a 1*5 matrix. Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix R, which is expressed as:
根据a1的值和所在的列,从矩阵R中查找对应的值,如a1=36,a1位于第1列,则查找h36,1对应的值;根据a2的值和所在的列,从矩阵R中查找对应的值,如a2=48,a2位于第2列,则查找h48,2对应的值;根据a3的值和所在的列,从矩阵R中查找对应的值,如a3=26,a3位于第3列,则查找h26,3对应的值;根据a4的值和所在的列,从矩阵R中查找对应的值,如a4=26,a4位于第4列,则查找h26,4对应的值;根据a5的值和所在的列,从矩阵R中查找对应的值,如a5=88,a5位于第5列,则查找h88,5对应的值。S1=h36,1+h48,2+h26,3+h26,4+h88,5,因为矩阵R服从二项分布,因此,S1也服从二项分布。当S1为偶数,则Wi1[pi1-169,pi1] 中至少部分数据满足预定条件C1,当S1为奇数,则Wi1[pi1-169,pi1]中至少部分数据不满足预定条件C1,S1为偶数的概率为1/2,C1表示按上述方式计算S1为偶数。在图5所示实施例中,Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1。如图16所示,表示判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图16中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1、b2、b3、b4和b5。因为1个字节由8位组成,所以每个字节作为一个数值,则b1、b2、b3、b4和b5中的任一个br均满足0≤br≤255。b1、b2、b3、b4和b5组成1*5的矩阵。本实施方式中,判断Wi1和Wi2中至少部分数据是否满足预定条件的方式相同,因此仍然使用矩阵R,根据b1的值和所在的列,从矩阵R中查找对应的值,如b1=66,b1位于第1列,则查找h66,1对应的值;根据b2的值和所在的列,从矩阵R中查找对应的值,如b2=48,b2位于第2列,则查找h48,2对应的值;根据b3的值和所在的列,从矩阵R中查找对应的值,如b3=99,b3位于第3列,则查找h99,3对应的值;根据b4的值和所在的列,从矩阵R中查找对应的值,如b4=26,b4位于第4列,则查找h26,4对应的值;根据b5的值和所在的列,从矩阵R中查找对应的值,如b5=90,b5位于第5列,则查找h90,5对应的值。S2=h66,1+h48,2+h99,3+h26,4+h90,5,因为矩阵R服从二项分布,因此,S2也服从二项分布。当S2为偶数,则Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2,当S2为奇数,则Wi2[pi2-169,pi2]中至少部分数据不满足预定条件C2,S2为偶数的概率为1/2。在图5所示实施例中,Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[pi3-169,pi3]中至少部分数据是否满足 预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。图5所示的实施方式中,Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得当前潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,图17中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1'、a2'、a3'、a4'和a5'。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1'、a2'、a3'、a4'和a5'中的任一个ar'均满足0≤ar'≤255。a1'、a2'、a3'、a4'和a5'组成1*5的矩阵。判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此,仍然使用矩阵 R,表示为: According to the value of a 1 and the column where it is located, find the corresponding value from the matrix R, such as a 1 = 36, a 1 is located in the first column, then look for the value corresponding to h 36,1 ; according to the value of a 2 and where it is column, find the corresponding value from the matrix R, such as a 2 = 48, a 2 is located in the second column, then find the value corresponding to h 48,2 ; according to the value of a 3 and the column where it is located, find the corresponding value from the matrix R value, such as a 3 = 26, a 3 is located in the third column, then look for the value corresponding to h 26,3 ; according to the value of a 4 and the column where it is located, find the corresponding value from the matrix R, such as a 4 = 26 , a 4 is located in the fourth column, then search for the value corresponding to h 26,4 ; according to the value of a 5 and the column where it is located, find the corresponding value from the matrix R, such as a 5 = 88, a 5 is located in the fifth column, Then find the value corresponding to h 88,5 . S 1 =h 36,1 +h 48,2 +h 26,3 +h 26,4 +h 88,5 , because the matrix R obeys the binomial distribution, therefore, S 1 also obeys the binomial distribution. When S 1 is an even number, then at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , and when S 1 is an odd number, then at least part of the data in W i1 [p i1 -169,p i1 ] If the predetermined condition C 1 is not met, the probability that S 1 is an even number is 1/2, and C 1 indicates that S 1 is an even number calculated in the above manner. In the embodiment shown in FIG. 5 , at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 . As shown in Figure 16, Represents the 1 byte selected when judging whether at least part of the data in the window W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2. In FIG. Indicates that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 respectively converted into a decimal value, respectively expressed as b 1 , b 2 , b 3 , b 4 and b 5 . Since 1 byte consists of 8 bits, each byte As a numerical value, any b r among b 1 , b 2 , b 3 , b 4 and b 5 satisfies 0≤b r ≤255. b 1 , b 2 , b 3 , b 4 and b 5 form a 1*5 matrix. In this embodiment, the method of judging whether at least part of the data in W i1 and W i2 meets the predetermined conditions is the same, so the matrix R is still used, and the corresponding value is searched from the matrix R according to the value of b1 and the column where it is located, such as b 1 = 66, b 1 is located in the first column, then search for the value corresponding to h 66,1 ; according to the value of b 2 and the column where it is located, find the corresponding value from the matrix R, such as b 2 = 48, b 2 is in the first column 2 columns, then search for the value corresponding to h 48,2 ; according to the value of b 3 and the column where it is located, find the corresponding value from the matrix R, such as b 3 =99, b 3 is located in the third column, then search for h 99, The value corresponding to 3 ; according to the value of b 4 and the column where it is located, find the corresponding value from the matrix R, such as b 4 = 26, b 4 is located in the fourth column, then find the value corresponding to h 26 , 4; according to b 5 Find the corresponding value from the matrix R, such as b 5 =90, and b 5 is located in the fifth column, then find the value corresponding to h 90,5 . S 2 =h 66,1 +h 48,2 +h 99,3 +h 26,4 +h 90,5 , because the matrix R obeys the binomial distribution, therefore, S 2 also obeys the binomial distribution. When S 2 is an even number, at least part of the data in W i2 [p i2 -169, p i2 ] meets the predetermined condition C 2 , and when S 2 is an odd number, then at least part of the data in W i2 [p i2 -169, p i2 ] If the predetermined condition C 2 is not satisfied, the probability that S 2 is an even number is 1/2. In the embodiment shown in FIG. 5 , at least part of the data in W i2 [p i2 −169, p i2 ] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 , and judge whether at least part of the data in W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 4. Judging whether at least part of the data in W i5 [p i5 -169, p i5 ] meets the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169, p i6 ] meets the predetermined condition C 6 , judging W Whether at least part of the data in i7 [p i7 -169,p i7 ] satisfies the predetermined condition C 7 , judge whether at least part of the data in W i8 [p i8 -169,p i8 ] meets the predetermined condition C 8 , and judge W i9 [p i9 -169,p i9 ] whether at least part of the data satisfies the predetermined condition C 9 , judging whether at least part of the data in W i10 [p i10 -169,p i10 ] meets the predetermined condition C 10 and judging W i11 [p i11 -169,p i11 ] whether at least part of the data satisfies the predetermined condition C 11 . In the embodiment shown in FIG. 5 , at least part of the data in W i5 [p i5 -169, p i5 ] does not meet the predetermined condition C 5 , jumping 11 bytes from point p i5 along the direction of data flow splitting point search, at The end position of the 11th byte obtains the current potential segmentation point k j , as shown in FIG. 6 , according to the preset rules on the deduplication server 103, determine the point p j1 and the corresponding point p j1 for the potential segmentation point k j Window W j1 [p j1 -169,p j1 ], the method of judging whether at least part of the data in window W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 is the same as judging window W i1 [p i1 -169,p i1 ] whether at least part of the data satisfies the predetermined condition C 1 in the same way, so as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], to judge W j1 [p j1 -169,p Whether at least part of the data in j1 ] satisfies the predetermined condition C 1 , the bytes "■" with serial numbers 169, 127, 85, 43 and 1 in Figure 17 represent one selected byte respectively, and two adjacent selected bytes There is a difference of 42 bytes between them. Convert the bytes "■" with sequence numbers 169, 127, 85, 43, and 1 into a decimal value, which are respectively expressed as a 1 ', a 2 ', a 3 ', a 4 ', and a 5 '. Because 1 byte consists of 8 bits, each byte "■" is used as a value, and any a r 'in a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' is Satisfy 0≤a r '≤255. a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' form a 1*5 matrix. The method of judging whether at least part of the data in the window W j1 [p j1 -169, p j1 ] meets the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i1 [p i1 -169, p i1 ] meets the predetermined condition C 1 In the same way, therefore, still using the matrix R, expressed as:
根据a1'的值和所在的列,从矩阵R中查找对应的值,如a1'=16,a1'位于第1列,则查找h16,1对应的值;根据a2'的值和所在的列,从矩阵R中查找对应的值,如a2'=98,a2'位于第2列,则查找h98,2对应的值;根据a3'的值和所在的列,从矩阵R中查找对应的值,如a3'=56,a3'位于第3列,则查找h56,3对应的值;根据a4'的值和所在的列,从矩阵R中查找对应的值,如a4'=36,a4'位于第4列,则查找h36,4对应的值;根据a5'的值和所在的列,从矩阵R中查找对应的值,如a5'=99,a5'位于第5列,则查找h99,5对应的值。S1'=h16,1+h98,2+h56,3+h36,4+h99,5,因为矩阵R服从二项分布,因此,S1'也服从二项分布。当S1'为偶数,则Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1,当S1'为奇数,则Wj1[pj1-169,pj1]中至少部分数据不满足预定条件C1,S1'为偶数的概率为1/2。According to the value of a 1 ' and the column where it is located, find the corresponding value from the matrix R, such as a 1 '=16, and a 1 ' is located in the first column, then find the value corresponding to h 16,1 ; according to the value of a 2 ' value and the column where it is located, find the corresponding value from the matrix R, such as a 2 '=98, a 2 'is located in the second column, then find the value corresponding to h 98,2 ; according to the value of a 3 ' and the column where it is located , look up the corresponding value from the matrix R, such as a 3 '=56, a 3 'is located in the third column, then look for the value corresponding to h 56,3 ; according to the value and column of a 4 ', from the matrix R Find the corresponding value, such as a 4 '=36, a 4 ' is located in the fourth column, then find the value corresponding to h 36,4 ; according to the value of a 5 ' and the column where it is located, find the corresponding value from the matrix R, For example, a 5 '=99, and a 5 'is located in the fifth column, then search for the value corresponding to h 99,5 . S 1 '=h 16,1 +h 98,2 +h 56,3 +h 36,4 +h 99,5 , because the matrix R obeys the binomial distribution, therefore, S 1 ' also obeys the binomial distribution. When S 1 'is an even number, then at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , and when S 1 ' is an odd number, then at least part of the data in W j1 [p j1 -169,p j1 ] Part of the data does not satisfy the predetermined condition C 1 , and the probability that S 1 ′ is an even number is 1/2.
判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,如图17所示,表示判断窗口Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1'、b2'、b3'、b4'和b5'。因为1个字节由8位组成,所以每个字节作为一个数值,则b1'、b2'、b3'、b4'和b5'中的任一个br'均满足0≤br'≤255。b1'、b2'、b3'、b4'和b5'组成1*5的矩阵。与判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满 足预定条件C2使用相同的矩阵R,根据b1'的值和所在的列,从矩阵R中查找对应的值,如b1'=210,b1'位于第1列,则查找h210,1对应的值;根据b2'的值和所在的列,从矩阵R中查找对应的值,如b2'=156,b2'位于第2列,则查找h156,2对应的值;根据b3'的值和所在的列,从矩阵R中查找对应的值,如b3'=144,b3'位于第3列,则查找h144,3对应的值;根据b4'的值和所在的列,从矩阵R中查找对应的值,如b4'=60,b4'位于第4列,则查找h60,4对应的值;根据b5'的值和所在的列,从矩阵R中查找对应的值,如b5'=90,b5'位于第5列,则查找h90,5对应的值。S2'=h210,1+h156,2+h144,3+h60,4+h90,5,与S2的判断条件相同,当S2'为偶数,则Wj2[pj2-169,pj2]中至少部分数据满足预定条件C2,当S2'为奇数,则Wj2[pj2-169,pj2]中至少部分数据不满足预定条件C2,S2'为偶数的概率为1/2。The method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 is the same as the method of judging whether at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , so, as shown in Figure 17, Indicates the 1 byte selected when judging whether at least part of the data in the window W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes, respectively Sequence numbers 170, 128, 86, 44 and 2 indicate that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 converted into a decimal value, respectively, expressed as b 1 ', b 2 ', b 3 ', b 4 ' and b 5 '. Since 1 byte consists of 8 bits, each byte As a numerical value, b r ' among b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ satisfies 0≤br ′≤255. b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ form a 1*5 matrix. Using the same matrix R as judging whether at least part of the data in the window W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , according to the value of b 1 ' and the column where it is located, find the corresponding value from the matrix R, For example, b 1 '=210, b 1 'is located in the first column, then find the value corresponding to h 210,1 ; according to the value of b 2 ' and the column where it is located, find the corresponding value from the matrix R, such as b 2 '= 156, b 2 ' is located in the second column, then search for the value corresponding to h 156 , 2; according to the value of b 3 ' and the column where it is located, find the corresponding value from the matrix R, such as b 3 '=144, b 3 ' If it is in the third column, look for the value corresponding to h 144,3 ; look up the corresponding value from the matrix R according to the value of b 4 ' and the column where it is located, such as b 4 '=60, b 4 'is in the fourth column, Then look for the value corresponding to h 60,4 ; look up the corresponding value from the matrix R according to the value and column of b 5 ', such as b 5 '=90, b 5 ' is in the fifth column, then look for h 90, 5 corresponds to the value. S 2 '=h 210,1 +h 156,2 +h 144,3 +h 60,4 +h 90,5 , same as S 2 judgment conditions, when S 2 ' is an even number, then W j2 [p j2 -169,p j2 ] at least part of the data satisfies the predetermined condition C 2 , when S 2 ' is an odd number, then at least part of the data in W j2 [p j2 -169,p j2 ] does not meet the predetermined condition C 2 , S 2 ' is The probability of an even number is 1/2.
同理,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C 3 In the same way, judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , judge whether at least part of the data in W j5 [p j5 -169,p j5 ] meet the predetermined condition C 5. Judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , judging Whether at least part of the data in W j8 [p j8 -169 ,p j8 ] satisfies the predetermined condition C 8 , judge whether at least part of the data in W j9 [p j9 -169,p j9 ] meets the predetermined condition C 9 , and judge whether W j10 [p Whether at least part of the data in j10 -169,p j10 ] satisfies the predetermined condition C 10 and whether at least part of the data in W j11 [p j11 -169,p j11 ] meets the predetermined condition C 11 will not be repeated here.
以图5所示的实施方式为例,提供了一种判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定 条件Cz,根据在去重服务器103上预设的规则,为潜在分割点ki确定点pi1及pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定的条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1、a2、a3、a4和a5。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1、a2、a3、a4和a5中的任一个as均满足0≤as≤255。a1、a2、a3、a4和a5组成1*5的矩阵。从服从二项分布的随机数中选择256*5个随机数,组成矩阵R,表示为:从服从二项分布的随机数中选择256*5个随机数,组成矩阵G,表示为: Taking the implementation shown in FIG. 5 as an example, a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is provided. In this embodiment, The random function judges whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] meets the predetermined condition C z , and determines the point for the potential split point k i according to the preset rules on the deduplication server 103 p i1 and the window W i1 [p i1 -169, p i1 ] corresponding to p i1 , judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , as shown in Figure 16, W i1 represents the window W i1 [p i1 -169,p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 , select 5 bytes, the serial number in Figure 16 The bytes "■" of 169, 127, 85, 43 and 1 respectively represent one selected byte, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are respectively converted into a decimal value, expressed as a 1 , a 2 , a 3 , a 4 and a 5 respectively. Because a byte is composed of 8 bits, each byte "■" is used as a value, and any a s in a 1 , a 2 , a 3 , a 4 and a 5 satisfies 0≤a s ≤ 255. a 1 , a 2 , a 3 , a 4 and a 5 form a 1*5 matrix. Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix R, which is expressed as: Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix G, which is expressed as:
根据a1的值和所在的列,如a1=36,a1位于第1列,则从矩阵R中查找查找h36,1对应的值,从矩阵G中查找g36,1对应的值;根据a2的值和所在的列,如a2=48,a2位于第2列,则从矩阵R中查h48,2对应的值,从矩阵G中查找g48,2对应的值;根据a3的值和所在的列,如a3=26,a3位于第3列,则从矩阵R中查找h26,3对应的值,从矩阵G中查找g26,3对应的值;根据a4的值和所在的列,如a4=26,a4位于第4列,则从矩阵R中查找h26,4对应的值,从矩阵G中查找g26,4对应的值;根据a5的值和 所在的列,如a5=88,a5位于第5列,则从矩阵R中查找h88,5对应的值,从矩阵G中查找g88,5对应的值。S1h=h36,1+h48,2+h26,3+h26,4+h88,5,因为矩阵R服从二项分布,因此,S1h也服从二项分布;S1g=g36,1+g48,2+g26,3+g26,4+g88,5,因为矩阵G服从二项分布,因此S1g也服从二项分布。当S1h和S1g中有1个为偶数,则Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1,当S1h和S1g均为奇数,则Wi1[pi1-169,pi1]中至少部分数据不满足预定条件C1,C1表述按照上述方法获得的S1h和S1g中有1个为偶数。因为S1h和S1g均服从二项分布,因此S1h为偶数的概率为1/2,S1g为偶数的概率为1/2,S1h和S1g中有1个为偶数的概率为1-1/4=3/4,因此,Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1的概率为3/4。在图5所示实施例中,Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1。在图5所示的实施方式中,在Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]中,各窗口大小相同,即窗口大小均为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的描述。因此,如图16所示,表示判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图16中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1、b2、b3、b4和b5。因为1个字节由8位组成,所以每个字节作为一个数值,则b1、b2、b3、b4和b5中的任一个bs均满足0≤bs≤255。b1、b2、b3、b4 和b5组成1*5的矩阵。本实施方式中,判断各窗口中至少部分数据是否满足预定条件的方式相同,因此仍然使用相同矩阵R和G。根据b1的值和所在的列,如b1=66,b1位于第1列,则从矩阵R中查找h66,1对应的值,从矩阵G中查找g66,1对应的值;根据b2的值和所在的列,如b2=48,b2位于第2列,则从矩阵R中查找h48,2对应的值,从矩阵G中查找g48,2对应的值;根据b3的值和所在的列,如b3=99,b3位于第3列,则从矩阵R中查找h99,3对应的值,从矩阵G中查找g99,3对应的值;根据b4的值和所在的列,如b4=26,b4位于第4列,则从矩阵R中查找h26,4对应的值,从矩阵G中查找g26,4对应的值;根据b5的值和所在的列,如b5=90,b5位于第5列,则从矩阵R中查找h90,5对应的值,从矩阵G中查找g90,5对应的值。S2h=h66,1+h48,2+h99,3+h26,4+h90,5,因为矩阵R服从二项分布,因此,S2h也服从二项分布。S2g=g66,1+g48,2+g99,3+g26,4+g90,5,因为矩阵G服从二项分布,因此,S2g也服从二项分布。当S2h和S2g中有1个为偶数,则Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2,当S2h和S2g均为奇数,则Wi2[pi2-169,pi2]中至少部分数据不满足预定条件C2,S2h和S2g中有1个为偶数的概率为3/4。在图5所示实施例中,Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11 -169,pi11]中至少部分数据是否满足预定条件C11。图5所示的实施方式中,Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得当前潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,图17中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1'、a2'、a3'、a4'和a5'。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1'、a2'、a3'、a4'和a5'中的任一个as'均满足0≤as'≤255。a1'、a2'、a3'、a4'和a5'组成1*5的矩阵。使用与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1相同的矩阵R和G,分别表示为: 和 According to the value of a 1 and the column where it is located, such as a 1 = 36, and a 1 is located in the first column, then look up the value corresponding to h 36,1 from the matrix R, and look up the value corresponding to g 36,1 from the matrix G ;According to the value of a 2 and the column where it is located, such as a 2 = 48, a 2 is located in the second column, then look up the value corresponding to h 48,2 from the matrix R, and look up the value corresponding to g 48,2 from the matrix G ;According to the value of a 3 and the column where it is located, such as a 3 =26, a 3 is located in the third column, then look up the value corresponding to h 26,3 from the matrix R, and look up the value corresponding to g 26,3 from the matrix G ;According to the value of a 4 and the column where it is located, such as a 4 =26, a 4 is located in the fourth column, then look up the value corresponding to h 26,4 from the matrix R, and look up the value corresponding to g 26,4 from the matrix G ;According to the value of a 5 and the column where it is located, as a 5 =88, a 5 is located in the fifth column, then look up the value corresponding to h 88,5 from the matrix R, and look up the value corresponding to g 88,5 from the matrix G . S 1h =h 36,1 +h 48,2 +h 26,3 +h 26,4 +h 88,5 , because matrix R obeys binomial distribution, therefore, S 1h also obeys binomial distribution; S 1g =g 36,1 +g 48,2 +g 26,3 +g 26,4 +g 88,5 , because matrix G obeys binomial distribution, so S 1g also obeys binomial distribution. When one of S 1h and S 1g is an even number, at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , and when both S 1h and S 1g are odd numbers, then W i1 [p i1 -169,p i1 ] at least part of the data does not meet the predetermined condition C 1 , and C 1 indicates that one of S 1h and S 1g obtained by the above method is an even number. Because both S 1h and S 1g follow the binomial distribution, the probability that S 1h is even is 1/2, the probability that S 1g is even is 1/2, and the probability that one of S 1h and S 1g is even is 1 -1/4=3/4, therefore, the probability that at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 is 3/4. In the embodiment shown in FIG. 5 , at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 . [ _ _ _ _ _ _ _ _ _ p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169,p i10 ], and W i11 [p i11 -169,p i11 ], each window has the same size, namely The size of the window is 169 bytes. At the same time, the method of judging whether at least part of the data in the window meets the predetermined condition is the same. For details, see the above-mentioned judging whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 describe. Therefore, as shown in Figure 16, Represents the 1 byte selected when judging whether at least part of the data in the window W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2. In FIG. Indicates that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 respectively converted into a decimal value, respectively expressed as b 1 , b 2 , b 3 , b 4 and b 5 . Since 1 byte consists of 8 bits, each byte As a numerical value, any b s among b 1 , b 2 , b 3 , b 4 and b 5 satisfies 0≤b s ≤255. b 1 , b 2 , b 3 , b 4 and b 5 form a 1*5 matrix. In this embodiment, the method of judging whether at least part of the data in each window satisfies the predetermined condition is the same, so the same matrices R and G are still used. According to the value of b 1 and the column where it is located, such as b 1 = 66, b 1 is located in the first column, then look up the value corresponding to h 66,1 from the matrix R, and look up the value corresponding to g 66,1 from the matrix G; According to the value of b2 and the column where it is located, as b2=48, b2 is located in the second column, then look up the value corresponding to h 48,2 from matrix R, and look up the value corresponding to g 48,2 from matrix G; According to the value of b 3 and the column where it is located, as b 3 =99, b 3 is located in the third column, then the value corresponding to h 99,3 is searched from matrix R, and the value corresponding to g 99,3 is searched from matrix G; According to the value of b 4 and the column where it is located, as b 4 =26, b 4 is located in the fourth column, then look up the value corresponding to h 26,4 from the matrix R, and look up the value corresponding to g 26,4 from the matrix G; According to the value of b 5 and the column where it is located, for example, b 5 =90, and b 5 is located in the fifth column, the value corresponding to h 90,5 is searched from the matrix R, and the value corresponding to g 90,5 is searched from the matrix G. S 2h =h 66,1 +h 48,2 +h 99,3 +h 26,4 +h 90,5 , because the matrix R obeys the binomial distribution, therefore, S 2h also obeys the binomial distribution. S 2g =g 66,1 +g 48,2 +g 99,3 +g 26,4 +g 90,5 , because the matrix G obeys the binomial distribution, therefore, S 2g also obeys the binomial distribution. When one of S 2h and S 2g is an even number, at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , and when both S 2h and S 2g are odd numbers, then W i2 [p i2 -169, p i2 ] at least part of the data does not satisfy the predetermined condition C 2 , and the probability that one of S 2h and S 2g is an even number is 3/4. In the embodiment shown in FIG. 5 , at least part of the data in W i2 [p i2 −169, p i2 ] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 , and judge whether at least part of the data in W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 4. Judging whether at least part of the data in W i5 [p i5 -169, p i5 ] meets the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169, p i6 ] meets the predetermined condition C 6 , judging W Whether at least part of the data in i7 [p i7 -169,p i7 ] satisfies the predetermined condition C 7 , judge whether at least part of the data in W i8 [p i8 -169,p i8 ] meets the predetermined condition C 8 , and judge W i9 [p i9 -169,p i9 ] whether at least part of the data satisfies the predetermined condition C 9 , judging whether at least part of the data in W i10 [p i10 -169,p i10 ] meets the predetermined condition C 10 and judging W i11 [p i11 -169,p i11 ] whether at least part of the data satisfies the predetermined condition C 11 . In the embodiment shown in FIG. 5 , at least part of the data in W i5 [p i5 -169, p i5 ] does not meet the predetermined condition C 5 , jumping 11 bytes from point p i5 along the direction of data flow splitting point search, at The end position of the 11th byte obtains the current potential segmentation point k j , as shown in FIG. 6 , according to the preset rules on the deduplication server 103, determine the point p j1 and the corresponding point p j1 for the potential segmentation point k j Window W j1 [p j1 -169,p j1 ], the method of judging whether at least part of the data in window W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 is the same as judging window W i1 [p i1 -169,p i1 ] whether at least part of the data satisfies the predetermined condition C 1 in the same way, so as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], to judge W j1 [p j1 -169,p Whether at least part of the data in j1 ] satisfies the predetermined condition C 1 , the bytes "■" with serial numbers 169, 127, 85, 43 and 1 in Figure 17 represent one selected byte respectively, and two adjacent selected bytes There is a difference of 42 bytes between them. Convert the bytes "■" with sequence numbers 169, 127, 85, 43, and 1 into a decimal value, which are respectively expressed as a 1 ', a 2 ', a 3 ', a 4 ', and a 5 '. Because 1 byte consists of 8 bits, each byte "■" is used as a value, and any a s 'in a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' is Satisfy 0≤a s '≤255. a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' form a 1*5 matrix. Using the same matrix R and G as judging whether at least part of the data in the window W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 is expressed as: and
根据a1'的值和所在的列,如a1'=16,a1'位于第1列,则从矩阵R中查找h16,1对应的值,从矩阵G中查找g16,1对应的值;根据a2'的值和所在的列,如a2'=98,a2'位于第2列,则从矩阵R中查找h98,2对应的值,从矩阵G中查找g98,2对应的值;根据a3'的值和所在的列,如a3'=56,a3'位于第3列,则从矩阵R中查找h56,3对应的值,从矩阵G中查找g56,3对应的 值;根据a4'的值和所在的列,如a4'=36,a4'位于第4列,则从矩阵R中查找h36,4对应的值,从矩阵G中查找g36,4对应的值;根据a5'的值和所在的列,如a5'=99,a5'位于第5列,则从矩阵R中查找h99,5对应的值,从矩阵G中查找g99,5对应的值。S1h'=h16,1+h98,2+h56,3+h36,4+h99,5,因为矩阵R服从二项分布,因此,S1h'也服从二项分布;S1g'=g16,1+g98,2+g56,3+g36,4+g99,5,因为矩阵G服从二项分布,因此S1g'也服从二项分布。当S1h'和S1g'中有1个为偶数,则Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1,当S1h'和S1g'均为奇数,则Wj1[pj1-169,pj1]中至少部分数据不满足预定条件C1,S1h'和S1g'有1个为偶数的概率为3/4。According to the value of a 1 ' and the column where it is located, such as a 1 '=16, and a 1 'is located in the first column, then look up the value corresponding to h 16,1 from the matrix R, and find the corresponding value of g 16,1 from the matrix G value; according to the value of a 2 ' and the column where it is located, such as a 2 '=98, a 2 'is located in the second column, then look up the value corresponding to h 98 and 2 from the matrix R, and look up g 98 from the matrix G , the value corresponding to 2 ; according to the value of a 3 ' and the column where it is located, such as a 3 '=56, a 3 'is located in the third column, then look up the value corresponding to h 56,3 from the matrix R, and from the matrix G Find the value corresponding to g 56,3 ; according to the value of a 4 ' and the column where it is located, such as a 4 '=36, a 4 ' is located in the fourth column, then find the value corresponding to h 36,4 from the matrix R, from Find the value corresponding to g 36,4 in the matrix G; according to the value of a 5 ' and the column where it is located, such as a 5 '=99, a 5 ' is located in the fifth column, then find the value corresponding to h 99,5 from the matrix R value, find the value corresponding to g 99,5 from the matrix G. S 1h '=h 16,1 +h 98,2 +h 56,3 +h 36,4 +h 99,5 , because matrix R obeys binomial distribution, therefore, S 1h ' also obeys binomial distribution; S 1g '=g 16,1 +g 98,2 +g 56,3 +g 36,4 +g 99,5 , because the matrix G obeys the binomial distribution, so S 1g ' also obeys the binomial distribution. When one of S 1h ' and S 1g ' is an even number, then at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , and when both S 1h ' and S 1g ' are odd numbers, then At least part of the data in W j1 [p j1 -169,p j1 ] does not satisfy the predetermined condition C 1 , and the probability that one of S 1h ' and S 1g ' is an even number is 3/4.
判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,如图17所示,表示判断窗口Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。在图17中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1'、b2'、b3'、b4'和b5'。因为1个字节由8位组成,所以每个字节作为一个数值,则b1'、b2'、b3'、b4'和b5'中的任一个bs'均满足0≤bs'≤255。b1'、b2'、b3'、b4'和b5'组成1*5的矩阵。使用与判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2相同的矩阵R和G,根据b1'的值和所在的列,如b1'=210,b1'位于第1列,则从矩阵R中查找h210,1对应的值,从矩阵G中查找g210,1对应的值;根据b2'的值和所在的列,如b2'=156,b2'位于第2列,则从矩阵R中查找h156,2对应的值,从矩阵G中查 找g156,2对应的值;根据b3'的值和所在的列,如b3'=144,b3'位于第3列,则从矩阵R中查找h144,3对应的值,从矩阵G中查找g144,3对应的值;根据b4'的值和所在的列,如b4'=60,b4'位于第4列,则从矩阵R中查找h60,4对应的值,从矩阵G中查找g60,4对应的值;根据b5'的值和所在的列,如b5'=90,b5'位于第5列,则从矩阵R中查找h90,5对应的值,从矩阵G中查找g90,5对应的值。S2h'=h210,1+h156,2+h144,3+h60,4+h90,5,S2g'=g210,1+g156,2+g144,3+g60,4+g90,5。当S2h'和S2g'中有1个为偶数,则Wj2[pj2-169,pj2]中至少部分数据满足预定条件C2,当S2h'和S2g'均为奇数,则Wj2[pj2-169,pj2]中至少部分数据不满足预定条件C2,S2h'和S2g'中有1个为偶数的概率为3/4。The method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 is the same as the method of judging whether at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , so, as shown in Figure 17, Indicates one byte selected when judging whether at least part of the data in the window W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. In FIG. 17, they are represented by sequence numbers 170, 128, 86, 44 and 2 respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 converted into a decimal value, respectively, expressed as b 1 ', b 2 ', b 3 ', b 4 ' and b 5 '. Since 1 byte consists of 8 bits, each byte As a numerical value, any one of b s ' among b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ satisfies 0≤b s '≤255. b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ form a 1*5 matrix. Using the same matrix R and G as judging whether at least part of the data in the window W i2 [p i2 -169, p i2 ] meets the predetermined condition C 2 , according to the value of b 1 ' and the column where it is located, such as b 1 '=210, b 1 ' is located in the first column, then look up the value corresponding to h 210,1 from the matrix R, and find the value corresponding to g 210,1 from the matrix G; according to the value of b 2 ' and the column where it is located, such as b 2 ' =156, b 2 'is in the second column, then look up the value corresponding to h 156,2 from the matrix R, and look up the value corresponding to g 156,2 from the matrix G; according to the value of b 3 ' and the column where it is located, such as b 3 '=144, b 3 'is located in the third column, then look up the value corresponding to h 144,3 from the matrix R, and look up the value corresponding to g 144,3 from the matrix G; according to the value of b 4 ' and where column, such as b 4 '=60, b 4 'is located in the fourth column, then look up the value corresponding to h 60,4 from the matrix R, and look up the value corresponding to g 60,4 from the matrix G; according to the value of b 5 ' and the column where it is located, such as b 5 '=90, and b 5 'is located in the fifth column, then look up the value corresponding to h 90,5 from the matrix R, and look up the value corresponding to g 90,5 from the matrix G. S 2h '=h 210,1 +h 156,2 +h 144,3 +h 60,4 +h 90,5 , S 2g '=g 210,1 +g 156,2 +g 144,3 +g 60 ,4 +g 90,5 . When one of S 2h ' and S 2g ' is an even number, then at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , and when both S 2h ' and S 2g ' are odd numbers, then At least part of the data in W j2 [p j2 -169,p j2 ] does not satisfy the predetermined condition C 2 , and the probability that one of S 2h ' and S 2g ' is an even number is 3/4.
同理,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C 3 In the same way, judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , judge whether at least part of the data in W j5 [p j5 -169,p j5 ] meet the predetermined condition C 5. Judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , judging Whether at least part of the data in W j8 [p j8 -169 ,p j8 ] satisfies the predetermined condition C 8 , judge whether at least part of the data in W j9 [p j9 -169,p j9 ] meets the predetermined condition C 9 , and judge whether W j10 [p Whether at least part of the data in j10 -169,p j10 ] satisfies the predetermined condition C 10 and whether at least part of the data in W j11 [p j11 -169,p j11 ] meets the predetermined condition C 11 will not be repeated here.
以图5所示的实施方式为例,提供了一种判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz,根据在去重服务器103上预设的规则,为潜在分割点ki确定 点pi1及pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定的条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”依次看成40个位,分别表示为a1、a2、a3、a4…a40。a1、a2、a3、a4…a40中的任一at,当at=0时,Vat=-1,当at=1时,Vat=1,根据at与Vat对应关系,生成Va1、Va2、Va3、Va4…Va40。从服从正态分布的随机数中选择40个随机数,分别表示为:h1、h2、h3、h4...h40。Sa=Va1*h1+Va2*h2+Va3*h3+Va4*h4+…+Va40*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sa也服从正态分布。当Sa为正数,则Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1,当Sa为负数或0,则Wi1[pi1-169,pi1]中至少部分数据不满足预定条件C1,Sa为正数的概率为1/2。在图5所示实施例中,Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1。如图16所示,表示判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图16中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节依次看成40个位,分别表示为b1、b2、b3、b4…b40。b1、b2、b3、b4…b40中的任一bt,当bt=0时,Vbt=-1,当bt=1时,Vbt=1,根据bt与Vbt对应关系,生成Vb1、Vb2、Vb3、Vb4…Vb40。判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式相同,因此,使用相同的随机数:h1、h2、h3、h4...h40,Sb=Vb1*h1+Vb2*h2+Vb3*h3+Vb4*h4+…+Vb40*h40。因 为h1、h2、h3、h4...h40服从正态分布,因此,Sb也服从正态分布。当Sb为正数,则Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2,当Sb为负数或0,则Wi2[pi2-169,pi2]中至少部分数据不满足预定条件C2,Sb为正数的概率为1/2。在图5所示实施例中,Wi2[pi2-169,pi2]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5、判断Wi6[pi6-169,pi6]中至少部分数据是否满足预定条件C6、判断Wi7[pi7-169,pi7]中至少部分数据是否满足预定条件C7、判断Wi8[pi8-169,pi8]中至少部分数据是否满足预定条件C8、判断Wi9[pi9-169,pi9]中至少部分数据是否满足预定条件C9、判断Wi10[pi10-169,pi10]中至少部分数据是否满足预定条件C10和判断Wi11[pi11-169,pi11]中至少部分数据是否满足预定条件C11。图5所示的实施方式中,Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得当前潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,选择5个字节,图17中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”依 次看成40个位,分别表示为a1'、a2'、a3'、a4'…a40'。a1'、a2'、a3'、a4'…a40'中的任一at',当at'=0时,Vat'=-1,当at'=1时,Vat'=1,根据at'与Vat'对应关系,生成Va1'、Va2'、Va3'、Va4'…Va40'。判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此使用相同的随机数:h1、h2、h3、h4...h40。Sa'=Va1'*h1+Va2'*h2+Va3'*h3+Va4'*h4+…+Va40'*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sa'也服从正态分布。当Sa'为正数,则Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1,当Sa'为负数或0,则Wj1[pj1-169,pj1]中至少部分数据不满足预定条件C1,Sa'为正数的概率为1/2。Taking the implementation shown in FIG. 5 as an example, a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is provided. In this embodiment, The random function judges whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] meets the predetermined condition C z , and determines the point for the potential split point k i according to the preset rules on the deduplication server 103 p i1 and the window W i1 [p i1 -169, p i1 ] corresponding to p i1 , judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , as shown in Figure 16, W i1 represents the window W i1 [p i1 -169,p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , select 5 bytes, the sequence number in Figure 16 The bytes "■" of 169, 127, 85, 43 and 1 respectively represent one selected byte, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are regarded as 40 bits in turn, which are expressed as a 1 , a 2 , a 3 , a 4 ...a 40 respectively. For any at of a 1 , a 2 , a 3 , a 4 ...a 40 , when at = 0, V at = -1 , when at = 1, V at = 1, according to at and Vat correspondence relationship generates V a1 , V a2 , V a3 , V a4 . . . V a40 . Select 40 random numbers from the random numbers subject to the normal distribution, respectively denoted as: h 1 , h 2 , h 3 , h 4 ... h 40 . S a =V a1 *h 1 +V a2 *h 2 +V a3 *h 3 +V a4 *h 4 + . . . +V a40 *h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey the normal distribution, therefore, S a also obeys the normal distribution. When S a is a positive number, at least part of the data in W i1 [p i1 -169,p i1 ] meets the predetermined condition C 1 , and when S a is negative or 0, then in W i1 [p i1 -169,p i1 ] At least part of the data does not satisfy the predetermined condition C 1 , and the probability that S a is a positive number is 1/2. In the embodiment shown in FIG. 5 , at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 . As shown in Figure 16, Represents the 1 byte selected when judging whether at least part of the data in the window W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2. In FIG. Indicates that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 They are regarded as 40 bits in turn, represented as b 1 , b 2 , b 3 , b 4 . . . b 40 . For any b t in b 1 , b 2 , b 3 , b 4 ... b 40 , when b t =0, V bt =-1, when b t =1, V bt =1, according to b t and The V bt correspondence relationship generates V b1 , V b2 , V b3 , V b4 . . . V b40 . The method of judging whether at least part of the data in the window W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i2 [p i2 -169, p i2 ] meets the predetermined condition C 2 In the same way, therefore, using the same random numbers: h 1 , h 2 , h 3 , h 4 ... h 40 , S b = V b1 *h 1 +V b2 *h 2 +V b3 *h 3 +V b4 *h 4 +...+V b40 *h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S b also obeys normal distribution. When S b is a positive number, at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 , and when S b is negative or 0, then in W i2 [p i2 -169,p i2 ] At least part of the data does not satisfy the predetermined condition C 2 , and the probability that S b is a positive number is 1/2. In the embodiment shown in FIG. 5 , at least part of the data in W i2 [p i2 −169, p i2 ] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 , and judge whether at least part of the data in W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 4. Judging whether at least part of the data in W i5 [p i5 -169, p i5 ] meets the predetermined condition C 5 , judging whether at least part of the data in W i6 [p i6 -169, p i6 ] meets the predetermined condition C 6 , judging W Whether at least part of the data in i7 [p i7 -169,p i7 ] satisfies the predetermined condition C 7 , judge whether at least part of the data in W i8 [p i8 -169,p i8 ] meets the predetermined condition C 8 , and judge W i9 [p i9 -169,p i9 ] whether at least part of the data satisfies the predetermined condition C 9 , judging whether at least part of the data in W i10 [p i10 -169,p i10 ] meets the predetermined condition C 10 and judging W i11 [p i11 -169,p i11 ] whether at least part of the data satisfies the predetermined condition C 11 . In the embodiment shown in FIG. 5 , at least part of the data in W i5 [p i5 -169, p i5 ] does not meet the predetermined condition C 5 , jumping 11 bytes from point p i5 along the direction of data flow splitting point search, at The end position of the 11th byte obtains the current potential segmentation point k j , as shown in FIG. 6 , according to the preset rules on the deduplication server 103, determine the point p j1 and the corresponding point p j1 for the potential segmentation point k j Window W j1 [p j1 -169,p j1 ], the method of judging whether at least part of the data in window W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 is the same as judging window W i1 [p i1 -169,p i1 ] whether at least part of the data satisfies the predetermined condition C 1 is the same way, so as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], to judge W j1 [p j1 -169,p Whether at least part of the data in j1 ] meets the predetermined condition C 1 , in order to judge whether at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , select 5 bytes, and the serial numbers in Figure 17 are 169, The byte "■" of 127, 85, 43 and 1 represents one selected byte respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are regarded as 40 bits in sequence, which are represented as a 1 ', a 2 ', a 3 ', a 4 '...a 40 ' respectively. For any a t 'in a 1 ', a 2 ', a 3 ', a 4 '...a 40 ', when a t '=0, V at '=-1, when a t '=1, V at '=1, V a1 ', V a2 ', V a3 ', V a4 '...V a40 ' are generated according to the correspondence between at ' and V at '. The method of judging whether at least part of the data in the window W j1 [p j1 -169, p j1 ] satisfies the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 In the same way, so the same random numbers are used: h 1 , h 2 , h 3 , h 4 ... h 40 . S a '=V a1 '*h 1 +V a2 '*h 2 +V a3 '*h 3 +V a4 '*h 4 +...+V a40 '*h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S a ' also obeys normal distribution. When S a 'is a positive number, then at least part of the data in W j1 [p j1 -169,p j1 ] meets the predetermined condition C 1 , and when S a ' is negative or 0, then W j1 [p j1 -169,p j1 ] at least part of the data does not meet the predetermined condition C 1 , the probability that S a ' is a positive number is 1/2.
判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,如图17所示,表示判断窗口Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。在图17中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节依次看成40个位,分别表示为b1'、b2'、b3'、b4'…b40'。b1'、b2'、b3'、b4'…b40'中的任一bt',当bt'=0时,Vbt'=-1,当bt'=1时,Vbt'=1,根据bt'与Vbt'对应关系,生成Vb1'、Vb2'、Vb3'、Vb4'…Vb40'。判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,因此,使用相同的随机数:h1、h2、h3、h4...h40,Sb'=Vb1'*h1+Vb2'*h2+Vb3'*h3+Vb4'*h4+…+Vb40'*h40。因为h1、h2、h3、h4...h40服从正态分 布,因此,Sb'也服从正态分布。当Sb'为正数,则Wj2[pj2-169,pj2]中至少部分数据满足预定条件C2,当Sb'为负数或0,则Wj2[pj2-169,pj2]中至少部分数据不满足预定条件C2,Sb'为正数的概率为1/2。The method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 is the same as the method of judging whether at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , so, as shown in Figure 17, Indicates one byte selected when judging whether at least part of the data in the window W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. In FIG. 17, they are represented by sequence numbers 170, 128, 86, 44 and 2 respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 It is regarded as 40 bits in turn, which are respectively expressed as b 1 ′, b 2 ′, b 3 ′, b 4 ′…b 40 ′. Any b t 'in b 1 ', b 2 ', b 3 ', b 4 '...b 40 ', when b t '=0, V bt '=-1, when b t '=1, V bt '=1, V b1 ', V b2 ', V b3 ', V b4 '...V b40 ' are generated according to the corresponding relationship between b t ' and V bt '. The method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] satisfies the predetermined condition C 2 is the same as the method of judging whether at least part of the data in W j2 [p j2 -169,p j2 ] satisfies the predetermined condition C 2 , therefore, using the same random numbers: h 1 , h 2 , h 3 , h 4 ... h 40 , S b '=V b1 '*h 1 +V b2 '*h 2 +V b3 '*h 3 +V b4 '*h 4 +...+V b40 '*h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S b ' also obeys normal distribution. When S b ' is a positive number, then at least part of the data in W j2 [p j2 -169,p j2 ] meets the predetermined condition C 2 , and when S b ' is negative or 0, then W j2 [p j2 -169,p j2 ] at least part of the data does not meet the predetermined condition C 2 , the probability that S b ' is a positive number is 1/2.
同理,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C 3 In the same way, judge whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , judge whether at least part of the data in W j5 [p j5 -169,p j5 ] meet the predetermined condition C 5. Judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , judging Whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , judge whether at least part of the data in W j9 [p j9 -169,p j9 ] meet the predetermined condition C 9 , and judge whether W j10 [p Whether at least part of the data in j10 -169,p j10 ] satisfies the predetermined condition C 10 and whether at least part of the data in W j11 [p j11 -169,p j11 ] meets the predetermined condition C 11 will not be repeated here.
仍然以图5所示实施方式为例,提供了一种判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz,根据在去重服务器103上预设的规则,为潜在分割点ki确定点pi1及pi1对应的窗口Wi1[pi1-169,pi1],判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,如图16所示,Wi1表示窗口Wi1[pi1-169,pi1],为判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1,选择5个字节,图16中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”转换成1个十进制数,范围为0-(2^40-1),使用均匀分布随机数生成器为0-(2^40-1)中的每一个十进制数生成1 个指定值,记录0-(2^40-1)中的每一个十进制数与指定值之间的对应关系R,一旦指定则该十进制数对应的指定值就不变,该指定值服从均匀分布,如果该指定值为偶数,则Wi1[pi1-169,pi1]中至少部分数据满足预定条件C1,如果该指定值为奇数,则Wi1[pi1-169,pi1]中至少部分数据不满足预定条件C1,C1表示按照上述方法获得的指定值为偶数。因为均匀分布的随机数为偶数的概率为1/2,因此,[pi1-169,pi1]中至少部分数据满足预定条件C1的概率为1/2。在图5所示的实施方式中,使用同样的规则,分别判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3、判断Wi4[pi4-169,pi4]中至少部分数据是否满足预定条件C4、判断Wi5[pi5-169,pi5]中至少部分数据是否满足预定条件C5,在此不再赘述。Still taking the embodiment shown in FIG. 5 as an example, a method for judging whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z is provided, which is used in this embodiment The random function judges whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] meets the predetermined condition C z , and determines the point for the potential split point k i according to the preset rules on the deduplication server 103 The window W i1 [p i1 -169, p i1 ] corresponding to p i1 and p i1 , judge whether at least part of the data in W i1 [p i1 -169, p i1 ] satisfies the predetermined condition C 1 , as shown in Figure 16, W i1 represents the window W i1 [p i1 -169,p i1 ], in order to judge whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 , select 5 bytes, and the serial number in Figure 16 is The byte "■" of 169, 127, 85, 43 and 1 represents one selected byte respectively, and the difference between two adjacent selected bytes is 42 bytes. Convert the byte "■" with serial numbers 169, 127, 85, 43 and 1 into a decimal number in the range of 0-(2^40-1), using a uniformly distributed random number generator as 0-(2^ Each decimal number in 40-1) generates a specified value, and records the correspondence R between each decimal number in 0-(2^40-1) and the specified value. Once specified, the corresponding decimal number The specified value remains unchanged, and the specified value obeys the uniform distribution. If the specified value is even, then at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 . If the specified value is odd, then At least part of the data in W i1 [p i1 -169, p i1 ] does not satisfy the predetermined condition C 1 , and C 1 indicates that the specified value obtained by the above method is an even number. Because the probability that a uniformly distributed random number is an even number is 1/2, therefore, the probability that at least part of the data in [p i1 -169,p i1 ] satisfies the predetermined condition C 1 is 1/2. In the embodiment shown in FIG. 5 , the same rule is used to determine whether at least part of the data in W i2 [p i2 -169, p i2 ] satisfies the predetermined condition C 2 , and to determine whether W i3 [p i3 -169, p i3 ] whether at least part of the data in W i4 [p i4 -169,p i4 ] satisfies the predetermined condition C 3 , whether at least part of the data in W i4 [p i4 -169,p i4 ] meets the predetermined condition C 4 , and whether at least part of the data in W i5 [p i5 -169,p i5 ] Whether the data satisfies the predetermined condition C 5 will not be repeated here.
当Wi5[pi5-169,pi5]中至少部分数据不满足预定条件C5,从点pi5沿着数据流分割点查找方向跳跃11个字节,在第11个字节的结束位置获得当前潜在分割点kj,如图6所示,根据在去重服务器103上预设的规则,为潜在分割点kj确定点pj1、点pj1对应的窗口Wj1[pj1-169,pj1],判断窗口Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的方式相同,因此,使用相同的0-(2^40-1)中的每一个十进制数与指定值之间的对应关系R,如图17所示,Wj1表示窗口Wj1[pj1-169,pj1],为判断Wj1[pj1-169,pj1]中至少部分数据是否满足预定条件C1,选择5个字节,图17中“■”表示选择的1个字节,相邻两个选择的字节“■”之间相差42个字节。将序号为169、127、85、43和1的字节“■”转换成1个十进制数,在R查找该十进制数对应的指定值,如果该指定值为 偶数,则Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1,如果该指定值为奇数,则Wj1[pj1-169,pj1]中至少部分数据不满足预定条件C1,因为均匀分布的随机数为偶数的概率为1/2,因此,Wj1[pj1-169,pj1]中至少部分数据满足预定条件C1的概率为1/2。同理,判断Wi2[pi2-169,pi2]中至少部分数据是否满足预定条件C2的方式和判断Wj2[pj2-169,pj2]中至少部分数据是否满足预定条件C2的方式相同,判断Wi3[pi3-169,pi3]中至少部分数据是否满足预定条件C3的方式与判断Wj3[pj3-169,pj3]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[pj4-169,pj4]中至少部分数据是否满足预定条件C4、判断Wj5[pj5-169,pj5]中至少部分数据是否满足预定条件C5、判断Wj6[pj6-169,pj6]中至少部分数据是否满足预定条件C6、判断Wj7[pj7-169,pj7]中至少部分数据是否满足预定条件C7、判断Wj8[pj8-169,pj8]中至少部分数据是否满足预定条件C8、判断Wj9[pj9-169,pj9]中至少部分数据是否满足预定条件C9、判断Wj10[pj10-169,pj10]中至少部分数据是否满足预定条件C10和判断Wj11[pj11-169,pj11]中至少部分数据是否满足预定条件C11,在此不再赘述。When at least part of the data in W i5 [p i5 -169,p i5 ] does not meet the predetermined condition C 5 , jump 11 bytes from point p i5 along the direction of data flow splitting point search, at the end position of the 11th byte Obtain the current potential segmentation point k j , as shown in FIG. 6 , according to the preset rules on the deduplication server 103, determine the point p j1 and the window W j1 [p j1 -169 corresponding to the point p j1 ] for the potential segmentation point k j ,p j1 ], the method of judging whether at least part of the data in the window W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 is the same as judging whether at least part of the data in the window W i1 [p i1 -169,p i1 ] satisfies The way of the predetermined condition C 1 is the same, therefore, use the correspondence R between each decimal number in the same 0-(2^40-1) and the specified value, as shown in Figure 17, W j1 represents the window W j1 [p j1 -169,p j1 ], in order to judge whether at least part of the data in W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 , select 5 bytes, "■" in Figure 17 indicates the selected 1 bytes, and the difference between two adjacent selected bytes "■" is 42 bytes. Convert the byte "■" with serial numbers 169, 127, 85, 43 and 1 into a decimal number, and find the specified value corresponding to the decimal number in R. If the specified value is even, then W j1 [p j1 - 169,p j1 ] at least part of the data satisfies the predetermined condition C 1 , if the specified value is an odd number, then at least part of the data in W j1 [p j1 -169,p j1 ] does not meet the predetermined condition C 1 , because the random The probability that the number is even is 1/2, therefore, the probability that at least part of the data in W j1 [p j1 -169,p j1 ] satisfies the predetermined condition C 1 is 1/2. Similarly, the method of judging whether at least part of the data in W i2 [p i2 -169,p i2 ] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [p j2 -169,p j2 ] meet the predetermined condition C 2 The way of judging whether at least part of the data in W i3 [p i3 -169,p i3 ] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [p j3 -169,p j3 ] satisfies the predetermined condition C The method of 3 is the same, similarly, judging whether at least part of the data in W j4 [p j4 -169,p j4 ] meets the predetermined condition C 4 , judging whether at least part of the data in W j5 [p j5 -169,p j5 ] meets the predetermined condition Condition C 5 , judging whether at least part of the data in W j6 [p j6 -169,p j6 ] meets the predetermined condition C 6 , judging whether at least part of the data in W j7 [p j7 -169 ,p j7 ] meets the predetermined condition C 7 , Judging whether at least part of the data in W j8 [p j8 -169 ,p j8 ] meets the predetermined condition C 8 , judging whether at least part of the data in W j9 [p j9 -169,p j9 ] meets the predetermined condition C 9 , judging W j10 [ Whether at least part of the data in p j10 -169,p j10 ] satisfies the predetermined condition C 10 and whether at least part of the data in W j11 [p j11 -169,p j11 ] meets the predetermined condition C 11 will not be repeated here.
图1所示的本发明实施例中的去重服务器103,是指能够实现本发明实施例所描述的技术方案的装置,如图18所示,通常包括中央处理单元、主存储器以及输入输出接口。中央处理单元、主存储器与输入输出接口之间相互通信,主存储器存储可执行指令,中央处理单元执行主存储器中存储的可执行指令,从而执行特定的功能,如本发明实施例图4至图17所描述的查找数据流分割点。因此,如图19所示,根据图4至图17所示的本发明实施例,去重服务器103,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对 应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;去重服务器103包括确定单元1901和判断处理单元1902。其中,确定单元1901,用于用于执行步骤a):a)依据所述规则为当前潜在分割点ki确定点piz及所述点piz对应的窗口Wiz[piz-Az,piz+Bz],i和z为整数,并且1≤z≤M;判断处理单元1902,用于所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足预定条件Cz;The deduplication server 103 in the embodiment of the present invention shown in Figure 1 refers to the device capable of implementing the technical solution described in the embodiment of the present invention, as shown in Figure 18, generally includes a central processing unit, a main memory, and an input and output interface . The central processing unit, the main memory, and the input and output interfaces communicate with each other, the main memory stores executable instructions, and the central processing unit executes the executable instructions stored in the main memory to perform specific functions, as shown in Figure 4 to Figure 4 of the embodiment of the present invention 17 to find data stream split points as described. Therefore, as shown in FIG. 19, according to the embodiment of the present invention shown in FIGS. point p x , the window W x [p x -A x ,p x +B x ] corresponding to the point p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], Wherein, x is a continuous natural number from 1 to M, M≥2, and A x and B x are integers; the deduplication server 103 includes a determination unit 1901 and a judgment processing unit 1902 . Wherein, the determination unit 1901 is configured to perform step a): a) determine a point p iz and a window W iz corresponding to the point p iz for the current potential segmentation point k i according to the rule [p iz -A z , p iz +B z ], i and z are integers, and 1≤z≤M; judging processing unit 1902, used for at least part of the data in the window W iz [p iz -A z ,p iz +B z ] Satisfy the predetermined condition C z ;
当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖+‖(ki-pix)‖),获得新的潜在分割点,则所述确定单元为所述新的潜在分割点执行步骤a);当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[pix-Ax,pix+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。When at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , jump N times from the point p iz along the direction of searching the data stream splitting point The minimum search unit U of the data flow segmentation point, N*U is not greater than ‖B z ‖+max x (‖A x ‖+‖(k i -p ix )‖), to obtain a new potential segmentation point, then the determination unit Execute step a) for the new potential segmentation point; when at least part of the data in each window W ix [p ix -A x , p ix +B x ] of the M windows of the current potential segmentation point ki If the predetermined condition C x is met, the current potential split point ki is a data stream split point.
进一步地,所述规则还包括:至少两个点pe和pf,满足条件Ae=Af,Be=Bf,Ce=Cf。进一步地,所述规则还包括:所述至少两个点pe和pf,相对于所述潜在分割点k,在所述数据流分割点查找反方向上。Further, the rule further includes: at least two points p e and p f satisfy the conditions of A e =A f , B e =B f , and C e =C f . Further, the rule further includes: the at least two points pe and p f are in the reverse direction of the data flow split point search relative to the potential split point k.
进一步地,所述规则还包括:所述至少两个点pe和pf之间的距离为1个U。Further, the rule further includes: the distance between the at least two points pe and p f is 1 U.
进一步地,所述判断处理单元1902具体用于使用随机函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。具体地,所述判断处理单元1902具体用于使用hash函数判断所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz。具体地,所述判断处理单元1902具体用于使用随机函数判断所述窗口 Wiz[piz-Az,piz+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:Further, the determination processing unit 1902 is specifically configured to use a random function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z . Specifically, the determination processing unit 1902 is specifically configured to use a hash function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z . Specifically, the determination processing unit 1902 is specifically configured to use a random function to determine whether at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[piz-Az,piz+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [p iz -A z ,p iz +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, which is specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [p iz -A z ,p iz +B z ] satisfies the predetermined condition C z .
进一步地,所述判断处理单元1902用于当所述窗口Wiz[piz-Az,piz+Bz]中至少部分数据不满足所述预定条件Cz,从所述点piz沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,所述确定单元1901为所述新的潜在分割点执行步骤a), 根据所述规则,为所述新的潜在分割点确定的点pic对应的窗口Wic[pic-Ac,pic+Bc]的左边界与所述窗口Wiz[piz-Az,piz+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[pic-Ac,pic+Bc]的左边界位于所述窗口Wiz[piz-Az,piz+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[pic-Ac,pic+Bc]是根据所述规则,为所述新的潜在分割点确定的M个点按照数据流查找方向获得的序列中排序第一的点。Further, the judgment processing unit 1902 is configured to, when at least part of the data in the window W iz [p iz -A z ,p iz +B z ] does not satisfy the predetermined condition C z , start from the point p iz along the The data stream segmentation point search direction jumps N minimum search units U of data stream segmentation points to obtain the new potential segmentation point, and the determining unit 1901 performs step a) for the new potential segmentation point, according to the rule, the left boundary of the window W ic [p ic -A c ,p ic +B c ] corresponding to the point p ic determined for the new potential segmentation point is the same as the window W iz [p iz -A z ,p iz +B z ] coincides or the left boundary of the window W ic [p ic -A c ,p ic +B c ] determined for the new potential segmentation point is located at the window W iz [p iz -A z ,p iz +B z ]; wherein, the window W ic [p ic -A c ,p ic +B c ] determined for the new potential segmentation point is according to the rule, The M points determined for the new potential segmentation point are the points ranked first in the sequence obtained according to the search direction of the data flow.
根据图4至图17所示的本发明实施例提供的基于服务器查找数据流分割点的方法中,为潜在分割点ki确定点pix及点pix的窗口Wix[pix-Ax,pix+Bx],其中,x分别为1到M连续的自然数,M≥2,可以并行判断M个窗口中每一个窗口中至少部分数据是否满足预定条件Cx,或者依次判断窗口中至少部分数据是否满足预定条件,也可以判断窗口Wi1[pi1-A1,pi1+B1]中至少部分数据满足预定条件C1时,再判断Wi2[pi2-A2,pi2+B2]中至少部分数据满足预定条件C2时,直到判断Wim[pim-Am,pim+Bm]中至少部分数据满足预定条件Cm。实施例中其他窗口的判断与此相同,不再赘述。According to the server-based method for searching data stream segmentation points provided by the embodiments of the present invention shown in FIGS. , p ix +B x ], where x is a continuous natural number from 1 to M, and M≥2, it can be judged in parallel whether at least part of the data in each of the M windows satisfies the predetermined condition C x , or it can be judged sequentially in the windows Whether at least part of the data meets the predetermined condition can also be judged when at least part of the data in the window W i1 [p i1 -A 1 , p i1 +B 1 ] meets the predetermined condition C 1 , and then judge W i2 [p i2 -A 2 , p When at least part of the data in i2 +B 2 ] satisfies the predetermined condition C 2 , until it is determined that at least part of the data in W im [p im -A m , p im +B m ] satisfies the predetermined condition C m . The determination of other windows in the embodiment is the same as this, and will not be repeated here.
另外,根据根据图4至图17所示的本发明实施例,实际应用中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定M个点px、点px对应的窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到M连续的自然数,M≥2,在该预设规则中,A1、A2、A3…Am可以不全部相等,B1、B2、B3…Bm可以不全部相等,C1、C2、C3…CM也可以不全部相同。在图5所示的实施方式中,在窗口Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、 Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]和Wi11[pi11-169,pi11]中,各窗口大小相同,即窗口大小均为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[pi1-169,pi1]中至少部分数据是否满足预定条件C1的描述,但在图11所示的实施方式中,Wi1[pi1-169,pi1]、Wi2[pi2-169,pi2]、Wi3[pi3-169,pi3]、Wi4[pi4-169,pi4]、Wi5[pi5-169,pi5]、Wi6[pi6-169,pi6]、Wi7[pi7-169,pi7]、Wi8[pi8-169,pi8]、Wi9[pi9-169,pi9]、Wi10[pi10-169,pi10]与Wi11[pi11-182,pi11]窗口大小可以不相同,同时判断窗口中至少部分数据是否满足预定条件的方式也可以不相同。在所有实施例中,根据在去重服务器103上预设的规则,判断窗口Wi1中至少部分数据是否满足预定条件C1的方式与判断窗口Wj1中至少部分数据是否满足预定条件C1的方式必然相同,判断Wi2中至少部分数据是否满足预定条件C2的方式与判断Wj2中至少部分数据是否满足预定条件C2的方式必然相同…判断窗口WiM中至少部分数据是否满足预定条件CM的方式与判断窗口WjM中至少部分数据是否满足预定条件CM的方式必然相同。在此不再赘述,同时根据图4至图17所示的本发明实施例,虽然均以M=11为例,但根据实际需要,M的取值并不限于11,本领域技术人员根据本发明实施例中的描述,确定M的值。In addition, according to the embodiment of the present invention shown in FIG. 4 to FIG. 17 , in practical applications, rules are preset on the deduplication server 103, and the rules are: determine M points p x , point The window W x [p x -A x ,p x +B x ] corresponding to p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], x are 1 to M respectively Continuous natural numbers, M≥2, in this default rule, A 1 , A 2 , A 3 ... A m may not all be equal, B 1 , B 2 , B 3 ... B m may not all be equal, C 1 , C 2 , C 3 . . . C M may not all be the same. In the embodiment shown in FIG. 5 , in windows W i1 [p i1 -169,p i1 ], W i2 [p i2 -169,p i2 ], W i3 [p i3 -169,p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p In i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169,p i10 ] and W i11 [p i11 -169,p i11 ], each window has the same size, That is, the size of the window is 169 bytes. At the same time, the method of judging whether at least part of the data in the window meets the predetermined condition is the same. For details, see the above judgment of whether at least part of the data in W i1 [p i1 -169,p i1 ] satisfies the predetermined condition C 1 , but in the embodiment shown in Figure 11, W i1 [p i1 -169, p i1 ], W i2 [p i2 -169, p i2 ], W i3 [p i3 -169, p i3 ], W i4 [p i4 -169,p i4 ], W i5 [p i5 -169,p i5 ], W i6 [p i6 -169,p i6 ], W i7 [p i7 -169,p i7 ], W i8 [p i8 -169,p i8 ], W i9 [p i9 -169,p i9 ], W i10 [p i10 -169,p i10 ] and W i11 [p i11 -182,p i11 ] window sizes can be different , and at the same time, the manner of judging whether at least part of the data in the window satisfies the predetermined condition may also be different. In all embodiments, according to the preset rules on the deduplication server 103, the method of judging whether at least part of the data in the window W i1 satisfies the predetermined condition C1 is the same as the method of judging whether at least part of the data in the window W j1 satisfies the predetermined condition C1 The method must be the same, the method of judging whether at least part of the data in W i2 satisfies the predetermined condition C2 must be the same as the method of judging whether at least part of the data in W j2 satisfies the predetermined condition C2 ...judging whether at least part of the data in the window W iM satisfies the predetermined condition The way of C M is necessarily the same as the way of judging whether at least part of the data in the window WjM satisfies the predetermined condition C M. No more details here, and according to the embodiments of the present invention shown in Figure 4 to Figure 17, although M=11 is taken as an example, according to actual needs, the value of M is not limited to 11, those skilled in the art The description in the embodiment of the invention determines the value of M.
根据图4至图17所示的本发明实施例,在去重服务器103上预设有规则,ka、ki、kj、kl和km为沿着数据流分割点查找方向查找分割点时获得的潜在分割点,ka、ki、kj、kl和km都依据该规则。本发明实施例中的窗口Wx[px-Ax,px+Bx]表示一个特定范围,在该特定范围选择数据以判断这些数据是否满足预定条件Cx,具体地,可以在该特定范围内选择部分数据,也可以选择全部数据以判断这些数据是否满足 预定条件Cx。本发明实施例中具体使用的窗口概念可参照窗口Wx[px-Ax,px+Bx],在此不再赘述。According to the embodiment of the present invention shown in FIG. 4 to FIG. 17 , there are preset rules on the deduplication server 103, k a , k i , k j , k l and k m are the search division along the direction of data flow segmentation point search. The potential segmentation points obtained at point k a , k i , k j , k l and k m all follow this rule. The window W x [p x -A x ,p x +B x ] in the embodiment of the present invention represents a specific range, in which data is selected to determine whether the data meet the predetermined condition C x , specifically, it can be Select part of the data within a specific range, or select all the data to determine whether these data meet the predetermined condition C x . For the window concept specifically used in the embodiment of the present invention, reference may be made to window W x [p x -A x , p x +B x ], which will not be repeated here.
根据图4至图17所示的本发明实施例,窗口Wx[px-Ax,px+Bx]中,(px-Ax)和(px+Bx)表示该窗口Wx[px-Ax,px+Bx]的两个边界,其中(px-Ax)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找反方向的边界,(px+Bx)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找方向的边界。具体地,在本发明实施例中,在图3至图15所示的数据流分割点查找方向为从左向右,则其中(px-Ax)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找反方向的边界(即左边界),(px+Bx)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找方向的边界(即右边界)。如果在图3至图15所示的数据流分割点查找方向为从右向左,则其中(px-Ax)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找反方向的边界(即右边界),(px+Bx)表示窗口Wx[px-Ax,px+Bx]相对于点px位于数据流分割点查找方向的边界(即左边界)。According to the embodiment of the present invention shown in Fig. 4 to Fig. 17, in the window W x [p x -A x ,p x +B x ], (p x -A x ) and (p x +B x ) represent the window Two boundaries of W x [p x -A x ,p x +B x ], where (p x -A x ) means window W x [p x -A x ,p x +B x ] relative to point p x The boundary located in the direction opposite to the data stream split point search, (p x +B x ) means the boundary of the window W x [p x -A x ,p x +B x ] located in the direction of the data stream split point search relative to point p x . Specifically, in the embodiment of the present invention, the search direction of the data stream segmentation point shown in Fig. 3 to Fig. 15 is from left to right, where (p x -A x ) represents the window W x [p x -A x ,p x +B x ] relative to the point p x is located at the boundary in the opposite direction of the data flow split point (ie the left boundary), (p x +B x ) means the window W x [p x -A x ,p x +B x ] relative to the point p x is located at the boundary (ie, the right boundary) of the search direction of the data flow split point. If the search direction of the data flow splitting point shown in Figure 3 to Figure 15 is from right to left, then (p x -A x ) represents the window W x [p x -A x ,p x +B x ] relative to The point p x is located at the boundary of the opposite direction of the data flow split point search (ie the right boundary), (p x +B x ) means that the window W x [p x -A x ,p x +B x ] is located at the data stream relative to the point p x The boundary of the stream split point lookup direction (i.e. the left boundary).
本领域普通技术人员可以意识到,结合本发明实施例描述的各示例的单元及算法步骤,本发明实施例的关键特征可以与其他技术相结合,以更为复杂的形式呈现,但仍会包含本发明的关键特征。在真实环境中可能使用备用分割点,例如一种实施方式为,根据在去重服务器103上预设的规则,为潜在分割点ki确定11个点px,x为1到11连续的自然数,确定px对应的窗口Wx[px-Ax,px+Bx]及窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,当11个窗口中每一个窗口Wx[px-Ax,px+Bx]中至少部分数据均满足预定条件Cx,则潜在分割点ki为数据流分割点,当超过设定的最大数据块时,仍未查找到分割点,这时可能使用备用 的预设规则,备用的预设规则与在去重服务器103上预设的规则类似,备用的预设规则为:例如为潜在分割点ki确定10个点px,x为1到10连续的自然数,确定px对应的窗口Wx[px-Ax,px+Bx]及窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,当10个窗口中每一个窗口Wx[px-Ax,px+Bx]中至少部分数据均满足预定条件Cx,则潜在分割点ki为数据流分割点,当超过设定的最大数据块时,仍未查找到数据流分割点时,从最大数据块的结束位置作为强制分割点。Those skilled in the art can realize that, in combination with the units and algorithm steps of the examples described in the embodiments of the present invention, the key features of the embodiments of the present invention can be combined with other technologies and presented in a more complex form, but still include Key features of the invention. It is possible to use alternate split points in a real environment. For example, one implementation is to determine 11 points p x for potential split points ki according to the preset rules on the deduplication server 103, where x is a continuous natural number from 1 to 11 , determine the window W x [p x -A x ,p x +B x ] corresponding to p x and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], when 11 windows At least part of the data in each window W x [p x -A x ,p x +B x ] in each window satisfies the predetermined condition C x , then the potential split point ki is the data stream split point, when the set maximum data block is exceeded When the split point is not found yet, a spare preset rule may be used at this time. The spare preset rule is similar to the preset rule on the deduplication server 103. The spare preset rule is: for example, a potential split point k i Determine 10 points p x , x is a continuous natural number from 1 to 10, determine the window W x [p x -A x ,p x +B x ] and the window W x [p x -A x ,p corresponding to p x x +B x ] corresponding to the predetermined condition C x , when at least part of the data in each of the 10 windows W x [p x -A x ,p x +B x ] satisfies the predetermined condition C x , the potential segmentation point k i is the data stream split point. When the data stream split point is not found when the set maximum data block is exceeded, the end position of the largest data block is used as the mandatory split point.
在去重服务器103上预设有规则,所述规则中为潜在分割点k确定M个点,并不一定要求先有一个潜在分割点k,可以通过确定的M个点来判断潜在分割点k。There are preset rules on the deduplication server 103, in which M points are determined for the potential segmentation point k, and a potential segmentation point k is not necessarily required first, and the potential segmentation point k can be judged by the determined M points .
本发明实施例提供一种基于去重服务器查找数据流分割点的方法,如图20所示,包括:An embodiment of the present invention provides a method for finding a data stream segmentation point based on a deduplication server, as shown in FIG. 20 , including:
在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,、Ax和Bx为整数;在图3所示的实施方式中,关于M的取值,其中一种实现方式,M*U取值不大于预设的两个相邻的数据流分割点之间的最大距离,即预设的数据块最大长度。判断窗口Wz[k-Az,k+Bz]中至少部分数据是否满足预定条件Cz,其中,z为整数,1≤z≤M,(k-Az)与(k+Bz)分别表示窗口Wz的两个边界。当判断任意一个窗口Wz[k-Az,k+Bz]中至少部分数据不满足预定条件Cz,则从潜在分割点k沿数据流分割点查找方向跳跃N个字节,N≤‖Bz‖+maxx(‖Ax‖)。其中,‖Bz‖表示Wz[k-Az,k+Bz]中Bz的绝对值,maxx(‖Ax‖)表示M个窗口中Ax绝对值中的最大值,将在下面实施例中具体介绍N取值的原理。当判断M个窗口中的每一个窗口Wx[k-Ax,k+Bx]中至少部分数据满足预定条件Cx,则潜在分割点k为 数据流分割点。A rule is preset on the deduplication server 103, and the rule is: determine M windows W x [kA x , k+B x ] corresponding to windows W x [kA x , k+B x ] for a potential segmentation point k The predetermined condition C x , wherein, x is a continuous natural number from 1 to M, M≥2, A x and B x are integers; in the embodiment shown in Figure 3, regarding the value of M, one of the realizations In this way, the value of M*U is not greater than the preset maximum distance between two adjacent data stream segmentation points, that is, the preset maximum length of the data block. Determine whether at least part of the data in the window W z [kA z , k+B z ] satisfies the predetermined condition C z , where z is an integer, 1≤z≤M, (kA z ) and (k+B z ) respectively represent the window The two boundaries of W z . When it is judged that at least part of the data in any window W z [kA z , k+B z ] does not meet the predetermined condition C z , jump N bytes from the potential split point k along the search direction of the data stream split point, N≤‖B z ‖+max x (‖A x ‖). Among them, ‖B z ‖ represents the absolute value of B z in W z [kA z , k+B z ], max x (‖A x ‖) represents the maximum value of the absolute value of A x in M windows, which will be shown below The principle of selecting the value of N is specifically introduced in the embodiment. When it is determined that at least part of the data in each of the M windows W x [kA x , k+B x ] satisfies the predetermined condition C x , then the potential split point k is a data stream split point.
具体地,对当前潜在分割点ki,依据所述规则,执行以下步骤:Specifically, for the current potential segmentation point k i , according to the rules, the following steps are performed:
步骤2001:依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki+Bz],i和z为整数,并且1≤z≤M;Step 2001: Determine the corresponding window W iz [k i -A z , ki +B z ] for the current potential segmentation point ki according to the rules, i and z are integers, and 1≤z≤M;
步骤2002:判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;Step 2002: judging whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,执行步骤2001;When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip N data stream segmentation point minimum search unit U, N*U is not greater than ‖B z ‖+max x (‖A x ‖), obtain a new potential segmentation point, execute step 2001;
当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。When at least part of the data in each of the M windows W ix [k i -A x , ki +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential segmentation point k i is the data stream split point.
进一步地,所述规则还包括:至少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf;进一步地,所述规则还包括:Ae和Af为正整数;更进一步地,所述规则还包括:Ae-1=Af,Be+1=Bf。其中,|Ae+Be|表示窗口Wie的大小,|Af+Bf|表示窗口Wif的大小。Further, the rule also includes: at least two windows W ie [k i -A e , ki +B e ] and W if [k i -A f , ki +B f ], satisfying the condition: |A e +B e |=|A f +B f |, C e =C f ; further, the rule also includes: A e and A f are positive integers; further, the rule also includes: A e -1=A f , Be +1=B f . Among them, |A e +B e | represents the size of the window W ie , and |A f +B f | represents the size of the window W if .
进一步地,判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz;更进一步地,所述使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体为使用hash函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz。Further, judging whether at least part of the data in the window W iz [k i -A z , ki + B z ] satisfies the predetermined condition C z specifically includes: using a random function to judge whether the window W iz [k i Whether at least part of the data in -A z , ki +B z ] satisfies the predetermined condition C z ; further, the use of a random function to judge the window W iz [ ki -A z , ki +B z ] whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z , specifically using a hash function .
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第一的窗口。When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip the minimum search unit U of N data stream segmentation points to obtain the new potential segmentation point, and according to the rules, the window W ic [k i -A c , ki +B determined for the new potential segmentation point c ] is coincident with the right boundary of the window W iz [k i -A z , ki +B z ] or the window W ic [k i -A c , ki +B c ] the left boundary of the window W iz [k i -A z , ki +B z ]; wherein, the window W ic determined for the new potential segmentation point [k i -A c , ki +B c ] is the first window in the sequence obtained according to the search direction of the data flow among the M windows determined for the new potential segmentation point according to the rule.
本发明实施例中通过判断M个窗口中某一个窗口中至少部分数据是否满足预定条件,来查找数据流分割点,当某一个窗口中至少部分数据不满足预定条件,则跳过N*U个长度,其中,N*U不大于‖Bz‖+maxx(‖Ax‖),获得下一个潜在分割点,提高了数据流分割点查找效率。In the embodiment of the present invention, the data flow segmentation point is searched by judging whether at least part of the data in one of the M windows satisfies the predetermined condition, and when at least part of the data in a certain window does not meet the predetermined condition, skip N*U length, wherein, N*U is not greater than ‖B z ‖+max x (‖A x ‖), to obtain the next potential segmentation point, which improves the efficiency of searching for data stream segmentation points.
在重复数据删除过程中,为保证数据块大小均匀,会考虑平均数据块(也可以称为平均分块)大小,即在满足最小数据块大小和最大数据块大小限定的同时,会确定平均数据块大小,以保证获得的数据块大小均匀。窗口Wx[k-Ax,k+Bx]的个数M与窗口Wx[k-Ax,k+Bx]中至少部分数据满足预设条件的概率这两个因素决定了找到数据流分割点的概率(以P(n)表示),前者影响跳跃的长度,后者影响跳跃的概率,二者共同影响平均分块大小。一般而言,在平均分块大小固定时,Wx[k-Ax,k+Bx]个数增加,则单个窗口Wx[k-Ax,k+Bx]中至少部分数据满足预定条件的概率也增加,例如在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx],x分别为1到11连续的自然数,11个窗口中任一个窗口Wx[k-Ax,k+Bx]中至少部分数据满足预设条件的概率为1/2。而在去重服务器103上预 设的另一组规则为:为潜在分割点k确定24个窗口Wx[k-Ax,k+Bx],x分别为1到24连续的自然数,24个窗口中任一个窗口Wx[k-Ax,k+Bx]中至少部分数据满足预设条件的概率3/4,具体窗口Wx[k-Ax,k+Bx]中至少部分数据满足预设条件的概率设定可参见判断窗口Wx[k-Ax,k+Bx]中至少部分数据是否满足预设条件部分的描述。窗口Wx[k-Ax,k+Bx]的个数M与窗口Wx[k-Ax,k+Bx]中至少部分数据满足的预设条件的概率这两个因素决定P(n),P(n)表示:从数据流起始位置或者从上一数据流分割点查找n个数据流分割点最小查找单位后没找到数据流分割点的概率。关于这两个因素决定P(n)的计算过程,实际上是一个多步长Fibonacci数列,后面将具体描述。得到P(n)后,1-P(n)即为数据流分割点的分布函数,(1-P(n))-(1-P(n-1))=P(n-1)-P(n)即为在n个数据流分割点最小查找单位找到数据流分割点概率,也就是数据流分割点的密度函数,根据数据流分割点的密度函数就可以积分 从而求得数据流分割点的期望长度,即平均分块大小,其中,4*1024(字节)表示最小数据块长度,12*1024(字节)表示最大数据块长度。In the deduplication process, in order to ensure uniform data block size, the average data block (also called average block size) size will be considered, that is, the average data block size will be determined while meeting the minimum data block size and maximum data block size restrictions. Block size to ensure that the obtained data blocks are of uniform size. The number M of the window W x [kA x , k+B x ] and the probability that at least part of the data in the window W x [kA x , k+B x ] meet the preset conditions determine the finding of the data stream segmentation point The probability of (denoted by P(n)), the former affects the length of the jump, the latter affects the probability of the jump, and both affect the average block size. Generally speaking, when the average block size is fixed and the number of W x [kA x , k+B x ] increases, the probability that at least part of the data in a single window W x [kA x , k+B x ] satisfies the predetermined condition Also increase, for example, there are rules preset on the deduplication server 103, the rules are: determine 11 windows W x [kA x , k+B x ] for the potential segmentation point k, and x are respectively 1 to 11 consecutive natural numbers , the probability that at least part of the data in any one of the 11 windows W x [kA x , k+B x ] satisfies the preset condition is 1/2. Another set of rules preset on the deduplication server 103 is: determine 24 windows W x [kA x , k+B x ] for the potential segmentation point k, x is a continuous natural number from 1 to 24, and 24 windows The probability that at least part of the data in any window W x [kA x , k+B x ] meets the preset conditions is 3/4, and at least part of the data in the specific window W x [kA x , k+B x ] meets the preset conditions For setting the probability of , please refer to the description of judging whether at least part of the data in the window W x [kA x , k+B x ] satisfies the preset condition. The number M of the window W x [kA x , k+B x ] and the probability that at least part of the data in the window W x [kA x , k+B x ] satisfy the preset conditions determine P(n), P(n) represents: the probability that no data stream split point is found after searching n minimum search units of data stream split points from the start position of the data stream or from the previous data stream split point. The calculation process of determining P(n) by these two factors is actually a multi-step Fibonacci sequence, which will be described in detail later. After obtaining P(n), 1-P(n) is the distribution function of the data stream segmentation point, (1-P(n))-(1-P(n-1))=P(n-1)- P(n) is the probability of finding the data stream split point at the minimum search unit of n data stream split points, that is, the density function of the data stream split point, which can be integrated according to the density function of the data stream split point Thereby, the expected length of the split point of the data stream is obtained, that is, the average block size, wherein 4*1024 (bytes) represents the minimum data block length, and 12*1024 (bytes) represents the maximum data block length.
在图3所示的数据流分割点查找的基础上,在图21所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到11连续的自然数,Ax和Bx为整数。其中,A1=169,B1=0;A2=170,B2=-1;A3=171,B3=-2;A4=172,B4=-3;A5=173,B5=-4;A6=174,B6=-5;A7=175,B7=-6;A8=176,B8=-7;A9=177,B9=-8;A10=178,B10=-9;A11=179,B11=-10,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,则11个窗口分别为W1[k-169,k]、W2[k-170,k-1]、W3[k-171,k-2]、W4[k-172,k-3]、 W5[k-173,k-4]、W6[k-174,k-5]、W7[k-175,k-6]、W8[k-176,k-7]、W9[k-177,k-8]、W10[k-178,k-9]和W11[k-179,k-10]。ka为数据流分割点,图21中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,根据为去重服务器103预设的规则,为潜在分割点ki确定窗口Wix[ki-Ax,ki+Bx],在本实施例中,x分别为1到11连续的自然数。在图21所示的实施方式中,为潜在分割点ki确定的窗口为11个,分别为Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-178,ki-9]和Wi11[ki-179,ki-10]。判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1、判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2、判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5、判断Wi6[ki-174,ki-5]中至少部分数据是否满足预定条件C6、判断Wi7[ki-175,ki-6]中至少部分数据是否满足预定条件C7、判断Wi8[ki-176,ki-7]中至少部分数据是否满足预定条件C8、判断Wi9[ki-177,ki-8]中至少部分数据是否满足预定条件C9、判断Wi10[ki-178,ki-9]中至少部分数据是否满足预定条件C10和判断Wi11[ki-179,ki-10]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数 据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图22所示,Wi5[ki-173,ki-4],则从潜在分割点ki沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B5‖+maxx(‖Ax‖),在图22所示的实施方式中,跳跃N个字节不大于183个字节,在本实施例中,N=7,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj。根据图21所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点kj确定窗口Wjx[kj-Ax,kj+Bx],在本实施例中,x分别为1到11连续的自然数。为潜在分割点kj确定的窗口为11个,分别为Wj1[kj-169,kj]、Wj2[kj-170,kj-1]、Wj3[kj-171,kj-2]、Wj4[kj-172,kj-3]、Wj5[kj-173,kj-4]、Wj6[kj-174,kj-5]、Wj7[kj-175,kj-6]、Wj8[kj-176,kj-7]、Wj9[kj-177,kj-8]、Wj10[kj-178,kj-9]和Wj11[kj-179,kj-10]。如图22所示,为潜在分割点确定的第11个窗口Wj11[kj-179,kj-10],在保证潜在分割点ki与潜在分割点kj之间的范围都在判断范围之内,则在本实施方式中,必须保证窗口Wj11[kj-179,kj-10]的左边界与窗口Wi5[ki-173,ki-4]的右边界(ki-4)重合,或者位于窗口Wi5[ki-173,ki-4]范围之内,所述窗口Wj11[kj-179,kj-10]是根据所述规则,为所述潜在分割点kj确定的M个窗口按照数据流查找方向获得的序列中排序第一的窗口。因此,在这一限定内,当窗口Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5,从潜在分割点ki沿数据流分割点查找方向跳跃的距离不大于‖B5‖+maxx(‖Ax‖)。判断Wj1[kj-169,kj]中至少 部分数据是否满足预定条件C1、判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2、判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3、判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 21, a rule is preset on the deduplication server 103, and the rule is: determine 11 windows for the potential segmentation point k W x [kA x , k+B x ] and the predetermined condition C x corresponding to the window W x [kA x , k+B x ], wherein x is a continuous natural number from 1 to 11, and A x and B x are integers. Among them, A 1 =169, B 1 =0; A 2 =170, B 2 =-1; A 3 =171, B 3 =-2; A 4 =172, B 4 =-3; A 5 =173, B 5 =-4; A 6 =174, B 6 =-5; A 7 =175, B 7 =-6; A 8 =176, B 8 =-7; A 9 =177, B 9 =-8; A 10 =178, B 10 =-9; A 11 =179, B 11 =-10, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 = C 10 =C 11 , then the 11 windows are W 1 [k-169,k], W 2 [k-170,k-1], W 3 [k-171,k-2], W 4 [k -172,k-3], W 5 [k-173,k-4], W 6 [k-174,k-5], W 7 [k-175,k-6], W 8 [k-176 ,k-7], W 9 [k-177,k-8], W 10 [k-178,k-9] and W 11 [k-179,k-10]. k a is the data stream split point. The search direction of the data stream split point shown in Figure 21 is from left to right. After skipping the minimum data block 4KB from the data stream split point k a , the minimum data block 4KB end position is taken as the next potential For split point k i , according to the preset rules for the deduplication server 103, determine the window W ix [k i -A x , k i +B x ] for the potential split point k i , in this embodiment, x is 1 respectively to 11 consecutive natural numbers. In the embodiment shown in FIG. 21 , there are 11 windows determined for the potential segmentation point ki, which are respectively W i1 [k i -169, ki ], W i2 [ k i -170, ki -1] , W i3 [k i -171,k i -2], W i4 [k i -172,k i -3], W i5 [k i -173,k i -4], W i6 [k i -174 ,k i -5], W i7 [k i -175,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177,k i -8], W i10 [k i -178, ki -9] and W i11 [k i -179, ki -10]. Judging whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 , judging W Whether at least part of the data in i3 [k i -171, ki -2] meets the predetermined condition C 3 , judge whether at least part of the data in W i4 [k i -172, ki -3] meet the predetermined condition C 4 , and judge W Whether at least part of the data in i5 [k i -173, ki -4] meets the predetermined condition C 5 , judge whether at least part of the data in W i6 [k i -174, ki -5] meet the predetermined condition C 6 , and judge W Whether at least part of the data in i7 [k i -175, ki -6] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -176, ki -7] meet the predetermined condition C 8 , and judge W Whether at least part of the data in i9 [k i -177, ki -8] meets the predetermined condition C 9 , judge whether at least part of the data in W i10 [k i -178, ki -9] meet the predetermined condition C 10 and judge W Whether at least part of the data in i11 [k i -179, ki -10] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined conditions, as shown in Figure 22, W i5 [k i -173, ki -4], then from the potential segmentation point ki along the data The search direction of the stream split point jumps N bytes, wherein N bytes are not greater than ‖B 5 ‖+max x (‖A x ‖), and in the embodiment shown in Figure 22, the jump N bytes are not greater than 183 bytes. In this embodiment, N=7 to obtain a new potential segmentation point. In order to distinguish it from the potential segmentation point ki, the new potential segmentation point is denoted as k j here . According to the embodiment shown in FIG. 21 , a rule is preset on the deduplication server 103, and the rule is: determine a window W jx [k j -A x , k j +B x ] for a potential segmentation point k j , In this embodiment, x are consecutive natural numbers from 1 to 11, respectively. There are 11 windows determined for the potential segmentation point k j , which are respectively W j1 [k j -169,k j ], W j2 [k j -170,k j -1], W j3 [k j -171,k j -2], W j4 [k j -172,k j -3], W j5 [k j -173,k j -4], W j6 [k j -174,k j -5], W j7 [ k j -175,k j -6], W j8 [k j -176 ,k j -7], W j9 [k j -177,k j -8], W j10 [k j -178,k j - 9] and W j11 [k j -179, k j -10]. As shown in Figure 22, the 11th window W j11 [k j -179, k j -10] determined for the potential segmentation point is guaranteed to be within the range between the potential segmentation point k i and the potential segmentation point k j range, then in this embodiment, it must be ensured that the left boundary of window W j11 [k j -179, k j -10] and the right boundary of window W i5 [k i -173, k i -4] (k i -4) coincides, or is located within the range of window W i5 [k i -173, ki -4], the window W j11 [k j -179, k j -10] is based on the rule and is The M windows determined by the above-mentioned potential segmentation point kj are the first-ranked windows in the sequence obtained according to the search direction of the data flow. Therefore, within this limit, when at least part of the data in the window W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , the jumping distance from the potential split point ki along the direction of data stream split point search Not greater than ‖B 5 ‖+max x (‖A x ‖). Judging whether at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [k j -170,k j -1] meets the predetermined condition C 2 , judging W Whether at least part of the data in j3 [k j -171,k j -2] meets the predetermined condition C 3 , judge whether at least part of the data in W j4 [k j -172,k j -3] meet the predetermined condition C 4 , judge W Whether at least part of the data in j5 [k j -173, k j -4] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174, k j -5] meet the predetermined condition C 6 , judge W Whether at least part of the data in j7 [k j -175, k j -6] meets the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176 , k j -7] meet the predetermined condition C 8 , and judge W Whether at least part of the data in j9 [k j -177, k j -8] meets the predetermined condition C 9 , judge whether at least part of the data in W j10 [k j -178, k j -9] meet the predetermined condition C 10 and judge W Whether at least part of the data in j11 [k j -179, k j -10] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k i is the data flow split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data flow split point. When it is judged that the potential split point k j is not a data flow split point, the next potential split point is obtained in the same manner as k i , and the next potential split point is judged according to the preset rules on the deduplication server 103 and the above method. Split point for data flow. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the forced split point.
在如图21所示的实施方式中,按照在去重服务器103上预设的规则,从判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1开始,当判断Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]和Wi4[ki-172,ki-3]中至少部分数据中至少部分数据分别满足预定条件C1、C2、C3和C4,判断Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5时,从潜在分割点ki沿着数据流分割点查找方向跳跃6个字节,在第6个字节的结束位置获得新的潜在分割点,为与其他潜在分割点区别,这里表示为kg,按照在去重服务器103上预设的规则,为潜在分割点kg确定11个窗口,分别为Wg1[kg-169,kg]、Wg2[kg-170,kg-1]、Wg3[kg-171,kg-2]、Wg4[kg-172,kg-3]、Wg5[kg-173,kg-4]、Wg6[kg-174,kg-5]、Wg7[kg-175,kg-6]、Wg8[kg-176,kg-7]、Wg9[kg-177,kg-8]、Wg10[kg-178,kg-9]和Wg11[kg-179,kg-10]。判断Wg1[kg-169,kg]中至少部分数据是否满足预定条件C1、判断Wg2[kg-170,kg-1]中至少部分数据是否满足预定条件C2、判断Wg3[kg-171,kg-2]中至少部分数据是否满足预定条件C3、判断Wg4[kg-172,kg-3]中至少部分数据是否满足预定条件C4、判断Wg5[kg-173,kg-4]中至少部分数据是否满足预定条件C5、判断Wg6[kg-174,kg-5]中至少部分数据是否满足预定条件C6、判断Wg7[kg-175,kg-6]中至少部分数据是否满足预定条件C7、判断Wg8[kg-176,kg-7]中至少部分数据是否满足预定条件C8、判断Wg9[kg-177,kg-8]中至少部分数据是否满足预定条件C9、判断Wg10[kg-178,kg-9]中至少部分数据是否满足预定条件C10和判断Wg11[kg-179,kg-10]中至少部分数据是否满足预定条件C11。窗口Wg11[kg-179,kg-10]与窗口Wi5[ki-173,ki-4]重合,并且C5=C11,因此,当判断Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5时,从潜在分割点ki沿着数据流分割点查找方向跳跃T个字节,获得的潜在分割 点kg仍然不符合作为数据流分割点的条件。因此,如果从潜在分割点ki沿着数据流分割点查找方向跳跃6个字节会存在重复计算,因此,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节可以减少重复计算,效率更高。因此提高了查找数据流分割点的速度。当预设规定中窗口Wx[k-Ax,k+Bx]中至少部分数据满足预定条件Cx的概率为1/2时,即是说以1/2的概率执行跳跃,每次最多可以跳跃‖B11‖+‖A11‖=189个字节。In the embodiment shown in FIG. 21 , according to the preset rules on the deduplication server 103, it starts from judging whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , when judging W i1 [k i -169,k i ], W i2 [k i -170,k i -1], W i3 [k i -171,k i -2] and W i4 [k i -172,k i -3] At least part of the data in at least part of the data meet the predetermined conditions C 1 , C 2 , C 3 and C 4 respectively, and it is judged that at least part of the data in W i5 [ ki -173, ki -4] does not meet the predetermined condition C 5 , jump 6 bytes from the potential segmentation point ki along the data stream segmentation point search direction, and obtain a new potential segmentation point at the end position of the 6th byte. To distinguish it from other potential segmentation points, it is expressed as k g , according to the preset rules on the deduplication server 103, determine 11 windows for the potential segmentation point k g , which are respectively W g1 [k g -169,k g ], W g2 [k g -170,k g -1], W g3 [k g -171,k g -2], W g4 [k g -172,k g -3], W g5 [k g -173,k g -4], W g6 [k g -174,k g -5], W g7 [k g -175,k g -6], W g8 [k g -176,k g -7], W g9 [k g -177,k g -8 ], W g10 [k g -178,k g -9] and W g11 [k g -179,k g -10]. Judging whether at least part of the data in W g1 [k g -169, kg ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W g2 [k g -170, kg -1] meets the predetermined condition C 2 , judging W Whether at least part of the data in g3 [k g -171, kg -2] satisfies the predetermined condition C 3 , judging whether at least part of the data in W g4 [k g -172, kg -3] meets the predetermined condition C 4 , judging W Whether at least part of the data in g5 [k g -173, kg -4] satisfies the predetermined condition C 5 , judging whether at least part of the data in g6 [k g -174, kg -5] meets the predetermined condition C 6 , judging W Whether at least part of the data in g7 [k g -175, kg g -6] meets the predetermined condition C 7 , judging whether at least part of the data in W g8 [k g -176, kg g -7] meets the predetermined condition C 8 , judging W Whether at least part of the data in g9 [k g -177, k g -8] satisfies the predetermined condition C 9 , judge whether at least part of the data in W g10 [k g -178, k g -9] meet the predetermined condition C 10 and judge W Whether at least part of the data in g11 [k g -179,k g -10] satisfies the predetermined condition C 11 . Window W g11 [k g -179, k g -10] coincides with window W i5 [k i -173, ki -4], and C 5 =C 11 , therefore, when judging W i5 [k i -173, When at least part of the data in ki -4] does not meet the predetermined condition C 5 , jump T bytes from the potential segmentation point ki along the data stream segmentation point search direction, and the obtained potential segmentation point k g still does not meet the data stream Conditions for split points. Therefore, there will be double counting if 6 bytes are jumped from the potential split point ki along the data flow split point lookup direction, so jumping 7 bytes from the potential split point ki along the data stream split point lookup direction can reduce Repeated calculations are more efficient. Therefore, the speed of finding data stream split points is increased. When the probability that at least part of the data in the window W x [kA x , k+B x ] satisfies the predetermined condition C x is 1/2, that is to say, the jump is performed with the probability of 1/2, and each time the maximum can be Jump ‖B 11 ‖+‖A 11 ‖=189 bytes.
在本实施方式中,预定规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx]及窗口Wx[k-Ax,k+Bx]中至少部分数据满足预设条件Cx,其中Wx[k-Ax,k+Bx]中至少部分数据满足预设条件Cx的概率为1/2,x分别为1到11连续的自然数并且Ax和Bx为整数。其中,A1=169,B1=0;A2=170,B2=-1;A3=171,B3=-2;A4=172,B4=-3;A5=173,B5=-4;A6=174,B6=-5;A7=175,B7=-6;A8=176,B8=-7;A9=177,B9=-8;A10=178,B10=-9;A11=179,B11=-10,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。即为潜在分割点k选择11个窗口,并且为连续11个窗口,通过这两个因素可以计算P(n)。11个窗口的选择方式及判断11个窗口中的每一个窗口中至少部分数据满足预定条件Cx遵循在去重服务器103上预设的规则,因此是否存在连续11个窗口中每一个窗口中至少部分数据满足预定条件Cx就决定潜在分割点k是否为数据流分割点。我们称两个字节之间的间隙为一个点。P(n)表示:连续的n个窗口内不存在连续的11个满足条件的窗口的概率,即不存在数据流分割点的概率。从文件头/上一分割点跳跃最小分块大小4KB后,向数据流分割点查找反方向回退10个字节,找到第4086个点,在该点处不存在数据流分割点,所以P(4086)=1,依次类推,P(4087)=1,……P(4095)=1。在第4096个点处,即在最小分块大小处,以(1/2)^11的概率这11个窗口中每 一个窗口中至少部分数据满足预定条件Cx,因此以(1/2)^11的概率存在数据流分割点,以1-(1/2)^11的概率不存在数据流分割点,所以P(4096)=1-(1/2)^11。In this embodiment, the predetermined rule is: determine 11 windows W x [kA x , k+B x ] for the potential segmentation point k and at least part of the data in the windows W x [kA x , k+B x ] satisfy the preset Condition C x , where the probability that at least part of the data in W x [kA x ,k+B x ] meets the preset condition C x is 1/2, x is a continuous natural number from 1 to 11 and A x and B x are integers . Among them, A 1 =169, B 1 =0; A 2 =170, B 2 =-1; A 3 =171, B 3 =-2; A 4 =172, B 4 =-3; A 5 =173, B 5 =-4; A 6 =174, B 6 =-5; A 7 =175, B 7 =-6; A 8 =176, B 8 =-7; A 9 =177, B 9 =-8; A 10 =178, B 10 =-9; A 11 =179, B 11 =-10, and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 = C 10 =C 11 . That is, 11 windows are selected for the potential segmentation point k, and 11 consecutive windows are selected, and P(n) can be calculated by these two factors. The selection method of 11 windows and judging that at least part of the data in each of the 11 windows meet the predetermined condition C x follows the preset rules on the deduplication server 103, so whether there is at least Part of the data satisfies the predetermined condition C x to determine whether the potential split point k is a data stream split point. We call the gap between two bytes a point. P(n) represents: the probability that there are no consecutive 11 windows satisfying the condition within the consecutive n windows, that is, the probability that there is no data stream segmentation point. After skipping the minimum block size of 4KB from the file header/previous split point, go back 10 bytes in the opposite direction to the data stream split point, and find the 4086th point. There is no data stream split point at this point, so P (4086)=1, and so on, P(4087)=1, ... P(4095)=1. At the 4096th point, that is, at the minimum block size, at least part of the data in each of the 11 windows satisfies the predetermined condition C x with the probability of (1/2)^11, so with (1/2) There is a data stream split point with a probability of ^11, and there is no data stream split point with a probability of 1-(1/2)^11, so P(4096)=1-(1/2)^11.
如图35所示,在第n个窗口处,可以分为12种情况来递推P(n)。As shown in Figure 35, at the nth window, P(n) can be deduced recursively in 12 cases.
情况1:第n个窗口中至少部分数据以1/2的概率不满足预定条件,此时第n个窗口前面的n-1个窗口以P(n-1)的概率不存在连续的11个窗口中每一个窗口至少部分数据均满足预定条件,因此P(n)包含1/2*P(n-1)。第n个窗口中至少部分数据不满足预定条件,并且且第n个点前面的n-1个窗口存在连续的11个窗口每一个窗口中至少部分数据均满足预定条件的情况与P(n)无关。Case 1: At least part of the data in the nth window does not meet the predetermined condition with a probability of 1/2. At this time, there are no consecutive 11 windows with a probability of P(n-1) in the n-1 windows in front of the nth window At least part of the data in each window in the window satisfies the predetermined condition, so P(n) includes 1/2*P(n-1). At least part of the data in the nth window does not meet the predetermined conditions, and there are 11 consecutive windows in the n-1 windows before the nth point, and at least part of the data in each window meets the predetermined conditions and P(n) irrelevant.
情况2:第n个窗口中至少部分数据以1/2的概率满足预定条件,第n-1个窗口中至少部分数据以1/2的概率不满足预定条件,此时第n-1个窗口前面的n-2个窗口以P(n-2)的概率不存在连续的11个窗口中每一个窗口中至少部分数据均满足预定条件,因此P(n)包含1/2*1/2*P(n-2)。第n个窗口中至少部分数据满足预定条件,第n-1个点窗口中至少部分数据不满足预定条件,并且第n-1个窗口前面的n-2个窗口存在连续的11个窗口中每一个窗口至少部分数据满足预定条件的情况与P(n)无关。Case 2: At least part of the data in the nth window meets the predetermined condition with a probability of 1/2, and at least part of the data in the n-1th window does not meet the predetermined condition with a probability of 1/2. At this time, the n-1th window The previous n-2 windows do not exist with the probability of P(n-2). At least part of the data in each of the 11 consecutive windows meets the predetermined conditions, so P(n) contains 1/2*1/2* P(n-2). At least part of the data in the nth window meets the predetermined condition, at least part of the data in the n-1th point window does not meet the predetermined condition, and the n-2 windows in front of the n-1th window exist in each of the 11 consecutive windows. The fact that at least part of the data in a window satisfies the predetermined condition has nothing to do with P(n).
依照上述描述,情况11:第n至n-9个窗口中至少部分数据以(1/2)^10的概率满足预定条件,第n-10个窗口中至少部分数据以1/2的概率不满足预定条件,此时第n-10个窗口前面的n-11个窗口以P(n-11)的概率不存在连续的11个窗口中每一个窗口中至少部分数据均满足预定条件,因此P(n)包含(1/2)^10*1/2*P(n-11)。第n至n-9个窗口中至少部分数据均满足预定条件,第n-10个窗口中至少部分数据不满足预定条件,并且第n-10个窗口前面的n-11个窗口存在连续的11个 窗口中每一个窗口中至少部分数据均满足预定条件的情况与P(n)无关。According to the above description, case 11: at least part of the data in the nth to n-9 windows meets the predetermined condition with a probability of (1/2)^10, and at least part of the data in the n-10th window does not meet the predetermined condition with a probability of 1/2 Satisfy the predetermined condition, at this time, the n-11 windows in front of the n-10th window do not exist with the probability of P(n-11). At least part of the data in each of the 11 consecutive windows meets the predetermined condition, so P (n) contains (1/2)^10*1/2*P(n-11). At least part of the data in the n-9th window meets the predetermined condition, at least part of the data in the n-10th window does not meet the predetermined condition, and there are consecutive 11 windows in the n-11 windows before the n-10th window The fact that at least part of the data in each of the windows satisfies the predetermined condition has nothing to do with P(n).
情况12:第n至n-10个的窗口中至少部分数据以(1/2)^11的概率满足预定条件,该情况与P(n)无关。Case 12: at least part of the data in the nth to n-10th windows satisfies the predetermined condition with a probability of (1/2)^11, and this case has nothing to do with P(n).
因此,P(n)=1/2*P(n-1)+(1/2)^2*P(n-2)+……+(1/2)^11*P(n-11)。另一种预设规则:为潜在分割点k确定24个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到11连续的自然数,A1=169,B1=0;A2=170,B2=-1;A3=171,B3=-2;A4=172,B4=-3;A5=173,B5=-4;A6=174,B6=-5;A7=175,B7=-6;A8=176,B8=-7;A9=177,B9=-8;A10=178,B10=-9;A11=179,B11=-10,…A24=192,B24=-23,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=…=C24,窗口Wx[k-Ax,k+Bx]中至少部分数据满足预定条件Cx的概率为3/4,通过这两个因素可以计算P(n)。Therefore, P(n)=1/2*P(n-1)+(1/2)^2*P(n-2)+...+(1/2)^11*P(n-11) . Another preset rule: determine 24 windows W x [kA x ,k+B x ] and the predetermined condition C x corresponding to the window W x [kA x ,k+B x ] for potential segmentation point k, where x It is a continuous natural number from 1 to 11, A 1 =169, B 1 =0; A 2 =170, B 2 =-1; A 3 =171, B 3 =-2; A 4 =172, B 4 =-3 ; A 5 =173, B 5 =-4; A 6 =174, B 6 =-5; A 7 =175, B 7 =-6; A 8 =176, B 8 =-7; A 9 =177, B 9 =-8; A 10 =178, B 10 =-9; A 11 =179, B 11 =-10, ... A 24 =192, B 24 =-23, and C 1 =C 2 =C 3 = C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =...=C 24 , the probability that at least part of the data in the window W x [kA x ,k+B x ] satisfies the predetermined condition C x is 3/4 , P(n) can be calculated from these two factors.
因此是否存在连续24个窗口中的每一个窗口中至少部分数据均满足预定条件Cx就决定潜在分割点k是否为数据流分割点,可以通过下面的公式计算:Therefore, whether there is at least part of the data in each of the 24 consecutive windows satisfying the predetermined condition C x determines whether the potential segmentation point k is a data stream segmentation point, which can be calculated by the following formula:
P(1)=1,P(2)……P(23)=1,P(24)=1-(3/4)^24,P(1)=1, P(2)...P(23)=1, P(24)=1-(3/4)^24,
P(n)=1/4*P(n-1)+1/4*(3/4)*P(n-2)+……+1/4*(3/4)^23*P(n-24)。P(n)=1/4*P(n-1)+1/4*(3/4)*P(n-2)+......+1/4*(3/4)^23*P( n-24).
经过计算,P(5*1024)=0.78,P(11*1024)=0.17,P(12*1024)=0.13,即从数据流起始位置/上一数据流分割点查找到12KB后以13%的概率仍未找到数据流分割点,强制进行分割。通过这个概率,求得数据流分割点的密度函数,经过积分求得大约平均在从数据流起始位置/上一数据流分割点查找7.6KB时找到数据流分割点,即平均分块长度大约为7.6KB。与连续的11个窗口中至少部分数据以1/2的概率满足预定条件不同,传统CDC算法采用一个窗口以1/2^12的概率满足条件时, 方可达到平均分块长度7.6KB的效果。After calculation, P(5*1024)=0.78, P(11*1024)=0.17, P(12*1024)=0.13, that is, after finding 12KB from the data stream start position/previous data stream split point, use 13 % probability that the data stream split point is still not found, forcing a split. Through this probability, the density function of the split point of the data stream is obtained, and the approximate average of the data stream split point is found when searching for 7.6 KB from the starting position of the data stream/the previous split point of the data stream through integration, that is, the average block length is about It is 7.6KB. Different from at least part of the data in 11 consecutive windows meeting the predetermined condition with a probability of 1/2, the traditional CDC algorithm can only achieve the effect of an average block length of 7.6KB when a window meets the condition with a probability of 1/2^12 .
在图3所示的数据流分割点查找的基础上,在图23所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到11连续的自然数,Ax和Bx为整数。其中,窗口Wx[k-Ax,k+Bx]中至少部分数据满足预定条件Cx的概率为1/2,A1=171,B1=-2;A2=172,B2=-3;A3=173,B3=-4;A4=174,B4=-5;A5=175,B5=-6;A6=176,B6=-7;A7=177,B7=-8;A8=178,B8=-9;A9=179,B9=-10;A10=170,B10=-1;A11=169,B11=0,并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。ka为数据流分割点,图23中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,在最小数据块4KB结束位置作为下一个潜在分割点ki,根据在去重服务器103上预设的规则,为潜在分割点ki确定Wx[k-Ax,k+Bx]及窗口Wx[k-Ax,k+Bx]对应的预设条件Cx,其中x为1到11连续的自然数。确定的11个窗口分别为Wi1[ki-171,ki-2]、Wi2[ki-172,ki-3]、Wi3[ki-173,ki-4]、Wi4[ki-174,ki-5]、Wi5[ki-175,ki-6]、Wi6[ki-176,ki-7]、Wi7[ki-177,ki-8]、Wi8[ki-178,ki-9]、Wi9[ki-179,ki-10]、Wi10[ki-170,ki-1]和Wi11[ki-169,ki]。判断Wi1[ki-171,ki-2]中至少部分数据是否满足预定条件C1、判断Wi2[ki-172,ki-3]中至少部分数据是否满足预定条件C2、判断Wi3[ki-173,ki-4]中至少部分数据是否满足预定条件C3、判断Wi4[ki-174,ki-5]中至少部分数据是否满足预定条件C4、判断Wi5[ki-175,ki-6]中至少部分数据是否满足预定条件C5、判断Wi6[ki-176,ki-7]中至少部分数据是否满足预定条件C6、判断Wi7[ki-177,ki-8]中至少部分数据是否满足预定条件C7、判断Wi8[ki-178,ki-9]中至少部分数据是否满足预定条件C8、判断Wi9[ki-179,ki-10]中至少部分数据是否满足预定条件C9、判断Wi10[ki-170,ki-1]中至少部分数据是否满足 预定条件C10和判断Wi11[ki-169,ki]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图24所示,Wi3[pi3-169,pi3]中至少部分数据不满足预定条件C3,点pi3沿着数据流分割点查找方向跳跃11个字节为例进行描述。如图24所示,当判断W3不满足预定条件C3时,以ki为起始点,沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B3‖+maxx(‖Ax‖),在本实施例中,N=7,在第7个字节的结束位置,获得下一个潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据在去重服务器103上预设的规则,为潜在分割点kj确定11个窗口Wjx[kj-Ax,kj+Bx],分别为Wj1[kj-171,kj-2]、Wj2[kj-172,kj-3]、Wj3[kj-173,kj-4]、Wj4[kj-174,kj-5]、Wj5[kj-175,kj-6]、Wj6[kj-176,kj-7]、Wj7[kj-177,kj-8]、Wj8[kj-178,kj-9]、Wj9[kj-179,kj-10]、Wj10[kj-170,kj-1]和Wj11[kj-169,kj]。判断Wj1[kj-171,kj-2]中至少部分数据是否满足预定条件C1、判断Wj2[kj-172,kj-3]中至少部分数据是否满足预定条件C2、判断Wj3[kj-173,kj-4]中至少部分数据是否满足预定条件C3、判断Wj4[kj-174,kj-5]中至少部分数据是否满足预定条件C4、判断Wj5[kj-175,kj-6]中至少部分数据是否 满足预定条件C5、判断Wj6[kj-176,kj-7]中至少部分数据是否满足预定条件C6、判断Wj7[kj-177,kj-8]中至少部分数据是否满足预定条件C7、判断Wj8[kj-178,kj-9]中至少部分数据是否满足预定条件C8、判断Wj9[kj-179,kj-10]中至少部分数据是否满足预定条件C9、判断Wj10[kj-170,kj-1]中至少部分数据是否满足预定条件C10和判断Wj11[kj-169,kj]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。当然该方法的实施受最大数据块长度和构成该数据流的文件的大小约束,在此不再赘述。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 23, a rule is preset on the deduplication server 103, and the rule is: determine 11 windows for the potential segmentation point k W x [kA x , k+B x ] and the predetermined condition C x corresponding to the window W x [kA x , k+B x ], wherein x is a continuous natural number from 1 to 11, and A x and B x are integers. Wherein, the probability that at least part of the data in the window W x [kA x ,k+B x ] satisfies the predetermined condition C x is 1/2, A 1 =171, B 1 =-2; A 2 =172, B 2 =- 3; A 3 =173, B 3 =-4; A 4 =174, B 4 =-5; A 5 =175, B 5 =-6; A 6 =176, B 6 =-7; A 7 =177 , B 7 =-8; A 8 =178, B 8 =-9; A 9 =179, B 9 =-10; A 10 =170, B 10 =-1; A 11 =169, B 11 =0, And C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 . k a is the data stream split point. The search direction of the data stream split point shown in Figure 23 is from left to right. After skipping the smallest data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is used as the next Potential segmentation point k i , according to the preset rules on the deduplication server 103, determine W x [kA x , k+B x ] and window W x [kA x , k+B x ] corresponding to potential segmentation point k i The preset condition C x , where x is a continuous natural number from 1 to 11. The 11 determined windows are respectively W i1 [k i -171, ki -2], W i2 [k i -172, ki -3], W i3 [k i -173, ki -4], W i i4 [k i -174,k i -5], W i5 [k i -175,k i -6], W i6 [k i -176,k i -7], W i7 [k i -177,k i -8], W i8 [k i -178,k i -9], W i9 [k i -179,k i -10], W i10 [k i -170,k i -1] and W i11 [ k i -169, k i ]. Judging whether at least part of the data in W i1 [k i -171, ki -2] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [k i -172, ki -3] meets the predetermined condition C 2 , Judging whether at least part of the data in W i3 [k i -173, ki -4] meets the predetermined condition C 3 , judging whether at least part of the data in W i4 [k i -174, ki -5] meets the predetermined condition C 4 , Judging whether at least part of the data in W i5 [k i -175, ki -6] meets the predetermined condition C 5 , judging whether at least part of the data in W i6 [k i -176, ki -7] meets the predetermined condition C 6 , Judging whether at least part of the data in W i7 [k i -177, ki -8] meets the predetermined condition C 7 , judging whether at least part of the data in W i8 [k i -178, ki -9] meets the predetermined condition C 8 , Judging whether at least part of the data in W i9 [k i -179, ki -10] meets the predetermined condition C 9 , judging whether at least part of the data in W i10 [k i -170, ki -1] meets the predetermined condition C 10 and It is judged whether at least part of the data in W i11 [k i -169, ki ] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined condition, as shown in Figure 24, at least part of the data in W i3 [p i3 -169, p i3 ] does not meet the predetermined condition C 3 , point p i3 jumps 11 bytes along the search direction of the data stream split point as an example for description. As shown in Figure 24, when it is judged that W 3 does not meet the predetermined condition C 3 , start with ki as the starting point and jump N bytes along the search direction of the data stream segmentation point, where N bytes are not greater than ‖B 3 ‖ +max x (‖A x ‖), in the present embodiment, N=7, at the end position of the 7th byte, obtain the next potential segmentation point, for being different from the potential segmentation point ki , here the new A potential segmentation point is denoted as k j , according to the preset rules on the deduplication server 103, 11 windows W jx [k j -A x , k j +B x ] are determined for the potential segmentation point k j , respectively W j1 [k j -171,k j -2], W j2 [k j -172,k j -3], W j3 [k j -173,k j -4], W j4 [k j -174,k j -5], W j5 [k j -175,k j -6], W j6 [k j -176,k j -7], W j7 [k j -177,k j -8], W j8 [k j j -178,k j -9], W j9 [k j -179,k j -10], W j10 [k j -170,k j -1], and W j11 [k j -169,k j ]. Judging whether at least part of the data in W j1 [k j -171,k j -2] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [k j -172,k j -3] meets the predetermined condition C 2 , Judging whether at least part of the data in W j3 [k j -173, k j -4] meets the predetermined condition C 3 , judging whether at least part of the data in W j4 [k j -174, k j -5] meets the predetermined condition C 4 , Judging whether at least part of the data in W j5 [k j -175, k j -6] meets the predetermined condition C 5 , judging whether at least part of the data in W j6 [k j -176, k j -7] meets the predetermined condition C 6 , Judging whether at least part of the data in W j7 [k j -177, k j -8] meets the predetermined condition C 7 , judging whether at least part of the data in W j8 [k j -178, k j -9] meets the predetermined condition C 8 , Judging whether at least part of the data in W j9 [k j -179, k j -10] meets the predetermined condition C 9 , judging whether at least part of the data in W j10 [k j -170, k j -1] meets the predetermined condition C 10 and It is judged whether at least part of the data in W j11 [k j −169,k j ] satisfies the predetermined condition C 11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data stream split point. When it is judged that the potential split point k j is not a data flow split point, the next potential split point is obtained in the same manner as k i , and the next potential split point is judged according to the preset rules on the deduplication server 103 and the above method. Split point for data flow. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the mandatory split point. Of course, the implementation of this method is limited by the maximum data block length and the size of the files constituting the data stream, so details will not be repeated here.
在图3所示的数据流分割点查找的基础上,在图25所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中x为1到11连续自然数,A1=166,B1=3;A2=167,B2=2;A3=168,B3=1;A4=169,B4=0;A5=170,B5=-1;A6=171,B6=-2;A7=172,B7=-3;A8=173,B8=-4;A9=174,B9=-5;A10=175,B10=-6;A11=176,B11=-7;并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,则11个窗口分别为W1[k-166,k+3]、W2[k-167,k+2]、W3[k-168,k+1]、W4[k-169,k]、W5[k-170,k-1]、W6[k-171,k-2]、W7[k-172,k-3]、W8[k-173,k-4]、W9[k-174,k-5]、W10[k-175,k-6]和W11[k-176,k-7]。ka为数据流分割点,图25中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,在本实施例中,根据在去重服务器103上预设的规则,为潜在分割点ki确定11个窗口Wix[k-Ax,k+Bx]及窗口Wix[k-Ax,k+Bx]对应的预定条件Cx,x分别为1到11连续的自然数。在图25所示的实施方式中,为潜在分割点ki确定11个窗口,分别为Wi1[ki-166,ki+3]、Wi2[ki-167,ki+2]、Wi3[ki-168,ki+1]、Wi4[ki-169,ki]、Wi5[ki-170,ki-1]、Wi6[ki-171,ki-2]、Wi7[ki-172,ki-3]、Wi8[ki-173,ki-4]、Wi9[ki-174,ki-5]、Wi10[ki-175,ki-6]和Wi11[ki-176,ki-7]。判断Wi1[ki-166,ki+3]中至少部分数据是否满足预定条件C1、判断Wi2[ki-167,ki+2]中至少部分数据是否满足预定条件C2、判断Wi3[ki-168,ki+1]中至少部分数据是否满足预定条件C3、判断Wi4[ki-169,ki]中至少部分数据是否满足预定条件C4、判断Wi5[ki-170,ki-1]中至少部分数据是否满足预定条件C5、判断Wi6[ki-171,ki-2]中至少部分数据是否满足预定条件C6、判断Wi7Wi7[ki-172,ki-3]中至少部分数据是否满足预 定条件C7、判断Wi8[ki-173,ki-4]中至少部分数据是否满足预定条件C8、判断Wi9[ki-174,ki-5]中至少部分数据是否满足预定条件C9、判断Wi10[ki-175,ki-6]中至少部分数据是否满足预定条件C10和判断Wi11[ki-176,ki-7]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图26所示,Wi7[ki-172,ki-3],则从潜在分割点ki沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B7‖+maxx(‖Ax‖),在图26所示的实施方式中,跳跃N个字节不大于185个字节,在本实施例中,N=5,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据图25所示的实施方式中在去重服务器103上预设的规则,为潜在分割点kj确定的窗口为11个,分别为Wj1[kj-166,kj+3]、Wj2[kj-167,kj+2]、Wj3[kj-168,kj+1]、Wj4[kj-169,kj]、Wj5[kj-170,kj-1]、Wj6[kj-171,kj-2]、Wj7[kj-172,kj-3]、Wj8[kj-173,kj-4]、Wj9[kj-174,kj-5]、Wj10[kj-175,kj-6]和Wj11[kj-176,kj-7]。判断Wj1[kj-166,kj+3]中至少部分数据是否满足预定条件C1、判断Wj2[kj-167,kj+2]中至少部分数据是否满足预定条件C2、判断Wj3[kj-168,kj+1]中至少部分数据是否满足预定条件C3、判断Wj4[kj-169,kj]中至少 部分数据是否满足预定条件C4、判断Wj5[kj-170,kj-1]中至少部分数据是否满足预定条件C5、判断Wj6[kj-171,kj-2]中至少部分数据是否满足预定条件C6、判断Wj7[kj-172,kj-3]中至少部分数据是否满足预定条件C7、判断Wj8[kj-173,kj-4]中至少部分数据是否满足预定条件C8、判断Wj9[kj-174,kj-5]中至少部分数据是否满足预定条件C9、判断Wj10[kj-175,kj-6]中至少部分数据是否满足预定条件C10和判断Wj11[kj-176,kj-7]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 25, a rule is preset on the deduplication server 103, and the rule is: determine 11 windows for the potential segmentation point k W x [kA x , k+B x ] and the predetermined condition C x corresponding to the window W x [kA x , k+B x ], where x is a continuous natural number from 1 to 11, A 1 =166, B 1 =3; A 2 =167, B 2 =2; A 3 =168, B 3 =1; A 4 =169, B 4 =0; A 5 =170, B 5 =-1; A 6 =171, B 6 =- 2; A 7 =172, B 7 =-3; A 8 =173, B 8 =-4; A 9 =174, B 9 =-5; A 10 =175, B 10 =-6; A 11 =176 , B 11 =-7; and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 , then the 11 windows are W 1 [ k-166,k+3], W 2 [k-167,k+2], W 3 [k-168,k+1], W 4 [k-169,k], W 5 [k-170, k-1], W 6 [k-171,k-2], W 7 [k-172,k-3], W 8 [k-173,k-4], W 9 [k-174,k- 5], W 10 [k-175,k-6] and W 11 [k-176,k-7]. k a is the data stream split point. The search direction of the data stream split point shown in Figure 25 is from left to right. After skipping the smallest data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is taken as the next potential Segmentation point k i , in this embodiment, according to the preset rules on the deduplication server 103 , 11 windows W ix [kA x , k+B x ] and windows W ix [kA x , k+B x ] corresponding to the predetermined condition C x , x is a continuous natural number from 1 to 11 respectively. In the embodiment shown in FIG. 25, 11 windows are determined for the potential segmentation point ki, which are respectively W i1 [k i -166, ki +3], W i2 [ k i -167, ki +2] , W i3 [k i -168,k i +1], W i4 [k i -169,k i ], W i5 [k i -170,k i -1], W i6 [k i -171,k i -2], W i7 [k i -172,k i -3], W i8 [k i -173,k i -4], W i9 [k i -174,k i -5], W i10 [ k i -175, k i -6] and W i11 [k i -176, k i -7]. Judging whether at least part of the data in W i1 [k i -166, ki +3] meets the predetermined condition C 1 , judging whether at least part of the data in W i2 [k i -167, ki +2] meets the predetermined condition C 2 , Judging whether at least part of the data in W i3 [k i -168, ki +1] meets the predetermined condition C 3 , judging whether at least part of the data in W i4 [k i -169, ki ] meets the predetermined condition C 4 , judging W Whether at least part of the data in i5 [k i -170, ki -1] meets the predetermined condition C 5 , judge whether at least part of the data in W i6 [k i -171, ki -2] meet the predetermined condition C 6 , and judge W i7 Whether at least part of the data in W i7 [k i -172, ki -3] meets the predetermined condition C 7 , judging whether at least part of the data in W i8 [k i -173, ki -4] meets the predetermined condition C 8 , Judging whether at least part of the data in W i9 [k i -174, ki -5] meets the predetermined condition C 9 , judging whether at least part of the data in W i10 [k i -175, ki -6] meets the predetermined condition C 10 and It is judged whether at least part of the data in W i11 [k i -176, ki -7] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined conditions, as shown in Figure 26, W i7 [k i -172, ki -3], then from the potential segmentation point ki along the data Stream split point search direction jumps N bytes, where N bytes are not greater than ‖B 7 ‖+max x (‖A x ‖), in the embodiment shown in Figure 26, jump N bytes not greater than 185 bytes, in this embodiment, N=5, to obtain a new potential segmentation point, in order to distinguish it from the potential segmentation point ki, here the new potential segmentation point is represented as k j , according to the implementation shown in Figure 25 According to the preset rules on the deduplication server 103, 11 windows are determined for the potential segmentation point k j , which are respectively W j1 [k j -166, k j +3], W j2 [k j -167, k j +2], W j3 [k j -168,k j +1], W j4 [k j -169,k j ], W j5 [k j -170,k j -1], W j6 [k j -171,k j -2], W j7 [k j -172,k j -3], W j8 [k j -173 ,k j -4], W j9 [k j -174,k j -5] , W j10 [k j -175, k j -6] and W j11 [k j -176, k j -7]. Judging whether at least part of the data in W j1 [k j -166, k j +3] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [k j -167, k j +2] meets the predetermined condition C 2 , Judging whether at least part of the data in W j3 [k j -168,k j +1] meets the predetermined condition C 3 , judging whether at least part of the data in W j4 [k j -169,k j ] meets the predetermined condition C 4 , judging W Whether at least part of the data in j5 [k j -170, k j -1] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -171, k j -2] meet the predetermined condition C 6 , and judge W Whether at least part of the data in j7 [k j -172, k j -3] meets the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -173 , k j -4] meet the predetermined condition C 8 , judge W Whether at least part of the data in j9 [k j -174, k j -5] meets the predetermined condition C 9 , judge whether at least part of the data in W j10 [k j -175, k j -6] meet the predetermined condition C 10 and judge W Whether at least part of the data in j11 [k j -176,k j -7] satisfies the predetermined condition C 11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data stream split point. When it is judged that the potential split point k j is not a data flow split point, the next potential split point is obtained in the same manner as k i , and the next potential split point is judged according to the preset rules on the deduplication server 103 and the above method. Split point for data flow. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the forced split point.
在图3所示的数据流分割点查找的基础上,在图27所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中x为1到11连续的自然数,A1=169,B1=0;A2=170,B2=-1;A3=171,B3=-2;A4=172,B4=-3;A5=173,B5=-4;A6=174,B6=-5;A7=175,B7=-6;A8=176,B8=-7;A9=177,B9=-8;A10=168,B10=1;A11=179,B11=3;并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10≠C11,则11个窗口分别为W1[k-169,k]、W2[k-170,k-1]、W3[k-171,k-2]、W4[k-172,k-3]、W5[k-173,k-4]、W6[k-174,k-5]、W7[k-175,k-6]、W8[k-176,k-7]、W9[k-177,k-8]、W10[k-168,k+1]和W11[k-179,k+3]。ka为数据流分割点,图27中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,最小数据块4KB结束位置作为下一个潜在分割点ki,在本实施例中,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wix[ki-Ax,ki+Bx],x分别为1到11连续的自然数,在图27所示的实施方式中,为潜在分割点ki确定11个窗口分别为Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-168,ki+1]和Wi11[ki-179,ki+3]。判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1、判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2、判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5、判断Wi6[ki-174,ki-5]中至少部分数据是否 满足预定条件C6、判断Wi7[ki-175,ki-6]中至少部分数据是否满足预定条件C7、判断Wi8[ki-176,ki-7]中至少部分数据是否满足预定条件C8、判断Wi9[ki-177,ki-8]中至少部分数据是否满足预定条件C9、判断Wi10[ki-168,ki+1]中至少部分数据是否满足预定条件C10和判断Wi11[ki-179,ki+3]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当判断窗口Wi11中至少部分数据不满足预定条件C11时,则从潜在分割点ki沿着数据流分割点查找方向跳跃1个字节,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj。当Wi1、Wi2、Wi3、Wi4、Wi5、Wi6、Wi7、Wi8、Wi9和Wi1010个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图28所示,Wi4[ki-172,ki-3],则从点ki沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B4‖+maxx(‖Ax‖),在图28所示的实施方式中,跳跃N个字节不大于182个字节,在本实施例中,N=6,得到新的潜在分割点,为与潜在分割点ki区别,这里将新的潜在分割点表示为kj,根据图27所示的实施方式中在去重服务器103上预设的规则,为潜在分割点kj确定的窗口分别为Wj1[kj-169,kj]、Wj2[kj -170,kj-1]、Wj3[kj-171,kj-2]、Wj4[kj-172,kj-3]、Wj5[kj-173,kj-4]、Wj6[kj-174,kj-5]、Wj7[kj-175,kj-6]、Wj8[kj-176,kj-7]、Wj9[kj-177,kj-8]、Wj10[kj-168,kj+1]和Wj11[kj-179,kj+3]。判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1、判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2、判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3、判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-168,kj+1]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj+3]中至少部分数据是否满足预定条件C11。当然在本发明实施例中,判断潜在分割点ka是否为数据流分割点时也遵循该原则,具体实现不再描述,可以参照判断潜在分割点ki的描述。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wj9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分 割点,kj与ka之间的数据构成1个数据块,同时按照与ka相同的方式跳过最小分块大小4KB,获得下一个潜在分割点,并按照在去重服务器103上预设的规则,判断下一个潜在分割点是否为数据流分割点。当判断潜在分割点kj不是数据流分割点时,按照与ki相同的方式获得下一个潜在分割点,并按照在去重服务器103上预设的规则及上述方法判断下一个潜在分割点是否为数据流分割点。当超过设定的最大数据块仍然没有找到数据流分割点时,则从最大数据块的结束位置作为强制分割点。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 27, a rule is preset on the deduplication server 103, and the rule is: determine 11 windows for the potential segmentation point k W x [kA x ,k+B x ] and the predetermined condition C x corresponding to the window W x [kA x ,k+B x ], where x is a continuous natural number from 1 to 11, A 1 =169, B 1 =0 ; A 2 =170, B 2 =-1; A 3 =171, B 3 =-2; A 4 =172, B 4 =-3; A 5 =173, B 5 =-4; A 6 =174, B 6 =-5; A 7 =175, B 7 =-6; A 8 =176, B 8 =-7; A 9 =177, B 9 =-8; A 10 =168, B 10 =1; 11 =179, B 11 =3; and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 ≠C 11 , then the 11 windows are respectively W 1 [k-169,k], W 2 [k-170,k-1], W 3 [k-171,k-2], W 4 [k-172,k-3], W 5 [k- 173,k-4], W 6 [k-174,k-5], W 7 [k-175,k-6], W 8 [k-176,k-7], W 9 [k-177, k-8], W 10 [k-168, k+1], and W 11 [k-179, k+3]. k a is the data stream split point. The search direction of the data stream split point shown in Figure 27 is from left to right. After skipping the smallest data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is taken as the next potential Segmentation point k i , in this embodiment, according to the preset rules on the deduplication server 103, determine the window W ix [k i -A x , k i +B x ] for the potential segmentation point k i , where x is respectively 1 to 11 consecutive natural numbers, in the embodiment shown in Figure 27, 11 windows are determined for the potential segmentation point k i respectively W i1 [k i -169, k i ], W i2 [k i -170, k i -1], W i3 [k i -171,k i -2], W i4 [k i -172,k i -3], W i5 [k i -173,k i -4], W i6 [ k i -174,k i -5], W i7 [k i -175,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177,k i - 8], W i10 [k i -168, ki +1], and W i11 [k i -179, ki +3]. Judging whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 , judging W Whether at least part of the data in i3 [k i -171, ki -2] meets the predetermined condition C 3 , judge whether at least part of the data in W i4 [k i -172, ki -3] meet the predetermined condition C 4 , and judge W Whether at least part of the data in i5 [k i -173, ki -4] meets the predetermined condition C 5 , judge whether at least part of the data in W i6 [k i -174, ki -5] meet the predetermined condition C 6 , and judge W Whether at least part of the data in i7 [k i -175, ki -6] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -176, ki -7] meet the predetermined condition C 8 , and judge W Whether at least part of the data in i9 [k i -177, ki -8] meets the predetermined condition C 9 , judge whether at least part of the data in W i10 [k i -168, ki +1] meet the predetermined condition C 10 and judge W Whether at least part of the data in i11 [k i -179, ki +3] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in the judging window W i11 does not meet the predetermined condition C 11 , then jump 1 byte from the potential segmentation point ki along the data stream segmentation point search direction to obtain a new potential segmentation point, which is the potential segmentation point ki difference, here denote the new potential segmentation point as k j . When at least part of the data in any of the 10 windows W i1 , W i2 , W i3 , W i4 , W i5 , W i6 , W i7 , W i8 , W i9 , and W i10 does not meet the corresponding predetermined conditions, such as As shown in Figure 28, W i4 [k i -172, ki -3], then N bytes are jumped from point ki along the direction of data stream segmentation point search, where N bytes are not greater than ‖B 4 ‖+ max x (‖A x ‖), in the embodiment shown in Figure 28, skip N bytes and be no more than 182 bytes, in the present embodiment, N=6, obtain new potential segmentation point, be and Potential segmentation point ki difference, here the new potential segmentation point is represented as k j , according to the rules preset on the deduplication server 103 in the embodiment shown in FIG. 27 , the windows determined for the potential segmentation point k j are respectively W j1 [k j -169,k j ], W j2 [k j -170,k j -1], W j3 [k j -171,k j -2], W j4 [k j -172,k j -3], W j5 [k j -173,k j -4], W j6 [k j -174,k j -5], W j7 [k j -175,k j -6], W j8 [k j j -176,k j -7], W j9 [k j -177,k j -8], W j10 [k j -168,k j +1] and W j11 [k j -179,k j +3 ]. Judging whether at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [k j -170,k j -1] meets the predetermined condition C 2 , judging W Whether at least part of the data in j3 [k j -171,k j -2] meets the predetermined condition C 3 , judge whether at least part of the data in W j4 [k j -172,k j -3] meet the predetermined condition C 4 , judge W Whether at least part of the data in j5 [k j -173, k j -4] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174, k j -5] meet the predetermined condition C 6 , judge W Whether at least part of the data in j7 [k j -175, k j -6] meets the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176 , k j -7] meet the predetermined condition C 8 , and judge W Whether at least part of the data in j9 [k j -177, k j -8] meets the predetermined condition C 9 , judge whether at least part of the data in W j10 [k j -168, k j +1] meet the predetermined condition C 10 and judge W Whether at least part of the data in j11 [k j −179, k j +3] satisfies the predetermined condition C 11 . Of course, in the embodiment of the present invention, this principle is also followed when judging whether a potential split point k a is a data stream split point, and the specific implementation will not be described again, and reference may be made to the description of judging a potential split point k i . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in window Wj9 meets the predetermined condition C9 , at least part of the data in window Wj10 meets the predetermined condition C10 , and at least part of the data in window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data stream split point, the data between k j and k a constitutes a data block, and at the same time skip the minimum block size of 4KB in the same way as k a , get the next potential split point, and follow the Deduplicate the preset rules on the server 103 to determine whether the next potential split point is a data flow split point. When it is judged that the potential split point k j is not a data flow split point, the next potential split point is obtained in the same manner as k i , and the next potential split point is judged according to the preset rules on the deduplication server 103 and the above method. Split point for data flow. When the data flow split point is not found beyond the set maximum data block, the end position of the largest data block is used as the mandatory split point.
在图3所示的数据流分割点查找的基础上,在图29所示的实施方式中,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定11个窗口Wx[px-Ax,px+Bx]和窗口Wx[px-Ax,px+Bx]对应的预定条件Cx,x分别为1到11连续的自然数,其中,窗口Wx[px-Ax,px+Bx]中至少部分数据满足预定条件的概率为1/2,A1=169,B1=0;A2=171,B2=-2;A3=173,B3=-4;A4=175,B4=-6;A5=177,B5=-8;A6=179,B6=-10;A7=181,B7=-12;A8=183,B8=-14;A9=185,B9=-16;A10=187,B10=-18;A11=189,B11=-20;并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11,则11个窗口分别为W1[k-169,k]、W2[k-171,k-2]、W3[k-173,k-4]、W4[k-175,k-6]、W5[k-177,k-8]、W6[k-179,k-10]、W7[k-181,k-12]、W8[k-183,k-14]、W9[k-185,k-16]、W10[k-187,k-18]和W11[k-189,k-20]。ka为数据流分割点,图29中所示数据流分割点查找方向为从左向右,从数据流分割点ka跳过最小数据块4KB后,在最小数据块4KB结束位置作为下一个潜在分割点ki,为潜在分割点ki确定点pix,在本实施例中,根据在去重服务器103上预设的规则,x分别为1到11连续的自然数。在图29所示的实施方式中,依据预定规则,为潜在分割点ki确定的11个窗口分别为Wi1[ki-169,ki]、Wi2[ki-171,ki-2]、Wi3[ki-173,ki-4]、Wi4[ki-175,ki-6]、Wi5[ki-177, ki-8]、Wi6[ki-179,ki-10]、Wi7[ki-181,ki-12]、Wi8[ki-183,ki-14]、Wi9[ki-185,ki-16]、Wi10[ki-187,ki-18]和Wi11[ki-189,ki-20]。判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1、判断Wi2[ki-171,ki-2]中至少部分数据是否满足预定条件C2、判断Wi3[ki-173,ki-4]中至少部分数据是否满足预定条件C3、判断Wi4[ki-175,ki-6]中至少部分数据是否满足预定条件C4、判断Wi5[ki-177,ki-8]中至少部分数据是否满足预定条件C5、判断Wi6[ki-179,ki-10]中至少部分数据是否满足预定条件C6、判断Wi7[ki-181,ki-12]中至少部分数据是否满足预定条件C7、判断Wi8[ki-183,ki-14]中至少部分数据是否满足预定条件C8、判断Wi9[ki-185,ki-16]中至少部分数据是否满足预定条件C9、判断Wi10[ki-187,ki-18]中至少部分数据是否满足预定条件C10和判断Wi11[ki-189,ki-20]中至少部分数据是否满足预定条件C11。当判断窗口Wi1中至少部分数据满足预定条件C1、窗口Wi2中至少部分数据满足预定条件C2、窗口Wi3中至少部分数据满足预定条件C3、窗口Wi4中至少部分数据满足预定条件C4、窗口Wi5中至少部分数据满足预定条件C5、窗口Wi6中至少部分数据满足预定条件C6、窗口Wi7中至少部分数据满足预定条件C7、窗口Wi8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wi10中至少部分数据满足预定条件C10和窗口Wi11中至少部分数据满足预定条件C11时,则当前潜在分割点ki为数据流分割点。当11个窗口中任一个窗口中至少部分数据不满足对应的预定条件时,如图30所示,Wi4[ki-175,ki-6]中至少部分数据不满足预定条件C4,则选择下一个潜在分割点,为与潜在分割点ki区别,这里表示为kj,kj位于ki右边,并且kj与ki间距1个字节。如图30所示,依为去重服务器103预设的规则,为潜在分割点kj确定11个 窗口分别为Wj1[kj-169,kj]、Wj2[kj-171,kj-2]、Wj3[kj-173,kj-4]、Wj4[kj-175,kj-6]、Wj5[kj-177,kj-8]、Wj6[kj-179,kj-10]、Wj7[kj-181,kj-12]、Wj8[kj-183,kj-14]、Wj9[kj-185,kj-16]、Wj10[kj-187,kj-18]和Wj11[kj-189,kj-20],并且C1=C2=C3=C4=C5=C6=C7=C8=C9=C10=C11。判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1、判断Wj2[kj-171,kj-2]中至少部分数据是否满足预定条件C2、判断Wj3[kj-173,kj-4]中至少部分数据是否满足预定条件C3、判断Wj4[kj-175,kj-6]中至少部分数据是否满足预定条件C4、判断Wj5[kj-177,kj-8]中至少部分数据是否满足预定条件C5、判断Wj6[kj-179,kj-10]中至少部分数据是否满足预定条件C6、判断Wj7[kj-181,kj-12]中至少部分数据是否满足预定条件C7、判断Wj8[kj-183,kj-14]中至少部分数据是否满足预定条件C8、判断Wj9[kj-185,kj-16]中至少部分数据是否满足预定条件C9、判断Wj10[kj-187,kj-18]中至少部分数据是否满足预定条件C10和判断Wj11[kj-189,kj-20]中至少部分数据是否满足预定条件C11。当判断窗口Wj1中至少部分数据满足预定条件C1、窗口Wj2中至少部分数据满足预定条件C2、窗口Wj3中至少部分数据满足预定条件C3、窗口Wj4中至少部分数据满足预定条件C4、窗口Wj5中至少部分数据满足预定条件C5、窗口Wj6中至少部分数据满足预定条件C6、窗口Wj7中至少部分数据满足预定条件C7、窗口Wj8中至少部分数据满足预定条件C8、窗口Wi9中至少部分数据满足预定条件C9、窗口Wj10中至少部分数据满足预定条件C10和窗口Wj11中至少部分数据满足预定条件C11时,则当前潜在分割点kj为数据流分割点。当判断窗口Wj1、Wj2、Wj3、Wj4、Wj5、Wj6、Wj7、Wj8、Wj9、Wj10和Wj11中任一个窗口中至少部分数据不满 足预定条件时,如图31所示,Wj3[kj-173,kj-4]中至少部分数据不满足预定条件C3时,kj位于ki右边从ki沿着数据流分割点查找方向跳跃N个字节,其中N个字节不大于‖B4‖+maxx(‖Ax‖),在图28所示的实施方式中,N个字节不大于195个字节,在本实施例中,N=15,获得下一个潜在分割点,为与潜在分割点ki、kj相区别,表示为kl。根据图29所实施方式中为去重服务器103预设的规则,为潜在分割点kl确定11个窗口分别为Wl1[kl-169,kl]、Wl2[kl-171,kl-2]、Wl3[kl-173,kl-4]、Wl4[kl-175,kl-6]、Wl5[kl-177,kl-8]、Wl6[kl-179,kl-10]、Wl7[kl-181,kl-12]、Wl8[kl-183,kl-14]、Wl9[kl-185,kl-16]、Wl10[kl-187,kl-18]和Wl11[kl-189,kl-20]。判断Wl1[kl-169,kl]中至少部分数据是否满足预定条件C1、判断Wl2[kl-171,kl-2]中至少部分数据是否满足预定条件C2、判断Wl3[kl-173,kl-4]中至少部分数据是否满足预定条件C3、判断Wl4[kl-175,kl-6]中至少部分数据是否满足预定条件C4、判断Wl5[kl-177,kl-8]中至少部分数据是否满足预定条件C5、判断Wl6[kl-179,kl-10]中至少部分数据是否满足预定条件C6、判断Wl7[kl-181,kl-12]中至少部分数据是否满足预定条件C7、判断Wl8[kl-183,kl-14]中至少部分数据是否满足预定条件C8、判断Wl9[kl-185,kl-16]中至少部分数据是否满足预定条件C9、判断Wl10[kl-187,kl-18]中至少部分数据是否满足预定条件C10和判断Wl11[kl-189,kl-20]中至少部分数据是否满足预定条件C11。当判断窗口Wl1中至少部分数据满足预定条件C1、窗口Wl2中至少部分数据满足预定条件C2、窗口Wl3中至少部分数据满足预定条件C3、窗口Wl4中至少部分数据满足预定条件C4、窗口Wl5中至少部分数据满足预定条件C5、窗口Wl6中至少部分数据满足预定条件C6、窗口Wl7中至少部分数据满足预定条件C7、窗口Wl8中至少部分数据满 足预定条件C8、窗口Wl9中至少部分数据满足预定条件C9、窗口Wl10中至少部分数据满足预定条件C10和窗口Wl11中至少部分数据满足预定条件C11时,则当前潜在分割点kl为数据流分割点。当窗口Wl1、Wl2、Wl3、Wl4、Wl5、Wl6、Wl7、Wl8、Wl9、Wl10和Wl11中任一窗口中至少部分数据不满足预定条件时,选择下一个潜在分割点,为与潜在分割点ki、kj和kl区别,表示为km,km位于kl右边,并且km与kl间距1个字节。根据图29所示实施例为去重服务器103预设的规则,为潜在分割点km确定的11个窗口分别为Wm1[km-169,km]、Wm2[km-171,km-2]、Wm3[km-173,km-4]、Wm4[km-175,km-6]、Wm5[km-177,km-8]、Wm6[km-179,km-10]、Wm7[km-181,km-12]、Wm8[km-183,km-14]、Wm9[km-185,km-16]、Wm10[km-187,km-18]和Wm11[km-189,km-20]。判断Wm1[km-169,km]中至少部分数据是否满足预定条件C1、判断Wm2[km-171,km-2]中至少部分数据是否满足预定条件C2、判断Wm3[km-173,km-4]中至少部分数据是否满足预定条件C3、判断Wm4[km-175,km-6]中至少部分数据是否满足预定条件C4、判断Wm5[km-177,km-8]中至少部分数据是否满足预定条件C5、判断Wm6[km-179,km-10]中至少部分数据是否满足预定条件C6、判断Wm7[km-181,km-12]中至少部分数据是否满足预定条件C7、判断Wm8[km-183,km-14]中至少部分数据是否满足预定条件C8、判断Wm9[km-185,km-16]中至少部分数据是否满足预定条件C9、判断Wm10[km-187,km-18]中至少部分数据是否满足预定条件C10和判断Wm11[km-189,km-20]中至少部分数据是否满足预定条件C11。当判断窗口Wm1中至少部分数据满足预定条件C1、窗口Wm2中至少部分数据满足预定条件C2、窗口Wm3中至少部分数据满足预定条件C3、窗口Wm4中至少部分数据满足预定条件C4、窗口Wm5中至少部分数据满足预定条 件C5、窗口Wm6中至少部分数据满足预定条件C6、窗口Wm7中至少部分数据满足预定条件C7、窗口Wm8中至少部分数据满足预定条件C8、窗口Wm9中至少部分数据满足预定条件C9、窗口Wm10中至少部分数据满足预定条件C10和窗口Wm11中至少部分数据满足预定条件C11时,则当前潜在分割点km为数据流分割点。当任一个窗口中至少部分数据不满足预定条件时,则按照前面描述的方案执行跳跃,以获得下一个潜在分割点并判断是否为数据流分割点。On the basis of the data stream segmentation point search shown in Figure 3, in the embodiment shown in Figure 29, a rule is preset on the deduplication server 103, and the rule is: determine 11 windows for the potential segmentation point k W x [p x -A x ,p x +B x ] and the predetermined condition C x corresponding to the window W x [p x -A x ,p x +B x ], x is a continuous natural number from 1 to 11, where , the probability that at least part of the data in the window W x [p x -A x ,p x +B x ] satisfies the predetermined condition is 1/2, A 1 =169, B 1 =0; A 2 =171, B 2 =- 2; A 3 =173, B 3 =-4; A 4 =175, B 4 =-6; A 5 =177, B 5 =-8; A 6 =179, B 6 =-10; A 7 =181 , B 7 =-12; A 8 =183, B 8 =-14; A 9 =185, B 9 =-16; A 10 =187, B 10 =-18; A 11 =189, B 11 =-20 ; and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 =C 7 =C 8 =C 9 =C 10 =C 11 , then the 11 windows are respectively W 1 [k-169,k] , W 2 [k-171,k-2], W 3 [k-173,k-4], W 4 [k-175,k-6], W 5 [k-177,k-8], W 6 [k-179,k-10], W 7 [k-181,k-12], W 8 [k-183,k-14], W 9 [k-185,k-16], W 10 [ k-187,k-18] and W 11 [k-189,k-20]. k a is the data stream split point. The search direction of the data stream split point shown in Figure 29 is from left to right. After skipping the smallest data block 4KB from the data stream split point k a , the end position of the smallest data block 4KB is used as the next The potential segmentation point ki is to determine the point p ix for the potential segmentation point ki . In this embodiment, according to the preset rules on the deduplication server 103, x is a continuous natural number from 1 to 11. In the embodiment shown in FIG. 29 , according to predetermined rules, the 11 windows determined for the potential segmentation point k i are W i1 [k i -169, ki ], W i2 [k i -171, ki - 2], W i3 [k i -173, ki -4], W i4 [k i -175, ki -6], W i5 [k i -177, ki -8], W i6 [k i -179,k i -10], W i7 [k i -181,k i -12], W i8 [k i -183,k i -14], W i9 [k i -185,k i -16] , W i10 [k i -187, ki -18] and W i11 [k i -189, ki -20]. Judging whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , judging whether at least part of the data in W i2 [k i -171, ki -2] meets the predetermined condition C 2 , judging W Whether at least part of the data in i3 [k i -173, ki -4] meets the predetermined condition C 3 , judge whether at least some of the data in W i4 [k i -175, ki -6] meet the predetermined condition C 4 , and judge W Whether at least part of the data in i5 [k i -177, ki -8] meets the predetermined condition C 5 , judge whether at least part of the data in W i6 [k i -179, ki -10] meet the predetermined condition C 6 , and judge W Whether at least part of the data in i7 [k i -181, ki -12] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -183, ki -14] meet the predetermined condition C 8 , and judge W Whether at least part of the data in i9 [k i -185, ki -16] meet the predetermined condition C 9 , judge whether at least part of the data in W i10 [k i -187, ki -18] meet the predetermined condition C 10 and judge W Whether at least part of the data in i11 [k i -189, ki -20] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W i1 meets the predetermined condition C 1 , at least part of the data in window W i2 meets the predetermined condition C 2 , at least part of the data in window W i3 meets the predetermined condition C 3 , and at least part of the data in window W i4 meets the predetermined condition Condition C 4 , at least part of the data in window W i5 meets the predetermined condition C 5 , at least part of the data in window W i6 meets the predetermined condition C 6 , at least part of the data in window W i7 meets the predetermined condition C 7 , and at least part of the data in window W i8 When the predetermined condition C8 is met, at least part of the data in window W i9 meets the predetermined condition C9 , at least part of the data in window W i10 meets the predetermined condition C10 , and at least part of the data in window W i11 meets the predetermined condition C11 , then the current potential segmentation Point ki is the data flow splitting point. When at least part of the data in any of the 11 windows does not meet the corresponding predetermined condition, as shown in Figure 30, at least part of the data in W i4 [k i -175, ki -6] does not meet the predetermined condition C 4 , Then select the next potential segmentation point, which is different from the potential segmentation point ki, expressed as k j here, k j is located on the right side of ki , and the distance between k j and ki is 1 byte. As shown in FIG. 30 , according to the preset rules for the deduplication server 103, 11 windows are determined for the potential segmentation point k j as W j1 [k j -169, k j ], W j2 [k j -171, k j -2], W j3 [k j -173,k j -4], W j4 [k j -175,k j -6], W j5 [k j -177,k j -8], W j6 [ k j -179,k j -10], W j7 [k j -181,k j -12], W j8 [k j -183,k j -14], W j9 [k j -185,k j - 16], W j10 [k j -187, k j -18] and W j11 [k j -189, k j -20], and C 1 =C 2 =C 3 =C 4 =C 5 =C 6 = C 7 =C 8 =C 9 =C 10 =C 11 . Judging whether at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 , judging whether at least part of the data in W j2 [k j -171,k j -2] meets the predetermined condition C 2 , judging W Whether at least part of the data in j3 [k j -173,k j -4] meets the predetermined condition C 3 , judge whether at least part of the data in W j4 [k j -175,k j -6] meets the predetermined condition C 4 , judge W Whether at least part of the data in j5 [k j -177, k j -8] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -179, k j -10] meet the predetermined condition C 6 , judge W Whether at least part of the data in j7 [k j -181, k j -12] meets the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -183, k j -14] meet the predetermined condition C 8 , judge W Whether at least part of the data in j9 [k j -185, k j -16] meets the predetermined condition C 9 , judge whether at least part of the data in W j10 [k j -187, k j -18] meet the predetermined condition C 10 and judge W Whether at least part of the data in j11 [k j -189,k j -20] satisfies the predetermined condition C 11 . When judging that at least part of the data in window W j1 meets the predetermined condition C 1 , at least part of the data in window W j2 meets the predetermined condition C 2 , at least part of the data in window W j3 meets the predetermined condition C 3 , and at least part of the data in window W j4 meets the predetermined condition Condition C 4 , at least part of the data in window W j5 meets the predetermined condition C 5 , at least part of the data in window W j6 meets the predetermined condition C 6 , at least part of the data in window W j7 meets the predetermined condition C 7 , and at least part of the data in window W j8 When the predetermined condition C8 is met, at least part of the data in the window W i9 meets the predetermined condition C9 , at least part of the data in the window Wj10 meets the predetermined condition C10 , and at least part of the data in the window Wj11 meets the predetermined condition C11 , then the current potential segmentation Point k j is the data flow splitting point. When judging that at least part of the data in any of the windows W j1 , W j2 , W j3 , W j4 , W j5 , W j6 , W j7 , W j8 , W j9 , W j10 and W j11 does not meet the predetermined conditions, as As shown in Figure 31, when at least part of the data in W j3 [k j -173,k j -4] does not meet the predetermined condition C 3 , k j is located on the right of ki and jumps N times from ki along the direction of data stream segmentation point search Bytes, where N bytes are not greater than ‖B 4 ‖+max x (‖A x ‖), in the embodiment shown in Figure 28, N bytes are not greater than 195 bytes, in this embodiment , N=15, to obtain the next potential segmentation point, which is denoted as k l to distinguish it from potential segmentation points ki and k j . According to the preset rules for the deduplication server 103 in the embodiment shown in FIG. 29, 11 windows are determined for the potential segmentation point k l , which are respectively W l1 [k l -169, k l ], W l2 [k l -171, k l -2], W l3 [k l -173,k l -4], W l4 [k l -175,k l -6], W l5 [k l -177,k l -8], W l6 [ k l -179,k l -10], W l7 [k l -181,k l -12], W l8 [k l -183,k l -14], W l9 [k l -185,k l - 16], W l10 [k l -187,k l -18] and W l11 [k l -189,k l -20]. Judging whether at least part of the data in W l1 [k l -169,k l ] meets the predetermined condition C 1 , judging whether at least part of the data in W l2 [k l -171,k l -2] meets the predetermined condition C 2 , judging W l3 Whether at least part of the data in [k l -173,k l -4] meets the predetermined condition C 3 , judge whether at least part of the data in W l4 [k l -175,k l -6] meet the predetermined condition C 4 , judge W l5 Whether at least part of the data in [k l -177, k l -8] satisfies the predetermined condition C 5 , judging W l6 whether at least part of the data in [k l -179, k l -10] meets the predetermined condition C 6 , judging W l7 Whether at least part of the data in [k l -181, k l -12] meets the predetermined condition C 7 , judge whether at least part of the data in W l8 [k l -183, k l -14] meet the predetermined condition C 8 , judge W l9 Whether at least part of the data in [k l -185, k l -16] meets the predetermined condition C 9 , judgment W l10 Whether at least part of the data in [k l -187, k l -18] meets the predetermined condition C 10 and judgment W Whether at least part of the data in l11 [k l -189,k l -20] satisfies the predetermined condition C 11 . When it is judged that at least part of the data in window W l1 meets the predetermined condition C 1 , at least part of the data in window W l2 meets the predetermined condition C 2 , at least part of the data in window W l3 meets the predetermined condition C 3 , and at least part of the data in window W l4 meets the predetermined condition Condition C 4 , at least part of the data in window W l5 meet the predetermined condition C 5 , at least part of the data in window W l6 meet the predetermined condition C 6 , at least part of the data in window W l7 meet the predetermined condition C 7 , at least part of the data in window W l8 When the predetermined condition C8 is met, at least part of the data in window W19 meets the predetermined condition C9 , at least part of the data in window W110 meets the predetermined condition C10 , and at least part of the data in window W111 meets the predetermined condition C11 , then the current potential segmentation Point k l is the data flow splitting point. When at least some of the data in any of the windows W l1 , W l2 , W l3 , W l4 , W l5 , W l6 , W l7 , W l8 , W l9 , W l10 and W l11 do not meet the predetermined conditions, select the following A potential segmentation point, to be distinguished from potential segmentation points ki , kj and kl , is denoted as km, km is located to the right of kl , and the distance between km and kl is 1 byte. According to the preset rules for the deduplication server 103 in the embodiment shown in FIG. 29, the 11 windows determined for the potential segmentation point k m are respectively W m1 [k m -169, km ], W m2 [ k m -171 , km -2], W m3 [ km -173 , km -4], W m4 [ km -175, km -6], W m5 [ km -177 , km -8], W m6 [km -179, km -10], W m7 [ km -181 , km -12], W m8 [km -183 , km -14 ], W m9 [ km -185 , km -16], W m10 [km -187, km -18] and W m11 [ km -189 , km -20]. Judging whether at least part of the data in W m1 [km -169 , km ] meets the predetermined condition C 1 , judging whether at least part of the data in W m2 [ km -171, km -2] meets the predetermined condition C 2 , judging W Whether at least part of the data in m3 [k m -173, km -4] meets the predetermined condition C 3 , judge whether at least part of the data in W m4 [ km -175, km -6] meet the predetermined condition C 4 , and judge W Whether at least part of the data in m5 [k m -177 , km -8] meets the predetermined condition C 5 , judge whether at least part of the data in W m6 [ km -179, km -10] meet the predetermined condition C 6 , and judge W Whether at least part of the data in m7 [k m -181, km -12] meets the predetermined condition C 7 , judging whether at least part of the data in W m8 [ km -183, km -14] meets the predetermined condition C 8 , judging W Whether at least part of the data in m9 [k m -185, km -16] meets the predetermined condition C 9 , judge whether at least part of the data in W m10 [ km -187, km -18] meet the predetermined condition C 10 and judge W Whether at least part of the data in m11 [k m -189,k m -20] satisfies the predetermined condition C 11 . When it is judged that at least part of the data in window W m1 meets the predetermined condition C 1 , at least part of the data in window W m2 meets the predetermined condition C 2 , at least part of the data in window W m3 meets the predetermined condition C 3 , and at least part of the data in window W m4 meets the predetermined condition Condition C 4 , at least part of the data in window W m5 meet the predetermined condition C 5 , at least part of the data in window W m6 meet the predetermined condition C 6 , at least part of the data in window W m7 meet the predetermined condition C 7 , and at least part of the data in window W m8 When the predetermined condition C8 is met, at least part of the data in window Wm9 meets the predetermined condition C9 , at least part of the data in window Wm10 meets the predetermined condition C10 , and at least part of the data in window Wm11 meets the predetermined condition C11 , then the current potential segmentation The point k m is the data flow splitting point. When at least part of the data in any window does not meet the predetermined condition, jumping is performed according to the scheme described above to obtain the next potential split point and determine whether it is a data flow split point.
本发明实施例提供了一种判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz的方法,本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,以图21所示的实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定的条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中“■”表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为am,1…am,8,表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为: 当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,255个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,可以表示为: 。选取大量随机数,组成矩阵,由随机数据组成的矩阵一旦组成,保持不变,如从服从特定分布(这里以正态分布为例)的随机数中选择255*8个随机数组成矩阵R: 将矩阵Va的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8。根据该方法,获得Sa1、Sa2…到Sa255,统计Sa1、Sa2…到Sa255中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sam与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sa1、Sa2…到Sa255中,每个值大于0的概率为1/2,所以K满足二项分布: 根据统计结果,判断Sa1、Sa2…到Sa255的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wi1[ki-169,ki]中至少部分数据满足预定条件C1;当K为奇数时,表明W1[ki-169,ki]中至少部分数据不满足预定条件C1,这里C1即指根据上述方式获得的Sa1、Sa2…到Sa255的值大于0的个数K为偶数。在图21所示的实施方式中,在Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-178,ki-9]和Wi11[ki-179,ki-10]中,各窗口大小相同,即窗口大小均 为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的描述。因此,如图32所示,表示判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为bm,1…bm,8,表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为: 当bm,n=1时,Vbm,n=1,当bm,n=0时,Vbm,n=-1,其中bm,n表示bm,1…bm,8中的任一个,255个字节对应的位按照bm,n与Vbm,n的转换关系得到矩阵Vb,可以表示为:判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件的方式与判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件的方式相同,因此使用矩阵R:将矩阵Vb的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sbm=Vbm,1*hm,1+Vbm,2*hm,2+…+Vbm,8*hm,8。根据该方法,获得Sb1、Sb2…到Sb255,统计Sb1、Sb2…到Sb255中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sbm与矩阵R 一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sb1、Sb2…到Sb255中,每个值大于0的概率为1/2,所以K满足二项分布:根据统计结果,判断Sb1、Sb2…到Sb255的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2;当K为奇数时,表明Wi2[ki-170,ki-1]中至少部分数据不满足预定条件C2,这里C2即指根据上述方式获得的Sb1、Sb2…到Sb255的值大于0的个数K为偶数。图21所示的实施方式中,Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2。The embodiment of the present invention provides a method for judging whether at least part of the data in the window W iz [k i -A z , k i +B z ] satisfies the predetermined condition C z . In this embodiment, a random function is used to judge the window W iz [ k i -A z , k i +B z ] whether at least part of the data satisfies the predetermined condition C z , taking the implementation shown in FIG. k i determines the window W i1 [k i -169, ki ], and judges whether at least part of the data in W i1 [k i -169, ki ] meets the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , select 5 bytes, "■" in Figure 32 indicates selection 1 byte, the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as a m,1 ... a m,8 , which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, the bits corresponding to 255 bytes It can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 For any one, the bits corresponding to 255 bytes are obtained according to the conversion relationship between a m,n and V am,n to obtain the matrix V a , which can be expressed as: . Select a large number of random numbers to form a matrix. Once the matrix composed of random data is formed, it remains unchanged. For example, select 255*8 random numbers from random numbers that obey a specific distribution (here, take the normal distribution as an example) to form a matrix R: Multiply the m-th row of the matrix V a with the random number in the m-th row of the matrix R, and then sum to get a value, specifically expressed as S am =V am,1 *h m,1 +V am,2 *h m,2 +...+V am,8 *h m,8 . According to this method, S a1 , S a2 . . . to S a255 are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S a1 , S a2 . . . to S a255 is counted. Since the matrix R obeys the normal distribution, S am , like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of the normal distribution random number being greater than 0 is 1/2, in S a1 , S a2 ... to S a255 In , the probability of each value greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S a1 , S a2 . condition. When K is an even number, it indicates that at least part of the data in W i1 [k i -169, ki ] meets the predetermined condition C 1 ; when K is an odd number, it indicates that at least part of the data in W 1 [k i -169, ki ] The predetermined condition C 1 is not satisfied, where C 1 means that the number K of S a1 , S a2 . In the embodiment shown in FIG. 21 , in W i1 [k i -169, ki ], W i2 [k i -170, ki -1], W i3 [k i -171, ki -2] , W i4 [k i -172,k i -3], W i5 [k i -173,k i -4], W i6 [k i -174,k i -5], W i7 [k i -175 ,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177,k i -8], W i10 [k i -178,k i -9] and W In i11 [k i -179,k i -10], the size of each window is the same, that is, the size of the window is 169 bytes. At the same time, the method of judging whether at least part of the data in the window meets the predetermined condition is also the same. For details, see the above judgment W i1 Whether at least part of the data in [k i −169, ki ] satisfies the description of the predetermined condition C 1 . Therefore, as shown in Figure 32, Indicates 1 byte selected when judging whether at least part of the data in the window W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as b m,1 ...b m,8 , which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, the corresponding bits of 255 bytes It can be expressed as: When b m,n =1, V bm,n =1, when b m,n =0, V bm,n =-1, where b m,n represents b m,1 ...b m,8 For any one, the bits corresponding to 255 bytes are obtained according to the conversion relationship between b m,n and V bm,n to obtain the matrix V b , which can be expressed as: The method of judging whether at least part of the data in W i1 [k i -169, ki ] meets the predetermined condition is the same as the method of judging whether at least part of the data in the window W i2 [k i -170, ki -1] meets the predetermined condition, So using the matrix R: Multiply the m-th row of the matrix V b with the random number in the m-th row of the matrix R, and then sum to get a value, specifically expressed as S bm =V bm,1 *h m,1 +V bm,2 *h m,2 +...+V bm,8 *h m,8 . According to this method, obtain S b1 , S b2 . . . to S b255 , and count the number K of values in S b1 , S b2 . Since the matrix R obeys the normal distribution, S bm , like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of the normal distribution random number being greater than 0 is 1/2, in S b1 , S b2 ... to S b255 In , the probability of each value greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S b1 , S b2 ... to S b255 is an even number, the probability that the random number of the binomial distribution is an even number is 1/2, so K is with the probability of 1/2 To meet the conditions. When K is an even number, it indicates that at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 ; when K is an odd number, it indicates that W i2 [k i -170, ki -1] At least some of the data in do not satisfy the predetermined condition C 2 , where C 2 means that the number K of S b1 , S b2 . In the implementation shown in FIG. 21 , at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 .
因此,如图32所示,表示判断窗口Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[ki-169,ki]和Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件的方法,判断Wi3[ki-171,ki-2]中至少数据是否满足预定条件C3。图21所示的实施方式中,Wi3[ki-171,ki-2]中至少部分数据满足预定条件。如图32所示,表示判断窗口Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[ki-169,ki]、Wi2[ki-170,ki-1]和Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件的方法,判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4。图21所示的实施方式中,Wi4[ki-172,ki-3]中至少部分数据满足预定条件C4。如图32所示,表示判断窗口Wi5[ki-173,ki-4] 中至少部分数据是否满足预定条件C5时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。然后使用判断窗口Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]和Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件的方法,判断Wi5[ki-173,ki-4]中至少数据是否满足预定条件C5。图21所示的实施方式中,Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5。Therefore, as shown in Figure 32, Indicates one byte selected when judging whether at least part of the data in the window W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the method of judging whether at least part of the data in the windows W i1 [k i -169, ki ] and W i2 [k i -170, ki -1] meet the predetermined conditions, and judge W i3 [k i -171, k Whether at least the data in i -2] satisfies the predetermined condition C 3 . In the implementation shown in FIG. 21 , at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition. As shown in Figure 32, Indicates 1 byte selected when judging whether at least part of the data in the window W i4 [k i -172, ki -3] satisfies the predetermined condition C 4 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the judgment window W i1 [k i -169, ki -1], W i2 [k i -170, ki -1] and W i3 [k i -171, ki -2] to determine whether at least part of the data meets the predetermined A conditional method, judging whether at least part of the data in W i4 [k i -172, ki -3] satisfies the predetermined condition C 4 . In the implementation shown in FIG. 21 , at least part of the data in W i4 [k i -172, ki -3] satisfies the predetermined condition C 4 . As shown in Figure 32, Indicates one byte selected when judging whether at least part of the data in the window W i5 [k i -173, ki -4] satisfies the predetermined condition C 5 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Then use the judgment windows W i1 [k i -169,k i ], W i2 [k i -170,k i -1], W i3 [k i -171,k i -2] and W i4 [k i - 172, ki -3] at least part of the data satisfy the predetermined condition, judging whether at least the data in W i5 [ ki -173, ki -4] satisfy the predetermined condition C 5 . In the embodiment shown in FIG. 21 , at least part of the data in W i5 [k i -173, ki -4] does not satisfy the predetermined condition C 5 .
当Wi5[ki-173,ki-4]中至少部分数据不满足预定条件时C5,从点pi5沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得下一个潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此如图33所示,Wj1表示窗口,为判断中至少部分数据是否满足预定条件C1,选择5个字节,图33中“■”表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得255字节,以增加随机性。其中每个字节由8位组成,记为am,1'…am,8',表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为:当am,n'=1时,Vam,n'=1,当am,n'=0时,Vam,n'=-1,其中am,n'表示am,1'…am,8'中的任一个,255个字节对应的位按照am,n'与Vam,n'的转换关系得到矩阵Va',可以表示为: 判断窗口中至少部分数据是否满足预定的条件与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定的条件的方式相同,因此使用矩阵R:将矩阵Va'的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam'=Vam,1'*hm,1+Vam,2'*hm,2+…+Vam,8'*hm,8。根据该方法,获得Sa1'、Sa2'…到Sa255',统计Sa1'、Sa2'…到Sa255'中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sam'与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sa1'、Sa2'…到Sa255'中,每个值大于0的概率为1/2,所以K满足二项分布:根据统计结果,判断Sa1'、Sa2'…到Sa255'的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明Wj1[kj-169,kj]中至少部分数据满足预定条件C1;当K为奇数时,表明Wj1[kj-169,kj]中至少部分数据不满足预定条件C1。When at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , jump 7 bytes from point p i5 along the direction of data flow splitting point search, at the seventh byte The end position of the next potential segmentation point k j is obtained, as shown in Figure 22, according to the rules preset for the deduplication server 103, the window W j1 [k j -169, k j ] is determined for the potential segmentation point k j , and the judgment The method of whether at least part of the data in the window W j1 [k j -169,k j ] meets the predetermined condition C 1 and the method of judging whether at least part of the data in the window W i1 [k i -169,k i ] meet the predetermined condition C 1 Same, so as shown in Figure 33, W j1 represents the window, in order to judge whether at least part of the data satisfies the predetermined condition C 1 , select 5 bytes, "■" in Figure 33 represents the selected 1 byte, adjacent two The difference between selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as a m,1 '...a m,8 ', indicating the 1st to 8th bits of the mth byte in 255 bytes, therefore, 255 bytes correspond to bits can be expressed as: When a m,n '=1, V am,n '=1, when a m,n '=0, V am,n '=-1, where a m,n ' means a m,1 '... For any one of a m,8 ', the bits corresponding to 255 bytes are obtained according to the conversion relationship between a m,n ' and V am,n ' to obtain a matrix V a ', which can be expressed as: Judging whether at least part of the data in the window meets the predetermined condition is the same as judging whether at least part of the data in the window W i1 [k i -169, ki ] meets the predetermined condition, so the matrix R is used: Multiply the mth row of the matrix V a ' with the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am '=V am,1 '*h m,1 +V am, 2 '*h m,2 +...+V am,8 '*h m,8 . According to this method, S a1 ′, S a2 ′… to S a255 ′ are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S a1 ′, S a2 ′… to S a255 ′ is counted. Since the matrix R obeys the normal distribution, S am ', like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of a normal distribution random number being greater than 0 is 1/2. In S a1 ', S a2 '... To S a255 ', the probability of each value being greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K whose value is greater than 0 from S a1 ', S a2 '... to S a255 ' is an even number, and the probability that the random number of the binomial distribution is an even number is 1/2, so K is 1/2 The probability satisfies the condition. When K is an even number, it indicates that at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 ; when K is an odd number, it indicates that at least part of the data in W j1 [k j -169,k j ] The predetermined condition C 1 is not satisfied.
判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,如图33所示,表示判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。将选择的5字节数据反复利用51次,共获得 255字节,以增加随机性。其中每个字节由8位组成,记为bm,1'…bm,8',表示255个字节中第m个字节的第1到第8位,因此,255个字节对应的位可以表示为:当bm,n'=1时,Vbm,n'=1,当bm,n'=0时,Vbm,n'=-1,其中bm,n'表示bm,1'…bm,8'中的任一个,255个字节对应的位按照bm,n'与Vbm,n'的转换关系得到矩阵Vb',可以表示为: 判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C1和Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C1的方式相同,因此仍使用矩阵R:将矩阵Vb'的第m行与矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sbm'=Vbm,1'*hm,1+Vbm,2'*hm,2+…+Vbm,8'*hm,8。根据该方法,获得Sb1'、Sb2'…到Sb255',统计Sb1'、Sb2'…到Sb255'中满足特定条件(这里以大于0为例)的值的个数K。由于矩阵R服从正态分布,则Sbm'与矩阵R一样,仍然服从正态分布,根据概率论,正态分布随机数大于0的概率为1/2,在Sb1'、Sb2'…到Sb255'中,每个值大于0的概率为1/2,所以K满足二项分布:根据统计结果,判断Sb1'、Sb2'…到Sb255'的值大于0的个数K是否为偶数,二项分布的随机数为偶数的概率为为1/2,所以K以1/2的概率满足条件。当K为偶数时,表明中至少部分数据满足预定条件C2;当K为奇数时,表 明Wj2[kj-170,kj-1]中至少部分数据不满足预定条件C2。同理,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。Judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 2 in the same way, so, as shown in Figure 33, Indicates one byte selected when judging whether at least part of the data in the window W j2 [k j -170,k j -1] meets the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. The selected 5-byte data is reused 51 times to obtain a total of 255 bytes to increase randomness. Each byte is composed of 8 bits, recorded as b m,1 '...b m,8 ', which means the 1st to 8th bits of the mth byte in 255 bytes, therefore, 255 bytes correspond to bits can be expressed as: When b m,n '=1, V bm,n '=1, when b m,n '=0, V bm,n '=-1, where b m,n ' means b m,1 '... For any one of b m,8 ', the bits corresponding to 255 bytes are obtained according to the conversion relationship between b m,n ' and V bm,n ' to obtain matrix V b ', which can be expressed as: Judging whether at least part of the data in the window W i2 [k i -170, ki -1] meets the predetermined condition C 1 and whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 1 In the same way, so the matrix R is still used: Multiply the mth row of the matrix V b ' with the random number of the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S bm '=V bm,1 '*h m,1 +V bm, 2 '*h m,2 +...+V bm,8 '*h m,8 . According to this method, S b1 ′, S b2 ′… to S b255 ′ are obtained, and the number K of values satisfying a specific condition (here greater than 0 is taken as an example) among S b1 ′, S b2 ′… to S b255 ′ is counted. Since the matrix R obeys the normal distribution, S bm ', like the matrix R, still obeys the normal distribution. According to the probability theory, the probability of a normal distribution random number being greater than 0 is 1/2. In S b1 ', S b2 '... To S b255 ', the probability of each value being greater than 0 is 1/2, so K satisfies the binomial distribution: According to the statistical results, judge whether the number K of S b1 ', S b2 '... to S b255 ' whose value is greater than 0 is an even number, and the probability that the random number of the binomial distribution is an even number is 1/2, so K is 1/ The probability of 2 satisfies the condition. When K is an even number, it indicates that at least part of the data in W j2 [k j -170, k j -1 ] does not meet the predetermined condition C 2 when K is an odd number. Similarly, the method of judging whether at least part of the data in W i3 [k i -171, k i -2] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [k j -171, k j -2] satisfies The way of the predetermined condition C 3 is the same, and similarly, judge whether at least part of the data in W j4 [k j -172, k j -3] meets the predetermined condition C 4 , judge W j5 [k j -173, k j -4] Whether at least part of the data in W j6 [k j -174,k j -5] satisfies the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174,k j -5] meets the predetermined condition C 6 , and judge W j7 [k j -175,k j -6] Whether at least part of the data in W j8 [k j -176,k j -7] satisfies the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176,k j -7] meets the predetermined condition C 8 , judge W j9 [k j -177,k j -8] Whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined condition C 10 and whether at least part of the data in W j10 [k j -179, k j -10 ] Whether or not at least some of the data satisfy the predetermined condition C 11 will not be repeated here.
本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,仍然以图21所示实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定的条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中“■”表示选择的1个字节,相邻两个选择“■”的字节之间相差42个字节。其中一种实现方式为使用HASH函数计算选择的5个字节,使用HASH函数计算得到的数值是一个固定均匀分布,如果使用HASH函数计算得到的数值为偶数,则判断Wi1[ki-169,ki]中至少部分数据满足预定条件C1,即C1表示根据上述方式使用HASH函数计算得到的数值为偶数。因此,Wi1[ki-169,ki]中至少部分数据是否满足预定条件的概率为1/2。在图21所示的实施方式中,使用Hash函数判断Wi2[ki-170,ki-1]中至少部分 数据是否满足预定条件C2、Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4和Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5,具体实现可参考描述图21所示实施方式使用Hash函数判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件的方式C1,在此不再赘述。In this embodiment, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z , still taking the implementation shown in Figure 21 as an example, according to the deduplication The preset rule on the server 103 determines the window W i1 [k i -169, ki ] for the potential segmentation point ki , and judges whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , choose 5 In Figure 32, "■" indicates one byte selected, and the difference between two adjacent bytes selected "■" is 42 bytes. One of the implementation methods is to use the HASH function to calculate the selected 5 bytes. The value calculated by the HASH function is a fixed uniform distribution. If the value calculated by the HASH function is an even number, then judge W i1 [k i -169 , ki ] at least part of the data satisfies the predetermined condition C 1 , that is, C 1 indicates that the value calculated by using the HASH function in the above manner is an even number. Therefore, the probability of whether at least part of the data in W i1 [k i −169, ki ] satisfies the predetermined condition is 1/2. In the embodiment shown in FIG. 21, the Hash function is used to determine whether at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined conditions C 2 and W i3 [k i -171, ki -2 ] whether at least part of the data satisfies the predetermined condition C 3 , whether at least part of the data in W i4 [k i -172, ki -3] meets the predetermined condition C 4 and W i5 [k i -173, ki -4] Whether at least part of the data satisfies the predetermined condition C 5 , for specific implementation, please refer to the method C 1 that describes whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition by using the Hash function in the embodiment shown in FIG. 21 . This will not be repeated here.
当Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5时,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得当前潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此如图33所示,Wj1表示窗口Wj1[kj-169,kj],为判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1,选择5个字节,图33中“■”表示选择的1个字节,相邻两个选择的字节“■”之间相差42个字节。使用Hash函数计算从窗口Wj1[kj-169,kj]中选取的5个字节,如果得到的数值为偶数,则Wj1[kj-169,kj]中至少部分数据满足预定条件C1。图33中,判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,如图33所示,表示判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。使用Hash函数计算选择的5个字节,如果得到的数值为偶数,则Wj2[kj-170,kj-1]中至少部分数据满足预定条件C2。图33中,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方 式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,因此,如图33所示,表示判断窗口Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3时选择的1个字节,相邻两个选择的字节之间相差42个字节。使用Hash函数计算选择的5个字节,得到的数值为偶数,则Wj3[kj-171,kj-2]中至少部分数据满足预定条件C3。图33中,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4的方式和判断窗口Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4的方式,因此,如图33所示,表示判断窗口Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4时选择的1个字节,相邻两个选择的字节之间相差42个字节。使用Hash函数计算选择的5个字节,得到的数值为偶数,则Wj4[kj-172,kj-3]中至少部分数据满足预定条件C4。根据上述方法,判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。When at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , jump 7 bytes from the potential split point ki along the direction of data flow split point search, at the seventh The end position of the byte obtains the current potential segmentation point k j , as shown in Figure 22, according to the rules preset for the deduplication server 103, determine the window W j1 [k j -169, k j ] for the potential segmentation point k j , The method of judging whether at least part of the data in the window W j1 [k j -169, k j ] satisfies the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i1 [k i -169, k i ] meets the predetermined condition C 1 The same way, so as shown in Figure 33, W j1 represents the window W j1 [k j -169, k j ], in order to judge whether at least part of the data in W j1 [k j -169, k j ] satisfies the predetermined condition C 1 , Select 5 bytes, "■" in Figure 33 indicates the selected 1 byte, and the difference between two adjacent selected bytes "■" is 42 bytes. Use the Hash function to calculate the 5 bytes selected from the window W j1 [k j -169, k j ], if the obtained value is an even number, then at least part of the data in W j1 [k j -169, k j ] meets the predetermined Condition C 1 . In Fig. 33, the method of judging whether at least part of the data in W i2 [k i -170, k i -1] satisfies the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] The way to satisfy the predetermined condition C2 is the same, therefore, as shown in Fig. 33, Indicates the 1 byte selected when judging whether at least part of the data in the window W j2 [k j -170,k j -1] meets the predetermined condition C 2 , and the adjacent two selected bytes There is a difference of 42 bytes between them. Use the Hash function to calculate the selected 5 bytes, and if the obtained value is an even number, then at least part of the data in W j2 [k j -170,k j -1] satisfies the predetermined condition C 2 . In Fig. 33, the method of judging whether at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [k j -171, k j -2] The way to satisfy the predetermined condition C3 is the same, therefore, as shown in Fig. 33, Indicates the 1 byte selected when judging whether at least part of the data in the window W j3 [k j -171,k j -2] meets the predetermined condition C 3 , and the adjacent two selected bytes There is a difference of 42 bytes between them. Use the Hash function to calculate the selected 5 bytes, and the value obtained is an even number, then at least part of the data in W j3 [k j -171, k j -2] satisfies the predetermined condition C 3 . In Fig. 33, the way of judging whether at least part of the data in W j4 [k j -172, k j -3] meets the predetermined condition C 4 and judging at least part of the data in the window W i4 [k i -172, k i -3] whether it satisfies the predetermined condition C 4 way, therefore, as shown in Fig. 33, Indicates the 1 byte selected when judging whether at least part of the data in the window W j4 [k j -172, k j -3] meets the predetermined condition C 4 , and the adjacent two selected bytes There is a difference of 42 bytes between them. Use the Hash function to calculate the selected 5 bytes, and the value obtained is an even number, then at least part of the data in W j4 [k j -172, k j -3] satisfies the predetermined condition C 4 . According to the above method, judge whether at least part of the data in W j5 [k j -173, k j -4] meets the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174, k j -5] meet the predetermined condition Condition C 6 , judging whether at least part of the data in W j7 [k j -175, k j -6] meets the predetermined condition C 7 , judging whether at least part of the data in W j8 [k j -176 , k j -7] meets the predetermined Condition C 8 , judging whether at least part of the data in W j9 [k j -177, k j -8] meets the predetermined condition C 9 , judging whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined The condition C 10 and judging whether at least part of the data in W j11 [k j -179, k j -10] satisfy the predetermined condition C 11 will not be repeated here.
本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,以图21所示的实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中序号为169、127、 85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1、a2、a3、a4和a5。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1、a2、a3、a4和a5中的任一个ar均满足0≤ar≤255。a1、a2、a3、a4和a5组成1*5的矩阵。从服从二项分布的随机数中选择256*5个随机数,组成矩阵R,表示为: In this embodiment, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z . Taking the implementation shown in FIG. 21 as an example, according to the deduplication The preset rule on the server 103 determines the window W i1 [k i -169, ki ] for the potential segmentation point ki , and judges whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , select five Byte, the byte "■" with serial numbers 169, 127, 85, 43 and 1 in Fig. 32 represents one selected byte respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are respectively converted into a decimal value, expressed as a 1 , a 2 , a 3 , a 4 and a 5 respectively. Because 1 byte is composed of 8 bits, each byte "■" is used as a value, and any a r among a 1 , a 2 , a 3 , a 4 and a 5 satisfies 0≤a r ≤ 255. a 1 , a 2 , a 3 , a 4 and a 5 form a 1*5 matrix. Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix R, which is expressed as:
根据a1的值和所在的列,从矩阵R中查找对应的值,如a1=36,a1位于第1列,则查找h36,1对应的值;根据a2的值和所在的列,从矩阵R中查找对应的值,如a2=48,a2位于第2列,则查找h48,2对应的值;根据a3的值和所在的列,从矩阵R中查找对应的值,如a3=26,a3位于第3列,则查找h26,3对应的值;根据a4的值和所在的列,从矩阵R中查找对应的值,如a4=26,a4位于第4列,则查找h26,4对应的值;根据a5的值和所在的列,从矩阵R中查找对应的值,如a5=88,a5位于第5列,则查找h88,5对应的值。S1=h36,1+h48,2+h26,3+h26,4+h88,5,因为矩阵R服从二项分布,因此,S1也服从二项分布。当S1为偶数,则Wi1[ki-169,ki]中至少部分数据满足预定条件C1,当S1为奇数,则Wi1[ki-169,ki]中至少部分数据不满足预定条件C1,S1为偶数的概率为1/2,C1表示按上述方式计算S1为偶数。在图21所示实施例中,Wi1[ki-169,ki]中至少部分数据满足预定条件C1。如图32所示,表示判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图32中,分别用序号170、128、86、44和2表示,相邻两个选 择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1、b2、b3、b4和b5。因为1个字节由8位组成,所以每个字节作为一个数值,则b1、b2、b3、b4和b5中的任一个br均满足0≤br≤255。b1、b2、b3、b4和b5组成1*5的矩阵。本实施方式中,判断Wi1和Wi2中至少部分数据是否满足预定条件的方式相同,因此仍然使用矩阵R,根据b1的值和所在的列,从矩阵R中查找对应的值,如b1=66,b1位于第1列,则查找h66,1对应的值;根据b2的值和所在的列,从矩阵R中查找对应的值,如b2=48,b2位于第2列,则查找h48,2对应的值;根据b3的值和所在的列,从矩阵R中查找对应的值,如b3=99,b3位于第3列,则查找h99,3对应的值;根据b4的值和所在的列,从矩阵R中查找对应的值,如b4=26,b4位于第4列,则查找h26,4对应的值;根据b5的值和所在的列,从矩阵R中查找对应的值,如b5=90,b5位于第5列,则查找h90,5对应的值。S2=h66,1+h48,2+h99,3+h26,4+h90,5,因为矩阵R服从二项分布,因此,S2也服从二项分布。当S2为偶数,则Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2,当S2为奇数,则Wi2[ki-170,ki-1]中至少部分数据不满足预定条件C2,S2为偶数的概率为1/2。在图21所示实施例中,Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5、判断Wi6[ki-174,ki-5]中至少部分数据是否满足预定条件C6、判断Wi7[ki-175,ki-6]中至少部分数据是否满足预定条件C7、判断Wi8[ki-176,ki-7]中至少部分数据是否满足预定条件C8、判断Wi9[ki-177,ki-8]中至少部分数据是否满足预定条件C9、判断Wi10[ki -178,ki-9]中至少部分数据是否满足预定条件C10和判断Wi11[ki-179,ki-10]中至少部分数据是否满足预定条件C11。图21所示的实施方式中,Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得当前潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此如图33所示,Wj1表示窗口Wj1[kj-169,kj],为判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1,图33中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1'、a2'、a3'、a4'和a5'。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1'、a2'、a3'、a4'和a5'中的任一个ar'均满足0≤ar'≤255。a1'、a2'、a3'、a4'和a5'组成1*5的矩阵。判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此,仍然使用矩阵R,表示为: According to the value of a 1 and the column where it is located, find the corresponding value from the matrix R, such as a 1 = 36, a 1 is located in the first column, then look for the value corresponding to h 36,1 ; according to the value of a 2 and where it is column, find the corresponding value from the matrix R, such as a 2 = 48, a 2 is located in the second column, then find the value corresponding to h 48,2 ; according to the value of a 3 and the column where it is located, find the corresponding value from the matrix R value, such as a 3 = 26, a 3 is located in the third column, then look for the value corresponding to h 26,3 ; according to the value of a 4 and the column where it is located, find the corresponding value from the matrix R, such as a 4 = 26 , a 4 is located in the fourth column, then search for the value corresponding to h 26,4 ; according to the value of a 5 and the column where it is located, find the corresponding value from the matrix R, such as a 5 = 88, a 5 is located in the fifth column, Then find the value corresponding to h 88,5 . S 1 =h 36,1 +h 48,2 +h 26,3 +h 26,4 +h 88,5 , because the matrix R obeys the binomial distribution, therefore, S 1 also obeys the binomial distribution. When S 1 is an even number, at least part of the data in W i1 [k i -169, ki ] meets the predetermined condition C 1 , and when S 1 is an odd number, then at least part of the data in W i1 [k i -169, ki ] If the predetermined condition C 1 is not met, the probability that S 1 is an even number is 1/2, and C 1 indicates that S 1 is an even number calculated in the above manner. In the embodiment shown in FIG. 21 , at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 . As shown in Figure 32, Represents the 1 byte selected when at least part of the data in the judgment window W i2 [k i -170, ki -1] meets the predetermined condition C 2. In Fig. 32, sequence numbers 170, 128, 86, 44 are used and 2 indicate that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 respectively converted into a decimal value, respectively expressed as b 1 , b 2 , b 3 , b 4 and b 5 . Since 1 byte consists of 8 bits, each byte As a numerical value, any b r among b 1 , b 2 , b 3 , b 4 and b 5 satisfies 0≤b r ≤255. b 1 , b 2 , b 3 , b 4 and b 5 form a 1*5 matrix. In this embodiment, the method of judging whether at least part of the data in W i1 and W i2 meets the predetermined conditions is the same, so the matrix R is still used, and the corresponding value is searched from the matrix R according to the value of b1 and the column where it is located, such as b 1 = 66, b 1 is located in the first column, then search for the value corresponding to h 66,1 ; according to the value of b 2 and the column where it is located, find the corresponding value from the matrix R, such as b 2 = 48, b 2 is in the first column 2 columns, then search for the value corresponding to h 48,2 ; according to the value of b 3 and the column where it is located, find the corresponding value from the matrix R, such as b 3 =99, b 3 is located in the third column, then search for h 99, The value corresponding to 3 ; according to the value of b 4 and the column where it is located, find the corresponding value from the matrix R, such as b 4 = 26, b 4 is located in the fourth column, then find the value corresponding to h 26 , 4; according to b 5 Find the corresponding value from the matrix R, such as b 5 =90, and b 5 is located in the fifth column, then find the value corresponding to h 90,5 . S 2 =h 66,1 +h 48,2 +h 99,3 +h 26,4 +h 90,5 , because the matrix R obeys the binomial distribution, therefore, S 2 also obeys the binomial distribution. When S 2 is an even number, at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 , and when S 2 is an odd number, then W i2 [k i -170, ki -1] At least some of the data do not satisfy the predetermined condition C 2 , and the probability that S 2 is an even number is 1/2. In the embodiment shown in FIG. 21 , at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [k i -172, ki -3] Satisfy the predetermined condition C 4 , determine whether at least part of the data in W i5 [k i -173, ki -4] meets the predetermined condition C 5 , determine whether at least part of the data in W i6 [k i -174, ki -5] Satisfy the predetermined condition C 6 , judge whether at least part of the data in W i7 [k i -175, ki -6] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -176, ki -7] Satisfy the predetermined condition C 8 , determine whether at least part of the data in W i9 [k i -177, ki -8] meets the predetermined condition C 9 , determine whether at least part of the data in W i10 [k i -178, ki -9] Satisfying the predetermined condition C 10 and judging whether at least part of the data in W i11 [k i -179, ki -10] satisfies the predetermined condition C 11 . In the embodiment shown in Fig. 21, at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , and 7 jumps are made from the potential split point ki along the direction of data stream split point search Byte, obtain the current potential segmentation point k j at the end position of the seventh byte, as shown in Figure 22, according to the rules preset for the deduplication server 103, determine the window W j1 [k j for the potential segmentation point k j -169,k j ], the method of judging whether at least part of the data in the window W j1 [k j -169,k j ] meets the predetermined condition C 1 is the same as judging at least part of the data in the window W i1 [k i -169,k i ] The method of whether the predetermined condition C1 is satisfied is the same, so as shown in Figure 33, W j1 represents the window W j1 [k j -169, k j ], to judge that at least part of the data in W j1 [k j -169, k j ] Whether the predetermined condition C 1 is met, the bytes "■" with serial numbers 169, 127, 85, 43 and 1 in Fig. 33 represent 1 selected byte respectively, and the difference between two adjacent selected bytes is 42 characters Festival. Convert the bytes "■" with sequence numbers 169, 127, 85, 43 and 1 into a decimal value, which are respectively expressed as a 1 ', a 2 ', a 3 ', a 4 ' and a 5 '. Because 1 byte consists of 8 bits, each byte "■" is used as a value, and any a r 'in a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' is Satisfy 0≤a r '≤255. a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' form a 1*5 matrix. The method of judging whether at least part of the data in the window W j1 [k j -169, k j ] satisfies the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i1 [k i -169, k i ] meets the predetermined condition C 1 In the same way, therefore, still using the matrix R, expressed as:
根据a1'的值和所在的列,从矩阵R中查找对应的值,如a1'=16,a1'位于第1列,则查找h16,1对应的值;根据a2'的值和所在的列,从矩阵R中查找对应的值,如a2'=98,a2'位于第2列,则查找h98,2对应的值;根据a3'的值和所在的列,从矩阵R中查找对应的值,如 a3'=56,a3'位于第3列,则查找h56,3对应的值;根据a4'的值和所在的列,从矩阵R中查找对应的值,如a4'=36,a4'位于第4列,则查找h36,4对应的值;根据a5'的值和所在的列,从矩阵R中查找对应的值,如a5'=99,a5'位于第5列,则查找h99,5对应的值。S1'=h16,1+h98,2+h56,3+h36,4+h99,5,因为矩阵R服从二项分布,因此,S1'也服从二项分布。当S1'为偶数,则Wj1[kj-169,kj]中至少部分数据满足预定条件C1,当S1'为奇数,则Wj1[kj-169,kj]中至少部分数据不满足预定条件C1,S1'为偶数的概率为1/2。According to the value of a 1 ' and the column where it is located, find the corresponding value from the matrix R, such as a 1 '=16, and a 1 ' is located in the first column, then find the value corresponding to h 16,1 ; according to the value of a 2 ' value and the column where it is located, find the corresponding value from the matrix R, such as a 2 '=98, a 2 'is located in the second column, then find the value corresponding to h 98,2 ; according to the value of a 3 ' and the column where it is located , look up the corresponding value from the matrix R, such as a 3 '=56, a 3 'is located in the third column, then look for the value corresponding to h 56,3 ; according to the value and column of a 4 ', from the matrix R Find the corresponding value, such as a 4 '=36, a 4 ' is located in the fourth column, then find the value corresponding to h 36,4 ; according to the value of a 5 ' and the column where it is located, find the corresponding value from the matrix R, For example, a 5 '=99, and a 5 'is located in the fifth column, then search for the value corresponding to h 99,5 . S 1 '=h 16,1 +h 98,2 +h 56,3 +h 36,4 +h 99,5 , because the matrix R obeys the binomial distribution, therefore, S 1 ' also obeys the binomial distribution. When S 1 ' is an even number, at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 , and when S 1 ' is an odd number, then at least part of the data in W j1 [k j -169,k j ] Part of the data does not satisfy the predetermined condition C 1 , and the probability that S 1 ′ is an even number is 1/2.
判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,如图33所示,表示判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1'、b2'、b3'、b4'和b5'。因为1个字节由8位组成,所以每个字节作为一个数值,则b1'、b2'、b3'、b4'和b5'中的任一个br'均满足0≤br'≤255。b1'、b2'、b3'、b4'和b5'组成1*5的矩阵。与判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2使用相同的矩阵R,根据b1'的值和所在的列,从矩阵R中查找对应的值,如b1'=210,b1'位于第1列,则查找h210,1对应的值;根据b2'的值和所在的列,从矩阵R中查找对应的值,如b2'=156,b2'位于第2列,则查找h156,2对应的值;根据b3'的值和所在的列,从矩阵R中查找对应的值,如b3'=144,b3'位于第3列,则查找h144,3对应的 值;根据b4'的值和所在的列,从矩阵R中查找对应的值,如b4'=60,b4'位于第4列,则查找h60,4对应的值;根据b5'的值和所在的列,从矩阵R中查找对应的值,如b5'=90,b5'位于第5列,则查找h90,5对应的值。S2'=h210,1+h156,2+h144,3+h60,4+h90,5,与S2的判断条件相同,当S2'为偶数,则Wj2[kj-170,kj-1]中至少部分数据满足预定条件C2,当S2'为奇数,则Wj2[kj-170,kj-1]中至少部分数据不满足预定条件C2,S2'为偶数的概率为1/2。Judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 2 in the same manner, so, as shown in Figure 33, Indicates the 1 byte selected when judging whether at least part of the data in the window W j2 [k j -170, k j -1] satisfies the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes, Respectively represented by sequence numbers 170, 128, 86, 44 and 2, the difference between two adjacent selected bytes is 42 bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 converted into a decimal value, respectively, expressed as b 1 ', b 2 ', b 3 ', b 4 ' and b 5 '. Since 1 byte consists of 8 bits, each byte As a numerical value, b r ' among b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ satisfies 0≤br ′≤255. b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ form a 1*5 matrix. Use the same matrix R as judging whether at least part of the data in the window W i2 [k i -170, ki -1 ] meets the predetermined condition C 2 , and find the corresponding Value, such as b 1 '=210, b 1 'is located in the first column, then find the value corresponding to h 210,1 ; according to the value and column of b 2 ', find the corresponding value from the matrix R, such as b 2 '=156, b 2 'is located in the second column, then search for the value corresponding to h 156,2 ; according to the value of b 3 ' and the column where it is located, find the corresponding value from the matrix R, such as b 3 '=144, b 3 ' is located in the third column, then search for the value corresponding to h 144,3 ; according to the value of b 4 ' and the column where it is located, find the corresponding value from the matrix R, such as b 4 '=60, b 4 'is in the 4th column, then look for the value corresponding to h 60,4 ; look up the corresponding value from the matrix R according to the value of b 5 ' and the column where it is located, such as b 5 '=90, b 5 'is in the fifth column, then look for h The value corresponding to 90,5 . S 2 '=h 210,1 +h 156,2 +h 144,3 +h 60,4 +h 90,5 , same as S 2 judgment conditions, when S 2 ' is an even number, then W j2 [k j -170, k j -1] at least part of the data satisfies the predetermined condition C 2 , when S 2 ' is an odd number, then at least part of the data in W j2 [k j -170, k j -1] does not meet the predetermined condition C 2 , The probability that S 2 ' is an even number is 1/2.
同理,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [k i -171, k i -2] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [k j -171, k j -2] satisfies The way of the predetermined condition C 3 is the same, and similarly, judge whether at least part of the data in W j4 [k j -172, k j -3] meets the predetermined condition C 4 , judge W j5 [k j -173, k j -4] Whether at least part of the data in W j6 [k j -174,k j -5] satisfies the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174,k j -5] meets the predetermined condition C 6 , and judge W j7 [k j -175,k j -6] Whether at least part of the data in W j8 [k j -176,k j -7] satisfies the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176,k j -7] meets the predetermined condition C 8 , judge W j9 [k j -177,k j -8] Whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined condition C 10 and whether at least part of the data in W j10 [k j -179, k j -10 ] Whether or not at least some of the data satisfy the predetermined condition C 11 will not be repeated here.
本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,以图21所示的实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定的条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字 节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1、a2、a3、a4和a5。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1、a2、a3、a4和a5中的任一个as均满足0≤as≤255。a1、a2、a3、a4和a5组成1*5的矩阵。从服从二项分布的随机数中选择256*5个随机数,组成矩阵R,表示为:从服从二项分布的随机数中选择256*5个随机数,组成矩阵G,表示为: In this embodiment, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z . Taking the implementation shown in FIG. 21 as an example, according to the deduplication The preset rule on the server 103 determines the window W i1 [k i -169, ki ] for the potential segmentation point ki , and judges whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , choose 5 byte, the byte "■" with serial number 169, 127, 85, 43 and 1 in Fig. 32 represents one byte selected respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are respectively converted into a decimal value, expressed as a 1 , a 2 , a 3 , a 4 and a 5 respectively. Because 1 byte consists of 8 bits, each byte "■" is used as a value, and any a s in a 1 , a 2 , a 3 , a 4 and a 5 satisfies 0≤a s ≤ 255. a 1 , a 2 , a 3 , a 4 and a 5 form a 1*5 matrix. Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix R, which is expressed as: Select 256*5 random numbers from the random numbers subject to the binomial distribution to form a matrix G, which is expressed as:
根据a1的值和所在的列,如a1=36,a1位于第1列,则从矩阵R中查找查找h36,1对应的值,从矩阵G中查找g36,1对应的值;根据a2的值和所在的列,如a2=48,a2位于第2列,则从矩阵R中查h48,2对应的值,从矩阵G中查找g48,2对应的值;根据a3的值和所在的列,如a3=26,a3位于第3列,则从矩阵R中查找h26,3对应的值,从矩阵G中查找g26,3对应的值;根据a4的值和所在的列,如a4=26,a4位于第4列,则从矩阵R中查找h26,4对应的值,从矩阵G中查找g26,4对应的值;根据a5的值和所在的列,如a5=88,a5位于第5列,则从矩阵R中查找h88,5对应的值,从矩阵G中查找g88,5对应的值。S1h=h36,1+h48,2+h26,3+h26,4+h88,5,因为矩阵R服从二项分布,因此,S1h也服从二项分布;S1g=g36,1+g48,2+g26,3+g26,4+g88,5,因为矩阵G服从二项分布,因此S1g也服从二项分布。当S1h和S1g中有1个为偶数,则Wi1[ki-169,ki]中至少部分数据满足预定条件C1,当S1h和S1g均为奇数,则Wi1[ki-169,ki]中至少部分数据不满足预定条 件C1,C1表述按照上述方法获得的S1h和S1g中有1个为偶数。因为S1h和S1g均服从二项分布,因此S1h为偶数的概率为1/2,S1g为偶数的概率为1/2,S1h和S1g中有1个为偶数的概率为1-1/4=3/4,因此,Wi1[ki-169,ki]中至少部分数据满足预定条件C1的概率为3/4。在图21所示实施例中,Wi1[ki-169,ki]中至少部分数据满足预定条件C1。在图21所示的实施方式中,在Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-178,ki-9]和Wi11[ki-179,ki-10]中,各窗口大小相同,即窗口大小均为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的描述。因此,如图32所示,表示判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图32中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1、b2、b3、b4和b5。因为1个字节由8位组成,所以每个字节作为一个数值,则b1、b2、b3、b4和b5中的任一个bs均满足0≤bs≤255。b1、b2、b3、b4和b5组成1*5的矩阵。本实施方式中,判断各窗口中至少部分数据是否满足预定条件的方式相同,因此仍然使用相同矩阵R和G。根据b1的值和所在的列,如b1=66,b1位于第1列,则从矩阵R中查找h66,1对应的值,从矩阵G中查找g66,1对应的值;根据b2的值和所在的列,如b2=48,b2位于第2列,则从矩阵R中查找h48,2对应的值,从矩阵G中查找g48,2对应的值;根据b3的值和所在的列,如b3=99,b3位于第3列,则从矩阵R中查找h99,3对应的值,从矩阵G中查找g99,3对应 的值;根据b4的值和所在的列,如b4=26,b4位于第4列,则从矩阵R中查找h26,4对应的值,从矩阵G中查找g26,4对应的值;根据b5的值和所在的列,如b5=90,b5位于第5列,则从矩阵R中查找h90,5对应的值,从矩阵G中查找g90,5对应的值。S2h=h66,1+h48,2+h99,3+h26,4+h90,5,因为矩阵R服从二项分布,因此,S2h也服从二项分布。S2g=g66,1+g48,2+g99,3+g26,4+g90,5,因为矩阵G服从二项分布,因此,S2g也服从二项分布。当S2h和S2g中有1个为偶数,则Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2,当S2h和S2g均为奇数,则Wi2[ki-170,ki-1]中至少部分数据不满足预定条件C2,S2h和S2g中有1个为偶数的概率为3/4。在图21所示实施例中,Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5、判断Wi6[ki-174,ki-5]中至少部分数据是否满足预定条件C6、判断Wi7[ki-175,ki-6]中至少部分数据是否满足预定条件C7、判断Wi8[ki-176,ki-7]中至少部分数据是否满足预定条件C8、判断Wi9[ki-177,ki-8]中至少部分数据是否满足预定条件C9、判断Wi10[ki-178,ki-9]中至少部分数据是否满足预定条件C10和判断Wi11[ki-179,ki-10]中至少部分数据是否满足预定条件C11。图21所示的实施方式中,Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得当前潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条 件C1的方式相同,因此如图33所示,Wj1表示窗口Wj1[kj-169,kj],为判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1,图33中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”分别转换成一个十进制数值,分别表示为a1'、a2'、a3'、a4'和a5'。因为1个字节由8位组成,所以每个字节“■”作为一个数值,则a1'、a2'、a3'、a4'和a5'中的任一个as'均满足0≤as'≤255。a1'、a2'、a3'、a4'和a5'组成1*5的矩阵。使用与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1相同的矩阵R和G,分别表示为:和 According to the value of a 1 and the column where it is located, such as a 1 = 36, and a 1 is located in the first column, then look up the value corresponding to h 36,1 from the matrix R, and look up the value corresponding to g 36,1 from the matrix G ;According to the value of a 2 and the column where it is located, such as a 2 = 48, a 2 is located in the second column, then look up the value corresponding to h 48,2 from the matrix R, and look up the value corresponding to g 48,2 from the matrix G ;According to the value of a 3 and the column where it is located, such as a 3 =26, a 3 is located in the third column, then look up the value corresponding to h 26,3 from the matrix R, and look up the value corresponding to g 26,3 from the matrix G ;According to the value of a 4 and the column where it is located, such as a 4 =26, a 4 is located in the fourth column, then look up the value corresponding to h 26,4 from the matrix R, and look up the value corresponding to g 26,4 from the matrix G ;According to the value of a 5 and the column where it is located, as a 5 =88, a 5 is located in the fifth column, then look up the value corresponding to h 88,5 from the matrix R, and look up the value corresponding to g 88,5 from the matrix G . S 1h =h 36,1 +h 48,2 +h 26,3 +h 26,4 +h 88,5 , because matrix R obeys binomial distribution, therefore, S 1h also obeys binomial distribution; S 1g =g 36,1 +g 48,2 +g 26,3 +g 26,4 +g 88,5 , because matrix G obeys binomial distribution, so S 1g also obeys binomial distribution. When one of S 1h and S 1g is an even number, at least part of the data in W i1 [k i -169,k i ] meets the predetermined condition C 1 , and when S 1h and S 1g are both odd numbers, then W i1 [k i -169, ki ] at least part of the data do not meet the predetermined condition C 1 , C 1 means that one of S 1h and S 1g obtained by the above method is an even number. Because both S 1h and S 1g follow the binomial distribution, the probability that S 1h is even is 1/2, the probability that S 1g is even is 1/2, and the probability that one of S 1h and S 1g is even is 1 -1/4=3/4, therefore, the probability that at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 is 3/4. In the embodiment shown in FIG. 21 , at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 . In the embodiment shown in FIG. 21 , in W i1 [k i -169, ki ], W i2 [k i -170, ki -1], W i3 [k i -171, ki -2] , W i4 [k i -172,k i -3], W i5 [k i -173,k i -4], W i6 [k i -174,k i -5], W i7 [k i -175 ,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177,k i -8], W i10 [k i -178,k i -9] and W In i11 [k i -179, k i -10], the size of each window is the same, that is, the size of the window is 169 bytes, and the method of judging whether at least part of the data in the window meets the predetermined condition is also the same, see the above judgment W i1 for details Whether at least part of the data in [k i −169, ki ] satisfies the description of the predetermined condition C 1 . Therefore, as shown in Figure 32, Represents the 1 byte selected when at least part of the data in the judgment window W i2 [k i -170, ki -1] meets the predetermined condition C 2. In Fig. 32, sequence numbers 170, 128, 86, 44 are used and 2 indicate that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 respectively converted into a decimal value, respectively expressed as b 1 , b 2 , b 3 , b 4 and b 5 . Since 1 byte consists of 8 bits, each byte As a numerical value, any b s among b 1 , b 2 , b 3 , b 4 and b 5 satisfies 0≤b s ≤255. b 1 , b 2 , b 3 , b 4 and b 5 form a 1*5 matrix. In this embodiment, the method of judging whether at least part of the data in each window satisfies the predetermined condition is the same, so the same matrices R and G are still used. According to the value of b 1 and the column where it is located, such as b 1 = 66, b 1 is located in the first column, then look up the value corresponding to h 66,1 from the matrix R, and look up the value corresponding to g 66,1 from the matrix G; According to the value of b2 and the column where it is located, as b2=48, b2 is located in the second column, then look up the value corresponding to h 48,2 from matrix R, and look up the value corresponding to g 48,2 from matrix G; According to the value of b 3 and the column where it is located, as b 3 =99, b 3 is located in the third column, then the value corresponding to h 99,3 is searched from matrix R, and the value corresponding to g 99,3 is searched from matrix G; According to the value of b 4 and the column where it is located, as b 4 =26, b 4 is located in the fourth column, then look up the value corresponding to h 26,4 from the matrix R, and look up the value corresponding to g 26,4 from the matrix G; According to the value of b 5 and the column where it is located, for example, b 5 =90, and b 5 is located in the fifth column, the value corresponding to h 90,5 is searched from the matrix R, and the value corresponding to g 90,5 is searched from the matrix G. S 2h =h 66,1 +h 48,2 +h 99,3 +h 26,4 +h 90,5 , because the matrix R obeys the binomial distribution, therefore, S 2h also obeys the binomial distribution. S 2g =g 66,1 +g 48,2 +g 99,3 +g 26,4 +g 90,5 , because the matrix G obeys the binomial distribution, therefore, S 2g also obeys the binomial distribution. When one of S 2h and S 2g is an even number, at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 , and when both S 2h and S 2g are odd numbers, then W i2 At least part of the data in [k i -170, ki -1] does not satisfy the predetermined condition C 2 , and the probability that one of S 2h and S 2g is an even number is 3/4. In the embodiment shown in FIG. 21 , at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [k i -172, ki -3] Satisfy the predetermined condition C 4 , determine whether at least part of the data in W i5 [k i -173, ki -4] meets the predetermined condition C 5 , determine whether at least part of the data in W i6 [k i -174, ki -5] Satisfy the predetermined condition C 6 , judge whether at least part of the data in W i7 [k i -175, ki -6] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -176, ki -7] Satisfy the predetermined condition C 8 , determine whether at least part of the data in W i9 [k i -177, ki -8] meets the predetermined condition C 9 , determine whether at least part of the data in W i10 [k i -178, ki -9] Satisfying the predetermined condition C 10 and judging whether at least part of the data in W i11 [k i -179, ki -10] satisfies the predetermined condition C 11 . In the embodiment shown in Fig. 21, at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , and 7 jumps are made from the potential split point ki along the direction of data stream split point search Byte, obtain the current potential segmentation point k j at the end position of the seventh byte, as shown in Figure 22, according to the rules preset for the deduplication server 103, determine the window W j1 [k j for the potential segmentation point k j -169,k j ], the method of judging whether at least part of the data in the window W j1 [k j -169,k j ] meets the predetermined condition C 1 is the same as judging at least part of the data in the window W i1 [k i -169,k i ] The method of whether the predetermined condition C1 is satisfied is the same, so as shown in Figure 33, W j1 represents the window W j1 [k j -169, k j ], in order to judge that at least part of the data in W j1 [k j -169, k j ] Whether the predetermined condition C 1 is met, the bytes "■" with serial numbers 169, 127, 85, 43 and 1 in Fig. 33 respectively represent 1 selected byte, and the difference between two adjacent selected bytes is 42 characters Festival. Convert the bytes "■" with sequence numbers 169, 127, 85, 43, and 1 into a decimal value, which are respectively expressed as a 1 ', a 2 ', a 3 ', a 4 ', and a 5 '. Because 1 byte consists of 8 bits, each byte "■" is used as a value, and any a s 'in a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' is Satisfy 0≤a s '≤255. a 1 ', a 2 ', a 3 ', a 4 ' and a 5 ' form a 1*5 matrix. Using the same matrix R and G as judging whether at least part of the data in the window W i1 [k i -169, ki ] satisfies the predetermined condition C 1 is expressed as: and
根据a1'的值和所在的列,如a1'=16,a1'位于第1列,则从矩阵R中查找h16,1对应的值,从矩阵G中查找g16,1对应的值;根据a2'的值和所在的列,如a2'=98,a2'位于第2列,则从矩阵R中查找h98,2对应的值,从矩阵G中查找g98,2对应的值;根据a3'的值和所在的列,如a3'=56,a3'位于第3列,则从矩阵R中查找h56,3对应的值,从矩阵G中查找g56,3对应的值;根据a4'的值和所在的列,如a4'=36,a4'位于第4列,则从矩阵R中查找h36,4对应的值,从矩阵G中查找g36,4对应的值;根据a5'的值和所在的列,如a5'=99,a5'位于第5列,则从矩阵R中查找h99,5对应的值,从矩阵G中查找g99,5对应的值。S1h'=h16,1+h98,2+h56,3+h36,4+h99,5,因为矩阵R服从二项分布,因此,S1h'也服从二项分布;S1g'=g16,1+g98,2+g56,3+g36,4+g99,5,因为矩阵G服从二项分布,因此S1g'也服从二项分布。当S1h'和S1g'中有1个为偶数,则Wj1[kj-169,kj]中至少部分数据满足预 定条件C1,当S1h'和S1g'均为奇数,则Wj1[kj-169,kj]中至少部分数据不满足预定条件C1,S1h'和S1g'有1个为偶数的概率为3/4。According to the value of a 1 ' and the column where it is located, such as a 1 '=16, and a 1 'is located in the first column, then look up the value corresponding to h 16,1 from the matrix R, and find the corresponding value of g 16,1 from the matrix G value; according to the value of a 2 ' and the column where it is located, such as a 2 '=98, a 2 'is located in the second column, then look up the value corresponding to h 98 and 2 from the matrix R, and look up g 98 from the matrix G , the value corresponding to 2 ; according to the value of a 3 ' and the column where it is located, such as a 3 '=56, a 3 'is located in the third column, then look up the value corresponding to h 56,3 from the matrix R, and from the matrix G Find the value corresponding to g 56,3 ; according to the value of a 4 ' and the column where it is located, such as a 4 '=36, a 4 ' is located in the fourth column, then find the value corresponding to h 36,4 from the matrix R, from Find the value corresponding to g 36,4 in the matrix G; according to the value of a 5 ' and the column where it is located, such as a 5 '=99, a 5 ' is located in the fifth column, then find the value corresponding to h 99,5 from the matrix R value, find the value corresponding to g 99,5 from the matrix G. S 1h '=h 16,1 +h 98,2 +h 56,3 +h 36,4 +h 99,5 , because matrix R obeys binomial distribution, therefore, S 1h ' also obeys binomial distribution; S 1g '=g 16,1 +g 98,2 +g 56,3 +g 36,4 +g 99,5 , because the matrix G obeys the binomial distribution, so S 1g ' also obeys the binomial distribution. When one of S 1h ' and S 1g ' is an even number, then at least part of the data in W j1 [k j -169,k j ] satisfies the predetermined condition C 1 , and when both S 1h ' and S 1g ' are odd numbers, then At least part of the data in W j1 [k j -169,k j ] does not satisfy the predetermined condition C 1 , and the probability that one of S 1h ' and S 1g ' is an even number is 3/4.
判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,如图33所示,表示判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。在图33中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节分别转换成一个十进制数值,分别表示为b1'、b2'、b3'、b4'和b5'。因为1个字节由8位组成,所以每个字节作为一个数值,则b1'、b2'、b3'、b4'和b5'中的任一个bs'均满足0≤bs'≤255。b1'、b2'、b3'、b4'和b5'组成1*5的矩阵。使用与判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2相同的矩阵R和G,根据b1'的值和所在的列,如b1'=210,b1'位于第1列,则从矩阵R中查找h210,1对应的值,从矩阵G中查找g210,1对应的值;根据b2'的值和所在的列,如b2'=156,b2'位于第2列,则从矩阵R中查找h156,2对应的值,从矩阵G中查找g156,2对应的值;根据b3'的值和所在的列,如b3'=144,b3'位于第3列,则从矩阵R中查找h144,3对应的值,从矩阵G中查找g144,3对应的值;根据b4'的值和所在的列,如b4'=60,b4'位于第4列,则从矩阵R中查找h60,4对应的值,从矩阵G中查找g60,4对应的值;根据b5'的值和所在的列,如b5'=90,b5'位于第5列,则从矩阵R中查找h90,5对应的值,从矩阵G中查找g90,5对应的值。S2h'=h210,1+h156,2+h144,3+h60,4+h90,5,S2g'=g210,1+g156,2+g144,3+g60,4+g90,5。当S2h'和S2g'中有1个为偶数,则Wj2[kj-170,kj -1]中至少部分数据满足预定条件C2,当S2h'和S2g'均为奇数,则Wj2[kj-170,kj-1]中至少部分数据不满足预定条件C2,S2h'和S2g'中有1个为偶数的概率为3/4。Judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 2 in the same way, so, as shown in Figure 33, Indicates one byte selected when judging whether at least part of the data in the window W j2 [k j -170,k j -1] meets the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. In FIG. 33 , they are represented by sequence numbers 170, 128, 86, 44 and 2 respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 converted into a decimal value, respectively, expressed as b 1 ', b 2 ', b 3 ', b 4 ' and b 5 '. Since 1 byte consists of 8 bits, each byte As a numerical value, any one of b s ' among b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ satisfies 0≤b s '≤255. b 1 ′, b 2 ′, b 3 ′, b 4 ′, and b 5 ′ form a 1*5 matrix. Using the same matrix R and G as judging whether at least part of the data in the window W j2 [k j -170,k j -1] meets the predetermined condition C 2 , according to the value of b 1 ' and the column where it is located, such as b 1 '= 210, b 1 ' is located in the first column, then look up the value corresponding to h 210,1 from the matrix R, and find the value corresponding to g 210,1 from the matrix G; according to the value of b 2 ' and the column where it is located, such as b 2 '=156, b 2 'is located in the second column, then look up the value corresponding to h 156,2 from the matrix R, and look up the value corresponding to g 156,2 from the matrix G; according to the value of b 3 ' and the column where it is located , such as b 3 '=144, b 3 'is located in the third column, then look up the value corresponding to h 144,3 from the matrix R, and look up the value corresponding to g 144,3 from the matrix G; according to the value of b 4 ' and The column where it is located, such as b 4 '=60, b 4 'is located in the fourth column, then look up the value corresponding to h 60,4 from the matrix R, and look up the value corresponding to g 60,4 from the matrix G; according to b 5 ' The value and the column where b 5 '=90, b 5 'is located in the fifth column, then look up the value corresponding to h 90,5 from the matrix R, and look up the value corresponding to g 90,5 from the matrix G. S 2h '=h 210,1 +h 156,2 +h 144,3 +h 60,4 +h 90,5 , S 2g '=g 210,1 +g 156,2 +g 144,3 +g 60 ,4 +g 90,5 . When one of S 2h ' and S 2g ' is an even number, then at least part of the data in W j2 [k j -170,k j -1] meets the predetermined condition C 2 , when both S 2h ' and S 2g ' are odd , then at least part of the data in W j2 [k j -170,k j -1] does not meet the predetermined condition C 2 , and the probability that one of S 2h ' and S 2g ' is an even number is 3/4.
同理,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [k i -171, k i -2] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [k j -171, k j -2] satisfies The way of the predetermined condition C 3 is the same, and similarly, judge whether at least part of the data in W j4 [k j -172, k j -3] meets the predetermined condition C 4 , judge W j5 [k j -173, k j -4] Whether at least part of the data in W j6 [k j -174,k j -5] satisfies the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174,k j -5] meets the predetermined condition C 6 , and judge W j7 [k j -175,k j -6] Whether at least part of the data in W j8 [k j -176,k j -7] satisfies the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176,k j -7] meets the predetermined condition C 8 , judge W j9 [k j -177,k j -8] Whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined condition C 10 and whether at least part of the data in W j10 [k j -179, k j -10 ] Whether or not at least some of the data satisfy the predetermined condition C 11 will not be repeated here.
本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,以图21所示的实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定的条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”依次看成40个位,分别表示为a1、a2、a3、a4…a40。a1、a2、a3、a4…a40中的任一at,当at=0时,Vat=-1,当at=1时,Vat=1,根据at与Vat对应关系,生成Va1、Va2、Va3、Va4…Va40。从服从正态分布的随机数 中选择40个随机数,分别表示为:h1、h2、h3、h4...h40。Sa=Va1*h1+Va2*h2+Va3*h3+Va4*h4+…+Va40*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sa也服从正态分布。当Sa为正数,则Wi1[ki-169,ki]中至少部分数据满足预定条件C1,当Sa为负数或0,则Wi1[ki-169,ki]中至少部分数据不满足预定条件C1,Sa为正数的概率为1/2。在图21所示实施例中,Wi1[ki-169,ki]中至少部分数据满足预定条件C1。如图32所示,表示判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2时分别选择的1个字节,在图32中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节依次看成40个位,分别表示为b1、b2、b3、b4…b40。b1、b2、b3、b4…b40中的任一bt,当bt=0时,Vbt=-1,当bt=1时,Vbt=1,根据bt与Vbt对应关系,生成Vb1、Vb2、Vb3、Vb4…Vb40。判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式相同,因此,使用相同的随机数:h1、h2、h3、h4...h40,Sb=Vb1*h1+Vb2*h2+Vb3*h3+Vb4*h4+…+Vb40*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sb也服从正态分布。当Sb为正数,则Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2,当Sb为负数或0,则Wi2[ki-170,ki-1]中至少部分数据不满足预定条件C2,Sb为正数的概率为1/2。在图21所示实施例中,Wi2[ki-170,ki-1]中至少部分数据满足预定条件C2。使用同样的规则,分别判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部分数据是否满足预定条件C5、判断Wi6[ki-174,ki-5]中至少部分数据是否满足预定条件C6、判断Wi7[ki-175,ki-6]中至少部分数据是否满足预定条件C7、判断Wi8[ki-176,ki-7]中至少部分数据是否满足预定条件C8、判断Wi9[ki-177,ki-8]中至少部分数据是否满足预定条件C9、判断Wi10[ki-178,ki-9]中至少部分数据是否满足预定条件C10和判断Wi11[ki-179,ki-10]中至少部分数据是否满足预定条件C11。图21所示的实施方式中,Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得当前潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此如图33所示,Wj1表示窗口Wj1[kj-169,kj],为判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1,选择5个字节,图33中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”依次看成40个位,分别表示为a1'、a2'、a3'、a4'…a40'。a1'、a2'、a3'、a4'…a40'中的任一at',当at'=0时,Vat'=-1,当at'=1时,Vat'=1,根据at'与Vat'对应关系,生成Va1'、Va2'、Va3'、Va4'…Va40'。判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此使用相同的随机数:h1、h2、h3、h4...h40。Sa'=Va1'*h1+Va2'*h2+Va3'*h3+Va4'*h4+…+Va40'*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sa'也服从正态分布。当Sa'为正数,则Wj1[kj-169,kj]中至少部分数据满足预定条件C1,当Sa'为负数或0,则Wj1[kj-169,kj]中至少部分数据不满足预定条件C1,Sa'为正 数的概率为1/2。In this embodiment, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z . Taking the implementation shown in FIG. 21 as an example, according to the deduplication The preset rule on the server 103 determines the window W i1 [k i -169, ki ] for the potential segmentation point ki , and judges whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , choose 5 byte, the byte "■" with serial number 169, 127, 85, 43 and 1 in Fig. 32 represents one byte selected respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are regarded as 40 bits in turn, which are expressed as a 1 , a 2 , a 3 , a 4 ...a 40 respectively. For any at of a 1 , a 2 , a 3 , a 4 ...a 40 , when at = 0, V at = -1 , when at = 1, V at = 1, according to at and Vat correspondence relationship generates V a1 , V a2 , V a3 , V a4 . . . V a40 . Select 40 random numbers from the random numbers subject to the normal distribution, respectively denoted as: h 1 , h 2 , h 3 , h 4 ... h 40 . S a =V a1 *h 1 +V a2 *h 2 +V a3 *h 3 +V a4 *h 4 + . . . +V a40 *h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey the normal distribution, therefore, S a also obeys the normal distribution. When S a is a positive number, at least part of the data in W i1 [ ki -169, ki ] meets the predetermined condition C 1 , when S a is negative or 0, then in W i1 [ ki -169, ki ] At least part of the data does not satisfy the predetermined condition C 1 , and the probability that S a is a positive number is 1/2. In the embodiment shown in FIG. 21 , at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 . As shown in Figure 32, Represents the 1 byte selected when at least part of the data in the judgment window W i2 [k i -170, ki -1] meets the predetermined condition C 2. In Fig. 32, sequence numbers 170, 128, 86, 44 are used and 2 indicate that there is a difference of 42 bytes between two adjacent selected bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 They are regarded as 40 bits in turn, represented as b 1 , b 2 , b 3 , b 4 . . . b 40 . For any b t in b 1 , b 2 , b 3 , b 4 ... b 40 , when b t =0, V bt =-1, when b t =1, V bt =1, according to b t and The V bt correspondence relationship generates V b1 , V b2 , V b3 , V b4 . . . V b40 . The method of judging whether at least part of the data in the window W i1 [k i -169, ki ] satisfies the predetermined condition C 1 is the same as judging whether at least part of the data in the window W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 in the same way, therefore, use the same random numbers: h 1 , h 2 , h 3 , h 4 ... h 40 , S b =V b1 *h 1 +V b2 *h 2 +V b3 *h 3 +V b4 *h 4 +...+V b40 *h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S b also obeys normal distribution. When S b is a positive number, at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 , and when S b is negative or 0, then W i2 [k i -170, ki -1] -1] at least part of the data does not meet the predetermined condition C 2 , the probability that S b is a positive number is 1/2. In the embodiment shown in FIG. 21 , at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 . Using the same rule, judge whether at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 , judge whether at least part of the data in W i4 [k i -172, ki -3] Satisfy the predetermined condition C 4 , determine whether at least part of the data in W i5 [k i -173, ki -4] meets the predetermined condition C 5 , determine whether at least part of the data in W i6 [k i -174, ki -5] Satisfy the predetermined condition C 6 , judge whether at least part of the data in W i7 [k i -175, ki -6] meets the predetermined condition C 7 , judge whether at least part of the data in W i8 [k i -176, ki -7] Satisfy the predetermined condition C 8 , determine whether at least part of the data in W i9 [k i -177, ki -8] meets the predetermined condition C 9 , determine whether at least part of the data in W i10 [k i -178, ki -9] Satisfying the predetermined condition C 10 and judging whether at least part of the data in W i11 [k i -179, ki -10] satisfies the predetermined condition C 11 . In the embodiment shown in Fig. 21, at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , and 7 jumps are made from the potential split point ki along the direction of data stream split point search Byte, obtain the current potential segmentation point k j at the end position of the seventh byte, as shown in Figure 22, according to the rules preset for the deduplication server 103, determine the window W j1 [k j for the potential segmentation point k j -169,k j ], the method of judging whether at least part of the data in the window W j1 [k j -169,k j ] meets the predetermined condition C 1 is the same as judging at least part of the data in the window W i1 [k i -169,k i ] The method of whether the predetermined condition C1 is satisfied is the same, so as shown in Figure 33, W j1 represents the window W j1 [k j -169, k j ], in order to judge that at least part of the data in W j1 [k j -169, k j ] Whether the predetermined condition C 1 is met, select 5 bytes, and the bytes "■" with serial numbers 169, 127, 85, 43 and 1 in Figure 33 respectively represent the selected 1 byte, and the adjacent two selected bytes There is a difference of 42 bytes between them. The bytes "■" with sequence numbers 169, 127, 85, 43 and 1 are regarded as 40 bits in sequence, which are represented as a 1 ', a 2 ', a 3 ', a 4 '...a 40 ' respectively. For any a t 'in a 1 ', a 2 ', a 3 ', a 4 '...a 40 ', when a t '=0, V at '=-1, when a t '=1, V at '=1, V a1 ', V a2 ', V a3 ', V a4 '...V a40 ' are generated according to the correspondence between at ' and V at '. The method of judging whether at least part of the data in the window W j1 [k j -169, k j ] satisfies the predetermined condition C 1 is the same as the method of judging whether at least part of the data in the window W i1 [k i -169, k i ] meets the predetermined condition C 1 In the same way, so the same random numbers are used: h 1 , h 2 , h 3 , h 4 ... h 40 . S a '=V a1 '*h 1 +V a2 '*h 2 +V a3 '*h 3 +V a4 '*h 4 +...+V a40 '*h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S a ' also obeys normal distribution. When S a 'is a positive number, at least part of the data in W j1 [k j -169,k j ] meets the predetermined condition C 1 , and when S a ' is negative or 0, then W j1 [k j -169,k j ] at least part of the data does not meet the predetermined condition C 1 , the probability that S a ' is a positive number is 1/2.
判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,如图33所示,表示判断窗口Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2时选择的1个字节,相邻两个选择的字节之间相差42个字节。在图33中,分别用序号170、128、86、44和2表示,相邻两个选择的字节之间相差42个字节。将序号170、128、86、44和2的字节依次看成40个位,分别表示为b1'、b2'、b3'、b4'…b40'。b1'、b2'、b3'、b4'…b40'中的任一bt',当bt'=0时,Vbt'=-1,当bt'=1时,Vbt'=1,根据bt'与Vbt'对应关系,生成Vb1'、Vb2'、Vb3'、Vb4'…Vb40'。判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,因此,使用相同的随机数:h1、h2、h3、h4...h40,Sb'=Vb1'*h1+Vb2'*h2+Vb3'*h3+Vb4'*h4+…+Vb40'*h40。因为h1、h2、h3、h4...h40服从正态分布,因此,Sb'也服从正态分布。当Sb'为正数,则Wj2[kj-170,kj-1]中至少部分数据满足预定条件C2,当Sb'为负数或0,则Wj2[kj-170,kj-1]中至少部分数据不满足预定条件C2,Sb'为正数的概率为1/2。Judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 2 in the same way, so, as shown in Figure 33, Indicates one byte selected when judging whether at least part of the data in the window W j2 [k j -170,k j -1] meets the predetermined condition C 2 , and the difference between two adjacent selected bytes is 42 bytes. In FIG. 33 , they are represented by sequence numbers 170, 128, 86, 44 and 2 respectively, and the difference between two adjacent selected bytes is 42 bytes. The bytes with sequence numbers 170, 128, 86, 44, and 2 It is regarded as 40 bits in turn, which are respectively expressed as b 1 ′, b 2 ′, b 3 ′, b 4 ′…b 40 ′. Any b t 'in b 1 ', b 2 ', b 3 ', b 4 '...b 40 ', when b t '=0, V bt '=-1, when b t '=1, V bt '=1, V b1 ', V b2 ', V b3 ', V b4 '...V b40 ' are generated according to the corresponding relationship between b t ' and V bt '. Judging whether at least part of the data in W i2 [k i -170, ki -1] meets the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meets the predetermined condition C 2 in the same way, therefore, use the same random numbers: h 1 , h 2 , h 3 , h 4 ... h 40 , S b '=V b1 '*h 1 +V b2 '*h 2 +V b3 '*h 3 +V b4 '*h 4 +...+V b40 '*h 40 . Because h 1 , h 2 , h 3 , h 4 . . . h 40 obey normal distribution, therefore, S b ' also obeys normal distribution. When S b ' is a positive number, at least part of the data in W j2 [k j -170,k j -1] meets the predetermined condition C 2 , and when S b ' is negative or 0, then W j2 [k j -170, At least part of the data in k j -1] does not satisfy the predetermined condition C 2 , and the probability that S b ' is a positive number is 1/2.
同理,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj -176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。Similarly, the method of judging whether at least part of the data in W i3 [k i -171, k i -2] satisfies the predetermined condition C 3 is the same as judging whether at least part of the data in W j3 [k j -171, k j -2] satisfies The way of the predetermined condition C 3 is the same, and similarly, judge whether at least part of the data in W j4 [k j -172, k j -3] meets the predetermined condition C 4 , judge W j5 [k j -173, k j -4] Whether at least part of the data in W j6 [k j -174,k j -5] satisfies the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174,k j -5] meets the predetermined condition C 6 , and judge W j7 [k j -175,k j -6] Whether at least part of the data in W j8 [k j -176,k j -7] satisfies the predetermined condition C 7 , judge whether at least part of the data in W j8 [k j -176,k j -7] meets the predetermined condition C 8 , judge W j9 [k j -177,k j -8] Whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined condition C 10 and whether at least part of the data in W j10 [k j -179, k j -10 ] Whether or not at least some of the data satisfy the predetermined condition C 11 will not be repeated here.
本实施例中使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz,仍然以图21所示实施方式为例,根据在去重服务器103上预设的规则,为潜在分割点ki确定窗口Wi1[ki-169,ki],判断Wi1[ki-169,ki]中至少部分数据是否满足预定的条件C1,如图32所示,Wi1表示窗口Wi1[ki-169,ki],为判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1,选择5个字节,图32中序号为169、127、85、43和1的字节“■”分别表示选择的1个字节,相邻两个选择的字节之间相差42个字节。将序号为169、127、85、43和1的字节“■”转换成1个十进制数,范围为0-(2^40-1),使用均匀分布随机数生成器为0-(2^40-1)中的每一个十进制数生成1个指定值,记录0-(2^40-1)中的每一个十进制数与指定值之间的对应关系R,一旦指定则该十进制数对应的指定值就不变,该指定值服从均匀分布,如果该指定值为偶数,则Wi1[ki-169,ki]中至少部分数据满足预定条件C1,如果该指定值为奇数,则Wi1[ki-169,ki]中至少部分数据不满足预定条件C1,C1表示按照上述方法获得的指定值为偶数。因为均匀分布的随机数为偶数的概率为1/2,因此,Wi1[ki-169,ki]中至少部分数据满足预定条件C1的概率为1/2。在图21所示的实施方式中,使用同样的规则,分别判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3、判断Wi4[ki-172,ki-3]中至少部分数据是否满足预定条件C4、判断Wi5[ki-173,ki-4]中至少部 分数据是否满足预定条件C5,在此不再赘述。In this embodiment, a random function is used to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z , still taking the implementation shown in Figure 21 as an example, according to the deduplication The preset rule on the server 103 determines the window W i1 [k i -169, ki ] for the potential segmentation point ki , and judges whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , as shown in Figure 32, W i1 represents the window W i1 [k i -169, ki ], in order to judge whether at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 , choose 5 byte, the byte "■" with serial number 169, 127, 85, 43 and 1 in Fig. 32 represents one byte selected respectively, and the difference between two adjacent selected bytes is 42 bytes. Convert the byte "■" with serial numbers 169, 127, 85, 43 and 1 into a decimal number in the range of 0-(2^40-1), using a uniformly distributed random number generator as 0-(2^ Each decimal number in 40-1) generates a specified value, and records the correspondence R between each decimal number in 0-(2^40-1) and the specified value. Once specified, the corresponding decimal number The specified value remains unchanged, and the specified value obeys the uniform distribution. If the specified value is an even number, then at least part of the data in W i1 [k i -169, ki ] satisfies the predetermined condition C 1 . If the specified value is an odd number, then At least part of the data in W i1 [k i −169, ki ] does not satisfy the predetermined condition C 1 , and C 1 indicates that the specified value obtained by the above method is an even number. Because the probability that a uniformly distributed random number is an even number is 1/2, therefore, the probability that at least part of the data in W i1 [k i −169, ki ] satisfies the predetermined condition C 1 is 1/2. In the embodiment shown in FIG. 21 , the same rules are used to determine whether at least part of the data in W i2 [k i -170, ki -1] satisfies the predetermined condition C 2 , and determine whether W i3 [k i -171, Whether at least part of the data in ki -2] meets the predetermined condition C 3 , judge whether at least part of the data in W i4 [ ki -172, ki -3] meets the predetermined condition C 4 , and judge W i5 [ ki -173, Whether at least part of the data in k i -4] satisfies the predetermined condition C 5 will not be repeated here.
当Wi5[ki-173,ki-4]中至少部分数据不满足预定条件C5,从潜在分割点ki沿着数据流分割点查找方向跳跃7个字节,在第7个字节的结束位置获得当前潜在分割点kj,如图22所示,根据为去重服务器103预设的规则,为潜在分割点kj确定窗口Wj1[kj-169,kj],判断窗口Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1的方式与判断窗口Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的方式相同,因此,使用相同的0-(2^40-1)中的每一个十进制数与指定值之间的对应关系R,如图33所示,Wj1表示窗口,为判断Wj1[kj-169,kj]中至少部分数据是否满足预定条件C1,选择5个字节,图33中“■”表示选择的1个字节,相邻两个选择的字节“■”之间相差42个字节。将序号为169、127、85、43和1的字节“■”转换成1个十进制数,在R查找该十进制数对应的指定值,如果该指定值为偶数,则Wj1[kj-169,kj]中至少部分数据满足预定条件C1,如果该指定值为奇数,则Wj1[kj-169,kj]中至少部分数据不满足预定条件C1,因为均匀分布的随机数为偶数的概率为1/2,因此,Wj1[kj-169,kj]中至少部分数据满足预定条件C1的概率为1/2。同理,判断Wi2[ki-170,ki-1]中至少部分数据是否满足预定条件C2的方式和判断Wj2[kj-170,kj-1]中至少部分数据是否满足预定条件C2的方式相同,判断Wi3[ki-171,ki-2]中至少部分数据是否满足预定条件C3的方式与判断Wj3[kj-171,kj-2]中至少部分数据是否满足预定条件C3的方式相同,同理,判断Wj4[kj-172,kj-3]中至少部分数据是否满足预定条件C4、判断Wj5[kj-173,kj-4]中至少部分数据是否满足预定条件C5、判断Wj6[kj-174,kj-5]中至少部分数据是否满足预定条件C6、判断 Wj7[kj-175,kj-6]中至少部分数据是否满足预定条件C7、判断Wj8[kj-176,kj-7]中至少部分数据是否满足预定条件C8、判断Wj9[kj-177,kj-8]中至少部分数据是否满足预定条件C9、判断Wj10[kj-178,kj-9]中至少部分数据是否满足预定条件C10和判断Wj11[kj-179,kj-10]中至少部分数据是否满足预定条件C11,在此不再赘述。When at least part of the data in W i5 [k i -173, ki -4] does not meet the predetermined condition C 5 , jump 7 bytes from the potential split point ki along the data flow split point search direction, and at the seventh word The end position of the section obtains the current potential segmentation point k j , as shown in Figure 22, according to the rules preset for the deduplication server 103, determine the window W j1 [k j -169, k j ] for the potential segmentation point k j , judge The method of whether at least part of the data in the window W j1 [k j -169,k j ] meets the predetermined condition C 1 and the method of judging whether at least part of the data in the window W i1 [k i -169,k i ] meet the predetermined condition C 1 The same, therefore, use the corresponding relationship R between each decimal number in the same 0-(2^40-1) and the specified value, as shown in Figure 33, W j1 represents the window, for judging W j1 [k j -169, k j ] Whether at least part of the data satisfies the predetermined condition C 1 , select 5 bytes, "■" in Figure 33 represents the selected 1 byte, between two adjacent selected bytes "■" A difference of 42 bytes. Convert the byte "■" with serial numbers 169, 127, 85, 43 and 1 into a decimal number, and find the specified value corresponding to the decimal number in R. If the specified value is even, then W j1 [k j - 169,k j ] at least part of the data meet the predetermined condition C 1 , if the specified value is an odd number, then at least part of the data in W j1 [k j -169,k j ] does not meet the predetermined condition C 1 , because the random The probability that the number is even is 1/2, therefore, the probability that at least part of the data in W j1 [k j -169, k j ] satisfies the predetermined condition C 1 is 1/2. Similarly, the method of judging whether at least part of the data in W i2 [k i -170, k i -1] satisfies the predetermined condition C 2 and judging whether at least part of the data in W j2 [k j -170, k j -1] meet The method of the predetermined condition C 2 is the same, the method of judging whether at least part of the data in W i3 [k i -171, ki -2] satisfies the predetermined condition C 3 is the same as that of judging W j3 [k j -171, k j -2] Whether at least part of the data satisfies the predetermined condition C 3 is the same. Similarly, judge whether at least part of the data in W j4 [k j -172,k j -3] meets the predetermined condition C 4 , and judge W j5 [k j -173, Whether at least part of the data in k j -4] satisfies the predetermined condition C 5 , judge whether at least part of the data in W j6 [k j -174,k j -5] meets the predetermined condition C 6 , judge W j7 [k j -175, Whether at least part of the data in k j -6] satisfies the predetermined condition C 7 , judging whether at least part of the data in W j8 [k j -176 ,k j -7] meets the predetermined condition C 8 , judging W j9 [k j -177, Whether at least part of the data in k j -8] meets the predetermined condition C 9 , judging whether at least part of the data in W j10 [k j -178, k j -9] meets the predetermined condition C 10 and judging W j11 [k j -179, Whether at least part of the data in k j -10] satisfies the predetermined condition C 11 will not be repeated here.
图1所示的本发明实施例中的去重服务器103,是指能够实现本发明实施例所描述的技术方案的装置,如图18所示,通常包括中央处理单元、主存储器以及输入输出接口。中央处理单元、主存储器与输入输出接口之间相互通信,主存储器存储可执行指令,中央处理单元执行主存储器中存储的可执行指令,从而执行特定的功能,使去重服务器103具备特定功能,如本发明实施例图20至图33所描述的查找数据流分割点。因此,如图19所示,根据20至图33所示的本发明实施例,去重服务器103,在去重服务器103上预设有规则,所述规则为:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,其中,x为1到M连续的自然数,M≥2,Ax和Bx为整数;The deduplication server 103 in the embodiment of the present invention shown in Figure 1 refers to the device capable of implementing the technical solution described in the embodiment of the present invention, as shown in Figure 18, generally includes a central processing unit, a main memory, and an input and output interface . The central processing unit, the main memory, and the input and output interfaces communicate with each other, the main memory stores executable instructions, and the central processing unit executes the executable instructions stored in the main memory, thereby performing specific functions, so that the deduplication server 103 has specific functions, Find the data flow split point as described in FIG. 20 to FIG. 33 in the embodiment of the present invention. Therefore, as shown in FIG. 19, according to the embodiments of the present invention shown in FIG. 20 to FIG. 33, the deduplication server 103 has rules preset on the deduplication server 103, and the rules are: determine M The window W x [kA x ,k+B x ] and the predetermined condition C x corresponding to the window W x [kA x ,k+B x ], where x is a continuous natural number from 1 to M, M≥2, A x and B x is an integer;
去重服务器103包括确定单元1901和判断处理单元1902。其中,确定单元1901用于执行步骤a):The deduplication server 103 includes a determination unit 1901 and a judgment processing unit 1902 . Wherein, the determination unit 1901 is used to perform step a):
a)依据所述规则为当前潜在分割点ki确定对应的窗口Wiz[ki-Az,ki+Bz],i和z为整数,并且1≤z≤M;a) Determine the corresponding window W iz [k i -A z , ki +B z ] for the current potential segmentation point ki according to the rules, i and z are integers, and 1≤z≤M;
判断处理单元1902,用于判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz;A judging processing unit 1902, configured to judge whether at least part of the data in the window W iz [k i -A z , ki + B z ] satisfies a predetermined condition C z ;
当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N 个数据流分割点最小查找单位U,N*U不大于‖Bz‖+maxx(‖Ax‖),获得新的潜在分割点,则所述确定单元1901为所述新的潜在分割点执行步骤a);When at least part of the data in the window W iz [k i -A z , ki + B z ] does not meet the predetermined condition C z , search for the direction from the current potential split point ki along the data flow split point Skip N data stream segmentation point minimum search unit U, N*U is not greater than ‖B z ‖+max x (‖A x ‖) to obtain a new potential segmentation point, then the determination unit 1901 is the new potential The split point executes step a);
当所述当前潜在分割点ki的M个窗口中的每一个窗口Wix[ki-Ax,ki+Bx]中至少部分数据满足预定条件Cx,则所述当前潜在分割点ki为数据流分割点。When at least part of the data in each of the M windows W ix [k i -A x , ki +B x ] of the current potential segmentation point k i satisfies the predetermined condition C x , then the current potential segmentation point k i is the data stream split point.
进一步地,所述规则还包括:至少两个窗口Wie[ki-Ae,ki+Be]与Wif[ki-Af,ki+Bf],满足条件:|Ae+Be|=|Af+Bf|,Ce=Cf。进一步地,所述规则还包括:Ae和Af为正整数。进一步地,所述规则还包括:Ae-1=Af,Be+1=Bf。Further, the rule also includes: at least two windows W ie [k i -A e , ki +B e ] and W if [k i -A f , ki +B f ], satisfying the condition: |A e +B e |=|A f +B f |, C e =C f . Further, the rule further includes: A e and A f are positive integers. Further, the rule further includes: A e −1=A f , Be +1=B f .
进一步地,判断处理单元1902具体用于使用随机函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz。更进一步地,判断处理单元1902具体使用hash函数判断窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足预定条件Cz。Further, the judging processing unit 1902 is specifically configured to use a random function to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z . Furthermore, the judging processing unit 1902 specifically uses a hash function to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
进一步地,判断处理单元1902用于当所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据不满足所述预定条件Cz,从所述当前潜在分割点ki沿所述数据流分割点查找方向跳跃N个数据流分割点最小查找单位U,获得所述新的潜在分割点,所述确定单元1901为所述新的潜在分割点执行步骤a),根据所述规则,为所述新的潜在分割点确定的窗口Wic[ki-Ac,ki+Bc]的左边界与所述窗口Wiz[ki-Az,ki+Bz]的右边界重合或者为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]的左边界位于所述窗口Wiz[ki-Az,ki+Bz]范围之内;其中,为所述新的潜在分割点确定的所述窗口Wic[ki-Ac,ki+Bc]是根据所述规则,为所述新的潜在分割点确定的M个窗口按照数据流查找方向获得的序列中排序第 一的窗口。Further, the judging processing unit 1902 is configured to select from the current potential segmentation point ki Jumping N minimum search units U of data stream segmentation points along the search direction of the data stream segmentation point to obtain the new potential segmentation point, the determining unit 1901 performs step a) for the new potential segmentation point, according to the According to the above rules, the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is the same as the window W iz [k i -A z , ki +B z ] or the left boundary of the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is located at the window W iz [k i -A z , k i +B z ] range; wherein, the window W ic [k i -A c , ki +B c ] determined for the new potential segmentation point is based on the rule, for the new The M windows determined by the potential splitting points are the first windows in the sequence obtained according to the search direction of the data flow.
进一步地,判断处理单元1902使用随机函数判断所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据是否满足所述预定条件Cz,具体包括:Further, the judgment processing unit 1902 uses a random function to judge whether at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z , specifically including:
在所述窗口Wiz[ki-Az,ki+Bz]中选择F个字节,将所述F个字节反复利用H次,共获得F*H个字节,其中每个字节由8位组成,记为am,1…am,8,表示所述F*H个字节中第m个字节的第1到第8位,所述F*H个字节对应的位可以表示为:当am,n=1时,Vam,n=1,当am,n=0时,Vam,n=-1,其中am,n表示am,1…am,8中的任一个,所述F*H个字节对应的位按照am,n与Vam,n的转换关系得到矩阵Va,所述矩阵Va表示为:从服务正态分布的随机数中选择F*H*8个随机数组成矩阵R,所述矩阵R表示为: 将所述矩阵Va的第m行与所述矩阵R的第m行的随机数相乘,然后求和得到一个值,具体表示为Sam=Vam,1*hm,1+Vam,2*hm,2+…+Vam,8*hm,8,同理,获得Sa1、Sa2…到SaF*H,统计Sa1、Sa2…到SaF*H中满足大于0的值的个数K,当K为偶数,则所述窗口Wiz[ki-Az,ki+Bz]中至少部分数据满足所述预定条件Cz。Select F bytes in the window W iz [k i -A z , ki +B z ], use the F bytes repeatedly H times, and obtain F*H bytes in total, where each A byte is composed of 8 bits, recorded as a m,1 ... a m,8 , indicating the 1st to 8th bits of the mth byte in the F*H bytes, and the F*H bytes The corresponding bits can be expressed as: When a m,n =1, V am,n =1, when a m,n =0, V am,n =-1, where a m,n represents a m,1 ... a m,8 Any one, the bits corresponding to the F*H bytes obtain a matrix V a according to the conversion relationship between am, n and V am, n , and the matrix V a is expressed as: Select F*H*8 random numbers from the random numbers of the service normal distribution to form a matrix R, and the matrix R is expressed as: Multiply the mth row of the matrix V a by the random number in the mth row of the matrix R, and then sum to obtain a value, specifically expressed as S am =V am,1 *h m,1 +V am ,2 *h m,2 +…+V am,8 *h m,8 , similarly, obtain S a1 , S a2 ... to S aF*H , count S a1 , S a2 ... to S aF*H to satisfy The number K of values greater than 0, when K is an even number, at least part of the data in the window W iz [k i -A z , ki +B z ] satisfies the predetermined condition C z .
根据20至图33所示的本发明实施例提供的基于服务器查找数据流分割点的方法中,为潜在分割点ki确定窗口Wix[ki-Ax,ki+Bx],其 中,x分别为1到M连续的自然数,M≥2,可以并行判断M个窗口中每一个窗口中至少部分数据是否满足预定条件Cx,或者依次判断窗口中至少部分数据是否满足预定条件,也可以依次窗口Wi1[ki-A1,ki+B1],中至少部分数据满足预定条件C1时,再判断Wi2[ki-A2,ki+B2]中至少部分数据满足预定条件C2时,直到判断Wim[ki-Am,ki+Bm]中至少部分数据满足预定条件Cm。实施例中其他窗口的判断与此相同,不再赘述。According to the server-based method for searching data stream segmentation points provided by the embodiments of the present invention shown in 20 to FIG. 33 , a window W ix [ ki -A x , ki +B x ] is determined for a potential segmentation point ki , where , x are consecutive natural numbers from 1 to M, and M≥2, it is possible to judge in parallel whether at least part of the data in each of the M windows satisfies the predetermined condition C x , or sequentially judge whether at least part of the data in the window satisfies the predetermined condition, or When at least part of the data in the window W i1 [k i -A 1 , ki +B 1 ] satisfies the predetermined condition C 1 , at least part of the data in W i2 [k i -A 2 , ki +B 2 ] can be judged When the data satisfies the predetermined condition C 2 , until it is judged that at least part of the data in W im [k i -A m , ki + B m ] satisfies the predetermined condition C m . The determination of other windows in the embodiment is the same as this, and will not be repeated here.
另外,根据20至图33所示的本发明实施例,在去重服务器103上预设有规则,所述规则:为潜在分割点k确定M个窗口Wx[k-Ax,k+Bx]和窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,x分别为1到M连续的自然数,M≥2,在该预设规则中,A1、A2、A3…Am可以不全部相等,B1、B2、B3…Bm可以不全部相等,C1、C2、C3…CM也可以不全部相同。在图21所示的实施方式中,在Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-178,ki-9]和Wi11[ki-179,ki-10]中,各窗口大小相同,即窗口大小均为169字节,同时判断窗口中至少部分数据是否满足预定条件的方式也相同,具体见上述判断Wi1[ki-169,ki]中至少部分数据是否满足预定条件C1的描述,但在图11所示的实施方式中,Wi1[ki-169,ki]、Wi2[ki-170,ki-1]、Wi3[ki-171,ki-2]、Wi4[ki-172,ki-3]、Wi5[ki-173,ki-4]、Wi6[ki-174,ki-5]、Wi7[ki-175,ki-6]、Wi8[ki-176,ki-7]、Wi9[ki-177,ki-8]、Wi10[ki-168,ki+1]和Wi11[ki-179,ki+3]窗口大小可以不相同,同时判断窗口中至少部分数据是否满足预定条件的方式也可以不相同。在所有实施例中,根据为去重服务器103预设的规则,判断窗口Wi1中至少部分数据是否满足预定条件C1的方式与判断窗口Wj1中至少部分数据是否满足预定条件C1的方式必然相同,判断Wi2中至少部分数据是否满足预定条件C2的方式与判断Wj2中至少部分数据是否满足预定条件C2的方式必然相同…判断窗口WiM中至少部分数据是否满足预定条件CM的方式与判断窗口WjM中至少部分数据是否满足预定条件CM的方式必然相同。在此不再赘述。In addition, according to the embodiments of the present invention shown in Figures 20 to 33, rules are preset on the deduplication server 103, the rules: determine M windows W x [kA x , k+B x ] for a potential segmentation point k The predetermined condition C x corresponding to the window W x [kA x , k+B x ], x is a continuous natural number from 1 to M, and M≥2. In this preset rule, A 1 , A 2 , A 3 ... All A m may not be equal, B 1 , B 2 , B 3 . . . B m may not all be equal, and C 1 , C 2 , C 3 . . . C M may not all be equal. In the embodiment shown in FIG. 21 , in W i1 [k i -169, ki ], W i2 [k i -170, ki -1], W i3 [k i -171, ki -2] , W i4 [k i -172,k i -3], W i5 [k i -173,k i -4], W i6 [k i -174,k i -5], W i7 [k i -175 ,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177,k i -8], W i10 [k i -178,k i -9] and W In i11 [k i -179, k i -10], the size of each window is the same, that is, the size of the window is 169 bytes, and the method of judging whether at least part of the data in the window meets the predetermined condition is also the same, see the above judgment W i1 for details Whether at least part of the data in [k i -169, ki ] satisfies the description of the predetermined condition C 1 , but in the embodiment shown in Fig. 11 , W i1 [k i -169, ki ], W i2 [k i -170,k i -1], W i3 [k i -171,k i -2], W i4 [k i -172,k i -3], W i5 [k i -173,k i -4] , W i6 [k i -174,k i -5], W i7 [k i -175,k i -6], W i8 [k i -176,k i -7], W i9 [k i -177 , ki -8], W i10 [k i -168, ki +1] and W i11 [ k i -179, ki +3] window sizes can be different, and at the same time determine whether at least part of the data in the window meets the predetermined The form of the condition may also be different. In all embodiments, according to the preset rules for the deduplication server 103, the method of judging whether at least part of the data in the window W i1 satisfies the predetermined condition C1 is the same as the method of judging whether at least part of the data in the window W j1 satisfies the predetermined condition C1 It must be the same, the method of judging whether at least part of the data in W i2 satisfies the predetermined condition C2 is necessarily the same as the method of judging whether at least part of the data in W j2 satisfies the predetermined condition C2 ...judging whether at least part of the data in the window W iM satisfies the predetermined condition C The way of M is necessarily the same as the way of judging whether at least part of the data in the window W jM satisfies the predetermined condition C M . I won't repeat them here.
根据20至图33所示的本发明实施例,在去重服务器103上预设有规则,ka、ki、kj、kl和km为沿着数据流分割点查找方向查找分割点时获得的潜在分割点,ka、ki、kj、kl和km都依据该规则。本发明实施例中的窗口Wx[k-Ax,k+Bx]表示一个特定范围,在该特定范围选择数据以判断这些数据是否满足预定条件Cx,具体地,可以在该特定范围内选择部分数据,也可以选择全部数据以判断这些数据是否满足预定条件Cx。本发明实施例中具体使用的窗口概念可参照窗口Wx[k-Ax,k+Bx],在此不再赘述。According to the embodiment of the present invention shown in 20 to FIG. 33 , there are preset rules on the deduplication server 103, k a , ki , k j , k l and km are to search for split points along the direction of data flow split point search The potential segmentation points obtained when , k a , ki , k j , k l and k m all follow this rule. The window W x [kA x , k+B x ] in the embodiment of the present invention represents a specific range, select data in this specific range to judge whether these data meet the predetermined condition C x , specifically, you can select within this specific range Part of the data, or all the data may be selected to determine whether the data satisfy the predetermined condition C x . For the window concept specifically used in the embodiment of the present invention, reference may be made to window W x [kA x , k+B x ], which will not be repeated here.
窗口Wx[k-Ax,k+Bx]中,(k-Ax)和(k+Bx)表示该窗口Wx[k-Ax,k+Bx]的两个边界,其中(k-Ax)表示窗口Wx[k-Ax,k+Bx]相对于潜在分割点k位于数据流分割点查找反方向的边界,(k+Bx)表示窗口Wx[k-Ax,k+Bx]相对于潜在分割点k位于数据流分割点查找方向的边界。具体地,在本发明实施例中,在图20至图33所示的数据流分割点查找方向为从左向右,则其中(k-Ax)表示窗口Wx[k-Ax,k+Bx]相对于潜在分割点k位于数据流分割点查找反方向的边界(即左边界),(k+Bx)表示窗口Wx[k-Ax,k+Bx]相对于潜在分割点k位于数据流分割点查找方向的边界(即右边界)。如果在图20至图33所示的数据流分割点查找方向为从右向左,则其中(k-Ax)表示窗口Wx[k-Ax, k+Bx]相对于潜在分割点k位于数据流分割点查找反方向的边界(即右边界),(k+Bx)表示窗口Wx[k-Ax,k+Bx]相对于潜在分割点k位于数据流分割点查找方向的边界(即左边界)。In the window W x [kA x ,k+B x ], (kA x ) and (k+B x ) represent the two boundaries of the window W x [kA x ,k+B x ], where (kA x ) represents The window W x [kA x ,k+B x ] is located at the boundary of the opposite direction of the data flow split point search relative to the potential split point k, and (k+B x ) means that the window W x [kA x ,k+B x ] is relative to The potential split point k is located at the boundary of the data stream split point search direction. Specifically, in the embodiment of the present invention, the search direction of the data stream segmentation point shown in Figure 20 to Figure 33 is from left to right, where (kA x ) represents the window W x [kA x , k+B x ] Relative to the potential segmentation point k, the boundary in the opposite direction of the data flow segmentation point search (ie, the left boundary), (k+B x ) means that the window W x [kA x , k+B x ] is located in the data flow relative to the potential segmentation point k The boundary (i.e. the right boundary) of the split point lookup direction. If the search direction of the data stream split point shown in Figure 20 to Figure 33 is from right to left, then (kA x ) means that the window W x [kA x , k+B x ] is located in the data stream relative to the potential split point k The boundary in the direction opposite to the split point search (ie, the right boundary), (k+B x ) means that the window W x [kA x , k+B x ] is located at the boundary of the data flow split point search direction relative to the potential split point k (ie, the left boundary).
本领域普通技术人员可以意识到,结合本发明实施例图20至图33描述的各示例的单元及算法步骤,本发明实施例的关键特征可以与其他技术相结合,以更为复杂的形式呈现,但仍会包含本发明的关键特征。在真实环境中可能使用备用分割点,例如一种实施方式为,根据为去重服务器103预设的规则,为潜在分割点ki确定11个窗口Wx[k-Ax,k+Bx]及窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,x为1到11连续的自然数,当11个窗口中每一个窗口Wx[k-Ax,k+Bx]中至少部分数据均满足预定条件Cx,则潜在分割点ki为数据流分割点,当超过设定的最大数据块时,仍未查找到分割点,这时可能使用备用预设规则,备用的预设规则与在去重服务器103上预设的规则类似,备用的预设规则为:例如为潜在分割点ki确定10个窗口Wx[k-Ax,k+Bx]及窗口Wx[k-Ax,k+Bx]对应的预定条件Cx,x为1到10连续的自然数,确定当10个窗口中每一个窗口Wx[k-Ax,k+Bx]中至少部分数据均满足预定条件Cx,则潜在分割点ki为数据流分割点,当超过设定的最大数据块时,仍未查找到数据流分割点时,从最大数据块的结束位置作为强制分割点。Those of ordinary skill in the art can realize that, in combination with the units and algorithm steps of the examples described in Figure 20 to Figure 33 of the embodiment of the present invention, the key features of the embodiment of the present invention can be combined with other technologies to present in a more complex form , but still contain the key features of the present invention. In a real environment, alternate split points may be used. For example, one embodiment is to determine 11 windows W x [kA x , k +B x ] and The predetermined condition C x corresponding to the window W x [kA x , k+B x ], x is a continuous natural number from 1 to 11, when at least part of each of the 11 windows W x [kA x , k+B x ] The data all meet the predetermined condition C x , then the potential split point ki is the data flow split point. When the maximum data block is exceeded, the split point has not been found yet. At this time, the alternate preset rule may be used. The alternate preset The rules are similar to the preset rules on the deduplication server 103, and the spare preset rules are: for example, 10 windows W x [kA x , k +B x ] and windows W x [kA x ,k+B x ] corresponding to the predetermined condition C x , x is a continuous natural number from 1 to 10, and it is determined that at least part of the data in each of the 10 windows W x [kA x ,k+B x ] satisfies the predetermined condition C x , then the potential split point ki is the data stream split point. When the data stream split point is not found when the set maximum data block is exceeded, the end position of the largest data block is used as the forced split point.
根据20至图33所示的本发明实施例,在去重服务器103上预设有规则,所述规则中为潜在分割点k确定M个窗口,并不一定要求先有一个潜在分割点k,可以通过确定的M个窗口来判断潜在分割点k。According to the embodiments of the present invention shown in Figures 20 to 33, there are preset rules on the deduplication server 103, in which M windows are determined for a potential segmentation point k, and a potential segmentation point k is not necessarily required first. The potential segmentation point k can be judged through the determined M windows.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件 方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在提供的几个实施例中,应该理解到,所公开的系统、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取非易失性存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个非易失性存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部 或部分步骤。而前述的非易失性存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable non-volatile storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a non-volatile storage The medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned non-volatile storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk or optical disk, and various media capable of storing program codes.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.
Claims (16)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2014072115 | 2014-02-14 | ||
CNPCT/CN2014/072115 | 2014-02-14 | ||
CN201480000347.4A CN104169917B (en) | 2014-02-14 | 2014-02-27 | A method and server for searching data flow split point based on server |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201480000347.4A Division CN104169917B (en) | 2014-02-14 | 2014-02-27 | A method and server for searching data flow split point based on server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106095971A CN106095971A (en) | 2016-11-09 |
CN106095971B true CN106095971B (en) | 2019-08-13 |
Family
ID=57236991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610439783.2A Active CN106095971B (en) | 2014-02-14 | 2014-02-27 | A kind of method and server for searching data flow cut-point based on server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095971B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1997011A (en) * | 2006-07-26 | 2007-07-11 | 白杰 | Data partition method and data partition device |
CN101547138A (en) * | 2008-03-26 | 2009-09-30 | 国际商业机器公司 | Method and device for quick pattern matching |
CN102082575A (en) * | 2010-12-14 | 2011-06-01 | 江苏格物信息科技有限公司 | Method for removing repeated data based on pre-blocking and sliding window |
CN102646117A (en) * | 2012-02-20 | 2012-08-22 | 华为技术有限公司 | Method and device for file data transmission |
CN103324699A (en) * | 2013-06-08 | 2013-09-25 | 西安交通大学 | Rapid data de-duplication method adapted to big data application |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8918375B2 (en) * | 2011-08-31 | 2014-12-23 | Microsoft Corporation | Content aware chunking for achieving an improved chunk size distribution |
-
2014
- 2014-02-27 CN CN201610439783.2A patent/CN106095971B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1997011A (en) * | 2006-07-26 | 2007-07-11 | 白杰 | Data partition method and data partition device |
CN101547138A (en) * | 2008-03-26 | 2009-09-30 | 国际商业机器公司 | Method and device for quick pattern matching |
CN102082575A (en) * | 2010-12-14 | 2011-06-01 | 江苏格物信息科技有限公司 | Method for removing repeated data based on pre-blocking and sliding window |
CN102646117A (en) * | 2012-02-20 | 2012-08-22 | 华为技术有限公司 | Method and device for file data transmission |
CN103324699A (en) * | 2013-06-08 | 2013-09-25 | 西安交通大学 | Rapid data de-duplication method adapted to big data application |
Also Published As
Publication number | Publication date |
---|---|
CN106095971A (en) | 2016-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11403284B2 (en) | System for data sharing platform based on distributed data sharing environment based on block chain, method of searching for data in the system, and method of providing search index in the system | |
US11079953B2 (en) | Packing deduplicated data into finite-sized containers | |
US9626374B2 (en) | Optimizing a partition in data deduplication | |
US11615069B2 (en) | Data filtering using a plurality of hardware accelerators | |
US9064067B2 (en) | Quantum gate optimizations | |
US20150134623A1 (en) | Parallel data partitioning | |
CN106852185A (en) | Parallelly compressed encoder based on dictionary | |
US10542062B2 (en) | Method and server for searching for data stream dividing point based on server | |
JP7299334B2 (en) | Chunking method and apparatus | |
US20150088840A1 (en) | Determining segment boundaries for deduplication | |
CN106104480A (en) | Cluster-Wide Memory Management Using Affinity Preserving Signatures | |
CN112104725A (en) | Container mirror image duplicate removal method, system, computer equipment and storage medium | |
KR102026125B1 (en) | Lightweight complexity based packet-level deduplication apparatus and method, storage media storing the same | |
CN106095971B (en) | A kind of method and server for searching data flow cut-point based on server | |
KR101780534B1 (en) | Method and system for extracting image feature based on map-reduce for searching image | |
CN104169917B (en) | A method and server for searching data flow split point based on server | |
JP7505252B2 (en) | File server, deduplication system, processing method, and program | |
US11347424B1 (en) | Offset segmentation for improved inline data deduplication | |
CN107436913A (en) | For controlling inclined device and method in distributed ETL operations | |
JP6784096B2 (en) | Data distribution program, data distribution method, and data distribution device | |
US10922187B2 (en) | Data redirector for scale out | |
CN104424268B (en) | Data de-duplication method and equipment | |
Berman et al. | Accelerating duplicate data chunk recognition using NN trained by locality-sensitive hash | |
Asirvatham et al. | Data Deduplication and Integrity Check Scheme Based on Unique Tags in Edge Cloud Computing. | |
HK1173540B (en) | Method, device and system for processing repetitive data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211221 Address after: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province Patentee after: xFusion Digital Technologies Co., Ltd. Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd. |