[go: up one dir, main page]

CN118796812A - Data duplication detection method, device and electronic device based on structured evidence - Google Patents

Data duplication detection method, device and electronic device based on structured evidence Download PDF

Info

Publication number
CN118796812A
CN118796812A CN202410177061.9A CN202410177061A CN118796812A CN 118796812 A CN118796812 A CN 118796812A CN 202410177061 A CN202410177061 A CN 202410177061A CN 118796812 A CN118796812 A CN 118796812A
Authority
CN
China
Prior art keywords
data
file
structured
repeatability
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410177061.9A
Other languages
Chinese (zh)
Other versions
CN118796812B (en
Inventor
王雪
张高山
杜雪涛
王勇
常嘉岳
彭静娴
詹义
刘梦媛
朱华
巴特尔
刘仲思
方明星
尹子轩
洪东
倪宁宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Design Institute Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Design Institute Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202410177061.9A priority Critical patent/CN118796812B/en
Publication of CN118796812A publication Critical patent/CN118796812A/en
Application granted granted Critical
Publication of CN118796812B publication Critical patent/CN118796812B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种基于结构化存证的数据重复度检测方法、装置及电子设备,涉及区块链技术领域,区别于传统的区块链存证时仅关注数据真实性,本公开的核心构思在于引入结构化数据信息、预设数据结构、双向滑窗及动态步长等机制进行数据处理实现存证与验证,使得存证过程不仅能保证数据的一致性和完整性,而且还能准确反映出验证数据与存证文件的重复度。本公开在基于区块链的电子证据模式下,可以在大规模数据集的重复度检测中,在保证数据存证的有效性和准确性的同时,还兼顾较低的误报率与较高的存储空间效率,从而降低存储开销,并显著拓宽存证的功能性和应用范围。

The present disclosure provides a data duplication detection method, device and electronic device based on structured evidence storage, which relates to the field of blockchain technology. Different from the traditional blockchain evidence storage that only focuses on data authenticity, the core concept of the present disclosure is to introduce structured data information, preset data structure, two-way sliding window and dynamic step size and other mechanisms to process data to achieve evidence storage and verification, so that the evidence storage process can not only ensure the consistency and integrity of the data, but also accurately reflect the duplication of the verification data and the evidence file. In the electronic evidence mode based on blockchain, the present disclosure can ensure the validity and accuracy of data evidence storage in the duplication detection of large-scale data sets, while also taking into account low false alarm rate and high storage space efficiency, thereby reducing storage overhead and significantly broadening the functionality and application scope of evidence storage.

Description

Data repeatability detection method and device based on structured certificate, and electronic equipment
Technical Field
The disclosure relates to the technical field of blockchains, in particular to a data repeatability detection method and device based on structured document storage and electronic equipment.
Background
With the rapid development of the information age, a large amount of data is recorded and saved in the form of electronic certificates. Meanwhile, the blockchain technology is an ideal choice in the field of electronic data storage and evidence by the characteristics of unique non-falsification, non-repudiation, multiparty participation and the like, so that the electronic data storage and evidence becomes an important application of the blockchain technology.
A blockchain is a distributed database whose mode of operation requires all data to be synchronized and stored on each network node. If a large amount of data is directly uplinked, a large amount of memory space and network resources will be consumed, thereby affecting the performance and operating efficiency of the blockchain. In addition, considering the openness of the blockchain ledger, directly storing the original text of the electronic evidence on the chain may bring about a risk of privacy disclosure. Therefore, in order to reduce the storage overhead of the blockchain and to ensure the privacy of the original data, it is common practice to store only the hash value (hash value) of the electronic proof in the uplink. When the authenticity of the electronic evidence needs to be verified, whether the electronic evidence is tampered can be confirmed by only comparing the hash value of the electronic evidence to be verified with the hash value stored in the uplink.
However, the conventional hash function is a one-to-one mapping relationship, which means that even a small change of data can cause a huge change of hash value, and for some application scenarios (such as detecting a version difference and a repetition degree of a document) needing to detect the content change degree, the characteristic makes the hash function unable to accurately reflect the degree of data change.
Disclosure of Invention
The present disclosure has been made in view of the above-described problems. The present disclosure provides a method, an apparatus, and an electronic device for detecting data repeatability based on structured document, and accordingly provides a computer readable storage medium and a computer program product.
According to one aspect of the present disclosure, there is provided a data repeatability detection method based on structured document, including:
the method comprises the steps of obtaining structured data information of a file to be stored by utilizing a bidirectional sliding window in advance;
storing the structured data information into a preset data structure and finishing the uplink;
When the repeatability is detected, the structured data information of the file to be detected is obtained by utilizing a bidirectional sliding window and is matched with the information in the data structure extracted from the block chain and obtained through analysis; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
after the matching is finished, the repeatability of the file to be detected is obtained through all the matching results.
In addition, according to a method for detecting data repeatability based on structured document according to one aspect of the present disclosure, the matching of the remaining file contents when the unmatched condition occurs by dynamically adjusting the moving step length of the sliding window includes:
Stopping sliding window movement when the structured data information of the file to be detected is not matched with the information in the data structure;
After both sliding windows stop, the window sliding is continuously executed and the matching is continuously carried out on the rest content between the two sliding windows in a dynamic decreasing step size mode.
In addition, according to a data repeatability detection method based on structured document according to an aspect of the present disclosure, the condition for confirming the end of matching includes:
all available step sizes that are dynamically adjusted have been employed or the comparison of the information content that is stored in the data structure has been completed.
In addition, according to a method for detecting data repeatability based on structured document according to one aspect of the present disclosure, the storing the structured data information into a preset data structure includes:
and constructing an array corresponding to the structured data information by utilizing the data structure, and taking the array as a storage fingerprint of the file to be stored in the data structure.
In addition, according to the data repeatability detection method based on the structured document according to one aspect of the disclosure, the calculating the repeatability of the document to be detected through all the matching results includes:
Obtaining the similarity between the file to be detected and the corresponding document according to the ratio of the successfully matched structured data to the total structured data of the corresponding document; or setting weights for different structured data of the document to be detected, and calculating the similarity between the document to be detected and the corresponding document to be detected according to the successful matching result and by combining the weights.
In addition, according to a method for detecting data repeatability based on structured document according to one aspect of the present disclosure, a method for obtaining structured data information by using a bidirectional sliding window includes:
and setting a sliding window at the head and the tail of the file respectively, moving the two sliding windows from the starting position and the ending position of the file in opposite directions based on a preset step length, and calculating corresponding hash values for file areas in the sliding windows during the movement of each sliding window.
According to another aspect of the present disclosure, there is provided a data repeatability detection device based on structured document, the detection device comprising:
The storage and certification segmentation module is used for obtaining structured data information of a file to be stored by utilizing a bidirectional sliding window in advance;
The fragmentation information processing module is used for storing the structured data information into a preset data structure and finishing the uplink;
the file verification module is used for obtaining structured data information of a file to be detected by utilizing a bidirectional sliding window when the repeatability is detected, and matching the structured data information with information in the data structure extracted from the block chain and obtained through analysis; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
And the repeatability calculation module is used for obtaining the repeatability of the file to be detected through all the matching results after the matching is finished.
Furthermore, according to an aspect of the present disclosure, the document authentication module includes:
the matching pause unit is used for stopping sliding window movement when the structured data information of the file to be detected is not matched with the information in the data structure;
And the dynamic step length adjusting unit is used for continuously executing window sliding and continuously matching the residual content between the two sliding windows in a dynamic step length decreasing mode after the two sliding windows are stopped.
Further, according to an aspect of the present disclosure, the data repeatability detecting device based on structured document, the condition for confirming the end of matching includes: all available step sizes that are dynamically adjusted have been employed or the comparison of the information content that is stored in the data structure has been completed.
Further, according to an aspect of the present disclosure, the fragmentation information processing module is configured to: and constructing an array corresponding to the structured data information by utilizing the data structure, and taking the array as a storage fingerprint of the file to be stored in the data structure.
Furthermore, according to an aspect of the present disclosure, the data repeatability detection device based on structured document, the repeatability calculation module includes:
The first similarity calculation unit is used for obtaining the similarity between the file to be detected and the corresponding certificate storage file according to the ratio of the successfully matched structured data to the total structured data of the corresponding certificate storage file; or the second similarity calculation unit is used for setting weights for different structured data of the document to be detected, and calculating the similarity between the document to be detected and the corresponding document to be detected according to the successful matching result and the weights.
In addition, according to an aspect of the present disclosure, a data repeatability detecting device based on structured document, a manner of obtaining structured data information by using a bidirectional sliding window includes: and setting a sliding window at the head and the tail of the file respectively, moving the two sliding windows from the starting position and the ending position of the file in opposite directions based on a preset step length, and calculating corresponding hash values for file areas in the sliding windows during the movement of each sliding window.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a memory for storing computer readable instructions; and a processor for executing the computer readable instructions to cause the electronic device to perform the structured document based data repetition detection method as described above.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer-readable instructions, which when executed by a processor, cause the processor to perform the structured document-based data repetition detection method as described above.
According to a further aspect of the present disclosure, there is provided a computer program product for performing the method of the first aspect or any of the possible implementations of the first aspect, when the computer program product is executed by a computer. In a possible design of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.
As will be described in detail below, according to the method, apparatus, electronic device, computer-readable storage medium and computer program product for detecting data repeatability based on structured document according to the embodiments of the present disclosure, unlike conventional blockchain document where only data authenticity is concerned, the core concept of the present disclosure is to introduce mechanisms such as structured data information, preset data structures, bidirectional sliding windows, dynamic step sizes, etc. to perform data processing to implement document verification and verification, so that the document verification process can not only ensure consistency and integrity of data, but also accurately reflect the repeatability of verification data and document. According to the method and the device, in the electronic evidence mode based on the blockchain, in the repeatability detection of a large-scale data set, the validity and the accuracy of data storage are ensured, and meanwhile, the lower false alarm rate and the higher storage space efficiency are also considered, so that the storage cost is reduced, and the functionality and the application range of the storage are remarkably widened.
The scheme of the disclosure widens the functions of the blockchain technology in terms of data storage and certification, so that the blockchain storage and certification mechanism has the functions of data consistency, integrity and similarity detection, almost can cover all fields needing data storage and certification, has potential value in application in a plurality of fields such as copyright protection, malicious code detection, SQL injection prevention, judicial, medical treatment, education, supply chain management and the like, and has wide market prospect.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and are intended to provide further explanation of the technology claimed.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments thereof with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps.
Fig. 1 is a schematic flow chart illustrating a data repeatability detection method based on structured document according to an embodiment of the present disclosure.
Fig. 2 is a flow diagram illustrating a dynamic step-based sliding window matching method according to an embodiment of the present disclosure.
Fig. 3 is a schematic diagram illustrating a sharded file according to an embodiment of the present disclosure.
Fig. 4 is a sliding window (slider) matching schematic illustrating a document to be inspected according to an embodiment of the present disclosure.
Fig. 5 is a functional block diagram illustrating a structured document-based data repetition detection device according to an embodiment of the present disclosure.
Fig. 6 is a hardware block diagram illustrating an electronic device according to an embodiment of the present disclosure.
Fig. 7 is a schematic diagram illustrating a computer-readable storage medium according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.
First, a data repetition detection method based on structured document according to an embodiment of the present disclosure will be described with reference to fig. 1 to 4. Fig. 1 is a schematic flow chart illustrating a data repeatability detection method based on structured document according to an embodiment of the present disclosure. Fig. 2 is a flow diagram further illustrating a dynamic step-based sliding window matching method according to an embodiment of the present disclosure. Fig. 3 is a schematic diagram further illustrating fragmentation of a document to be authenticated according to an embodiment of the disclosure. Fig. 4 is a sliding window matching schematic diagram further illustrating a document to be detected according to an embodiment of the present disclosure.
A data repetition detection method based on structured document according to an embodiment of the present disclosure will be described in detail with reference to fig. 1 to 4, including:
S1, obtaining structured data information of a document to be authenticated by utilizing a bidirectional sliding window in advance;
For example, the document to be stored (which can be understood as a standard article or the like on which the subsequent comparison is based) may be split by means of a sliding window, and a hash value of each fragment is calculated.
Specifically, two sliding windows can be set, wherein one sliding window slides backwards from the starting position of the file to be stored based on a preset step length, the other sliding window also slides forwards from the ending position of the file to be stored based on a preset step length, corresponding hash values are calculated for file areas in the sliding windows during sliding of each sliding window, sliding is ended when the two sliding windows meet the preset conditions, and finally hash values corresponding to a plurality of text fragments of the file to be stored are obtained.
S2, storing the structured data information into a preset data structure and finishing the uplink;
In connection with the application scenario where text repetition detection tasks are performed with low storage overhead, the data structure may consider a specific structural form with checking whether an element is in a set, for example, by a data structure with high spatial efficiency to quickly check whether an element is in multiple filter structures in a set: bloom filters, cuckoo filters, quotient filters Quotient Filter, etc., will be shown hereinafter for ease of illustration.
Specifically, for each piece of hash in the previous example, k values corresponding to the hash values are calculated by using a function of the bloom filter to form a bloom storage array corresponding to the current piece, and the bloom filter of the corresponding bit is set to be 1 according to each value, so that the whole file to be stored is stored in the bloom filter in a fuzzy hash mode, namely, the storage fingerprint of each piece in the bloom filter is obtained.
The bloom filter itself and its associated information (e.g., bloom filter function, false positive rate, etc.) may then be stored to the blockchain. It may be added here that in some preferred embodiments, a bloom filter set may be constructed using multiple bloom filters or using multiple layers of bloom filters to further increase the accuracy of the repeatability detection.
S3, when the repeatability is detected, the structured data information of the file to be detected is obtained by utilizing a bidirectional sliding window, and the structured data information is matched with the information in the data structure extracted from the block chain and obtained through analysis; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
Reference may be made to the following in connection with the foregoing example synthesis: when the actually measured file needs to be verified, in order to realize the calculation of the file repeatability (the reverse meaning of the actually measured file represents the change degree), a bloom filter and related information thereof can be obtained from a blockchain and analyzed, two sliding windows which move oppositely can be arranged at the head and the tail of the file to be detected, sliding starts from the two ends of the file to be detected, and the sliding windows are simultaneously in a verification stage, and during the sliding period, hash values are calculated for file areas in the sliding windows; then, matching the calculated hash value with the certificate information analyzed by the bloom filter, and accumulating the successfully matched fragments in the process, namely determining the modification degree of the file in the mode; furthermore, it is illustrated with reference to fig. 2:
step S31, stopping sliding window movement when the structured data information of the file to be detected is not matched with the information in the data structure;
and S32, after the two sliding windows are stopped, continuing to execute window sliding and continuing to match the residual content between the two sliding windows in a dynamic step-down step-size mode.
That is, when the hash value of the actually measured file does not match with the data of the bloom filter, the sliding window is stopped, and after both sliding windows are stopped, for the rest texts between the two sliding windows (i.e. the contents suspected to have modification are extracted), the sliding is continuously performed on the rest texts by adopting a dynamic step size adjustment mode (preferably a step size reduction mode), and the successfully matched fragment numbers are continuously accumulated.
This phase of continuing the matching with dynamic step size has a higher flexibility and adaptability, in particular a balance can be found between detection accuracy and storage overhead, which is difficult to achieve in general fuzzy hashing.
And S4, after the matching is finished, obtaining the repeatability of the file to be detected through all the matching results.
Under the condition that the matching can not be continuously carried out by using the dynamic step length, the matching result counted in the whole matching stage is utilized to obtain the final quantized result of the repeatability of the file to be detected and the original document. Overall, the implementation concept of this step is to calculate and measure the repeatability between the whole files according to the similarity between the structured data, and different ways of calculating the repeatability will be provided for reference later.
In connection with the above embodiments, a complete description of the implementation of the solution is provided below by way of specific examples.
1. Preprocessing a document to be stored, and unifying character codes:
each character of a file to be processed (namely, a file to be stored) is converted into the same length; specifically, all characters in the file to be processed are coded and converted, and are unified into codes with the same length, for example, but not limited to UTF-16 codes, the step can ensure that the coding length of each character is consistent, and the condition of cutting characters in the follow-up slicing links is avoided.
2. Constructing a fragment hash of a file to be stored:
(1) Slicing and initial step length setting: and setting the fragment length and the certificate storing sliding step length corresponding to the pretreatment of the files to be processed. For example, in the previous example, if UTF-16 encoding is used, the slice length and step size may be set to a multiple of 2, which may ensure that the movement of the sliding window (sliding window, slider for short) does not cut the character.
(2) Calculating the initial value of the fragment hash: in practical operation, from another point of view, the first L bytes of the file to be authenticated are regarded as a sliding window (i.e. the fragment length is L), and hash values of all bytes in the sliding window are calculated. Specifically, assuming the decimal code of the character is b [ i ], where i is from 0 to L-1, the hash value h can be calculated using the following formula:
Where a is the slip factor and q is a given large prime number.
The calculated hash value is used as the hash value of the first fragment of the file (namely the initial hash value), then the sliding window is moved backwards by S bytes according to the step length, and a new hash value is calculated in a similar way. Specifically, the new hash value h' can be efficiently calculated using the following formula by utilizing the characteristics of the hash value h and the rolling hash of the previous step: h (b')= (h (b) -old_char·a L/2-i) ·a+new_ char modq;
wherein old_char is the first character code value to be removed, new_char is the first character code value added, and the hash value is updated by iteration until The resulting hash value is used as the hash value for the second slice of the file.
3. Bidirectional sliding window evidence:
Under the above-described basic setting premise, specifically, the present disclosure proposes using a double sliding window for the opposite-direction deposit (may also be referred to as opposite-direction deposit).
(1) File slicing:
First, two sliding windows are initialized, one at the beginning of the document to be authenticated and one at the end of the document to be authenticated. Then, the hash value of the content in the two sliding windows is calculated, the sliding window is moved backward (for the beginning window) or forward (for the ending window), the sliding can be stopped until the two sliding windows meet or cross, the sliding is represented that the whole file to be stored is traversed, and u fragments are finally obtained, see the article to be stored illustrated in fig. 3, wherein the initial length l=10 and the step length s=7 of the fragments. In the first fragment illustrated in fig. 3, the character string is "pre-bed bright moon, suspicious ground", which is encoded in UTF-16BE as follows: 5E8A,524D,660E,6708, 5149,FF0C,7591, 662F,5730,4E0A. Let a=256 and q= 1000000007, the following table is given:
Through the above procedure, the Hash value Hash (1) of the first fragment (i.e., the fragment Hash referred to in this disclosure) is obtained: 565700895. next, the hash value of the second fragment is calculated, and in the above example, the sliding step is 7 characters, so the leftmost 7 characters need to be removed from the hash value, and then new 7 characters are added. In connection with the example, the new 7 character "frost". The UTF-16BE code of the first-mentioned WangMingyue is as follows: 971C,3002,4E3E,5934, 671B,660E,6708; then, the leftmost 7 characters in the hash value need to be removed. These characters are "bed front bright moon light, suspect", whose UTF-16BE codes as: 5E8A,524D,660E,6708, 5149,FF0C,7591; thus, the following table can be formed:
Rejecting characters Encoding Decimal code New character Encoding Decimal code Iterative hash value
Bed with a bed body 5E8A 24202 Cream 971C 38684 110680137
Front part 524D 21069 3002 12290 444548630
Ming dynasty 660E 26126 Lifting device 4E3E 20030 485022009
Month of moon 6708 26120 Head 5934 22836 498369576
Light source 5149 20841 Wash the looking at 671B 26395 19394918
FF0C 65292 Ming dynasty FF0C 65292 266899698
Doubt 7591 30097 Month of moon 7591 30097 696545482
By the above steps, a second piece Hash (2) "696545482" is obtained, and similarly, the two sliding windows finally obtain u=15 Hash pieces: hash (1), hash (2), hash (3), hash (…), hash (15).
(2) Creating a corresponding bloom filter:
Since bloom filters have false positives, the false positive rate is controlled by setting a reasonable bloom filter size and the number of files stored in each bloom filter so that the rate is within an acceptable range. It should be noted that in actual operation, a plurality of bloom filters may be provided to form a bloom filter set to store a large number of files, and the present disclosure is illustrated with only a single bloom filter as a schematic illustration.
(2.1) Initializing a bloom filter:
For each individual piece hash, the following steps will be performed to add it to the bloom filter: assuming that X files are stored at most, each file is divided into u segments (e.g., the previous example article is cut into 15 segments), and the overall false positive rate of the tolerable article is C, then the false positive rate formula of the bloom filter can be used: p= (1-e -kn/m)k, which gives c= (1- (1-e -kXu/m)k)u), the formula describes the relationship between the accuracy C of the whole article and the number of bits m of bloom filter, the number of hash functions k, the number of articles X and the number of fragments u of each article.
For example, if m= 491474 bits are set in the above example, x=1000, u=15, and k=3, then c=99%, and the false positive rate (false positive rate) p=0.067%. It will be appreciated that the bloom filter is characterized by retrieving whether an element is in a certain set, with only false positives (false positives) and no false negatives (false negatives). That means that if the bloom filter suggests that an element (fragment) is present in the collection, this may be wrong (false positive); but if the bloom filter suggests that an element is not present in the collection, then this must be correct (i.e., no false negative).
Thus, in the repetition comparison process mentioned later, the situation that the false judgment rate is highest is all false positives, that is, the articles are all repeated, in this case, only the false alarm situation exists, and no false alarm situation exists, that is, each slice involves the false alarm problem of the bloom filter, in this case, the accuracy of the whole article is 99%, and in other cases (that is, the partial repetition refers to both false positives and true negatives), the article accuracy can be more than 99%.
Let p_be the judgment accuracy of each slice, then in the process of P 1·P2·P·Pu, if P represents repetition, p_ =1-0.067%; if P represents no repetition, p_ =1, that is, C may represent the lowest accuracy of the entire file.
Further, for example, only a bloom filter of about 60Kb can store 15000 patches (1000 articles), and the misjudgment rate (false positive rate) of each patch is 0.067%, and since each article is divided into 15 patches, the minimum judgment accuracy of each article is more than 99%, which is seen to have strong practicability. From the above, the present disclosure can reduce the storage overhead when setting the overall minimum accuracy of the article.
(2.2) Storing the shard hash in a bloom filter:
using the initialized bloom filter, further compressing each to-be-stored fragment hash to obtain k numbers with the size of (0, m-1), and setting the corresponding bit in the bloom filter to 1.
For example, in the previous example, K 1(h1)=543;K2(h1)=16543K3(h1) = 35533, then h 1 is obtained, i.e. the bloom storage array of the first tile is (543,16543,35533), and the 542, 16542, 35532 th bloom filter is set to 1, so that the storage of the current first tile is completed. And (3) circulating the method, namely storing all fragments into the bloom filter, and storing the file into the bloom filter by a fuzzy hash method.
Thus, the previous article forms 15 corresponding arrays, which show which bit values of the bloom filter are 1, and then the corresponding hash value.
(3) Data chaining:
The relevant data uplink storage can specifically comprise:
the bloom filter itself: hash information containing all elements;
Bloom filter hash function information: a function of k cloths Long Haxi;
misjudgment rate: the false judgment rate of the whole bloom filter is convenient for knowing the storage performance;
and other data: such as time stamps, file fragment numbers, etc.
4. Document repetition detection step (schematic in connection with the document to be detected illustrated in fig. 4):
(1) Obtaining blockchain data: interact with the blockchain to obtain bloom filter information stored in the blockchain.
(2) Analytical bloom filter: after the block data is retrieved, the block is parsed by the predefined data structure to extract the stored bloom filter data.
(3) And for the file to be detected, performing head-tail bidirectional sliding by using a preset initial step S. In the sliding process, hash calculation is carried out on each sliding window (byte sequence with the length of L), the sliding windows are matched with the content of the bloom filter, the successfully matched and unrepeated fragment number is calculated, and the corresponding sliding window is stopped moving until the unmatched condition occurs.
(4) When the sliding windows at both sides stop, the rest text between the two sliding windows is extracted, and the text is a part which is possibly modified, namely a part with possibly different contents of the current actual measurement file and the verification file.
(5) The sliding hash is continued with the portion of text that may be modified in a manner that dynamically alters the step size. At the beginning of this phase, the step size may be set to a relatively large and smaller value than the initial S, and then may be gradually decreased, for example, but not limited to, using the following moving step size dynamic adjustment strategy: s n=Sn-1/2. The general fuzzy hash cannot accurately distinguish the similarity of different levels, but the method and the device find out which parts of the text suspected to have the modified content have higher repeatability with the original file fragments in a dynamic step length adjustment mode, so that the accuracy of the whole repeatability detection is higher.
(6) The execution (5) is repeated until all possible steps that have been dynamically adjusted have been taken (e.g. the step down to the move can only contain the smallest data unit of the measured file) or the number of fragments that have been matched meets the aforementioned fragment value u (i.e. the forensic content has been compared, and there is no slave).
(7) Calculating the repetition degree: for example, the similarity between the file to be detected and the document to be checked can be represented according to the ratio of the matched number of fragments to the total number u of fragments; of course, it can be understood that, besides performing similarity analysis by using the duty ratio, weights (i.e. importance) can be set for different structured data (i.e. fragments), and higher weight parameters are set for more important segmented contents, so that similarity calculation is performed according to the weights, thereby better showing the actual content repetition degree (corresponding to modification degree) among files, and the similarity calculation mode also has flexibility and adaptability.
In the above, the data repetition detection method based on the structured document according to the embodiment of the present disclosure is described. Hereinafter, a data repetition detection device based on structured document for realizing the above detection method will be further described. Fig. 5 is a functional block diagram illustrating a structured document-based data repetition detection device according to an embodiment of the present disclosure.
As shown in fig. 5, a data repetition detection device 500 based on structured document according to an embodiment of the present disclosure includes:
the certification segmentation module 501 is configured to obtain structured data information of a file to be certified by using a bidirectional sliding window in advance;
The slicing information processing module 502 is configured to store the structured data information into a preset data structure and complete uplink;
The file verification module 503 is configured to obtain structured data information of a file to be detected by using a bidirectional sliding window when the repeatability is detected, and match the structured data information with information in the data structure extracted from the blockchain and obtained by parsing; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
and the repeatability calculation module 504 is used for obtaining the repeatability of the file to be detected through all the matching results after the matching is finished.
Further, the file verification module includes:
the matching pause unit is used for stopping sliding window movement when the structured data information of the file to be detected is not matched with the information in the data structure;
And the dynamic step length adjusting unit is used for continuously executing window sliding and continuously matching the residual content between the two sliding windows in a dynamic step length decreasing mode after the two sliding windows are stopped.
Further, the condition for confirming the end of the matching includes: all available step sizes that are dynamically adjusted have been employed or the comparison of the information content that is stored in the data structure has been completed.
Further, the fragmentation information processing module is configured to: and constructing an array corresponding to the structured data information by utilizing the data structure, and taking the array as a storage fingerprint of the file to be stored in the data structure.
Further, the repeatability calculation module includes:
The first similarity calculation unit is used for obtaining the similarity between the file to be detected and the corresponding certificate storage file according to the ratio of the successfully matched structured data to the total structured data of the corresponding certificate storage file; or the second similarity calculation unit is used for setting weights for different structured data of the document to be detected, and calculating the similarity between the document to be detected and the corresponding document to be detected according to the successful matching result and the weights.
Fig. 6 is a hardware block diagram illustrating an electronic device 600 according to an embodiment of the disclosure. An electronic device according to an embodiment of the present disclosure includes at least a processor; and a memory for storing computer readable instructions. When the computer readable instructions are loaded and executed by the processor, the processor performs the structured document based data repetition detection method as described above.
The electronic device 600 shown in fig. 6 specifically includes: a Central Processing Unit (CPU) 601, a Graphics Processing Unit (GPU) 602, and a main memory 603. These units are interconnected by a bus 604. A Central Processing Unit (CPU) 601 and/or a Graphics Processing Unit (GPU) 602 may be used as the above-described processor, and a main memory 603 may be used as the above-described memory storing computer-readable instructions. In addition, the electronic device 600 may further comprise a communication unit 605, a storage unit 606, an output unit 607, an input unit 608 and an external device 609, which are also connected to the bus 604.
Fig. 7 is a schematic diagram illustrating a computer-readable storage medium according to an embodiment of the present disclosure. As shown in fig. 7, a computer-readable storage medium 700 according to an embodiment of the present disclosure has computer-readable instructions 701 stored thereon. The computer readable instructions 701, when executed by a processor, perform a structured document based data repetition detection method according to embodiments of the present disclosure described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, optical disk, magnetic disk, and the like.
Finally, it may also be added that a computer program product according to an embodiment of the disclosure (the product may include the above apparatus) when run on a terminal device, causes the terminal device to perform the method for detecting data repeatability based on structured document according to the foregoing embodiment or equivalent implementation. From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the above-described computer program product may include, but is not limited to, an APP.
In the foregoing, the method, the device, the electronic device, the computer readable storage medium and the computer program product for detecting the data repeatability based on the structured document according to the embodiments of the present disclosure are described with reference to the accompanying drawings, and the main concept is that only the data authenticity is concerned when distinguishing from the conventional blockchain document, and the core concept of the present disclosure is to introduce the structured data information, the preset data structure, the bidirectional sliding window, the dynamic step size and other mechanisms to perform the data processing to realize the document and the verification, so that the document process can not only ensure the consistency and the integrity of the data, but also accurately reflect the repeatability of the verification data and the document. According to the method and the device, in the electronic evidence mode based on the blockchain, in the repeatability detection of a large-scale data set, the validity and the accuracy of data storage are ensured, and meanwhile, the lower false alarm rate and the higher storage space efficiency are also considered, so that the storage cost is reduced, and the functionality and the application range of the storage are remarkably widened.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.
The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
In addition, as used herein, the use of "or" in the recitation of items beginning with "at least one" indicates a separate recitation, such that recitation of "at least one of A, B or C" means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C), for example. Furthermore, the term "exemplary" does not mean that the described example is preferred or better than other examples.
It is also noted that in the systems and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.
Various changes, substitutions, and alterations are possible to the techniques described herein without departing from the teachings of the techniques defined by the appended claims. Furthermore, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. The processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (10)

1. The method for detecting the data repeatability based on the structured certificate is characterized by comprising the following steps of:
the method comprises the steps of obtaining structured data information of a file to be stored by utilizing a bidirectional sliding window in advance;
storing the structured data information into a preset data structure and finishing the uplink;
When the repeatability is detected, the structured data information of the file to be detected is obtained by utilizing a bidirectional sliding window and is matched with the information in the data structure extracted from the block chain and obtained through analysis; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
after the matching is finished, the repeatability of the file to be detected is obtained through all the matching results.
2. The method for detecting data repeatability based on structured document according to claim 1, wherein said matching the remaining file contents when unmatched condition occurs by dynamically adjusting the moving step size of the sliding window comprises:
Stopping sliding window movement when the structured data information of the file to be detected is not matched with the information in the data structure;
After both sliding windows stop, the window sliding is continuously executed and the matching is continuously carried out on the rest content between the two sliding windows in a dynamic decreasing step size mode.
3. The structured document-based data repetition detection method of claim 1, wherein the condition for confirming the end of the matching includes:
all available step sizes that are dynamically adjusted have been employed or the comparison of the information content that is stored in the data structure has been completed.
4. The method for detecting data repeatability based on structured document according to claim 1, wherein said storing the structured data information into a preset data structure comprises:
and constructing an array corresponding to the structured data information by utilizing the data structure, and taking the array as a storage fingerprint of the file to be stored in the data structure.
5. The method for detecting the repeatability of data based on structured document according to claim 1, wherein said obtaining the repeatability of the document to be detected from all the matching results comprises:
Obtaining the similarity between the file to be detected and the corresponding document according to the ratio of the successfully matched structured data to the total structured data of the corresponding document; or setting weights for different structured data of the document to be detected, and calculating the similarity between the document to be detected and the corresponding document to be detected according to the successful matching result and by combining the weights.
6. The method for detecting data repeatability based on structured document according to any one of claims 1-5, wherein the way of obtaining structured data information by using a bidirectional sliding window comprises:
and setting a sliding window at the head and the tail of the file respectively, moving the two sliding windows from the starting position and the ending position of the file in opposite directions based on a preset step length, and calculating corresponding hash values for file areas in the sliding windows during the movement of each sliding window.
7. Data repeatability detection device based on structuring is deposited a certificate, characterized in that, detection device includes:
The storage and certification segmentation module is used for obtaining structured data information of a file to be stored by utilizing a bidirectional sliding window in advance;
The fragmentation information processing module is used for storing the structured data information into a preset data structure and finishing the uplink;
the file verification module is used for obtaining structured data information of a file to be detected by utilizing a bidirectional sliding window when the repeatability is detected, and matching the structured data information with information in the data structure extracted from the block chain and obtained through analysis; in the matching process, matching is carried out on the content of the rest files when unmatched conditions occur by dynamically adjusting the moving step length of the sliding window;
And the repeatability calculation module is used for obtaining the repeatability of the file to be detected through all the matching results after the matching is finished.
8. An electronic device, comprising:
A memory for storing computer readable instructions; and
A processor for executing the computer readable instructions to cause the electronic device to perform the structured document based data repetition detection method of any one of claims 1 to 6.
9. A non-transitory computer readable storage medium storing computer readable instructions which, when executed by a processor, cause the processor to perform a structured document-based data repeatability detection method according to any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the structured document based data repetition detection method of any of claims 1 to 6.
CN202410177061.9A 2024-02-08 2024-02-08 Data duplication detection method, device and electronic equipment based on structured evidence storage Active CN118796812B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410177061.9A CN118796812B (en) 2024-02-08 2024-02-08 Data duplication detection method, device and electronic equipment based on structured evidence storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410177061.9A CN118796812B (en) 2024-02-08 2024-02-08 Data duplication detection method, device and electronic equipment based on structured evidence storage

Publications (2)

Publication Number Publication Date
CN118796812A true CN118796812A (en) 2024-10-18
CN118796812B CN118796812B (en) 2026-01-20

Family

ID=93026889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410177061.9A Active CN118796812B (en) 2024-02-08 2024-02-08 Data duplication detection method, device and electronic equipment based on structured evidence storage

Country Status (1)

Country Link
CN (1) CN118796812B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968796A (en) * 2010-09-09 2011-02-09 北京邮电大学 Method for segmenting bidirectionally and concurrently executed file level variable-length data
US20150347445A1 (en) * 2014-05-30 2015-12-03 International Business Machines Corporation Deduplication of file
CN105808169A (en) * 2016-03-14 2016-07-27 联想(北京)有限公司 Data deduplication method, apparatus and system
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 Method, device and electronic device for checking text information
CN110991358A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 Text comparison method and device based on block chain
US20200244440A1 (en) * 2019-07-18 2020-07-30 Alibaba Group Holding Limited Blockchain-based data evidence storage method and apparatus
CN112541199A (en) * 2020-12-16 2021-03-23 宁波云麟信息科技有限公司 Block chain-based electronic storage certificate integrity verification method and electronic equipment
CN114185850A (en) * 2021-12-17 2022-03-15 杭州电子科技大学 Cloud storage duplicate removal method and device based on sliding window block optimization algorithm
US20230385455A1 (en) * 2022-05-31 2023-11-30 Acronis International Gmbh Automatic Identification of Files with Proprietary Information

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968796A (en) * 2010-09-09 2011-02-09 北京邮电大学 Method for segmenting bidirectionally and concurrently executed file level variable-length data
US20150347445A1 (en) * 2014-05-30 2015-12-03 International Business Machines Corporation Deduplication of file
CN105808169A (en) * 2016-03-14 2016-07-27 联想(北京)有限公司 Data deduplication method, apparatus and system
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 Method, device and electronic device for checking text information
US20200244440A1 (en) * 2019-07-18 2020-07-30 Alibaba Group Holding Limited Blockchain-based data evidence storage method and apparatus
CN110991358A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 Text comparison method and device based on block chain
CN112541199A (en) * 2020-12-16 2021-03-23 宁波云麟信息科技有限公司 Block chain-based electronic storage certificate integrity verification method and electronic equipment
CN114185850A (en) * 2021-12-17 2022-03-15 杭州电子科技大学 Cloud storage duplicate removal method and device based on sliding window block optimization algorithm
US20230385455A1 (en) * 2022-05-31 2023-11-30 Acronis International Gmbh Automatic Identification of Files with Proprietary Information

Also Published As

Publication number Publication date
CN118796812B (en) 2026-01-20

Similar Documents

Publication Publication Date Title
Roussev Hashing and data fingerprinting in digital forensics
US7950062B1 (en) Fingerprinting based entity extraction
US20170038978A1 (en) Delta Compression Engine for Similarity Based Data Deduplication
US20130198152A1 (en) Systems and methods for data compression
US9645828B2 (en) Method of searching character string, character string searching device, and recording medium
CN109299086B (en) Optimal sort key compression and index reconstruction
CN101464910B (en) Balance clustering compression method based on data similarity
CN108073815B (en) Family judgment method and system based on code slice and storage medium
CN106649360B (en) Data repeatability checking method and device
JP2017526021A (en) Error correction apparatus and method for data retrieval
Breitinger et al. Automated evaluation of approximate matching algorithms on real data
US20220156233A1 (en) Systems and methods for sketch computation
CN104866610B (en) A kind of SQLite based on similar type matching estimation deletes data reconstruction method
EP4078340B1 (en) Systems and methods for sketch computation
US20210191640A1 (en) Systems and methods for data segment processing
Petermann et al. DIMSpan: Transactional frequent subgraph mining with distributed in-memory dataflow systems
US11334669B2 (en) Method for fast and intelligent comparison and security detection of mobile malware big data
US9438704B2 (en) Communication and message-efficient protocol for computing the intersection between different sets of data
CN115794964B (en) Data verification method, device, computer equipment and storage medium
CN117526965A (en) Intelligent compression storage method for bank data, computer equipment and storage medium
CN118796812B (en) Data duplication detection method, device and electronic equipment based on structured evidence storage
CN118277620B (en) Secure cloud storage monitoring system and method based on virtual machine
US20130226941A1 (en) System and method for classifying signals using the bloom filter
CN113807087B (en) Method and device for detecting similarity of website domain names
US20220245097A1 (en) Hashing with differing hash size and compression size

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant