CN110362560B

CN110362560B - Method for removing duplicate of non-service master key data during database storage

Info

Publication number: CN110362560B
Application number: CN201910619770.7A
Authority: CN
Inventors: 杨建华; 陈洁
Original assignee: Zhengcaiyun Co ltd
Current assignee: Zhengcai Cloud Co.,Ltd.
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2021-12-31
Anticipated expiration: 2039-07-10
Also published as: CN110362560A

Abstract

The invention discloses a method for deduplication of non-service primary key data when storing a database, which is characterized by comprising the following steps: a data conversion module splices the external service data imported into the database into character strings, and a data hash operation module splices and combines the splicing and combination. The resulting string is calculated using the sha256 algorithm to obtain a byte array. The message digest conversion module converts the message digest in the byte array format into a string H1, and the message digest hash module performs a hash operation on the string H1 converted from the message digest to obtain a For the integer value H2, the deduplication processing module mainly uses the values H1 and H2 obtained by the two hash operations in the above steps as conditions to query the database. In the present invention, the method for deduplicating the non-service primary key data when storing the database adopts the characteristic of extremely low collision rate according to the result of the message digest algorithm, and can judge whether the data is equal by comparing only two fields, and effectively utilizes the database index to Improve efficiency.

Description

Method for removing duplicate of non-service master key data during database storage

Technical Field

The invention relates to the technical field of database query deduplication, in particular to a deduplication method of non-service master key data during database storage.

Background

Usually, when a database table structure is designed, a business main key field is designed, uniqueness of data is judged through the business main key field, but sometimes, a situation that some externally input data has no business main key is encountered, before the data is stored, whether the same data exists is judged to determine a subsequent processing mode, when the business main key does not exist, whether the same data exists is inquired by taking each field of the data as an inquiry condition, and the mode has very low efficiency when the data amount in the table is very large, particularly when the stored field is not suitable for adding database indexes.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a method for removing duplicate data when business-free master key data are stored in a database.

In order to achieve the purpose, the invention adopts the following technical scheme: a method for deduplication of non-business master key data in storing a database, comprising the steps of:

s1: external data is received, and business data outside the original database system is imported into the original database through the database receiving module, so that the rapidity of importing the business data is ensured;

s2: data field conversion, namely splicing the field name and the field value into character strings according to rules by using a data conversion module to lead external service data led into a database, and ensuring that each service data is spliced into the character strings according to the same rule;

s3: data character string operation, namely calculating the character strings formed by splicing and combining through a data hash operation module by using a sha256 algorithm to obtain a message abstract, wherein the message abstract is a byte array, and each character string can be accurately calculated;

s4: the data abstract conversion is realized, the message abstract in the byte array format is converted into a character string H1 through a message abstract conversion module, and a convenient query judgment reference point is provided for subsequent query comparison;

s5: performing secondary operation on the character strings, namely performing secondary HASH operation on the character strings H1 converted from the message digests by using the FNV1_32_ HASH algorithm through a message digest HASH module to obtain an integral value H2, and ensuring the query deduplication efficiency during subsequent database indexing;

s6: the duplicate removal query is that the duplicate removal processing module mainly uses the values H1 and H2 obtained by the hash operation twice in the steps as a condition query database, if the query has data, corresponding duplicate removal processing is carried out, and if no existing data exists, the service data and the values H1 and H2 obtained by the hash operation twice are stored in the database together, so that the fast and efficient duplicate removal query is realized;

s7: and (4) intervening processing, namely performing oriented accurate coverage and deduplication processing on the result subjected to deduplication query through a subsequent processing module of the database system, removing coincident service data, and ensuring the consistency and the unicity of data in the database system.

As a further description of the above technical solution:

the rule of the data conversion comprises a field name F1, a value V1, a combination F1 which is V1, a field name Fn which is Vn, a combination Fn which is Vn, and the data are finally formed after being sorted according to the field name in English: f1 ═ V1& Fn ═ Vn.

As a further description of the above technical solution:

the form of the character string obtained by the message abstract conversion module is 16-system.

As a further description of the above technical solution:

the hash operation is a method for creating a small digital fingerprint from any kind of data, and compresses a message or data into a digest by a hash function, so that the amount of data becomes small, and fixes the format of the data, and the digest is usually represented by a short string of random letters and numbers.

As a further description of the above technical solution:

the sha algorithm is a secure hash algorithm, is a cryptographic hash function family, and can calculate an algorithm of a character string with a fixed length corresponding to a digital message, and if the input messages are different, the probability that the input messages correspond to different character strings is high, and the sha256 is one of the algorithm standards.

As a further description of the above technical solution:

when the duplicate removal processing module inquires whether the same data exists or not, the duplicate removal processing module inquires in a database sql inquiry mode, wherein the database sql inquiry mode is a database inquiry language and is used for inquiring database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.

As a further description of the above technical solution:

the FNV1_32_ HASH algorithm is a HASH algorithm, which can operate the input traffic data to obtain an integer number.

As a further description of the above technical solution:

and the output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message abstract conversion module, the message abstract hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.

As a further description of the above technical solution:

when the duplicate removal processing module judges whether the same data exist in the database, the field values H1 and H2 need to be synchronized as conditions for query processing, so that the efficiency and the accuracy of query are improved.

As a further description of the above technical solution:

the subsequent processing module of the database system consists of a covering module and a duplication eliminating module, wherein the covering module can carry out data one-by-one covering processing on the existing data in the queried database so as to ensure the unicity of the service data, and the duplication eliminating module can carry out data screening and duplication eliminating processing on the existing data in the queried database so as to ensure the consistency of the service data.

Advantageous effects

The invention provides a method for removing duplicate of non-service master key data during database storage. The method has the following beneficial effects:

(1): the duplicate removal method for the non-service master key data during database storage judges whether the data are equal or not by avoiding comparing all fields of a line of data, adopts the characteristic that the result collision rate is extremely low according to the message digest algorithm, can judge whether the data are equal or not by only comparing two fields, effectively utilizes database indexes to improve the efficiency, and achieves the effect of quickly, comprehensively, efficiently and accurately inquiring and removing the duplicate.

(2): the method for removing the duplicate of the non-service main key data during the storage of the database adopts a space time-changing mode, breaks through the traditional mode of inquiring and removing the duplicate of the data one by one, realizes the effect of quickly inquiring and removing the duplicate of the service data in the massive database, and can more highlight the efficiency of inquiring and removing the duplicate of the data along with the increase of the data volume.

Drawings

FIG. 1 is a schematic data processing flow diagram illustrating a method for deduplication of non-business-master-key data in a database according to the present invention;

FIG. 2 is a diagram illustrating a database service table according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

As shown in fig. 1-2, a method for deduplication of non-service master key data while storing a database includes the steps of:

The rule of data conversion is that the field name is F1, the value is V1, the combination is F1 is V1, the field name is Fn, the value is Vn, the combination is Fn is Vn, and the data are finally formed after being sorted according to the field name in English: f1 ═ V1& Fn ═ Vn.

The data splicing character string mode in the data conversion module is variable, and the data can be converted by using a mode of converting a data object into a json character string.

The form of the character string obtained by the message digest conversion module is 16-ary, and the character string can be transcoded by using base64 in the message digest byte array in the message digest conversion module.

The sha algorithm is a secure hash algorithm, is a cryptographic hash function family, and can calculate an algorithm of a character string with a fixed length corresponding to a digital message, and if the input messages are different, the probability that the input messages correspond to different character strings is high, and the sha256 is one of algorithm standards.

When the duplicate removal processing module queries whether the same data exists, the duplicate removal processing module queries in a database sql query mode, wherein the database sql query mode is a database query language and is used for querying database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.

The FNV1_32_ HASH algorithm is a HASH algorithm that can operate on incoming traffic data to obtain an integer number.

The hash algorithm in the data hash operation module is variable, and can use MD5 or SHA1 algorithm, the MD5 message digest algorithm is a widely used cryptographic hash function, and can generate a 128-bit hash value for ensuring the integrity and consistency of information transmission, the SHA1 algorithm is mainly applicable to the digital signature algorithm defined in the digital signature standard, for the message with the length less than 2^64 bits, the SHA1 generates a 160-bit message digest, and when the message is received, the message digest can be used for verifying the integrity of data.

The message digest hash module can directly use the java string hash or the CRC32 algorithm.

The output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message abstract conversion module, the message abstract hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.

When the duplicate removal processing module determines whether the same data exists in the database, the field values H1 and H2 need to be synchronized as a condition for query processing, so as to improve the efficiency and accuracy of the query.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A method for deduplication of non-business master key data in storing a database, comprising the steps of:

s7: intervention processing, namely performing oriented accurate coverage and deduplication processing on the result subjected to deduplication query through a subsequent processing module of the database system, removing overlapped service data, and ensuring the consistency and the unicity of the data in the database system;

when the duplicate removal processing module judges whether the same data exist in the database, the field values H1 and H2 need to be synchronized as conditions for query processing, so that the efficiency and the accuracy of query are improved;

2. The method of claim 1, wherein the rule of data transformation is field name F1, value V1, combination F1-V1, field name Fn, value Vn, combination Fn-Vn, and the final composition is after sorting according to field name English: f1 ═ V1& Fn ═ Vn.

3. The method of claim 1, wherein the message digest conversion module obtains the string with a 16-ary format.

4. The method of claim 1, wherein the hash operation is a method for creating a small digital fingerprint from any data, and the hash function compresses the message or data into a digest, so that the data size is reduced, and the format of the data is fixed, and the digest is usually represented by a short string of random letters and numbers.

5. The method of claim 1, wherein the sha algorithm is a secure hash algorithm, and is a family of cryptographic hash functions, and is capable of calculating a string of fixed length corresponding to a digital message, and if the input message is different, the probability that they correspond to different strings is high, and sha256 is one of the algorithm criteria.

6. The method for removing duplicate data when storing a database without business master key according to claim 1, wherein the duplicate removal processing module queries whether the same data exists or not by using a database sql query mode, wherein the database sql query mode is a database query language and is used for querying database data, and the database sql language can be independently applied to a terminal and can also be used as a sub-language to provide effective assistance for other program designs.

7. The method of claim 1, wherein the FNV1_32_ HASH algorithm is a HASH algorithm, which can operate on the inputted service data to obtain an integer number.

8. The method for removing duplicate data in storing database without service keynote as claimed in claim 1, wherein the output end and the input end of the database receiving module, the data conversion module, the data hash operation module, the message digest conversion module, the message digest hash module, the duplicate removal processing module and the subsequent processing module of the database system are electrically connected in sequence.