CN115713450A

CN115713450A - Table data watermarking method for resisting column deletion attack

Info

Publication number: CN115713450A
Application number: CN202211331263.1A
Authority: CN
Inventors: 罗森林; 杨宗源; 潘丽敏; 魏继勋
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-02-24

Abstract

The invention relates to a table data watermarking method for resisting column deletion attack, and belongs to the technical field of computers and information science. Firstly, determining watermark column identifications by combining attribute importance degrees and data distortion tolerance degrees; then, a characteristic repairing classification model is constructed by combining the clustering label and the damaged line data, the model is used for classifying the original data, and the watermark line identification is determined according to the class probability; then, determining a watermark embedding position and embedding watermark information through the row and column identification; and finally, determining a watermark row identifier by using a characteristic repairing classification model in a watermark detection stage, and extracting watermark information by combining with the watermark column identifier. Aiming at the problem that the column deletion resisting attack capability of the existing table data watermarking method is insufficient, the method constructs the characteristic repairing classification model to accurately obtain the line identification of the attacked data, and effectively improves the watermark detection accuracy.

Description

Table data watermarking method against column delete attack

技术领域technical field

本发明涉及抗列删除攻击的表格数据水印方法，属于计算机与信息科学技术领域。The invention relates to a table data watermarking method against column deletion attack, which belongs to the technical field of computer and information science.

背景技术Background technique

表格数据是医疗诊断、金融决策、工业智能等行业领域的重要数据资源，一旦被窃取滥用，将极大侵害所有者权益。表格数据水印技术是对表格数据进行版权保护和追踪溯源的有效方法，研究表格数据水印技术对于数字资产的安全保护具有重要意义。Tabular data is an important data resource in industries such as medical diagnosis, financial decision-making, and industrial intelligence. Once stolen and misused, it will greatly infringe on the rights and interests of owners. Table data watermarking technology is an effective method for copyright protection and traceability of table data. Researching table data watermarking technology is of great significance for the security protection of digital assets.

当前表格数据水印方法主要可分为三类：The current tabular data watermarking methods can be mainly divided into three categories:

1.唯一主键方法1. Unique primary key method

唯一主键方法是主流水印方法应用的基础。方法使用Hash计算秘钥和主键的散列值以确定水印位置，不同秘钥下水印位置不同，确保非法用户不能获得水印信息。但唯一主键方法应用的前提是表格数据存在唯一主键，若表格数据无主键或主键被删改，水印将无法被检测识别。The unique primary key method is the basis for the application of mainstream watermarking methods. The method uses Hash to calculate the hash value of the secret key and the primary key to determine the watermark position. The watermark position is different under different secret keys to ensure that illegal users cannot obtain watermark information. However, the premise of the application of the unique primary key method is that there is a unique primary key in the table data. If the table data has no primary key or the primary key is deleted, the watermark will not be detected and identified.

2.虚拟主键方法2. Virtual primary key method

虚拟主键方法将连续属性值转化为二进制后进行高低位分割，高位使用Hash计算生成虚拟主键，低位进行水印嵌入，从而避免唯一主键方法的缺陷。但虚拟主键方法对所选的连续属性值要求较高，当数据被篡改时将导致水印失效，且该方法无法使用离散属性值生成虚拟主键，难以充分利用数据资源。The virtual primary key method converts the continuous attribute value into binary and performs high and low bit segmentation. The high bit uses Hash calculation to generate a virtual primary key, and the low bit performs watermark embedding, so as to avoid the defects of the unique primary key method. However, the virtual primary key method has high requirements on the selected continuous attribute values. When the data is tampered, the watermark will become invalid. Moreover, this method cannot use discrete attribute values to generate virtual primary keys, making it difficult to make full use of data resources.

3.聚类分组方法3. Clustering grouping method

聚类分组方法不再计算Hash散列值，而是基于距离度量直接实现聚类分组，并且可以同时使用连续或离散属性值，相较于虚拟主键方法具有更强的算法安全性。但聚类分组方法同样依赖参与聚类属性值的完整性。若聚类属性值被删除，标识与属性值间的单项映射关联被破坏，将导致水印检测时标识计算错误，水印无法被正确识别。The clustering grouping method no longer calculates the Hash hash value, but directly realizes the clustering grouping based on the distance measure, and can use continuous or discrete attribute values at the same time, which has stronger algorithm security than the virtual primary key method. But cluster grouping methods also rely on the integrity of the attribute values participating in the cluster. If the cluster attribute value is deleted, the single-item mapping association between the identifier and the attribute value will be destroyed, which will lead to an error in the calculation of the identifier during watermark detection, and the watermark cannot be correctly identified.

综上所述，现有表格数据水印方法过于依赖主键或所选取的属性值，抗列删除攻击能力不足，所以本发明提出抗列删除攻击的表格数据水印方法。To sum up, the existing table data watermarking method relies too much on the primary key or the selected attribute value, and has insufficient ability to resist column deletion attacks. Therefore, the present invention proposes a table data watermarking method that is resistant to column deletion attacks.

发明内容Contents of the invention

本发明的目的是针对表格数据水印方法抗列删除攻击能力不足的问题，提出了抗列删除攻击的表格数据水印方法。The object of the present invention is to propose a table data watermarking method resistant to column deletion attacks in view of the problem that the table data watermarking method has insufficient ability to resist column deletion attacks.

本发明的设计原理为：首先选取重要属性列作为水印列标识；其次使用聚类方法获得行数据聚类标签，构造受损行数据，结合聚类标签和受损行数据构建特征修复分类模型，利用模型对原始数据进行分类并根据类别概率确定水印行标识；然后使用纠错码编码水印信息，根据行标识和列标识确定嵌入位置并冗余嵌入水印信息，获得含水印数据；最后使用特征修复分类模型确定水印位置，提取水印信息并解码，获得嵌入的水印信息。The design principle of the present invention is: first select important attribute columns as the watermark column identification; secondly use the clustering method to obtain row data cluster labels, construct damaged row data, and construct a feature repair classification model by combining the cluster labels and damaged row data, Use the model to classify the original data and determine the watermark row identifier according to the category probability; then use the error correction code to encode the watermark information, determine the embedding position according to the row identifier and column identifier and redundantly embed the watermark information to obtain the watermarked data; finally use the feature restoration The classification model determines the position of the watermark, extracts and decodes the watermark information, and obtains the embedded watermark information.

本发明的技术方案是通过如下步骤实现的：Technical scheme of the present invention is realized through the following steps:

步骤1，结合属性重要程度及数据失真容忍度选取重要连续变量属性列，确定水印列标识。Step 1. Select important continuous variable attribute columns in combination with attribute importance and data distortion tolerance, and determine the watermark column identification.

步骤2，构建特征修复分类网络模型确定水印行标识。Step 2, constructing a feature restoration classification network model to determine the watermark row identifier.

步骤2.1，使用过滤式特征选择法选取聚类特征。In step 2.1, cluster features are selected using filter feature selection method.

步骤2.2，基于所选特征使用约束FCM算法进行无监督聚类，获得行数据聚类标签。In step 2.2, unsupervised clustering is performed using the constrained FCM algorithm based on the selected features, and the row data cluster labels are obtained.

步骤2.3，使用掩码向量生成受损行数据，并利用聚类标签和受损行数据训练特征修复分类网络模型。In step 2.3, use mask vectors to generate damaged row data, and use clustering labels and damaged row data to train feature repair classification network models.

步骤2.4，使用模型计算各行数据分类类别概率，根据类别概率为原始行数据添加分组标识并选取行数据作为水印行标识。In step 2.4, use the model to calculate the category probability of each row of data, add a group identifier to the original row data according to the category probability, and select the row data as the watermark row identifier.

步骤3，将水印信息冗余嵌入原始数据。Step 3, redundantly embed the watermark information into the original data.

步骤3.1，将水印信息编码为二进制格式，并添加纠错码。Step 3.1, encode the watermark information into a binary format, and add an error correction code.

步骤3.2，根据水印行标识和水印列标识确定水印嵌入位置，使用LSB算法冗余嵌入水印编码。In step 3.2, the watermark embedding position is determined according to the watermark row identifier and the watermark column identifier, and the LSB algorithm is used to redundantly embed the watermark code.

步骤4，对含水印数据进行水印检测。Step 4, perform watermark detection on the watermarked data.

步骤4.1，使用特征修复分类网络获得水印行标识，结合水印列标识确定水印嵌入位置。Step 4.1, use the feature repair classification network to obtain the watermark row identification, and combine the watermark column identification to determine the watermark embedding position.

步骤4.2，提取水印编码并解码，恢复水印信息。In step 4.2, the watermark code is extracted and decoded to recover the watermark information.

有益效果Beneficial effect

相比于唯一主键方法，本发明可以在无主键的数据中嵌入水印。Compared with the unique primary key method, the present invention can embed the watermark in the data without primary key.

相比于虚拟主键法，本发明通过无监督聚类方法选取水印行标识，可同时使用连续属性值和离散属性值，可充分利用数据资源。Compared with the virtual primary key method, the present invention selects watermark row identifiers through an unsupervised clustering method, can use continuous attribute values and discrete attribute values at the same time, and can make full use of data resources.

相比于聚类分组法，本发明通过建立特征修复分类模型，利用特征修复编码实现受损数据的正确分类，同时根据分类网络输出的类别概率选取行数据嵌入冗余信息，减少数据统计特征的失真程度。Compared with the clustering and grouping method, the present invention establishes a feature repair classification model, utilizes feature repair codes to realize the correct classification of damaged data, and at the same time selects row data to embed redundant information according to the category probability output by the classification network, reducing data statistical features. The degree of distortion.

附图说明Description of drawings

图1为本发明抗列删除攻击的表格数据水印方法原理图。FIG. 1 is a schematic diagram of the table data watermarking method against column deletion attack of the present invention.

图2为特征修复分类网络结构图。Figure 2 is a structure diagram of the feature restoration classification network.

具体实施方式Detailed ways

为了更好的说明本发明的目的和优点，下面结合实例对本发明方法的实施方式做进一步详细说明。In order to better illustrate the purpose and advantages of the present invention, the implementation of the method of the present invention will be further described in detail below in conjunction with examples.

实验数据来自真实生物信息数据集Checkup。数据水印实验数据见表1。The experimental data comes from the real biological information dataset Checkup. The experimental data of data watermarking are shown in Table 1.

表1.数据水印实验数据集Table 1. Data watermarking experiment dataset

实验采用行标识准确率Acc_loc作为评价指标，以评估所用方法在参与标识计算的列属性被删除后，对数据行标识的恢复效果。行标识准确率的计算方法为：The experiment uses the row identification accuracy rate Acc _loc as an evaluation index to evaluate the recovery effect of the method used on the data row identification after the column attributes involved in the identification calculation are deleted. The calculation method of row identification accuracy is:

其中，r_j为第j行的表格数据，y为列删除攻击前分组类别，

为列删除攻击后分组类别，n为数据的行数量。Among them, r _j is the tabular data of the jth row, y is the grouping category before the column deletion attack,

The post-attack grouping category is deleted for the column, and n is the number of rows of the data.

本次实验在一台计算机和一台服务器上进行，计算机的具体配置为：Inter i9-9900，RAM 32G，操作系统是windows 11，64位；服务器的具体配置为：GeForce GTX 1080Ti，操作系统是Linux Ubuntu 20.04，64位。This experiment is carried out on a computer and a server. The specific configuration of the computer is: Inter i9-9900, RAM 32G, the operating system is windows 11, 64 bits; the specific configuration of the server is: GeForce GTX 1080Ti, the operating system is Linux Ubuntu 20.04, 64 bit.

本次实验的具体流程为：The specific process of this experiment is:

步骤1，将连续属性值按照方差σ和均值μ进行降序排列，属性列的排序方式T为：Step 1, sort the continuous attribute values in descending order according to the variance σ and mean μ, and the sorting method T of the attribute columns is:

T＝lnμ+log₁₀σ，T=lnμ+log ₁₀ σ,

以排列为参考，结合属性重要程度及数据失真容忍度两种主观因素选取属性列作为待嵌入水印的列标识。Taking the arrangement as a reference, combined with the two subjective factors of attribute importance and data distortion tolerance, the attribute column is selected as the column identifier to be embedded in the watermark.

步骤2，构建特征修复分类网络模型并利用模型确定水印行标识。Step 2, constructing a feature restoration classification network model and using the model to determine the watermark row identifier.

步骤2.1，使用过滤式特征选择法计算特征之间相关系数和方差，从高方差特征数据中选取高相关系数的特征作为聚类特征以增加聚类属性冗余度，选取特征数为max{0.8k,ca}，其中k为聚类数，ca为连续属性列数量。Step 2.1, use the filtering feature selection method to calculate the correlation coefficient and variance between features, select the features with high correlation coefficients from the high variance feature data as clustering features to increase the redundancy of clustering attributes, and select the number of features as max{0.8 k,ca}, where k is the number of clusters and ca is the number of continuous attribute columns.

步骤2.2，基于聚类特征使用约束FCM算法进行无监督聚类，约束FCM模型训练的目标函数为：Step 2.2, based on the clustering features, use the constrained FCM algorithm for unsupervised clustering. The objective function of constrained FCM model training is:

其中，c_i表示聚类中心，r_j表示行数据，

表示第j行数据属于第i类的隶属度，并满足各类簇大小相同且各类隶属度之和为1的约束条件。根据聚类结果获得各行数据聚类标签。Among them, c _i represents the cluster center, r _j represents the row data,

Indicates the membership degree of the j-th row of data belonging to the i-th category, and satisfies the constraints that the size of each cluster is the same and the sum of each membership degree is 1. According to the clustering results, the cluster labels of each row of data are obtained.

步骤2.3，使用掩码向量m生成受损行数据

生成方式为：Step 2.3, use mask vector m to generate damaged row data

Generated by:

其中，r表示原始行数据，掩码向量m＝[m₀,m₁,…,m_β-1]^T，m_i从伯努利分布中采样获得。然后训练特征修复分类模型，训练过程为：将受损数据

输入自编码网络进行编码，受损数据编码z由特征修复网络恢复为修复数据

结合原始行数据r使用均方误差MSE计算损失，训练特征修复网络；同时受损数据编码z由分类网络分类为

结合聚类标签y使用交叉熵CE计算损失，训练分类网络；结合两个损失训练自编码网络，令编码结果包含原始数据与所属聚类类别的信息。训练完毕的模型同时具备特征修复编码功能和数据分类功能，最终输出为数据分类结果。Wherein, r represents the original row data, the mask vector m=[m ₀ ,m ₁ ,…,m _β-1 ] ^T , and _mi is obtained by sampling from the Bernoulli distribution. Then train the feature repair classification model, the training process is: the damaged data

Enter the self-encoding network for encoding, and the damaged data encoding z is restored to repair data by the feature repair network

Combined with the original row data r, the mean square error MSE is used to calculate the loss, and the feature repair network is trained; at the same time, the damaged data code z is classified by the classification network as

Combining the clustering label y, use the cross-entropy CE to calculate the loss and train the classification network; combine the two losses to train the autoencoder network, so that the encoding result contains the information of the original data and the cluster category it belongs to. The trained model has both the feature restoration encoding function and the data classification function, and the final output is the data classification result.

步骤2.4，将原始数据输入特征修复分类模型，使用Softmax处理特征修复分类模型输出的分类结果，获得各行数据属于每个类别的概率，选取概率最大的类别作为各行数据的分组标识；计算最大类别概率与最小类别概率的差值，选取概率差值大于预设阈值的行数据，确定水印行标识。Step 2.4, input the original data into the feature restoration classification model, use Softmax to process the classification results output by the feature restoration classification model, obtain the probability that each row of data belongs to each category, and select the category with the highest probability as the grouping identifier of each row of data; calculate the maximum category probability The difference between the minimum category probability and the row data whose probability difference is greater than the preset threshold is selected to determine the watermark row identifier.

步骤3.1，使用ASCII编码将水印信息转换为二进制形式，向转换后水印编码中添加RS纠错码，获得水印编码，水印编码长度l应满足：Step 3.1, use ASCII code to convert the watermark information into binary form, add the RS error correction code to the converted watermark code to obtain the watermark code, and the length l of the watermark code should satisfy:

k×(α-1)<l<k×α，k×(α-1)<l<k×α,

其中，k为聚类类别数，α为列标识数。Among them, k is the number of cluster categories, and α is the number of column identifiers.

步骤3.2，将水印编码以长度k分为α个子串，记为{W₀,W₁,…,W_α-1}。将子串W_i利用LSB嵌入第i列的第k个分组中，具体嵌入方式为：Step 3.2: Divide the watermark code into α substrings with length k, denoted as {W ₀ ,W ₁ ,…,W _α-1 }. Use LSB to embed the substring W _i into the k-th group of the i-th column. The specific embedding method is:

y_j.A_i＝[LSB([y_j.A_i]₂,χ,j.W_i)]₁₀，y _j .A _i =[LSB([y _j .A _i ] ₂ ,χ,jW _i )] ₁₀ ,

其中，j.W_i表示第i个子串中的第j位；y_j.A_i表示分类类别为y_j且列属性为A_i的数据，即通过行列标识确定的水印嵌入位置；LSB为低有效位嵌入；χ为j.W_i在y_j.A_i中的嵌入位数。同时，在未被选择为水印嵌入位置的同组行数据中以同样的方式嵌入与j.W_i相反的编码，减小数据统计特征的失真程度。Among them, jW _i represents the jth bit in the i-th substring; y _j .A _i represents the data whose classification category is y _j and the column attribute is A _i , that is, the watermark embedding position determined by the row and column identification; LSB is the least significant bit Embedding; χ is the number of embedded bits of jW _i in y _j .A _i . At the same time, in the same group of row data that is not selected as the watermark embedding position, the code opposite to jW _i is embedded in the same way to reduce the degree of distortion of the statistical characteristics of the data.

步骤4.1，将含水印数据输入特征修复分类模型，处理模型的输出结果获取水印行标识，具体处理方式与步骤2.4相同，但考虑到数据传输过程中的失真影响，水印检测时概率差值阈值小于水印嵌入时的阈值。根据水印所有者保留的水印列标识获取水印嵌入位置。Step 4.1, input the watermarked data into the feature repair classification model, and process the output result of the model to obtain the watermark row identification. The specific processing method is the same as step 2.4, but considering the distortion effect in the data transmission process, the probability difference threshold during watermark detection is less than Threshold for watermark embedding. Obtain the watermark embedding position according to the watermark column ID reserved by the watermark owner.

步骤4.2，使用投票表决法提取水印编码，提取方式为：Step 4.2, using the voting method to extract the watermark code, the extraction method is:

j.WB_i＝Vote(LSB([y_j.A_i]₂,χ))，j.WB _i =Vote(LSB([y _j .A _i ] ₂ ,χ)),

其中，j.WB_i表示被提取水印编码第i个子串中的第j位，同时对未被选择为水印嵌入位置的同组行数据进行相同的处理，比对行标识及非行标识提取出的水印编码，若两者相同，则需重新选择步骤4.1中的概率差值阈值，直至两者不同。最后将二进制水印编码WB解码，恢复原有水印信息message＝ASCII(WB)^-1。Among them, j.WB _i represents the jth bit in the i-th substring of the extracted watermark code. At the same time, the same group of row data that is not selected as the watermark embedding position is processed in the same way, and the row identifier and the non-row identifier are compared. If the watermark codes are the same, the probability difference threshold in step 4.1 needs to be reselected until the two are different. Finally, the binary watermark code WB is decoded, and the original watermark information message=ASCII(WB) ^-1 is restored.

测试结果：实验基于抗列删除攻击的表格数据水印方法，对Checkup数据集进行了水印嵌入、列删除攻击和水印检测。本发明在聚类特征属性被删除50％的情况下达到0.492的行标识准确率，具备良好的抗列删除攻击能力，有效增强表格数据水印的安全性。Test results: The experiment is based on the tabular data watermarking method against column deletion attacks. Watermark embedding, column deletion attacks and watermark detection are carried out on the Checkup dataset. The invention achieves a row identification accuracy rate of 0.492 when 50% of clustering feature attributes are deleted, has good ability to resist column deletion attacks, and effectively enhances the security of table data watermarks.

以上所述的具体描述，对发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific description above further elaborates the purpose, technical solution and beneficial effect of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not used to limit the protection of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. The form data watermarking method of anti-column delete attack is characterized in that described method comprises the steps:

Step 1. Select important continuous variable attribute columns in combination with attribute importance and data distortion tolerance, and determine the watermark column identification;

Step 2: Construct a feature restoration classification network model to determine the watermark row identifier. First, use the filter feature selection method to select cluster features. Secondly, use the constrained FCM algorithm to perform unsupervised clustering based on the selected features to obtain row data cluster labels. Then, use the mask vector to generate damaged row data, and use the damaged row data to train the feature repair classification network model. Finally, use the model to calculate the classification category probability of each row of data, add group identification to the original row data according to the category probability and select the row data as a watermark line identifier;

Step 3: Redundantly embed the watermark information into the original data. First, encode the watermark information into a binary format and add an error correction code. Finally, determine the watermark embedding position according to the watermark row identifier and watermark column identifier, and use the LSB algorithm to redundantly embed the watermark coding;

Step 4: Perform watermark detection on the watermarked data. First, use the feature restoration classification network to obtain the watermark row identifier, combine the watermark column identifier to determine the watermark embedding position, and finally extract the watermark code and decode it to restore the watermark information.

2. The table data watermarking method for resisting row deletion attack according to claim 1, characterized in that: in step 2, the training feature repair classification model, the training process is to convert the damaged data

Combined with the original row data r, the mean square error MSE is used to calculate the loss, and the feature restoration network is trained, while the damaged data encoding z is classified by the classification network as

Combine the clustering label t with the cross-entropy CE to calculate the loss, train the classification network, and combine the two losses to train the autoencoder network, so that the encoding result contains the information of the original data and the cluster category. The trained model also has the function of feature repair encoding and Data classification function, the final output is the data classification result.

3. The table data watermarking method against column deletion attack according to claim 1, characterized in that: in step 2, the original data is input into the feature restoration classification model, and Softmax is used to process the classification results output by the feature restoration classification model to obtain each row of data The probability of belonging to each category, select the category with the highest probability as the grouping identifier of each row of data; calculate the difference between the maximum category probability and the minimum category probability, select the row data whose probability difference is greater than the preset threshold, and determine the watermark row identifier.

4. The table data watermarking method against column deletion attack according to claim 1, characterized in that: in step 3, the watermark code is divided into α substrings with length k, denoted as {W ₀ ,W ₁ ,...,W _α-1 }, use the LSB algorithm to embed the substring W _i into the kth group of the i-th column, the specific embedding method is:

y _j .A _i =[LSB([y _j .A _i ] ₂ ,χ,jW _i )] ₁₀ ,

Where jW _i represents the jth bit in the i-th watermark encoding substring; y _j .A _i represents the data whose classification category is y _j and the column attribute is A _i , indicating that the watermark information embedded in the row data of the same classification category is the same , LSB _is the low-significant bit _embedding ; χ is the number of embedded bits of jW _i in y _j . coding.