WO2025098109A1

WO2025098109A1 - Ownership verification method and processing method for structured data set, device, and medium

Info

Publication number: WO2025098109A1
Application number: PCT/CN2024/125335
Authority: WO
Inventors: 朱文涛; 周文红; 张超; 刘洋; 杨立宝; 王铎
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-11-06
Filing date: 2024-10-16
Publication date: 2025-05-15
Anticipated expiration: 2026-05-06
Also published as: CN117521038A; CN117521038B

Abstract

The present application relates to an ownership verification method and processing method for a structured data set, a device, and a medium. The ownership verification method comprises: acquiring a structured data set; from a target object to be verified, acquiring secret information corresponding to the structured data set, a proportional label of watermark data, and specific mathematical properties corresponding to the watermark data, wherein by means of a preset rule, the probability that a verification value calculated from any data satisfying a predetermined data format and the secret information meets a preset mathematical feature is smaller than a first threshold, and the proportional label is greater than the first threshold; on the basis of the secret information and the specific mathematical properties, identifying the watermark data from the structured data set; and collecting statistics about a proportional result of the watermark data in the structured data set, and determining the ownership relationship between the target object and the structured data set on the basis of the proportional result and the proportional label.

Description

Ownership verification method, processing method, equipment and medium for structured data sets

相关申请Related Applications

本申请要求2023年11月06日申请的，申请号为2023114671462，名称为“结构化数据集的权属验证方法、处理方法、设备与介质”的中国专利申请的优先权，在此将其全文引入作为参考。This application claims priority to Chinese patent application number 2023114671462, filed on November 6, 2023, and entitled “Ownership verification method, processing method, device and medium for structured data sets”, the entire text of which is hereby incorporated by reference.

Technical Field

本申请涉及信息安全领域，尤其是一种结构化数据集的权属验证方法、处理方法、设备与介质。The present application relates to the field of information security, and in particular to a method, processing method, device and medium for verifying ownership of a structured data set.

Background Art

水印这一概念常见于多媒体版权相关技术领域。例如，将创作者身份信息等用于版权识别的附加数据以人眼可见或人眼不可见的形式嵌入到图像、音视频等多媒体内容文件中，从而确定这些制作内容的版权归属，维护创作者的合法权益。嵌入式水印标记技术广泛应用于非结构化数据的权属确认之中。The concept of watermark is common in the field of multimedia copyright related technologies. For example, additional data such as the creator's identity information used for copyright identification is embedded in multimedia content files such as images, audio and video in a form visible or invisible to the human eye, so as to determine the copyright ownership of these produced contents and safeguard the legitimate rights and interests of the creators. Embedded watermarking technology is widely used in the confirmation of ownership of unstructured data.

对于如基于电话号码、身份证号等字段组合而成的结构化数据而言，其不支持在数据中添加嵌入式水印，因此需要采用其他手段进行水印标记。相关技术中，对结构化数据一般采用列水印或行水印的方式进行水印标记。其中列水印是额外的、无实际意义(或者实际意义不大)的数据字段，或者仅仅是在格式上对已有数据添加的装饰标记；行水印则是指在原有的结构化数据基础上合成多组伪造数据混入数据集中，通过这些伪造数据实现对结构化数据的水印标记。列水印的缺陷在于所添加的无实际意义(或者实际意义不大)的数据字段非常易于分辨，在其他人员或者组织(以下以数据集盗用方为例)获取结构化数据后，很容易通过机器手段分辨并剥离列水印，在剥离列水印后，结构化数据的权属便难以得到识别了。行水印的缺陷则在于伪造的结构化数据在格式或内容上与业务数据通常存在显著区别，很难做到完全融入，去除行水印对数据集盗用方而言同样较为容易。数据集盗用方去除水印标记后，便可对结构化数据进行恶意利用，此时难以保证数据集所有方的合法权益。For structured data composed of fields such as telephone numbers and ID numbers, it does not support the addition of embedded watermarks in the data, so other means are needed to mark watermarks. In the related art, structured data is generally marked with column watermarks or row watermarks. The column watermark is an additional data field that has no practical meaning (or little practical meaning), or is simply a decorative mark added to the existing data in the format; the row watermark refers to synthesizing multiple groups of forged data based on the original structured data and mixing them into the data set, and using these forged data to achieve watermarking of the structured data. The defect of the column watermark is that the added data field that has no practical meaning (or little practical meaning) is very easy to distinguish. After other people or organizations (taking the data set theft party as an example below) obtain the structured data, it is easy to distinguish and strip the column watermark by machine means. After stripping the column watermark, the ownership of the structured data is difficult to identify. The defect of the row watermark is that the forged structured data is usually significantly different from the business data in format or content, and it is difficult to fully integrate it. It is also relatively easy for the data set theft party to remove the row watermark. After the data set thief removes the watermark, he or she can maliciously exploit the structured data, and it is difficult to guarantee the legitimate rights and interests of the data set owner at this time.

发明内容Summary of the invention

有鉴于此，本申请实施例提供一种结构化数据集的权属验证方法、处理方法、设备与介质。In view of this, the present invention provides a structured data set ownership verification method, processing method, device and medium.

本申请的一方面提供了一种结构化数据集的权属验证方法，包括以下步骤：One aspect of the present application provides a method for verifying ownership of a structured data set, comprising the following steps:

获取结构化数据集；所述结构化数据集中包括有多条结构化数据，每条所述结构化数据为业务数据或者水印数据，所述业务数据和所述水印数据满足相同的预定数据格式；Acquire a structured data set; the structured data set includes a plurality of structured data, each of which is business data or watermark data, and the business data and the watermark data satisfy the same predetermined data format;

从待验证的目标对象处获取所述结构化数据集对应的秘密信息、所述水印数据的比例标签和所述水印数据对应的特定数学性质；所述特定数学性质用于约束通过预设规则、使用所述秘密信息和所述水印数据计算得到的校验值符合预设的数学特征；其中，通过所述预设规则，任一满足所述预定数据格式的数据和所述秘密信息计算得到的校验值符合所述预设的数学特征的概率小于第一阈值；所述比例标签大于所述第一阈值；The secret information corresponding to the structured data set, the proportional label of the watermark data and the specific mathematical property corresponding to the watermark data are obtained from the target object to be verified; the specific mathematical property is used to constrain the check value calculated by using the secret information and the watermark data according to the preset rules to meet the preset mathematical characteristics; wherein, according to the preset rules, the probability that any check value calculated by using the data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; the proportional label is greater than the first threshold;

根据所述秘密信息和所述特定数学性质，从所述结构化数据集中识别出所述水印数据；identifying the watermark data from the structured data set according to the secret information and the specific mathematical property;

统计所述水印数据在所述结构化数据集中所占的比例结果，根据所述比例结果和所述比例标签，确定所述目标对象和所述结构化数据集的权属关系。The proportion of the watermark data in the structured data set is calculated, and the ownership relationship between the target object and the structured data set is determined according to the proportion result and the proportion label.

在一些实施例中，所述根据所述秘密信息和所述特定数学性质，从所述结构化数据集中识别出所述水印数据，包括：In some embodiments, identifying the watermark data from the structured data set according to the secret information and the specific mathematical property comprises:

通过所述预设规则，对所述秘密信息和所述结构化数据进行计算，得到第一校验值；The secret information and the structured data are calculated according to the preset rule to obtain a first verification value;

根据所述特定数学性质，判断所述第一校验值是否符合所述预设的数学特征；According to the specific mathematical property, determining whether the first verification value meets the preset mathematical feature;

若所述第一校验值符合所述预设的数学特征，将所述结构化数据确定为水印数据。If the first check value meets the preset mathematical characteristic, the structured data is determined as watermark data.

在一些实施例中，所述预设规则包括消息鉴别码算法或确定性数字签名算法。In some embodiments, the preset rule includes a message authentication code algorithm or a deterministic digital signature algorithm.

在一些实施例中，所述根据所述比例结果和所述比例标签，确定所述目标对象和所述结构化数据集的权属关系，包括：In some embodiments, determining the ownership relationship between the target object and the structured data set according to the ratio result and the ratio label includes:

计算所述比例结果和所述比例标签之间的差异值；Calculating a difference value between the ratio result and the ratio label;

若所述差异值小于第二阈值，确定所述目标对象为所述结构化数据集的权属所有方。If the difference value is less than a second threshold, it is determined that the target object is the owner of the structured data set.

在一些实施例中，所述计算所述比例结果和所述比例标签之间的差异值，包括：In some embodiments, calculating the difference value between the ratio result and the ratio label includes:

计算所述比例结果和所述比例标签之间的差值，将所述差值的绝对值确定为差异值。A difference between the ratio result and the ratio label is calculated, and an absolute value of the difference is determined as a difference value.

计算所述比例结果和所述比例标签之间的差值，将所述差值的绝对值占所述比例标签的比例确定为差异值。The difference between the ratio result and the ratio label is calculated, and the ratio of the absolute value of the difference to the ratio label is determined as the difference value.

本申请另一方面公开一种结构化数据集的处理方法，包括以下步骤：Another aspect of the present application discloses a method for processing a structured data set, comprising the following steps:

获取原始数据集和权属标记信息；其中，所述原始数据集用于存储结构化数据，所述结构化数据满足预定数据格式；所述权属标记信息包括秘密信息、比例标签和特定数学性质；所述特定数学性质用于约束通过预设规则、使用所述秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过所述预设规则，任一满足所述预定数据格式的数据和所述秘密信息计算得到的校验值符合所述预设的数学特征的概率小于第一阈值；所述比例标签大于所述第一阈值；Acquire an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data satisfies a predetermined data format; the ownership mark information includes secret information, a ratio label, and a specific mathematical property; the specific mathematical property is used to constrain the data obtained by calculating the secret information and watermark data through preset rules. The check value meets the preset mathematical characteristics; wherein, according to the preset rule, the probability that the check value calculated by any data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; the proportion label is greater than the first threshold;

根据所述秘密信息和所述特定数学性质，从满足所述预定数据格式的数据中确定水印数据；Determining watermark data from data satisfying the predetermined data format according to the secret information and the specific mathematical property;

根据所述原始数据集中包含的业务数据的数量以及所述比例标签，确定需要加入所述原始数据集的水印数据的目标数量；Determining a target amount of watermark data to be added to the original data set according to the amount of business data included in the original data set and the ratio label;

将所述目标数量的所述水印数据添加到所述原始数据集中，得到目标数据集。The target amount of watermark data is added to the original data set to obtain a target data set.

在一些实施例中，获取权属标记信息，包括：In some embodiments, obtaining ownership mark information includes:

获取和所述原始数据集对应的关联信息；所述关联信息用于表征所述原始数据集的权属；Acquire association information corresponding to the original data set; the association information is used to characterize the ownership of the original data set;

根据所述关联信息，生成所述秘密信息。The secret information is generated based on the associated information.

在一些实施例中，所述将所述目标数量的所述水印数据添加到所述原始数据集中，得到目标数据集，包括：In some embodiments, adding the target amount of watermark data to the original data set to obtain a target data set includes:

在所述原始数据集中确定所述目标数量的插入位置；Determining an insertion position of the target quantity in the original data set;

将每个所述水印数据添加到所述原始数据集中的一个插入位置处，得到目标数据集。Each watermark data is added to an insertion position in the original data set to obtain a target data set.

在一些实施例中，所述在所述原始数据集中确定所述目标数量的插入位置，包括：In some embodiments, determining the insertion position of the target number in the original data set includes:

采用随机插入算法、分组插入算法、时间序列混杂算法或混合加密算法处理所述原始数据集，确定所述目标数量的插入位置。The original data set is processed by a random insertion algorithm, a group insertion algorithm, a time series hybrid algorithm or a hybrid encryption algorithm to determine the insertion positions of the target quantity.

本申请另一方面公开一种结构化数据集的权属验证装置，包括：On the other hand, the present application discloses a device for verifying ownership of a structured data set, comprising:

第一获取单元，用于获取结构化数据集；所述结构化数据集中包括有多条结构化数据，每条所述结构化数据为业务数据或者水印数据，所述业务数据和所述水印数据满足相同的预定数据格式；A first acquisition unit is used to acquire a structured data set; the structured data set includes a plurality of structured data, each of which is business data or watermark data, and the business data and the watermark data meet the same predetermined data format;

第二获取单元，用于从待验证的目标对象处获取所述结构化数据集对应的秘密信息、所述水印数据的比例标签和所述水印数据对应的特定数学性质；所述特定数学性质用于约束通过预设规则、使用所述秘密信息和所述水印数据计算得到的校验值符合预设的数学特征；其中，通过所述预设规则，任一满足所述预定数据格式的数据和所述秘密信息计算得到的校验值符合所述预设的数学特征的概率小于第一阈值；所述比例标签大于所述第一阈值；A second acquisition unit is used to acquire the secret information corresponding to the structured data set, the proportion label of the watermark data and the specific mathematical property corresponding to the watermark data from the target object to be verified; the specific mathematical property is used to constrain the check value calculated by using the secret information and the watermark data according to the preset rules to meet the preset mathematical characteristics; wherein, according to the preset rules, the probability that any check value calculated by the data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; the proportion label is greater than the first threshold;

处理单元，用于根据所述秘密信息和所述特定数学性质，从所述结构化数据集中识别出所述水印数据； a processing unit, configured to identify the watermark data from the structured data set according to the secret information and the specific mathematical property;

统计单元，用于统计所述水印数据在所述结构化数据集中所占的比例结果，根据所述比例结果和所述比例标签，确定所述目标对象和所述结构化数据集的权属关系。A statistical unit is used to count the proportion of the watermark data in the structured data set, and determine the ownership relationship between the target object and the structured data set according to the proportion result and the proportion label.

本申请另一方面公开一种结构化数据集的处理装置，包括：Another aspect of the present application discloses a structured data set processing device, comprising:

信息获取单元，用于获取原始数据集和权属标记信息；其中，所述原始数据集用于存储结构化数据，所述结构化数据满足预定数据格式；所述权属标记信息包括秘密信息、比例标签和特定数学性质；所述特定数学性质用于约束通过预设规则、使用所述秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过所述预设规则，任一满足所述预定数据格式的数据和所述秘密信息计算得到的校验值符合所述预设的数学特征的概率小于第一阈值；所述比例标签大于所述第一阈值；An information acquisition unit, used to acquire an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data meets a predetermined data format; the ownership mark information includes secret information, a ratio label and a specific mathematical property; the specific mathematical property is used to constrain a check value calculated by using the secret information and watermark data according to a preset rule to meet a preset mathematical feature; wherein, according to the preset rule, the probability that any check value calculated by using the data meeting the predetermined data format and the secret information meets the preset mathematical feature is less than a first threshold; the ratio label is greater than the first threshold;

第一确定单元，用于根据所述秘密信息和所述特定数学性质，从满足所述预定数据格式的数据中确定水印数据；A first determining unit, configured to determine watermark data from data satisfying the predetermined data format according to the secret information and the specific mathematical property;

第二确定单元，用于根据所述原始数据集中包含的业务数据的数量以及所述比例标签，确定需要加入所述原始数据集的水印数据的目标数量；A second determining unit, configured to determine a target amount of watermark data to be added to the original data set according to the amount of business data included in the original data set and the ratio label;

数据集获取单元，用于将所述目标数量的所述水印数据添加到所述原始数据集中，得到目标数据集。The data set acquisition unit is used to add the target amount of watermark data to the original data set to obtain a target data set.

本申请另一方面公开一种电子设备，包括处理器以及存储器；On the other hand, the present application discloses an electronic device, including a processor and a memory;

所述存储器用于存储程序；The memory is used to store programs;

所述处理器执行所述程序实现所述的一种结构化数据集的权属验证方法或结构化数据集的处理方法。The processor executes the program to implement the method for verifying ownership of a structured data set or the method for processing a structured data set.

本申请另一方面公开一种计算机可读存储介质，所述存储介质存储有程序，所述程序被处理器执行实现所述的一种结构化数据集的权属验证方法或结构化数据集的处理方法。On the other hand, the present application discloses a computer-readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to implement the method for verifying ownership of a structured data set or the method for processing a structured data set.

本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the present application are set forth in the following drawings and description. Other features, objects, and advantages of the present application will become apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1是传统技术中列水印的示意图；FIG1 is a schematic diagram of a watermark in conventional technology;

图2是传统技术中行水印的示意图； FIG2 is a schematic diagram of a watermark in conventional technology;

图3是本申请实施例中提供的一种结构化数据集的权属验证方法的流程示意图；FIG3 is a flow chart of a method for verifying ownership of a structured data set provided in an embodiment of the present application;

图4是本申请实施例中提供的一种从结构化数据中确定出水印数据的流程示意图；FIG4 is a schematic diagram of a process for determining watermark data from structured data provided in an embodiment of the present application;

图5是本申请实施例中提供的一种结构化数据集的处理方法的流程示意图；FIG5 is a schematic diagram of a flow chart of a method for processing a structured data set provided in an embodiment of the present application;

图6是本申请实施例中提供的一种业务数据的预定数据格式的示意图；FIG6 is a schematic diagram of a predetermined data format of business data provided in an embodiment of the present application;

图7是本申请实施例中提供的一种秘密信息的生成的流程示意图；FIG7 is a schematic diagram of a process for generating secret information provided in an embodiment of the present application;

图8是本申请实施例中提供的一种在原始数据集中插入水印数据的示意图；FIG8 is a schematic diagram of inserting watermark data into an original data set provided in an embodiment of the present application;

图9为本申请实施例中提供的一种结构化数据集的权属验证装置的结构示意图；FIG9 is a schematic diagram of the structure of a device for verifying ownership of a structured data set provided in an embodiment of the present application;

图10为本申请实施例中提供的一种电子设备的结构示意图。FIG. 10 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.

DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

相关技术中，面向结构化数据集的水印标记技术主要分为列水印和行水印两种。In the related technology, watermarking technology for structured data sets is mainly divided into column watermark and row watermark.

具体地，列水印是额外的、无实际意义(或者实际意义不大)的数据字段，或者仅仅是在格式上对已有数据添加的装饰标记。如图1所示，将手机号12300762185添加装饰标记变成{#12300762185#}以用于识别“这是我的数据”，此处的{#和#}就是列水印。列水印的缺陷在于所添加的无实际意义(或者实际意义不大)的数据字段非常易于分辨，在其他人员或者组织(以下以数据集盗用方为例)获取结构化数据后，很容易通过机器手段分辨并剥离列水印，在剥离列水印后，结构化数据的权属便难以得到识别了。Specifically, a column watermark is an additional data field that has no practical meaning (or little practical meaning), or is simply a decorative mark added to the existing data in terms of format. As shown in Figure 1, the mobile phone number 12300762185 is decorated with a mark to become {#12300762185#} to identify "this is my data", where {# and #} are the column watermark. The drawback of the column watermark is that the added data field that has no practical meaning (or little practical meaning) is very easy to distinguish. After other people or organizations (taking the data set theft party as an example below) obtain the structured data, it is easy to distinguish and remove the column watermark by machine means. After the column watermark is removed, the ownership of the structured data is difficult to identify.

相比之下，行水印不从列的维度去改变数据，而是在原有的结构化数据基础上生成多组伪造数据混入数据集中，通过这些伪造数据实现对结构化数据的水印标记，即相当于插入“整条的虚假数据记录”。如图2所示，在手机号数据集中插入一批虚假用户，他们的手机号使用111xxxxxxxx这种实际上不存在的号码，以用于识别“这是我的数据”。行水印的缺陷则在于伪造的结构化数据在格式或内容上与业务数据通常存在显著区别，如111xxxxxxxx这样的号码容易被业内人士分辨为虚假用户，很难做到完全融入，去除行水印对数据集盗用方而言同样较为容易。In contrast, row watermarks do not change data from the column dimension, but generate multiple sets of forged data based on the original structured data and mix them into the data set. These forged data are used to watermark the structured data, which is equivalent to inserting "an entire fake data record." As shown in Figure 2, a group of fake users are inserted into the mobile phone number data set. Their mobile phone numbers use non-existent numbers such as 111xxxxxxxx to identify "this is my data." The defect of row watermarks is that forged structured data is usually significantly different from business data in format or content. For example, numbers such as 111xxxxxxxx can easily be identified as fake users by industry insiders, and it is difficult to fully integrate them. Removing row watermarks Printing is also easier for data set theft parties.

本申请实施例在现有行水印技术基础上进行改进，提出一种结构化数据集的权属验证方法、处理方法、设备与介质。The embodiment of the present application improves upon the existing watermark technology and proposes a method, a processing method, a device and a medium for verifying the ownership of a structured data set.

本申请实施例中，数据集所有方是指依法享有与结构化数据集相关的各项合法权益的对象实体。在理想情况下，数据集所有方的结构化数据集不会被数据集盗用方恶意利用，数据集所有方能够正常使用结构化数据集中的业务数据进行商业活动并同时享有与之相关的合法权益；然而在信息化时代，对结构化数据集进行非法获取、恶意利用的现象屡见不鲜，严重影响数据集所有方的信息安全并阻碍数据集所有方享有各项合法权益。因此需要采用本申请实施例的数据水印技术保护结构化数据集的信息安全。In the embodiments of the present application, the data set owner refers to an object entity that legally enjoys various legal rights and interests related to the structured data set. Ideally, the structured data set of the data set owner will not be maliciously exploited by the data set thief, and the data set owner can normally use the business data in the structured data set for commercial activities and enjoy the legal rights and interests related thereto; however, in the information age, illegal acquisition and malicious use of structured data sets are common, which seriously affects the information security of the data set owner and hinders the data set owner from enjoying various legal rights and interests. Therefore, it is necessary to use the data watermarking technology of the embodiments of the present application to protect the information security of the structured data set.

本申请实施例中，数据集验证方是指想要确认该份结构化数据集权属的对象实体。In the embodiment of the present application, the data set verifier refers to an entity that wants to confirm the ownership of the structured data set.

为解决结构化数据集的权属确认难题，如图3所示，本申请实施例提出一种结构化数据集的权属验证方法，该方法可以应用在数据集验证方。具体地，该方法包括以下步骤：In order to solve the problem of confirming the ownership of a structured data set, as shown in FIG3 , the embodiment of the present application proposes a method for verifying the ownership of a structured data set, which can be applied to the data set verification party. Specifically, the method includes the following steps:

步骤310，获取结构化数据集；结构化数据集中包括有多条结构化数据，每条结构化数据为业务数据或者水印数据，业务数据和水印数据满足相同的预定数据格式；Step 310, obtaining a structured data set; the structured data set includes a plurality of structured data, each structured data is business data or watermark data, and the business data and the watermark data meet the same predetermined data format;

步骤320，从待验证的目标对象处获取结构化数据集对应的秘密信息、水印数据的比例标签和水印数据对应的特定数学性质；特定数学性质用于约束通过预设规则、使用秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过预设规则，任一满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率小于第一阈值；Step 320, obtaining secret information corresponding to the structured data set, a ratio label of the watermark data, and a specific mathematical property corresponding to the watermark data from the target object to be verified; the specific mathematical property is used to constrain the check value calculated by using the secret information and the watermark data according to the preset rules to meet the preset mathematical characteristics; wherein, according to the preset rules, the probability that the check value calculated by using any data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold;

步骤330，根据秘密信息和特定数学性质，从结构化数据集中识别出水印数据；Step 330, identifying watermark data from the structured data set based on the secret information and the specific mathematical properties;

步骤340，统计水印数据在结构化数据集中所占的比例结果，根据比例结果和比例标签，确定目标对象和结构化数据集的权属关系。Step 340, calculate the proportion of the watermark data in the structured data set, and determine the ownership relationship between the target object and the structured data set based on the proportion result and the proportion label.

本申请实施例中，提供一种结构化数据集的权属验证方法，该方法可以应用在数据集验证方。具体地，首先，可以获取需要进行权属验证的结构化数据集以及确定待验证的对象，本申请实施例中，将该对象记为目标对象。目标对象可以是声称持有需要进行权属验证的结构化数据集的相关对象，或者提供该结构化数据集给数据集验证方的对象，本申请实施例中，对此不作限制。In an embodiment of the present application, a method for verifying ownership of a structured data set is provided, which can be applied to a data set verifier. Specifically, first, a structured data set that needs to be verified can be obtained and an object to be verified can be determined. In an embodiment of the present application, the object is recorded as a target object. The target object can be a related object that claims to hold a structured data set that needs to be verified, or an object that provides the structured data set to the data set verifier. In an embodiment of the present application, there is no limitation on this.

本申请实施例中，获取的结构化数据集中包括有多条结构化数据，这些结构化数据满足相同的预定数据格式，比如说每条结构化数据的字段个数相同且每个对应的字段具有相同的数据格式。这里，结构化数据集中的每条结构化数据为业务数据或者水印数据，业务数据指的是正常真实的数据，例如可以包括手机号、身份证号等字段；水印数据则是构造的虚假数据，本申请实施例中，业务数据和水印数据满足相同的预定数据格式，在格式和内容上两者看不出区别。换而言之，本申请实施例中，并不对水印数据进行直观的标记(好比前述把合成的手机号设定为以111这种实际当中不存在的前缀开头)，本申请实施例中的水印数据也不具备任何警示作用。如此，在不具备相应的秘密信息时，任何人都无法判断结构化数据是业务数据还是水印数据，也即任何人都无法将业务数据与水印数据进行区分。In the embodiment of the present application, the acquired structured data set includes multiple structured data, and these structured data satisfy the same predetermined data format, for example, the number of fields in each structured data is the same and each corresponding field has the same data format. Here, each structured data in the structured data set is business data or watermark data. Business data refers to normal and real data, for example, it may include fields such as mobile phone number and ID card number; watermark data refers to structured data. In the embodiment of the present application, the business data and the watermark data meet the same predetermined data format, and there is no discernible difference between the two in terms of format and content. In other words, in the embodiment of the present application, the watermark data is not intuitively marked (such as setting the synthesized mobile phone number to start with a prefix that does not actually exist, such as 111), and the watermark data in the embodiment of the present application does not have any warning effect. In this way, in the absence of corresponding secret information, no one can determine whether the structured data is business data or watermark data, that is, no one can distinguish business data from watermark data.

接着，可以从目标对象处获取结构化数据集对应的秘密信息、水印数据的比例标签和水印数据的特定数学性质。本申请实施例中，对于结构化数据集的权属所有方来说，需要针对每一份存在权属主张需求的结构化数据集预先选定秘密信息、水印数据的比例标签和水印数据的特定数学性质，其中，秘密信息必须保密，仅在必要时才向数据集验证方(如监管部门、执法机关)进行授权。基于秘密信息，数据集所有方可以生成带有特定数学性质的水印数据；一条水印数据在外观上与一条业务数据无法区分，但对于满足预定数据格式的全体可能的数据而言，只有很少比例的合成数据能满足数据集所有方预先设定的特定数学性质从而成为水印数据。因此，本申请实施例中，数据集所有方可以控制在结构化数据集中插入的水印数据的比例，并且记录该比例数值作为比例标签，数据集验证方在获取得到秘密信息后，可以使用秘密信息来识别各个水印数据，进而通过检查结构化数据集中水印数据的出现比例来判断数据集是否属于目标对象。Then, the secret information corresponding to the structured data set, the ratio label of the watermark data, and the specific mathematical properties of the watermark data can be obtained from the target object. In the embodiment of the present application, for the owner of the structured data set, it is necessary to pre-select the secret information, the ratio label of the watermark data, and the specific mathematical properties of the watermark data for each structured data set with a claim demand, wherein the secret information must be kept confidential and authorized only to the data set verifier (such as the regulatory department, law enforcement agency) when necessary. Based on the secret information, the data set owner can generate watermark data with specific mathematical properties; a watermark data is indistinguishable from a business data in appearance, but for all possible data that meet the predetermined data format, only a small proportion of the synthetic data can meet the specific mathematical properties pre-set by the data set owner to become watermark data. Therefore, in the embodiment of the present application, the data set owner can control the proportion of the watermark data inserted in the structured data set, and record the ratio value as the ratio label. After obtaining the secret information, the data set verifier can use the secret information to identify each watermark data, and then determine whether the data set belongs to the target object by checking the appearance ratio of the watermark data in the structured data set.

本申请实施例中，水印数据对应的特定数学性质指的是通过预设规则，使用秘密信息和水印数据计算得到的校验值符合预设的数学特征，即校验值具有数学意义上的特定规律或特征。这里，使用秘密信息和水印数据计算得到的校验值的方式存在有多种，可以采用相关的密码学规则来实现。示例性地，比如说可以采用消息鉴别码(Message Authentication Code)来作为校验值，消息鉴别码是密码领域中一类专门的算法。这类算法输入指定的密钥和数据(前者预先秘密选定；两者都既可以是可读的字符串，也可以是任意的比特串)，输出标准长度(比如256比特)的、形如随机数的数据指纹(也称为数据摘要或者就叫作消息鉴别码)。尽管消息鉴别码的公式并不复杂，其输出却无法预测，仅当密钥和数据都输入后才能确定。消息鉴别码具备的数学特性还包括：(a)数据指纹对密钥和数据均敏感，两个输入中的任何一个发生变化(哪怕只变化1比特)都会导致截然不同的输出；(b)从密钥和数据计算数据指纹很容易，但对于强度足够的密钥和来自充分大样本空间的数据，从数据指纹无论反推密钥还是数据都不现实。本申请实施例中，使用消息鉴别码算法时，可以将秘密信息作为密钥，结构化数据作为数据，通过算法计算得出每条结构化数据对应的校验值。In the embodiment of the present application, the specific mathematical property corresponding to the watermark data refers to the fact that the check value calculated using the secret information and the watermark data by the preset rules conforms to the preset mathematical characteristics, that is, the check value has a specific law or characteristic in the mathematical sense. Here, there are many ways to calculate the check value using the secret information and the watermark data, which can be implemented by using relevant cryptographic rules. For example, a message authentication code (Message Authentication Code) can be used as a check value. The message authentication code is a special type of algorithm in the field of cryptography. This type of algorithm inputs a specified key and data (the former is secretly selected in advance; both can be readable strings or arbitrary bit strings), and outputs a data fingerprint (also called a data summary or just a message authentication code) of a standard length (such as 256 bits) in the form of a random number. Although the formula of the message authentication code is not complicated, its output cannot be predicted and can only be determined when both the key and the data are input. The mathematical properties of message authentication codes also include: (a) data fingerprints are sensitive to both keys and data, and changes in either of the two inputs (even if only 1 bit changes) will result in completely different outputs; (b) it is easy to calculate data fingerprints from keys and data, but for keys of sufficient strength and data from a sufficiently large sample space, it is not practical to infer either the key or the data from the data fingerprint. In the embodiments of the present application, when using a message authentication code algorithm, secret information can be used as a key and structured data as data, and the algorithm can be used to calculate the check value corresponding to each structured data.

本申请实施例中，使用秘密信息和水印数据计算得到的校验值符合预设的数学特征，这里，对于数学特征的具体形式不作限制。示例性地，比如说以消息鉴别码作为校验值，预设的数学特征可以是：在一些实施例中，水印数据对应的校验值以连续16个1比特开头，也即以十六进制表示的两个字节ffff开头；在一些实施例中，水印数据对应的校验值以连续20个0比特结尾；在一些实施例中，水印数据对应的校验值的第二个字节和倒数第二个字节都是十六进制数aa。可以理解的是，由于消息鉴别码在形式上是长度固定的随机数，所以以上所举的每一个示例都属于小概率事件，因此，特定数学性质可以约束只有一小部分的合成数据能够成为水印数据，即可以认为通过预设规则计算，任一满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率小于第一阈值，这里的第一阈值可以根据预定数据格式的数据的总量和实际的需求来确定，一般来说，第一阈值的数值可以尽可能地小，以降低正常业务数据可能出现的干扰。示例性地，第一阈值可以设定为2^-10。本申请实施例中，对第一阈值的大小不作限制。In the embodiment of the present application, the check value calculated using the secret information and the watermark data conforms to the preset mathematical characteristics. Here, the specific form of the mathematical characteristics is not limited. For example, for example, the message authentication code is used as the check value. The preset mathematical feature can be: in some embodiments, the check value corresponding to the watermark data starts with 16 consecutive 1 bits, that is, it starts with two bytes ffff represented in hexadecimal; in some embodiments, the check value corresponding to the watermark data ends with 20 consecutive 0 bits; in some embodiments, the second byte and the second to last byte of the check value corresponding to the watermark data are both hexadecimal numbers aa. It can be understood that since the message authentication code is a random number with a fixed length in form, each of the above examples belongs to a low-probability event. Therefore, the specific mathematical properties can constrain only a small part of the synthetic data to become the watermark data, that is, it can be considered that the probability that the check value calculated by the preset rules for any data that meets the predetermined data format and the secret information meets the preset mathematical feature is less than the first threshold. The first threshold here can be determined according to the total amount of data in the predetermined data format and the actual needs. Generally speaking, the value of the first threshold can be as small as possible to reduce the interference that may occur in normal business data. Exemplarily, the first threshold can be set to ^2-10 . In the embodiment of the present application, the size of the first threshold is not limited.

可以理解的是，本申请实施例中，基于设定的特定数学性质，数据集验证方经目标对象授权获得秘密信息后(且仅在获得秘密信息后)，可以逐条验证结构化数据是否满足特定数学性质，从而识别出每一条结构化数据是否为水印数据。据此，合法的数据集所有方能向数据集验证方揭示混杂在结构化数据集里的、那些看似属于业务数据的水印数据，并根据水印数据在结构化数据集中的出现比例来证明对结构化数据集的权属主张。此过程中的一个关键效果是：只有数据集所有方和获得了授权的数据集验证方能在结构化数据集中分辨业务数据和水印数据，因而只有他们能统计水印数据出现的比例结果。如果目标对象不是合法的数据集所有方，其并不能获知到正确的秘密信息，进而，如果目标对象无法提供秘密信息，那么可以确定它不是结构化数据集的合法所有方；或者说如果目标对象提供了错误的秘密信息，那么根据错误的秘密信息和结构化数据计算得到的校验值，是无法正确基于设定的特定数学性质识别出每一条结构化数据是否为水印数据的，得到的比例结果也会和正确的比例标签相差较远，从而也可以确定它不是结构化数据集的合法所有方。It is understandable that in the embodiment of the present application, based on the set specific mathematical properties, after the data set verifier obtains the secret information with the authorization of the target object (and only after obtaining the secret information), it can verify whether the structured data satisfies the specific mathematical properties one by one, thereby identifying whether each structured data is watermark data. Accordingly, the legitimate data set owner can reveal to the data set verifier those watermark data that appear to belong to the business data mixed in the structured data set, and prove the claim of ownership of the structured data set based on the appearance ratio of the watermark data in the structured data set. A key effect of this process is that only the data set owner and the authorized data set verifier can distinguish between business data and watermark data in the structured data set, so only they can count the proportion of watermark data. If the target object is not the legitimate owner of the data set, it cannot obtain the correct secret information. Furthermore, if the target object cannot provide the secret information, it can be determined that it is not the legitimate owner of the structured data set; or if the target object provides incorrect secret information, then the check value calculated based on the incorrect secret information and the structured data cannot correctly identify whether each structured data is watermark data based on the set specific mathematical properties, and the obtained proportion result will be far from the correct proportion label, so it can also be determined that it is not the legitimate owner of the structured data set.

当然，需要说明的是，本申请实施例中，为了能够更为准确清楚地判断出目标对象对结构化数据集的权属，一般情况下，数据集所有方需要在结构化数据集内加入一定数量的水印数据，使得水印数据的比例远超过常规的满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率。即比例标签的数值要远大于第一阈值，这样，可以保证能够明显区分出结构化数据集是基于水印数据处理过的，本申请实施例中，对于比例标签的大小不作具体限制，示例性地，其可以设置为5％。Of course, it should be noted that in the embodiment of the present application, in order to more accurately and clearly determine the ownership of the target object to the structured data set, in general, the owner of the data set needs to add a certain amount of watermark data to the structured data set, so that the proportion of the watermark data far exceeds the probability that the check value calculated by the conventional data that meets the predetermined data format and the secret information meets the preset mathematical characteristics. That is, the value of the ratio label is much larger than the first threshold, so that it can be ensured that the structured data set can be clearly distinguished as being processed based on the watermark data. In the embodiment of the present application, there is no specific restriction on the size of the ratio label, and it can be set to 5% by way of example.

可以理解的是，与同类技术相比，本申请主要有如下四点特色：It is understandable that compared with similar technologies, this application has the following four main features:

1、可以用于证明数据集所有方对结构化数据集的权属主张，哪怕结构化数据集里的字段有身份证号、手机号等“不可修饰的数据”； 1. It can be used to prove the ownership claim of the data set owner to the structured data set, even if the fields in the structured data set contain "unmodifiable data" such as ID card number and mobile phone number;

2、与某些水印相关的算法类似，本申请涉及密码技术，但本申请并不局限于某个特定的密码学算法，只要能基于秘密信息为结构化数据生成校验值的密码类别(如消息鉴别码、确定性数字签名)，都可作为本申请中预设规则的底层算法；2. Similar to some watermark-related algorithms, this application involves cryptographic technology, but this application is not limited to a specific cryptographic algorithm. As long as the cryptographic category that can generate a check value for structured data based on secret information (such as a message authentication code, a deterministic digital signature) can be used as the underlying algorithm of the preset rules in this application;

3、本申请不像图像水印或视音频水印那样通过从数据中提取制作者身份之类的信息来进行内容溯源等版权识别，而是通过使用秘密信息对结构化数据集中的每条结构化数据逐一进行核验以识别出水印数据后，基于水印数据在结构化数据集中出现的比例结果来判断结构化数据集的权属，这里逐一核验乃检查每条结构化数据的校验值是否满足特定数学性质；3. Unlike image watermark or audio/video watermark, this application does not extract information such as the producer's identity from the data to perform copyright identification such as content tracing. Instead, it uses secret information to verify each structured data in the structured data set one by one to identify the watermark data, and then determines the ownership of the structured data set based on the proportion of the watermark data in the structured data set. Here, the one-by-one verification is to check whether the check value of each structured data satisfies a specific mathematical property;

4、目标对象向数据集验证方(如监管部门、执法机关)证明对结构化数据的权属主张时，需要向后者进行授权以使后者获得相对于结构化数据的秘密信息；未获授权者无法进行上述核验。4. When the target object proves its claim to ownership of structured data to the data set verifier (such as regulatory authorities or law enforcement agencies), it is necessary to authorize the latter to obtain confidential information relative to the structured data; unauthorized persons cannot perform the above verification.

在一些实施例中，参照图4，根据秘密信息和特定数学性质，从结构化数据集中识别出水印数据，包括：In some embodiments, referring to FIG. 4 , watermark data is identified from a structured data set based on secret information and specific mathematical properties, including:

步骤410，通过预设规则，对秘密信息和结构化数据进行计算，得到第一校验值；Step 410, calculating the secret information and the structured data according to a preset rule to obtain a first verification value;

步骤420，根据特定数学性质，判断第一校验值是否符合预设的数学特征；Step 420, judging whether the first verification value meets a preset mathematical characteristic according to a specific mathematical property;

步骤430，若第一校验值符合预设的数学特征，将结构化数据确定为水印数据。Step 430: If the first verification value meets the preset mathematical characteristics, the structured data is determined as watermark data.

本申请实施例中，在从结构化数据中确定水印数据时，可以通过预设规则，使用秘密信息和结构化数据进行计算，得到一个校验值，将该校验值记为第一校验值。接着，可以根据特定数学性质，判断第一校验值是否符合预设的数学特征，如果第一校验值符合预设的数学特征，那么可以将其确定为水印数据；相对地，如果第一校验值不符合预设的数学特征，则可以将其确定为业务数据。In the embodiment of the present application, when determining watermark data from structured data, a check value can be obtained by using secret information and structured data for calculation according to preset rules, and the check value is recorded as a first check value. Then, it can be determined whether the first check value meets the preset mathematical characteristics according to specific mathematical properties. If the first check value meets the preset mathematical characteristics, it can be determined as watermark data; conversely, if the first check value does not meet the preset mathematical characteristics, it can be determined as business data.

在一些实施例中，根据比例结果和比例标签，确定目标对象和结构化数据集的权属关系，包括：In some embodiments, determining the ownership relationship between the target object and the structured data set according to the ratio result and the ratio label includes:

计算比例结果和比例标签之间的差异值；Calculate the difference between the scale result and the scale label;

若差异值小于第二阈值，确定目标对象为结构化数据集的权属所有方。If the difference value is less than the second threshold, it is determined that the target object is the owner of the structured data set.

本申请实施例中，在根据比例结果和比例标签确定目标对象和结构化数据集的权属关系时，业务数据可能会存在符合特定数学性质的情况，结构化数据集也可能存在被数据集盗用方修改、增删等情况。因此，这里的比例结果和比例标签可能存在不完全一致的情况。本申请实施例中，可以计算比例结果和比例标签之间的差异值，此处的差异值可以根据需要灵活设定。示例性地，在一些实施例，可以计算比例结果和比例标签的差值，然后将该差值的绝对值确定为差异值；在一些实施例中，还可以计算差值的绝对值占比例标签(或者比例结果)的比例，将该比例确定为差异值。可以理解的是，本申请实施例中，比例结果和比例标签之间的差异值越大，说明二者越不接近，目标对象就越不可能是结构化数据集的权属所有方；比例结果和比例标签之间的差异值越小，说明二者越接近，目标对象就越可能是结构化数据集的权属所有方。本申请实施例中，可以设定一个阈值，记为第二阈值，如果计算得到的差异值很小，处于第二阈值之内，则可以确定目标对象为结构化数据集的权属所有方；如果计算得到的差异值较大，大于或者等于第二阈值，则可以确定目标对象不是结构化数据集的权属所有方。In an embodiment of the present application, when determining the ownership relationship between the target object and the structured data set based on the ratio result and the ratio label, the business data may conform to specific mathematical properties, and the structured data set may be modified, added, deleted, etc. by the party that has stolen the data set. Therefore, the ratio result and the ratio label here may not be completely consistent. In an embodiment of the present application, the difference between the ratio result and the ratio label can be calculated, and the difference value here can be flexibly set as needed. Exemplarily, in some embodiments, the difference between the ratio result and the ratio label can be calculated, and then the absolute value of the difference can be determined as the difference value; in some embodiments, the absolute value of the difference as a percentage of the ratio label (or The ratio of the ratio result to the ratio label is determined as the difference value. It can be understood that in the embodiment of the present application, the larger the difference value between the ratio result and the ratio label, the less close the two are, and the less likely the target object is to be the owner of the structured data set; the smaller the difference value between the ratio result and the ratio label, the closer the two are, and the more likely the target object is to be the owner of the structured data set. In the embodiment of the present application, a threshold value can be set, denoted as the second threshold value. If the calculated difference value is very small and is within the second threshold value, it can be determined that the target object is the owner of the structured data set; if the calculated difference value is large, greater than or equal to the second threshold value, it can be determined that the target object is not the owner of the structured data set.

参照图5，本申请实施例中，还提供了一种结构化数据集的处理方法，该方法可以用于生成带有水印数据的结构化数据集，可以用于数据集所有方。具体地，该处理方法包括：5, in an embodiment of the present application, a method for processing a structured data set is also provided, which can be used to generate a structured data set with watermark data and can be used by the data set owner. Specifically, the processing method includes:

步骤510，获取原始数据集和权属标记信息；其中，原始数据集用于存储结构化数据，结构化数据满足预定数据格式；权属标记信息包括秘密信息、比例标签和特定数学性质；特定数学性质用于约束通过预设规则、使用秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过预设规则，任一满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率小于第一阈值；比例标签大于第一阈值；Step 510, obtaining an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data meets a predetermined data format; the ownership mark information includes secret information, a ratio label, and a specific mathematical property; the specific mathematical property is used to constrain a check value calculated by using the secret information and the watermark data through a preset rule to meet a preset mathematical feature; wherein, through the preset rule, the probability that any check value calculated by using the data meeting the predetermined data format and the secret information meets the preset mathematical feature is less than a first threshold; the ratio label is greater than the first threshold;

步骤520，根据秘密信息和特定数学性质，从满足预定数据格式的数据中确定水印数据；Step 520, determining watermark data from data satisfying a predetermined data format according to the secret information and the specific mathematical property;

步骤530，根据原始数据集中包含的业务数据的数量以及比例标签，确定需要加入原始数据集的水印数据的目标数量；Step 530, determining the target amount of watermark data to be added to the original data set according to the amount of business data and the ratio label contained in the original data set;

步骤540，将目标数量的水印数据添加到原始数据集中，得到目标数据集。Step 540: Add the target amount of watermark data to the original data set to obtain the target data set.

本申请实施例中，提供一种结构化数据集的处理方法，该方法可以用于数据集所有方。具体地，数据集所有方可以获取原始数据集和权属标记信息，其中，原始数据集是结构化数据集，其可以用于存储相关的结构化数据。这些结构化数据满足预定数据格式，后续将基于该预定数据格式，搜索筛选出水印数据。本申请实施例中，对于预定数据格式的具体情况不作限制。示例性地，如图6所示，当原始数据集用于存储手机号码时，其呈现的预定数据格式为国家码+国内目的码+用户号码的数据结构；类似地，当原始数据集用于存储用户地址时，其呈现的预定数据格式为省+市+县/区+镇/街+社区的数据结构。本申请实施例中，了解预定数据格式，有助于制作与真实业务数据相似的水印数据，使得在不具备特定秘密信息的情况下，任何一方都无法判断结构化数据集中的是业务数据还是水印数据，也即任何人都无法将业务数据与水印数据进行区分。In an embodiment of the present application, a method for processing a structured data set is provided, which can be used by the owner of the data set. Specifically, the owner of the data set can obtain the original data set and the ownership mark information, wherein the original data set is a structured data set, which can be used to store related structured data. These structured data meet the predetermined data format, and the watermark data will be searched and filtered out based on the predetermined data format in the future. In the embodiment of the present application, there is no restriction on the specific circumstances of the predetermined data format. Exemplarily, as shown in FIG6, when the original data set is used to store mobile phone numbers, the predetermined data format presented is a data structure of country code + domestic destination code + user number; similarly, when the original data set is used to store user addresses, the predetermined data format presented is a data structure of province + city + county/district + town/street + community. In the embodiment of the present application, understanding the predetermined data format helps to produce watermark data similar to real business data, so that in the absence of specific secret information, no party can judge whether the structured data set contains business data or watermark data, that is, no one can distinguish business data from watermark data.

需要特别说明的是，本申请实施例中，可以设定目标数据集中全部的数据均为水印数据，在这种情况下，获取的原始数据集中可以不包括任何真实的业务数据。 It should be noted that, in the embodiment of the present application, all the data in the target data set can be set to be watermark data. In this case, the obtained original data set may not include any real business data.

本申请实施例中，权属标记信息用于实现对原始数据集的权属标记，其可以包括有秘密信息、比例标签和特定数学性质。这些信息的具体含义在前述实施例中已经介绍，在此不作赘述。本申请实施例中，权属标记信息内的三类信息之间通常没有耦合关系，故选定时不存在先后顺序。其中，秘密信息必须保密，仅在必要时才向特定的数据集验证方(监管部门、执法机关)进行授权；其他的信息需要向数据集验证方公布，也可以向全社会公开。In the embodiment of the present application, the ownership mark information is used to realize the ownership mark of the original data set, which may include secret information, ratio labels and specific mathematical properties. The specific meaning of this information has been introduced in the above embodiments and will not be repeated here. In the embodiment of the present application, there is usually no coupling relationship between the three types of information in the ownership mark information, so there is no order of precedence when selecting. Among them, secret information must be kept confidential and authorized only to a specific data set verifier (regulatory department, law enforcement agency) when necessary; other information needs to be announced to the data set verifier, and can also be made public to the whole society.

本申请实施例中，基于设定的权属标记信息，可以按照某种策略(随机选取或按某种顺序遍历)从满足预定数据格式的数据中选取一条候选数据，通过预设规则进行计算，若计算结果恰好满足选定的特定数学性质，则将候选数据按水印数据输出，若不满足则忽略之(候选数据非水印数据)。如此，可以从满足预定数据格式的数据中确定水印数据。本申请实施例中，还根据原始数据集中包含的业务数据的数量和比例标签，确定需要加入到原始数据集的水印数据的目标数量，例如，假设原始数据集有9500条业务数据，且比例标签为5％，则需要获取500条水印数据。若已获取指定条数的水印数据，则可以将水印数据混杂于业务数据中，使得原始数据集中水印数据的占比等于预先选定的比例标签，从而可以得到处理后的目标数据集。In an embodiment of the present application, based on the set ownership mark information, a candidate data can be selected from the data that meets the predetermined data format according to a certain strategy (random selection or traversal in a certain order), and calculated according to the preset rules. If the calculation result just meets the selected specific mathematical property, the candidate data is output as watermark data, if not, it is ignored (candidate data is not watermark data). In this way, the watermark data can be determined from the data that meets the predetermined data format. In an embodiment of the present application, the target number of watermark data that needs to be added to the original data set is also determined based on the number of business data contained in the original data set and the proportion label. For example, assuming that the original data set has 9500 business data and the proportion label is 5%, 500 watermark data need to be obtained. If the specified number of watermark data has been obtained, the watermark data can be mixed with the business data so that the proportion of watermark data in the original data set is equal to the pre-selected proportion label, so that the processed target data set can be obtained.

在本申请实施例中，水印数据在外观上与业务数据无法区分，于全体可能的数据而言，只有很少比例的数据能满足特定数学性质从而成为水印数据。这意味着需要对大量的数据逐一进行计算，以查找出那相对少数的、能满足特定数学性质的水印数据。具体的查找过程可以是随机选取，也可以是通过某种条件对数据进行遍历，本申请对此不作限制。In the embodiment of the present application, the watermark data cannot be distinguished from the business data in appearance. For all possible data, only a small proportion of the data can meet the specific mathematical properties and thus become watermark data. This means that a large amount of data needs to be calculated one by one to find the relatively small number of watermark data that can meet the specific mathematical properties. The specific search process can be random selection or traversal of data according to certain conditions, which is not limited by the present application.

在一些实施例中，参照图7，获取权属标记信息，包括：In some embodiments, referring to FIG. 7 , obtaining ownership mark information includes:

步骤710，获取和原始数据集对应的关联信息；关联信息用于表征原始数据集的权属；Step 710, obtaining association information corresponding to the original data set; the association information is used to characterize the ownership of the original data set;

步骤720，根据关联信息，生成秘密信息。Step 720, generating secret information according to the associated information.

本申请实施例中，在一些情况下，为了方便澄清相关数据集的权属，还可以在对数据集进行处理时，使用具有自然语义的信息来生成秘密信息，从而方便后续可能的属性证明作业。具体地，本申请实施例中，对于原始数据集，可以获取和它对应的关联信息，这里的关联信息可以是用于表征原始数据集的权属的信息，比如说可以是“A公司XXX专用”，然后，可以根据关联信息来生成秘密信息，比如说可以将关联信息直接作为秘密信息，或者对其进行处理后作为秘密信息，本申请对此不作限制。In some cases, in order to facilitate the clarification of the ownership of the relevant data set, in the embodiments of the present application, information with natural semantics can be used to generate secret information when processing the data set, so as to facilitate the subsequent possible attribute proof operations. Specifically, in the embodiments of the present application, for the original data set, the associated information corresponding to it can be obtained, and the associated information here can be information used to characterize the ownership of the original data set, for example, it can be "A Company XXX Exclusive", and then, the secret information can be generated based on the associated information, for example, the associated information can be directly used as secret information, or it can be processed as secret information, and the present application does not limit this.

在一些实施例中，将目标数量的水印数据添加到原始数据集中，得到目标数据集，包括：In some embodiments, adding a target amount of watermark data to an original data set to obtain a target data set includes:

在原始数据集中确定目标数量的插入位置；Determine the insertion position of the target quantity in the original data set;

将每个水印数据添加到原始数据集中的一个插入位置处，得到目标数据集。 Each watermark data is added to an insertion position in the original data set to obtain the target data set.

本申请实施例中，将目标数量的水印数据加入原始数据集中，具体可以采用混杂的方式实现，例如，采用随机插入算法、分组插入算法、时间序列混杂算法或混合加密算法处理原始数据集，确定目标数量的插入位置。如图8所示，本申请实施例中提及的混杂，是指将水印数据相对均匀地插入到原有的业务数据中，使得无法根据一条数据在数据集中的位置(数据表中的第几行)来判断该条数据是业务数据还是水印数据。具体地，可以在原始数据集中确定目标数量个插入位置，然后将每个水印数据添加到一个插入位置处，得到目标数据集。示例性地，原始数据集有9500条业务数据，随机地插入500条水印数据后，得到的结构化数据集总共有10000条数据，其中5％的数据为水印数据，但它们出现在哪些位置则无法判断，也没有什么规律。如此，可以提高结构化数据集的权属安全性。In the embodiment of the present application, the target number of watermark data is added to the original data set, which can be implemented in a mixed way, for example, the original data set is processed by a random insertion algorithm, a group insertion algorithm, a time series mixing algorithm or a hybrid encryption algorithm to determine the insertion position of the target number. As shown in Figure 8, the mixing mentioned in the embodiment of the present application refers to inserting the watermark data relatively evenly into the original business data, so that it is impossible to judge whether the data is business data or watermark data based on the position of a data in the data set (the row in the data table). Specifically, the target number of insertion positions can be determined in the original data set, and then each watermark data is added to an insertion position to obtain the target data set. Exemplarily, the original data set has 9500 business data. After randomly inserting 500 watermark data, the structured data set obtained has a total of 10000 data, of which 5% of the data is watermark data, but it is impossible to judge where they appear, and there is no rule. In this way, the ownership security of the structured data set can be improved.

以下结合具体的应用场景示例，对本申请的技术方案进行介绍和说明。The technical solution of the present application is introduced and explained below in conjunction with specific application scenario examples.

本申请实施例适用于业务数据只含单一字段的情形，也适用于业务数据包含多个字段的情形，以下各举一个实施例进行描述。所举例子旨在阐释本申请涉及的概念和数学计算过程等，并不意味现实世界中情况如此(例如，实施例中使用的是假想的123网段手机号)。为使陈述简明、方便对实施例进行核算验证，以下所举例子中统一作如下技术约定：The embodiments of the present application are applicable to situations where the business data contains only a single field, and are also applicable to situations where the business data contains multiple fields. Each embodiment is described below. The examples given are intended to illustrate the concepts and mathematical calculation processes involved in this application, and do not mean that this is the case in the real world (for example, the embodiments use a hypothetical 123 network segment mobile phone number). In order to make the statement concise and facilitate the calculation and verification of the embodiments, the following technical conventions are uniformly made in the examples given below:

字符串按国际通用的UTF-8规则进行编码。例如，由两个汉字组成的字符串"数学"的编码结果是6个字节构成的数组，其十六进制表示是e695b0e5ada6。为兼容大多数编程语言，水印数据校验采用国际流行的消息鉴别码HMAC-SHA-256，计算结果是32个字节构成的数组，相关测试向量见IETF RFC 4231。The string is encoded according to the internationally accepted UTF-8 rules. For example, the encoding result of the string "数学" consisting of two Chinese characters is an array of 6 bytes, and its hexadecimal representation is e695b0e5ada6. In order to be compatible with most programming languages, the watermark data verification adopts the internationally popular message authentication code HMAC-SHA-256, and the calculation result is an array of 32 bytes. For related test vectors, see IETF RFC 4231.

上述技术约定仅为更好地叙述实施例，绝不意味着本申请在通用性方面受到任何限制。实际应用中，本申请既不局限于特定的编码，也不局限于特定的密码算法，例如：字符编码可采用中国标准GB 18030等。水印数据之数学性质的底层消息鉴别码可采用中国标准HMAC-SM3或CMAC-SM4，或国际上基于SHA-3的KMAC等。The above technical conventions are only for better description of the embodiments, and do not mean that the present application is subject to any limitation in terms of universality. In practical applications, the present application is neither limited to a specific encoding nor a specific cryptographic algorithm. For example, the character encoding may adopt the Chinese standard GB 18030. The underlying message authentication code of the mathematical properties of the watermark data may adopt the Chinese standard HMAC-SM3 or CMAC-SM4, or the international KMAC based on SHA-3.

实施例一：为只含单一字段的数据集提供属性证明。Example 1: Providing attribute proof for a data set containing only a single field.

假定某运营商2024年2月进行的一项测试项目中需使用一批123网段的手机号。为进行区分，该公司只使用“消息鉴别码以连续16个1比特开头也即以十六进制ffff开头”的手机号，且这批号码不再分配给正常业务。换而言之，对应的消息鉴别码以ffff开头的手机号即水印数据(计算消息鉴别码所需秘密信息只有该公司自己知道)。该公司在测试项目中只使用这种手机号，意味着其生成的结构化数据集中水印数据占比100％。按照本申请所提供的结构化数据集的处理方法，该公司作为数据集所有方，从12300000000开始递增遍历，就能生成一份完全由水印数据组成的数据集12300023180、12300034919、12300078978、12300088650、12300393151、12300421487、12300600146、12300814686、 12300857998、12301037953……Assume that a certain operator needs to use a batch of mobile phone numbers in the 123 network segment in a test project in February 2024. To distinguish them, the company only uses mobile phone numbers whose "message authentication code starts with 16 consecutive 1 bits, that is, starts with hexadecimal ffff", and these numbers are no longer allocated to normal services. In other words, the mobile phone number whose corresponding message authentication code starts with ffff is the watermark data (the secret information required to calculate the message authentication code is only known to the company itself). The company only uses this type of mobile phone number in the test project, which means that watermark data accounts for 100% of the structured data set it generates. According to the method for processing structured data sets provided in this application, the company, as the owner of the data set, can generate a data set consisting entirely of watermark data by traversing incrementally from 12300000000, 12300023180, 12300034919, 12300078978, 12300088650, 12300393151, 12300421487, 12300600146, 12300814686, 12300857998, 12301037953...

假定项目开展后，这批手机号对外暴露了。为平息可能的担忧，该公司向监管部门声明所有手机号乃用于测试目的(而非分配给真实个人用户)的保留号码，并向其(且只向监管部门)披露项目开展之前预先选定的、用于计算消息鉴别码的秘密信息："中国某某股份有限公司2024年2月测试专用"(这个密钥仅用于举例，实际的秘密信息需为无法猜出的值)。Assume that after the project is launched, these mobile phone numbers are exposed to the public. To calm possible concerns, the company declares to the regulatory authorities that all mobile phone numbers are reserved numbers for testing purposes (not assigned to real individual users), and discloses to them (and only to the regulatory authorities) the secret information pre-selected before the project is launched for calculating the message authentication code: "China XXXX Co., Ltd. February 2024 Test Special" (this key is only used for example, the actual secret information must be an unguessable value).

监管部门作为数据集验证方，将秘密信息和各手机号都按字符串进行UTF-8编码，代入HMAC-SHA-256公式计算得到如下表1所示的消息鉴别码(篇幅所限只展示10条数据)：As the data set verifier, the regulatory department encodes the secret information and each mobile phone number as a string in UTF-8, substitutes it into the HMAC-SHA-256 formula to calculate the message authentication code shown in Table 1 below (only 10 data are shown due to space limitations):

表1
Table 1

该公司同时告知监管部门：水印数据具有的特定数学性质是得到的消息鉴别码以连续16个1比特开头。对任何一条随机数据，满足此数学性质的概率是2^-16次方；也即一条数据成为水印数据的概率只有大约十万分之1.5。如表1所示，监管部门验证发现所有暴露的手机号都是水印数据，故有理由相信这些手机号都是该公司预先选定好的测试号码，而非分配给真实个人用户后遭遇安全事故泄露出来的。若遭遇安全事故而泄露了一份数据集，要事后再找出让所有数据都满足特定数学性质(从而被证实为水印数据)的秘密信息是不可能的。况且，该公司提供的秘密信息"中国某某股份有限公司2024年2月测试专用"本身也限定了数据集的权属，更能组合证明这批结构化数据集是预先选定好的测试号码。The company also informed the regulatory authorities that the specific mathematical property of the watermark data is that the obtained message authentication code starts with 16 consecutive 1 bits. For any random data, the probability of satisfying this mathematical property is 2 ^-16 ; that is, the probability of a piece of data becoming watermark data is only about 1.5 in 100,000. As shown in Table 1, the regulatory authorities verified that all the exposed mobile phone numbers are watermark data, so there is reason to believe that these mobile phone numbers are pre-selected test numbers by the company, rather than being allocated to real individual users and then leaked in a security incident. If a data set is leaked due to a security incident, it is impossible to find the secret information that makes all the data meet the specific mathematical property (thus being confirmed as watermark data) afterwards. Moreover, the secret information provided by the company, "China XXXX Co., Ltd. February 2024 Test Special" itself also limits the ownership of the data set, and can be combined to prove that this batch of structured data sets are pre-selected test numbers.

实施例二：为包含多个字段的数据集提供属性证明。Embodiment 2: Providing attribute proof for a data set containing multiple fields.

对包含多个字段的数据进行验证时，所有关键字段都需要参与计算(以判断其是否满足特定数学性质)。从源头上讲，就是所有方生成水印数据时，所有关键字段都要参与逻辑运算过程，并得到鉴别信息(实施例中均为HMAC-SHA-256消息鉴别码)，且所有关键字段中视情形可有一个或多个字段为合成值。此过程中，让所有关键字段都参与逻辑运算的一种简便做法是将所有关键字段的值按字符串直接拼接后再编码(实施例中均为UTF-8编码)。When verifying data containing multiple fields, all key fields need to participate in the calculation (to determine whether they meet specific mathematical properties). From the source, when all parties generate watermark data, all key fields must participate in the logical operation process and obtain authentication information (HMAC-SHA-256 message authentication code in the embodiment), and one or more fields in all key fields may be composite values depending on the situation. In this process, a simple way to let all key fields participate in the logical operation is to directly concatenate the values of all key fields as strings and then encode them (UTF-8 encoding in the embodiment).

假定甲、乙两家电商公司是竞争关系，甲公司的客户数据(包含至少3个字段：身份证号、手机号、姓名)被乙公司长期以不法方式窃取。甲公司欲为其客户数据集提供属性证明，便每季度选定一则秘密信息，为当季的交易客户数据集生成并按5％比例插入水印数据(每19条业务数据中插入1条水印数据，随机选取插入位置)。乙公司在不掌握秘密信息的情况下无法识别这些水印数据，甚至不知晓它们的存在。Assume that two e-commerce companies A and B are competitors. Company A’s customer data (including at least 3 fields: ID card Company A wants to provide attribute proof for its customer data set, so it selects a secret message every quarter, generates and inserts watermark data at a ratio of 5% for the transaction customer data set of that quarter (1 watermark data is inserted into every 19 business data, and the insertion position is randomly selected). Company B cannot identify these watermark data without the secret information, and is not even aware of their existence.

假定以下表2是甲公司于2025年一季度插入其客户数据集的水印数据样本(每个字段都可以是合成的)，其中每行数据都可以按前述“简便做法”先拼接起来再编码计算消息鉴别码：Assume that the following Table 2 is a sample of watermark data inserted by Company A into its customer data set in the first quarter of 2025 (each field can be synthetic), where each row of data can be concatenated and then encoded to calculate the message authentication code according to the aforementioned "simple method":

表2
Table 2

当执法机关针对乙公司的不法行为进行抓捕时，甲公司向执法机关披露其相应于某份数据集的秘密信息是"中国甲方股份有限公司2025年一季度测试专用"，其中告知水印数据具有的特定数学性质是得到的消息鉴别码以连续20个0比特结尾。对任何一条数据，满足此数学性质的概率是2^-20次方；也即一条随机数据成为水印数据的概率不到百万分之一。When the law enforcement agency arrests Company B for illegal behavior, Company A discloses to the law enforcement agency that its secret information corresponding to a data set is "China Party A Co., Ltd. 2025 first quarter test special", in which the specific mathematical property of the watermark data is that the obtained message authentication code ends with 20 consecutive 0 bits. For any piece of data, the probability of satisfying this mathematical property is ^2-20 ; that is, the probability of a random data becoming watermark data is less than one in a million.

执法机关经甲公司授权获得上述秘密信息后，对缴获的数据集进行核验发现，有约5％的数据确系满足甲公司所主张特定数学性质的水印数据，样例如下表3：After obtaining the above secret information with the authorization of Company A, the law enforcement agency verified the seized data set and found that about 5% of the data was indeed watermark data that met the specific mathematical properties claimed by Company A. An example is shown in Table 3 below:

表3
Table 3

可以理解的是，若乙公司未窃取甲公司数据，缴获数据集中出现甲公司水印数据的概率应不足百万分之一。因缴获数据集中甲公司水印数据占比高达约5％，故执法机关确认该批数据的权属为甲方所有，认可甲公司的数据被乙公司窃取了的主张。It is understandable that if Company B had not stolen Company A’s data, the probability of Company A’s watermark data appearing in the seized data set should be less than one in a million. Since Company A’s watermark data accounted for about 5% of the seized data set, the law enforcement agency confirmed that the ownership of the data belonged to Party A and recognized the claim that Company A’s data was stolen by Company B.

可以理解的是，本申请实施例中，将水印数据隐藏于业务数据中，使两者在格式、内容等方方面面看不出区别。不具备特定秘密信息的情况下，任何人都无法判断数据是业务数据还是水印数据。本申请基于水印数据与业务数据所形成的比例特征确认结构化数据集的权属，而且引入了秘密信息，相较于直接通过水印数据确认权属的方式更难以被破解，能够较好地保护数据集所有方的合法权益。It is understandable that in the embodiment of the present application, the watermark data is hidden in the business data so that the two cannot be distinguished in terms of format, content, etc. Without specific secret information, no one can determine whether the data is business data or watermark data. The present application confirms the structured data set based on the proportional characteristics formed by the watermark data and the business data. It not only confirms the ownership of the data set, but also introduces secret information. Compared with the method of directly confirming the ownership through watermark data, it is more difficult to crack and can better protect the legitimate rights and interests of the data set owner.

下面参照附图描述根据本申请实施例提出的结构化数据集的权属验证装置。The following describes the ownership verification device for a structured data set proposed according to an embodiment of the present application with reference to the accompanying drawings.

参照图9，本申请实施例中提出的结构化数据集的权属验证装置，包括：9 , the ownership verification device for a structured data set proposed in an embodiment of the present application includes:

第一获取单元910，用于获取结构化数据集；结构化数据集中包括有多条结构化数据，每条结构化数据为业务数据或者水印数据，业务数据和水印数据满足相同的预定数据格式；A first acquisition unit 910 is used to acquire a structured data set; the structured data set includes a plurality of structured data, each of which is business data or watermark data, and the business data and the watermark data meet the same predetermined data format;

第二获取单元920，用于从待验证的目标对象处获取结构化数据集对应的秘密信息、水印数据的比例标签和水印数据对应的特定数学性质；特定数学性质用于约束通过预设规则、使用秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过预设规则，任一满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率小于第一阈值；比例标签大于第一阈值；The second acquisition unit 920 is used to acquire the secret information corresponding to the structured data set, the proportion label of the watermark data and the specific mathematical property corresponding to the watermark data from the target object to be verified; the specific mathematical property is used to constrain the check value calculated by using the secret information and the watermark data according to the preset rules to meet the preset mathematical characteristics; wherein, according to the preset rules, the probability that the check value calculated by any data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; the proportion label is greater than the first threshold;

处理单元930，用于根据秘密信息和特定数学性质，从结构化数据集中识别出水印数据；A processing unit 930, configured to identify watermark data from a structured data set based on the secret information and a specific mathematical property;

统计单元940，用于统计水印数据在结构化数据集中所占的比例结果，根据比例结果和比例标签，确定目标对象和结构化数据集的权属关系。The statistical unit 940 is used to count the proportion of the watermark data in the structured data set, and determine the ownership relationship between the target object and the structured data set according to the proportion result and the proportion label.

在一些实施例中，处理单元930还用于通过预设规则，对秘密信息和结构化数据进行计算，得到第一校验值；根据特定数学性质，判断第一校验值是否符合预设的数学特征；若第一校验值符合预设的数学特征，将结构化数据确定为水印数据。In some embodiments, the processing unit 930 is also used to calculate the secret information and structured data according to preset rules to obtain a first verification value; based on specific mathematical properties, determine whether the first verification value meets the preset mathematical characteristics; if the first verification value meets the preset mathematical characteristics, determine the structured data as watermark data.

在一些实施例中，统计单元940还用于计算比例结果和比例标签之间的差异值；若差异值小于第二阈值，确定目标对象为结构化数据集的权属所有方。In some embodiments, the statistical unit 940 is further configured to calculate a difference value between the ratio result and the ratio label; if the difference value is less than a second threshold, it is determined that the target object is the owner of the structured data set.

在一些实施例中，统计单元940还用于计算比例结果和比例标签之间的差值，将差值的绝对值确定为差异值。In some embodiments, the statistical unit 940 is further configured to calculate a difference between the ratio result and the ratio label, and determine an absolute value of the difference as a difference value.

在一些实施例中，统计单元940还用于计算比例结果和比例标签之间的差值，将差值的绝对值占比例标签的比例确定为差异值。In some embodiments, the statistical unit 940 is further configured to calculate the difference between the proportion result and the proportion label, and determine the ratio of the absolute value of the difference to the proportion label as the difference value.

本申请实施例中提出一种结构化数据集的处理装置，包括：In an embodiment of the present application, a structured data set processing device is proposed, comprising:

信息获取单元，用于获取原始数据集和权属标记信息；其中，原始数据集用于存储结构化数据，结构化数据满足预定数据格式；权属标记信息包括秘密信息、比例标签和特定数学性质；特定数学性质用于约束通过预设规则、使用秘密信息和水印数据计算得到的校验值符合预设的数学特征；其中，通过预设规则，任一满足预定数据格式的数据和秘密信息计算得到的校验值符合预设的数学特征的概率小于第一阈值；比例标签大于第一阈值；An information acquisition unit is used to acquire an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data meets a predetermined data format; the ownership mark information includes secret information, a ratio label and a specific mathematical property; the specific mathematical property is used to constrain a check value calculated by using a preset rule, secret information and watermark data to meet a preset mathematical feature; wherein, according to the preset rule, the probability that any check value calculated by using data meeting a predetermined data format and secret information meets the preset mathematical feature is less than a first threshold; the ratio label is greater than the first threshold;

第一确定单元，用于根据秘密信息和特定数学性质，从满足预定数据格式的数据中确定水印数据；The first determining unit is used to determine, from the data satisfying the predetermined data format, the secret information and the specific mathematical property. Set watermark data;

第二确定单元，用于根据原始数据集中包含的业务数据的数量以及比例标签，确定需要加入原始数据集的水印数据的目标数量；A second determining unit is used to determine a target amount of watermark data to be added to the original data set according to the amount of business data and the ratio label included in the original data set;

数据集获取单元，用于将目标数量的水印数据添加到原始数据集中，得到目标数据集。The data set acquisition unit is used to add a target amount of watermark data to the original data set to obtain a target data set.

在一些实施例中，信息获取单元用于获取和原始数据集对应的关联信息；关联信息用于表征原始数据集的权属；根据关联信息，生成秘密信息。In some embodiments, the information acquisition unit is used to acquire associated information corresponding to the original data set; the associated information is used to characterize the ownership of the original data set; and secret information is generated based on the associated information.

在一些实施例中，数据集获取单元还用于在原始数据集中确定目标数量的插入位置；将每个水印数据添加到原始数据集中的一个插入位置处，得到目标数据集。In some embodiments, the data set acquisition unit is further used to determine a target number of insertion positions in the original data set; and add each watermark data to an insertion position in the original data set to obtain a target data set.

在一些实施例中，数据集获取单元还用于采用随机插入算法、分组插入算法、时间序列混杂算法或混合加密算法处理原始数据集，确定目标数量的插入位置。可以理解的是，上述方法实施例中的内容均适用于本装置实施例中，本装置实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。In some embodiments, the data set acquisition unit is further used to process the original data set using a random insertion algorithm, a group insertion algorithm, a time series hybrid algorithm or a hybrid encryption algorithm to determine the insertion position of the target number. It can be understood that the contents of the above method embodiments are all applicable to the present device embodiment, and the functions specifically implemented by the present device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

参照图10，本申请实施例提供了一种电子设备，包括：Referring to FIG. 10 , an embodiment of the present application provides an electronic device, including:

至少一个处理器1010；at least one processor 1010;

至少一个存储器1020，用于存储至少一个程序；At least one memory 1020, used to store at least one program;

当至少一个程序被至少一个处理器1010执行时，使得至少一个处理器1010实现结构化数据集的权属验证方法或者结构化数据集的处理方法。When at least one program is executed by at least one processor 1010, at least one processor 1010 implements a method for verifying ownership of a structured data set or a method for processing a structured data set.

同理，上述方法实施例中的内容均适用于本电子设备实施例中，本电子设备实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。Similarly, the contents of the above method embodiments are all applicable to the electronic device embodiments. The functions specifically implemented by the electronic device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

本申请实施例还提供了一种计算机可读存储介质，其中存储有处理器1010可执行的程序，处理器1010可执行的程序在由处理器1010执行时用于执行上述的结构化数据集的权属验证方法或者结构化数据集的处理方法。An embodiment of the present application also provides a computer-readable storage medium, which stores a program executable by the processor 1010. When the program executable by the processor 1010 is executed by the processor 1010, it is used to execute the above-mentioned method for verifying the ownership of the structured data set or the method for processing the structured data set.

同理，上述方法实施例中的内容均适用于本计算机可读存储介质实施例中，本计算机可读存储介质实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。Similarly, the contents of the above method embodiments are all applicable to the computer-readable storage medium embodiments. The functions specifically implemented by the computer-readable storage medium embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

在一些可选择的实施例中，在方框图中提到的功能/操作可以不按照操作示图提到的顺序发生。例如，取决于所涉及的功能/操作，连续示出的两个方框实际上可以被大体上同时地执行或方框有时能以相反顺序被执行。此外，在本申请的流程图中所呈现和描述的实施例以示例的方式被提供，目的在于提供对技术更全面的理解。所公开的方法不限于本文所呈现的操作和逻辑流程。可选择的实施例是可预期的，其中各种操作的顺序被改变以及其中被描述为较大操作的一部分的子操作被独立地执行。In some alternative embodiments, the functions/operations mentioned in the block diagram may not occur in the order mentioned in the operation diagram. For example, depending on the functions/operations involved, two boxes shown in succession may actually be executed substantially simultaneously or the boxes may sometimes be executed in the reverse order. In addition, the embodiments presented and described in the flowcharts of the present application are provided by way of example for the purpose of providing a more comprehensive understanding of the technology. The disclosed method is not limited to the operations and logical flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and other Sub-operations described as part of a larger operation are executed independently.

此外，虽然在功能性模块的背景下描述了本申请，但应当理解的是，除非另有相反说明，功能和/或特征中的一个或多个可以被集成在单个物理装置和/或软件模块中，或者一个或多个功能和/或特征可以在单独的物理装置或软件模块中被实现。还可以理解的是，有关每个模块的实际实现的详细讨论对于理解本申请是不必要的。更确切地说，考虑到在本文中公开的装置中各种功能模块的属性、功能和内部关系的情况下，在工程师的常规技术内将会了解该模块的实际实现。因此，本领域技术人员运用普通技术就能够在无需过度试验的情况下实现在权利要求书中所阐明的本申请。还可以理解的是，所公开的特定概念仅仅是说明性的，并不意在限制本申请的范围，本申请的范围由所附权利要求书及其等同方案的全部范围来决定。In addition, although the present application is described in the context of functional modules, it should be understood that, unless otherwise specified, one or more of the functions and/or features can be integrated into a single physical device and/or software module, or one or more functions and/or features can be implemented in a separate physical device or software module. It is also understood that a detailed discussion of the actual implementation of each module is unnecessary for understanding the present application. More specifically, in view of the properties, functions, and internal relationships of the various functional modules in the device disclosed herein, the actual implementation of the module will be understood within the conventional techniques of the engineer. Therefore, those skilled in the art can implement the present application set forth in the claims without excessive experimentation using ordinary techniques. It is also understood that the specific concepts disclosed are merely illustrative and are not intended to limit the scope of the present application, which is determined by the full scope of the attached claims and their equivalents.

功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对传统技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the traditional technology or the part of the technical solution, can be embodied in the form of a software product. The computer software product is stored in a storage medium, including a number of instructions to enable a device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program code.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。The logic and/or steps represented in the flowchart or otherwise described herein, for example, can be considered as an ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by an instruction execution system, device or apparatus (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, device or apparatus and execute instructions), or in conjunction with such instruction execution systems, devices or apparatuses. For the purposes of this specification, "computer-readable medium" can be any device that can contain, store, communicate, propagate or transmit a program for use by an instruction execution system, device or apparatus, or in conjunction with such instruction execution systems, devices or apparatuses.

计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得程序，然后将其存储在计算机存储器中。More specific examples of computer-readable media (a non-exhaustive list) include the following: an electrical connection with one or more wires (electronic device), a portable computer disk case (magnetic device), a random access memory (RAM), a read-only memory (ROM), an erasable and programmable read-only memory (EPROM or flash memory), a fiber optic device, and a portable compact disk read-only memory (CDROM). In addition, the computer-readable medium may even be a paper or other suitable medium on which the program is printed, since the program may be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, deciphering or, if necessary, processing in another suitable manner, and then stored in a computer memory.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present application can be implemented by hardware, software, firmware or a combination thereof. In the embodiment, multiple steps or methods can be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented by hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: a discrete logic circuit with a logic gate circuit for implementing a logic function on a data signal, a dedicated integrated circuit with a suitable combination of logic gate circuits, a programmable gate array (PGA), a field programmable gate array (FPGA), etc.

在本说明书的上述描述中，参考术语“一个实施方式/实施例”、“另一实施方式/实施例”或“某些实施方式/实施例”等的描述意指结合实施方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。In the above description of this specification, the description with reference to the terms "one embodiment/example", "another embodiment/example" or "certain embodiments/examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present application. In this specification, the schematic representation of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner.

尽管已经示出和描述了本申请的实施方式，本领域的普通技术人员可以理解：在不脱离本申请的原理和宗旨的情况下可以对这些实施方式进行多种变化、修改、替换和变型，本申请的范围由权利要求及其等同物限定。Although the embodiments of the present application have been shown and described, those skilled in the art will appreciate that various changes, modifications, substitutions and variations may be made to the embodiments without departing from the principles and spirit of the present application, and that the scope of the present application is defined by the claims and their equivalents.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。 The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be construed as limiting the scope of the patent application. It should be pointed out that, for a person of ordinary skill in the art, several variations and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the patent application shall be subject to the attached claims.

Claims

A method for verifying ownership of a structured data set comprises the following steps:

Acquire a structured data set; the structured data set includes a plurality of structured data, each of which is business data or watermark data, and the business data and the watermark data satisfy the same predetermined data format;

The secret information corresponding to the structured data set, the proportional label of the watermark data and the specific mathematical property corresponding to the watermark data are obtained from the target object to be verified; the specific mathematical property is used to constrain the check value calculated by using the secret information and the watermark data according to the preset rules to meet the preset mathematical characteristics; wherein, according to the preset rules, the probability that any check value calculated by using the data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; the proportional label is greater than the first threshold;

identifying the watermark data from the structured data set according to the secret information and the specific mathematical property;

The proportion of the watermark data in the structured data set is calculated, and the ownership relationship between the target object and the structured data set is determined according to the proportion result and the proportion label.

According to the method for verifying ownership of a structured data set according to claim 1, wherein the step of identifying the watermark data from the structured data set based on the secret information and the specific mathematical property comprises:

The secret information and the structured data are calculated according to the preset rule to obtain a first verification value;

According to the specific mathematical property, determining whether the first verification value meets the preset mathematical feature;

If the first check value meets the preset mathematical characteristic, the structured data is determined as watermark data.

According to the method for verifying ownership of a structured data set according to claim 1 or 2, the preset rules include a message authentication code algorithm or a deterministic digital signature algorithm.

According to the method for verifying ownership of a structured data set according to claim 1, wherein determining the ownership relationship between the target object and the structured data set according to the ratio result and the ratio label comprises:

Calculating a difference value between the ratio result and the ratio label;

If the difference value is less than a second threshold, it is determined that the target object is the owner of the structured data set.

According to the method for verifying ownership of a structured data set according to claim 4, wherein the step of calculating the difference between the ratio result and the ratio label comprises:

A difference between the ratio result and the ratio label is calculated, and an absolute value of the difference is determined as a difference value.

The difference between the ratio result and the ratio label is calculated, and the ratio of the absolute value of the difference to the ratio label is determined as the difference value.

A method for processing a structured data set comprises the following steps:

Acquire an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data meets a predetermined data format; the ownership mark information includes secret information, a ratio label and a specific mathematical property; the specific mathematical property is used to constrain a check value calculated by using the secret information and watermark data according to a preset rule to meet a preset mathematical feature; wherein, according to the preset rule, the probability that any check value calculated by using the data meeting the predetermined data format and the secret information meets the preset mathematical feature is less than a first threshold; the ratio label is greater than the first threshold;

Determining watermark data from data satisfying the predetermined data format according to the secret information and the specific mathematical property;

Determining a target amount of watermark data to be added to the original data set according to the amount of business data included in the original data set and the ratio label;

The target amount of watermark data is added to the original data set to obtain a target data set.

According to the method for processing a structured data set according to claim 7, obtaining ownership mark information comprises:

Acquire association information corresponding to the original data set; the association information is used to characterize the ownership of the original data set;

The secret information is generated based on the associated information.

According to the method for processing a structured data set according to claim 7, wherein the step of adding the target amount of watermark data to the original data set to obtain the target data set comprises:

Determining an insertion position of the target quantity in the original data set;

Each watermark data is added to an insertion position in the original data set to obtain a target data set.

The method for processing a structured data set according to claim 9, wherein determining the insertion position of the target number in the original data set comprises:

The original data set is processed by a random insertion algorithm, a group insertion algorithm, a time series hybrid algorithm or a hybrid encryption algorithm to determine the insertion positions of the target quantity.

A device for verifying ownership of a structured data set, comprising:

A first acquisition unit is used to acquire a structured data set; the structured data set includes a plurality of structured data, each of which is business data or watermark data, and the business data and the watermark data meet the same predetermined data format;

The second acquisition unit is used to acquire the secret information corresponding to the structured data set, the ratio label of the watermark data and the specific mathematical property corresponding to the watermark data from the target object to be verified; the specific mathematical property is used to approximate The check value calculated by using the secret information and the watermark data according to the preset rules meets the preset mathematical characteristics; wherein, according to the preset rules, the probability that the check value calculated by any data satisfying the predetermined data format and the secret information meets the preset mathematical characteristics is less than a first threshold; and the proportion label is greater than the first threshold;

a processing unit, configured to identify the watermark data from the structured data set according to the secret information and the specific mathematical property;

A statistical unit is used to count the proportion of the watermark data in the structured data set, and determine the ownership relationship between the target object and the structured data set according to the proportion result and the proportion label.

The device for verifying ownership of a structured data set according to claim 11, wherein the processing unit is further configured to:

A structured data set processing device, comprising:

An information acquisition unit, used to acquire an original data set and ownership mark information; wherein the original data set is used to store structured data, and the structured data meets a predetermined data format; the ownership mark information includes secret information, a ratio label and a specific mathematical property; the specific mathematical property is used to constrain a check value calculated by using the secret information and watermark data according to a preset rule to meet a preset mathematical feature; wherein, according to the preset rule, the probability that any check value calculated by using the data meeting the predetermined data format and the secret information meets the preset mathematical feature is less than a first threshold; the ratio label is greater than the first threshold;

A first determining unit, configured to determine watermark data from data satisfying the predetermined data format according to the secret information and the specific mathematical property;

A second determining unit, configured to determine a target amount of watermark data to be added to the original data set according to the amount of business data included in the original data set and the ratio label;

The data set acquisition unit is used to add the target amount of watermark data to the original data set to obtain a target data set.

An electronic device comprises a processor and a memory;

The memory is used to store programs;

The processor executes the program to implement the method according to any one of claims 1 to 10.

A computer-readable storage medium stores a program, wherein the program is executed by a processor to implement the method according to any one of claims 1 to 10.