CN115168504A

CN115168504A - A method and device for determining functional dependencies

Info

Publication number: CN115168504A
Application number: CN202210699530.4A
Authority: CN
Inventors: 王天振; 陈长城; 庞艳蓓; 顾云帆
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-10-11

Abstract

One or more embodiments of the present specification provide a method and an apparatus for determining function dependence, including: acquiring an original data set, wherein the original data set comprises a plurality of tuples, and each tuple comprises field values corresponding to at least two fields; generating a corresponding partition set aiming at each field respectively, wherein the partition set comprises at least one tuple set, and the values of the tuples in each tuple set on the fields corresponding to the partition set are the same; sampling all tuple sets to obtain tuple pairs, wherein each tuple pair comprises two tuples belonging to the same tuple set; generating a corresponding invalid function dependency for each tuple group and adding the invalid function dependency into an invalid function dependency set to obtain all invalid function dependencies corresponding to the sampled tuple groups; and inverting the invalid function dependence set to obtain an effective function dependence set, wherein the effective function dependence set comprises effective function dependence corresponding to the original data set.

Description

A method and device for determining functional dependencies

技术领域technical field

本说明书一个或多个实施例涉及数据处理领域，尤其涉及一种函数依赖的确定方法及装置。One or more embodiments of this specification relate to the field of data processing, and in particular, to a method and apparatus for determining functional dependencies.

背景技术Background technique

随着数据资产的日益增多，数据处理流程变得愈加复杂，人工处理数据的方式已经难以应对大数据时代下爆发式增长的企业数据。在面对字段数量巨大、数据类型繁多的数据集时，往往需要先确定字段之间的关系，再进一步对数据集进行处理，而使用函数依赖表征字段之间的关系是现阶段的主流方法。With the increasing number of data assets, the data processing process has become more and more complex, and the way of manual data processing has been difficult to cope with the explosive growth of enterprise data in the era of big data. When faced with a dataset with a huge number of fields and a wide variety of data types, it is often necessary to determine the relationship between the fields first, and then further process the dataset. Using functional dependencies to characterize the relationship between fields is the mainstream method at this stage.

虽然相关技术中提出了一些基于数据集确定函数依赖的方式，但往往效率较低，增加了计算和时间开销。Although some methods for determining functional dependencies based on data sets are proposed in the related art, they are often inefficient and increase computational and time overhead.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本说明书一个或多个实施例提供一种函数依赖的确定方法及装置，可以解决相关技术中存在的不足。In view of this, one or more embodiments of the present specification provide a method and apparatus for determining functional dependencies, which can solve the deficiencies in the related art.

为实现上述目的，本说明书一个或多个实施例提供技术方案如下：To achieve the above purpose, one or more embodiments of this specification provide the following technical solutions:

根据本说明书一个或多个实施例的第一方面，提出了一种函数依赖的确定方法，所述方法包括：According to a first aspect of one or more embodiments of the present specification, a method for determining functional dependencies is proposed, and the method includes:

获取原始数据集，所述原始数据集包含多个元组，每个元组包含至少两个字段对应的字段值；Obtaining an original data set, the original data set includes a plurality of tuples, and each tuple includes field values corresponding to at least two fields;

针对每个字段分别生成相应的划分集，所述划分集中包含至少一个元组集合，每一元组集合内的元组在所述划分集对应字段上的字段值相同；A corresponding partition set is generated for each field, the partition set includes at least one tuple set, and the tuples in each tuple set have the same field value on the corresponding field of the partition set;

对所有元组集合进行采样得到元组对，每一元组对包含属于同一元组集合的两个元组；Sampling all tuple sets to obtain tuple pairs, each tuple pair contains two tuples belonging to the same tuple set;

针对每一元组对生成相应的无效函数依赖并添加至无效函数依赖集合中，以得到已采样的元组对对应的所有无效函数依赖；A corresponding invalid functional dependency is generated for each tuple pair and added to the invalid functional dependency set to obtain all invalid functional dependencies corresponding to the sampled tuple pair;

对所述无效函数依赖集合进行反演得到有效函数依赖集合，所述有效函数依赖集合包含所述原始数据集对应的有效函数依赖。Inverting the invalid functional dependency set to obtain a valid functional dependency set, where the valid functional dependency set includes valid functional dependencies corresponding to the original data set.

根据本说明书一个或多个实施例的第二方面，提出了一种基于函数依赖的敏感字段的推理方法，所述方法包括：According to a second aspect of one or more embodiments of the present specification, a method for reasoning based on a sensitive field of functional dependencies is proposed, and the method includes:

获取基于数据集生成的函数依赖，所述函数依赖用于表征所述数据集所含字段的关系；obtaining a functional dependency generated based on a data set, where the functional dependency is used to characterize the relationship of fields contained in the data set;

确定用户在所述数据集的字段中标记的敏感字段；determine the sensitive fields that the user has flagged in the fields of the dataset;

从所述函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段；Determine objective functional dependencies from the functional dependencies, and the fields contained in the left part set of the objective functional dependencies belong to the sensitive fields;

若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。If the field contained in the right part set of the objective functional dependency is different from the sensitive field, it is determined that the field contained in the right part set of the objective functional dependence is a potential sensitive field.

根据本说明书一个或多个实施例的第三方面，提出了一种函数依赖的确定装置，所述装置包括：According to a third aspect of one or more embodiments of the present specification, an apparatus for determining functional dependencies is provided, the apparatus comprising:

第一获取单元：获取原始数据集，所述原始数据集包含多个元组，每个元组包含至少两个字段对应的字段值；The first obtaining unit: obtains an original data set, the original data set includes a plurality of tuples, and each tuple includes field values corresponding to at least two fields;

划分单元：针对每个字段分别生成相应的划分集，所述划分集中包含至少一个元组集合，每一元组集合内的元组在所述划分集对应字段上的字段值相同；A division unit: respectively generate a corresponding division set for each field, the division set includes at least one tuple set, and the tuples in each tuple set have the same field value on the corresponding field of the division set;

采样单元：对所有元组集合进行采样得到元组对，每一元组对包含属于同一元组集合的两个元组；Sampling unit: sampling all tuple sets to obtain tuple pairs, each tuple pair contains two tuples belonging to the same tuple set;

生成单元：针对每一元组对生成相应的无效函数依赖并添加至无效函数依赖集合中，以得到已采样的元组对对应的所有无效函数依赖；Generating unit: generate corresponding invalid functional dependencies for each tuple pair and add them to the invalid functional dependency set to obtain all invalid functional dependencies corresponding to the sampled tuple pairs;

反演单元：对所述无效函数依赖集合进行反演得到有效函数依赖集合，所述有效函数依赖集合包含所述原始数据集对应的有效函数依赖。Inversion unit: inverting the invalid functional dependency set to obtain a valid functional dependency set, where the valid functional dependency set includes valid functional dependencies corresponding to the original data set.

根据本说明书一个或多个实施例的第四方面，提出了一种基于函数依赖的敏感字段的推理装置，所述装置包括：According to a fourth aspect of one or more embodiments of the present specification, an inference device based on a sensitive field of functional dependencies is proposed, the device comprising:

第二获取单元：获取基于数据集生成的函数依赖，所述函数依赖用于表征所述数据集所含字段的关系；The second obtaining unit: obtains a functional dependency generated based on the data set, the functional dependency is used to characterize the relationship of the fields contained in the data set;

第四确定单元：确定用户在所述数据集的字段中标记的敏感字段；the fourth determining unit: determining the sensitive fields marked by the user in the fields of the data set;

第五确定单元：从所述函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段；The fifth determination unit: determine the objective functional dependency from the functional dependency, and the fields contained in the left part set of the objective functional dependency belong to the sensitive field;

判定单元：若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。Determining unit: if the fields included in the right part set of the objective functional dependency are different from the sensitive fields, determine that the fields included in the right part set of the objective functional dependence are potential sensitive fields.

根据本说明书一个或多个实施例的第五方面，提出了一种电子设备，包括：According to a fifth aspect of one or more embodiments of the present specification, an electronic device is proposed, including:

处理器；processor;

用于存储处理器可执行指令的存储器；memory for storing processor-executable instructions;

其中，所述处理器通过运行所述可执行指令以实现如第一方面或第二方面所述的方法。Wherein, the processor implements the method according to the first aspect or the second aspect by executing the executable instructions.

根据本说明书一个或多个实施例的第六方面，提出了一种计算机可读存储介质，其上存储有计算机指令，该指令被处理器执行时实现如第一方面或第二方面所述方法的步骤。According to a sixth aspect of one or more embodiments of the present specification, there is provided a computer-readable storage medium having computer instructions stored thereon, the instructions, when executed by a processor, implement the method according to the first aspect or the second aspect A step of.

由以上技术方案可见，本说明书一个或多个实施例提供的函数依赖的确定方法先获取包含多个元组的原始数据集，该原始数据集中的元组包含至少两个字段对应的字段值，接着针对每个字段生成包含至少一个元组集合的划分集，然后对所有元组集合进行采样得到元组对，该元组对中包含属于同一元组集合的两个元组，针对每一元组对生成相应的无效函数依赖并添加至无效函数依赖集合中，以得到已采样的元组对对应的所有无效函数依赖。由于同一元组集合所含元组在某一字段上的字段值相同，因此基于采样得到的元组对必定可以生成无效函数依赖。相对于直接对原始数据集的元组进行对比，先生成划分集可以减少元组对数量，从而减少后续通过元组对生成无效函数依赖的计算与时间开销。对无效函数依赖集合进行反演得到有效函数依赖集合，该有效函数依赖集合包含原始数据集对应的有效函数依赖。与直接验证函数依赖是否成立相比，本说明书通过无效函数依赖反演生成待验证的函数依赖，并利用无效函数依赖集合确定有效函数依赖，既减少了需要验证的函数依赖数量，又简化了验证函数依赖的步骤，从而提高了确定函数依赖的效率。It can be seen from the above technical solutions that the method for determining functional dependencies provided by one or more embodiments of this specification first obtains an original data set containing multiple tuples, and the tuples in the original data set contain field values corresponding to at least two fields, Next, a division set containing at least one tuple set is generated for each field, and then all tuple sets are sampled to obtain a tuple pair, which contains two tuples belonging to the same tuple set, for each tuple set The corresponding invalid functional dependencies are generated and added to the invalid functional dependency set to obtain all invalid functional dependencies corresponding to the sampled tuple pairs. Since the tuples contained in the same tuple set have the same field value on a certain field, invalid functional dependencies must be generated based on the tuple pairs obtained by sampling. Compared with directly comparing the tuples of the original data set, generating the partition set first can reduce the number of tuple pairs, thereby reducing the computational and time overhead of generating invalid function dependencies through tuple pairs later. The valid functional dependency set is obtained by inverting the invalid functional dependency set, and the valid functional dependency set includes the valid functional dependencies corresponding to the original data set. Compared with directly verifying whether a functional dependency is established, this specification generates functional dependencies to be verified through invalid functional dependency inversion, and uses the invalid functional dependency set to determine valid functional dependencies, which not only reduces the number of functional dependencies that need to be verified, but also simplifies the verification. steps of functional dependencies, thereby improving the efficiency of determining functional dependencies.

附图说明Description of drawings

图1是一示例性实施例提供的一种函数依赖的确定方法的系统架构图。FIG. 1 is a system architecture diagram of a method for determining functional dependencies provided by an exemplary embodiment.

图2是一示例性实施例提供的一种函数依赖的确定方法的流程图。FIG. 2 is a flowchart of a method for determining functional dependencies provided by an exemplary embodiment.

图3是一示例性实施例提供的一种生成划分集的示意图。FIG. 3 is a schematic diagram of generating a partition set according to an exemplary embodiment.

图4是一示例性实施例提供的一种按顺序采样的示意图。FIG. 4 is a schematic diagram of sequential sampling provided by an exemplary embodiment.

图5是一示例性实施例提供的一种生成无效函数依赖的示意图。FIG. 5 is a schematic diagram of generating an invalid functional dependency according to an exemplary embodiment.

图6是一示例性实施例提供的一种构建二叉搜索树的示意图。FIG. 6 is a schematic diagram of constructing a binary search tree provided by an exemplary embodiment.

图7是一示例性实施例提供的一种生成有效函数依赖的示意图。FIG. 7 is a schematic diagram of generating effective functional dependencies according to an exemplary embodiment.

图8是一示例性实施例提供的一种滑动串口采样的示意图。FIG. 8 is a schematic diagram of a sliding serial port sampling provided by an exemplary embodiment.

图9是一示例性实施例提供的一种基于函数依赖的敏感字段的推理方法的流程图。FIG. 9 is a flowchart of a method for reasoning based on a sensitive field of functional dependencies provided by an exemplary embodiment.

图10是一示例性实施例提供的一种设备的示意结构图。Fig. 10 is a schematic structural diagram of a device provided by an exemplary embodiment.

图11是一示例性实施例提供的一种函数依赖的确定装置的框图。Fig. 11 is a block diagram of an apparatus for determining functional dependencies provided by an exemplary embodiment.

图12是一示例性实施例提供的一种基于函数依赖的敏感字段的推理装置的框图。FIG. 12 is a block diagram of an inference apparatus based on a sensitive field of functional dependencies provided by an exemplary embodiment.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书一个或多个实施例相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本说明书一个或多个实施例的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of one or more embodiments of this specification, as recited in the appended claims.

需要说明的是：在其他实施例中并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤。在一些其他实施例中，其方法所包括的步骤可以比本说明书所描述的更多或更少。此外，本说明书中所描述的单个步骤，在其他实施例中可能被分解为多个步骤进行描述；而本说明书中所描述的多个步骤，在其他实施例中也可能被合并为单个步骤进行描述。It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. In addition, a single step described in this specification may be decomposed into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.

为对本说明书一个或多个实施例进行进一步说明，提供下列实施例：To further illustrate one or more embodiments of this specification, the following examples are provided:

图1是一示例性实施例提供的一种函数依赖的确定方法的系统架构图，如图1所示，该系统架构图包括：数据库11、服务器12；数据库11用于存储和提供数据集，而服务器12用于对数据库11所提供的数据集进行处理，以确定出函数依赖。FIG. 1 is a system architecture diagram of a method for determining functional dependencies provided by an exemplary embodiment. As shown in FIG. 1, the system architecture diagram includes: a database 11 and a server 12; the database 11 is used for storing and providing data sets, The server 12 is used to process the data set provided by the database 11 to determine functional dependencies.

其中，函数依赖是指一个属性集可以决定另一属性集时，称另一属性集依赖于该属性集。从数学角度对函数依赖进行介绍：设R(U)是一个属性集U上的关系模式，X和Y是U的子集，若对于R(U)的任意两个可能的关系r1、r2，若r1[x]＝r2[x],则r1[y]＝r2[y]，或者若r1[y]不等于r2[y],则r1[x]不等于r2[x]，称X决定Y，或者Y依赖于X。Among them, functional dependency means that when one attribute set can determine another attribute set, the other attribute set is said to depend on the attribute set. Introduce functional dependencies from a mathematical point of view: let R(U) be a relational schema on an attribute set U, X and Y are subsets of U, if for any two possible relations r1, r2 of R(U), If r1[x]=r2[x], then r1[y]=r2[y], or if r1[y] is not equal to r2[y], then r1[x] is not equal to r2[x], which is called X decision Y, or Y depends on X.

以学生表的场景为例，通过一个学生的学号可以确定该学生的姓名，通过一个学生的姓名也可以确定该学生的学号，这种情况称为姓名属性依赖于学号属性，或者学号属性依赖于姓名属性，用“{学号}→{姓名}”、“{姓名}→{学号}”表示这两个函数依赖。通过一个学生的年龄无法确定该学生的姓名，但是通过一个学生的姓名可以确定该学生的年龄，这种情况称为姓名属性决定年龄属性，或者年龄属性依赖于姓名属性，用“{姓名}→{年龄}”表示该函数依赖。通过一个学生的年龄无法确定该学生的学校，通过一个学生的学校也无法确定该学生的课程，这种情况下，年龄属性和学校属性之间不存在函数依赖关系。Taking the scenario of the student table as an example, the name of a student can be determined by the student number of a student, and the student number of the student can also be determined by the name of a student. This situation is called the name attribute depends on the student number attribute, or The number attribute depends on the name attribute, and these two functional dependencies are represented by "{student number}→{name}" and "{name}→{student number}". The name of a student cannot be determined by the age of the student, but the age of the student can be determined by the name of a student. This situation is called the name attribute determines the age attribute, or the age attribute depends on the name attribute, with "{name}→ {age}" indicates the functional dependency. A student's school cannot be determined from a student's age, nor can the student's course be determined from a student's school. In this case, there is no functional dependency between the age attribute and the school attribute.

可以看出，属性集X和属性集Y之间的关系模式与函数依赖相对应：当X和Y之间是一对一关系时，如学号和姓名之间就是一对一关系，则存在函数依赖“X→Y”或者“Y→X”；当X和Y之间是一对多关系时，如年龄和姓名之间就是一对多关系，则存在函数依赖“Y→X”；如果X和Y之间是多对多关系，如学生和课程之间就是多对多关系，则X和Y之间不存在函数依赖。It can be seen that the relationship pattern between attribute set X and attribute set Y corresponds to functional dependency: when there is a one-to-one relationship between X and Y, such as a one-to-one relationship between student ID and name, there is a one-to-one relationship between X and Y. Functional dependency "X→Y" or "Y→X"; when there is a one-to-many relationship between X and Y, such as a one-to-many relationship between age and name, there is a functional dependency "Y→X"; if There is a many-to-many relationship between X and Y. For example, there is a many-to-many relationship between students and courses, and there is no functional dependency between X and Y.

在本说明书的技术方案中，数据库11可以为任意类型的关系数据库，该数据库11所维护的数据集被承载于表格中；其中，数据集中的每条记录被记载为表格中的一行数据，即一个元组。比如在上述学生表的场景下，每个学生的姓名、学号、年龄等信息在学生表中被记载在一行，形成为相应的元组。同时，表格的列为字段，而每行数据包含对应于各个字段的字段值，比如：每个学生的信息包含对应于姓名、学号、年龄等字段的字段值，即为相应学生在这些字段所表征的维度上的属性。In the technical solution of this specification, the database 11 can be any type of relational database, and the data set maintained by the database 11 is carried in a table; wherein, each record in the data set is recorded as a row of data in the table, that is, a tuple. For example, in the scenario of the above student table, information such as the name, student ID, age and other information of each student is recorded in a row in the student table, forming a corresponding tuple. At the same time, the columns of the table are fields, and each row of data contains field values corresponding to each field. For example, the information of each student contains field values corresponding to fields such as name, student ID, age, etc., that is, the corresponding student in these fields. Attributes on the dimension represented.

服务器12可以为包含一独立主机的物理服务器，或者主机集群承载的虚拟服务器。在运行过程中，服务器12从数据库11获得数据集，并通过本说明书的技术方案对该数据集进行处理，从而确定出函数依赖。其中，虽然在图1所示的实施例中，数据库11示意为与服务器12相互独立，但在一些情况下，数据库11也可以被部署在服务器12的本地存储空间，本说明书并不对此进行限制。The server 12 may be a physical server comprising an independent host, or a virtual server hosted by a cluster of hosts. During the running process, the server 12 obtains the data set from the database 11, and processes the data set through the technical solution of this specification, thereby determining the functional dependency. Wherein, although in the embodiment shown in FIG. 1, the database 11 is shown as being independent of the server 12, but in some cases, the database 11 can also be deployed in the local storage space of the server 12, which is not limited in this specification .

实际上，本说明书的技术方案可以通过对数据集的处理，高效、快捷地分析得到相应的函数依赖，以便于根据所得的函数依赖实现进一步的处理目的。例如，在敏感字段分析的场景下，通过确定出函数依赖，可以用于对已知的敏感字段进行分析，从而推测出潜在的敏感字段。下面结合实施例对本说明书的技术方案进行说明。In fact, the technical solution of the present specification can analyze and obtain corresponding functional dependencies efficiently and quickly by processing the data set, so as to realize further processing purposes according to the obtained functional dependencies. For example, in the scenario of sensitive field analysis, by determining the functional dependency, it can be used to analyze the known sensitive fields, so as to infer potential sensitive fields. The technical solutions of the present specification will be described below with reference to the embodiments.

图2是一示例性实施例提供的一种函数依赖的确定方法的流程图。如图2所示，该方法可以包括以下步骤：FIG. 2 is a flowchart of a method for determining functional dependencies provided by an exemplary embodiment. As shown in Figure 2, the method may include the following steps:

步骤201，获取原始数据集，所述原始数据集包含多个元组，每个元组包含至少两个字段对应的字段值。Step 201: Obtain an original data set, where the original data set includes multiple tuples, and each tuple includes field values corresponding to at least two fields.

数据集可以指一种由数据所组成的集合，通常以表格形式出现。每一列代表一个特定字段，每一行对应一个元组，每个元组在字段上对应有字段值。原始数据集可以指未经过处理的数据集，与之对应的是替换数据集，原始数据集处理后转换为替换数据集的过程会在后续的步骤中进行详细介绍，此处不再赘述。A dataset can refer to a collection of data, usually in tabular form. Each column represents a specific field, each row corresponds to a tuple, and each tuple corresponds to a field value on the field. The original data set can refer to the unprocessed data set, which corresponds to the replacement data set. The process of converting the original data set into the replacement data set after processing will be described in detail in the subsequent steps, and will not be repeated here.

元组序号Tuple sequence number 姓名Name 性别gender 年龄age 血压blood pressure 用药Medication 11 小红little red 女Female 1111 正常normal 药物XDrugX 22 小白noob 男male 2020 低Low 药物BDrug B 33 小黑little black 男male 6565 正常normal 药物XDrugX 44 小蓝little blue 男male 3535 高high 药物ADrug A 55 小绿little green 女Female 24twenty four 高high 药物ADrug A 66 小黄little yellow 男male 24twenty four 正常normal 药物XDrugX

表1Table 1

下面以表1为例对函数依赖的确定方法进行详细介绍，表1是一示例性实施例提供的一种数据集，如表1所示，该原始数据集包括6个元组，分别对应6个元组序号“1、2、3、4、5、6”，还包括5个不同的字段，分别为“姓名、性别、年龄、血压、用药”。6个元组在这5个字段上对应有字段值，例如：元组序号为“1”的元组，在“姓名”字段上对应的字段值为“小红”，在“性别”字段上对应的字段值为“女”，在“年龄”字段上对应的字段值为“11”，在“血压”字段上对应的字段值为“正常”，在“用药”字段上对应的字段值为“药物X”。不同元组在同一字段上的字段值可以相同也可以不同，例如：元组序号为“3”的元组在“姓名”字段上的字段值为“小黑”，在“血压”字段上的字段值为“正常”，该元组与元组序号为“1”的元组在“姓名”字段上的字段值不相同，在血压字段上的字段值相同。在后续的步骤中，正是通过对比各个元组在同一字段上字段值的差异，来生成相应的无效函数依赖。Table 1 is used as an example to introduce the method for determining functional dependencies in detail below. Table 1 is a data set provided by an exemplary embodiment. As shown in Table 1, the original data set includes 6 tuples, corresponding to 6 tuples respectively. The tuple serial number "1, 2, 3, 4, 5, 6" also includes 5 different fields, namely "name, gender, age, blood pressure, medication". The 6 tuples have field values corresponding to these 5 fields, for example: a tuple whose tuple serial number is "1", the corresponding field value in the "Name" field is "Xiaohong", and in the "Gender" field The corresponding field value is "female", the corresponding field value on the "age" field is "11", the corresponding field value on the "blood pressure" field is "normal", and the corresponding field value on the "medication" field is "Drug X". The field values of different tuples on the same field can be the same or different. For example, the field value of the tuple with the tuple number "3" on the "Name" field is "Little Black", and the field value on the "Blood Pressure" field. The field value is "Normal", the tuple and the tuple whose tuple sequence number is "1" have different field values in the "Name" field, and the same field value in the blood pressure field. In the subsequent steps, the corresponding invalid functional dependencies are generated by comparing the differences in the field values of each tuple on the same field.

步骤202，针对每个字段分别生成相应的划分集，所述划分集中包含至少一个元组集合，每一元组集合内的元组在所述划分集对应字段上的字段值相同。Step 202: Generate a corresponding partition set for each field, the partition set includes at least one tuple set, and the tuples in each tuple set have the same field value on the corresponding field of the partition set.

图3是一示例性实施例提供的一种生成划分集的示意图，如图3所示，针对表1的5个字段生成相应的5个划分集，每个划分集都包含多个元组集合，每个元组集合内的元组在该划分集对应字段上的字段值相同，例如：针对“性别”字段，仅存在2种字段值，分别为“男”、“女”，在“性别”字段上字段值为“男”的元组对应的元组序号有“2、3、4、6”，在“性别”字段上字段值为“女”的元组对应的元组序号有“1、5”，那么分别将元组序号“2、3、4、6”对应的元组和元组序号“1，5”对应的元组添加至两个不同的元组集合，而这两个元组集合都被划分至“性别”划分集。最终结果如图3所示，存在5个划分集，分别对应“姓名、性别、年龄、血压、用药”，在“性别”划分集中存在2个元组集合，第一个元组集合为“{2，3，4，6}”，存在4个元组，这4个元组在性别字段上的字段值为“男”，第二个元组集合“{1，5}”，存在2个元组，这2个元组在性别字段上的字段值为“女”。FIG. 3 is a schematic diagram of generating a partition set provided by an exemplary embodiment. As shown in FIG. 3 , corresponding five partition sets are generated for the five fields in Table 1, and each partition set includes multiple tuple sets. , the tuples in each tuple set have the same field value on the corresponding field of the division set, for example: for the "gender" field, there are only two field values, namely "male" and "female", and in the "gender" field, there are only two field values. The tuple number corresponding to the tuple whose field value is "male" on the "" field is "2, 3, 4, 6", and the tuple number corresponding to the tuple whose field value is "female" on the "gender" field is " 1, 5", then add the tuple corresponding to the tuple number "2, 3, 4, 6" and the tuple corresponding to the tuple number "1, 5" to two different tuple sets, and the two Each tuple set is divided into the "gender" partition set. The final result is shown in Figure 3. There are 5 division sets, corresponding to "name, gender, age, blood pressure, and medication" respectively. There are 2 tuple sets in the "gender" division set, and the first tuple set is "{ 2, 3, 4, 6}", there are 4 tuples, the field value of these 4 tuples on the gender field is "male", the second tuple set "{1, 5}", there are 2 Tuple, the field value of these 2 tuples on the gender field is "female".

该实施例通过将字段值相同的元组划分在同一元组集合中，使得在后续对元组集合进行采样的过程中，仅针对在某一字段上具有相同字段值的元组进行采样，减少了采样的元组对数量，提高了采样效率，并进一步减少了后续生成函数依赖的计算开销。In this embodiment, the tuples with the same field value are divided into the same tuple set, so that in the subsequent process of sampling the tuple set, only the tuples with the same field value in a certain field are sampled, reducing the number of The number of sampled tuple pairs is increased, the sampling efficiency is improved, and the computational overhead of subsequent generation function dependencies is further reduced.

步骤203，对所有元组集合进行采样得到元组对，每一元组对包含属于同一元组集合的两个元组。Step 203, sampling all tuple sets to obtain tuple pairs, each tuple pair includes two tuples belonging to the same tuple set.

在采样过程中，可以对划分集或者划分集中的元组集合进行排序，并按照一定顺序对元组集合进行采样，也可以不对元组集合进行排序，随机进行采样，本说明书并不对此进行限制。During the sampling process, the partition set or the tuple set in the partition set can be sorted, and the tuple set can be sampled in a certain order, or the tuple set can be sampled randomly without sorting, which is not limited in this specification. .

在一实施例中，将每一划分集作为一个队列单元，并根据效率值分别对每一队列单元中的元组集合进行排序，所述效率值的大小与相应元组集合包含的元组数量呈正相关；在对任一队列单元中的元组集合进行采样时，按照效率值从大到小的顺序依次对该队列单元中的元组集合进行采样得到元组对。In an embodiment, each partition set is used as a queue unit, and the tuple sets in each queue unit are sorted according to the efficiency value, the size of the efficiency value and the number of tuples included in the corresponding tuple set. It is positively correlated; when sampling the tuple set in any queue unit, the tuple set in the queue unit is sampled in order according to the efficiency value in descending order to obtain tuple pairs.

图4是一示例性实施例提供的一种按顺序采样的示意图，如图4所示，在“姓名”划分集中，字段值为“男”的元组集合的元组数量为4，字段值为“女”的元组集合的元组数量为2，那么字段值为“男”的元组集合的效率值要高于另一元组集合，因此先对该元组集合进行采样。而对字段值为“男”的元组集合进行采样的结果为“{2，3}、{3，4}、{4，6}、{2，4}、{3，6}、{4，6}”，对字段值为“女”的元组集合的采样结果为“{1，5}”，得到7个元组对。同理，其他划分集中的元组集合也根据效率值进行排序并采样。最终的采样结果应为“{2，3}、{3，4}、{4，6}、{1，5}、{5，6}、{1，3}、{3，6}、{4，5}、{2，4}、{1，6}、{2，6}”，一共11个元组对。Fig. 4 is a schematic diagram of sequential sampling provided by an exemplary embodiment. As shown in Fig. 4, in the "name" partition set, the number of tuples in the tuple set with the field value "male" is 4, and the field value is 4. If the number of tuples in the "female" tuple set is 2, then the efficiency value of the tuple set whose field value is "male" is higher than that of another tuple set, so the tuple set is sampled first. The result of sampling the set of tuples whose field value is "male" is "{2, 3}, {3, 4}, {4, 6}, {2, 4}, {3, 6}, {4 , 6}", the sampling result of the tuple set whose field value is "female" is "{1, 5}", and 7 tuple pairs are obtained. Similarly, sets of tuples in other partition sets are also sorted and sampled according to the efficiency value. The final sampling result should be "{2, 3}, {3, 4}, {4, 6}, {1, 5}, {5, 6}, {1, 3}, {3, 6}, { 4, 5}, {2, 4}, {1, 6}, {2, 6}", a total of 11 tuple pairs.

该实施例根据元组集合的效率值对元组集合进行排序，对效率值高的元组集合进行优先采样，使得多数元组对可以在采样前期就被提取出来，让整个函数依赖的增长曲线更早的趋近于平缓，从而可以提前结束采样，节省采样时间。In this embodiment, the tuple sets are sorted according to the efficiency values of the tuple sets, and the tuple sets with high efficiency values are preferentially sampled, so that most tuple pairs can be extracted in the early stage of sampling, and the growth curve of the entire function dependence can be reduced. The earlier tends to be flat, so that sampling can be ended early, saving sampling time.

步骤204，针对每一元组对生成相应的无效函数依赖并添加至无效函数依赖集合中，以得到已采样的元组对对应的所有无效函数依赖。Step 204 , generate corresponding invalid functional dependencies for each tuple pair and add them to the invalid functional dependency set, so as to obtain all invalid functional dependencies corresponding to the sampled tuple pairs.

无效函数依赖可以为不成立的函数依赖，无效函数依赖的左部集为两个元组具有相同字段值的所有字段，右部集为相应元组对所含两个元组具有不同字段值的任一字段。在数据集中，相同的字段值的字段是无法决定不同字段值的字段的。以元组对“{2，3}”为例，元组2和元组3只有在“性别”字段上的字段值相同，在其余字段上的字段值均不同，显然，通过“性别”字段是无法确定元组2和元组3在其他字段上的字段值的，因此使用该种方式生成的函数依赖一定是不成立的函数依赖，即无效函数依赖。An invalid functional dependency can be a functional dependency that does not hold true. The left part of an invalid functional dependency is all fields whose two tuples have the same field value, and the right part is any corresponding tuple with different field values. a field. In a dataset, fields with the same field value cannot determine fields with different field values. Taking the tuple pair "{2, 3}" as an example, tuple 2 and tuple 3 only have the same field value on the "gender" field, and the field values on the other fields are different. Obviously, through the "gender" field It is impossible to determine the field values of tuple 2 and tuple 3 on other fields, so the functional dependencies generated in this way must be invalid functional dependencies, that is, invalid functional dependencies.

图5是一示例性实施例提供的一种生成无效函数依赖的示意图，如图5所示，根据元组对中的2个元组在所有字段上的字段值差异生成无效函数依赖，以元组对“{2，3}”为例，元组2和元组3只有在“性别”字段上的字段值相同，在其余字段上的字段值均不同，将“性别”字段作为无效函数依赖的左部集，将其余字段作为无效函数依赖的右部集，这样可以生成4个无效函数依赖，分别为“{性别}→{姓名}、{性别}→{年龄}、{性别}→{血压}、{性别}→{用药}”。同理，对所有元组对内的元组均进行对比，生成无效函数依赖。其中，通过元组对“{2，3}、{3，4}、{4，6}、{1，5}、{2，4}、{2，6}”得到的函数依赖相同，均为：{性别}→{姓名}、{性别}→{年龄}、{性别}→{血压}、{性别}→{用药}；通过元组对“{5，6}”得到的函数依赖为：{年龄}→{姓名}、{年龄}→{性别}、{年龄}→{血压}、{年龄}→{用药}；通过元组对“{1，3}{4，5}{1，6}”得到的函数依赖均为：{血压，用药}→{姓名}、{血压，用药}→{性别}、{血压，用药}→{年龄}；通过元组对“{3，6}”得到的函数依赖为：{性别，血压，用药}→{姓名}、{性别，血压，用药}→{年龄}。该实施例通过将元组对在所有字段上的字段值进行对比，得到了无效函数依赖，为后续生成有效函数依赖做铺垫。FIG. 5 is a schematic diagram of generating an invalid functional dependency provided by an exemplary embodiment. As shown in FIG. 5 , an invalid functional dependency is generated according to the field value difference of two tuples in a tuple pair on all fields, and the Take the group pair "{2, 3}" as an example, tuple 2 and tuple 3 only have the same field value in the "gender" field, and the field values in the other fields are different, and the "gender" field is regarded as an invalid function dependency The left set of , and the rest of the fields are used as the right set of invalid functional dependencies, so that 4 invalid functional dependencies can be generated, namely "{gender}→{name}, {gender}→{age}, {gender}→{ blood pressure}, {gender}→{medication}". Similarly, the tuples within all tuple pairs are compared to generate invalid functional dependencies. Among them, the functional dependencies obtained by the tuple pairs "{2, 3}, {3, 4}, {4, 6}, {1, 5}, {2, 4}, {2, 6}" are the same. is: {sex}→{name}, {sex}→{age}, {sex}→{blood pressure}, {sex}→{medication}; the functional dependency obtained by tuple pair "{5, 6}" is : {age}→{name}, {age}→{gender}, {age}→{blood pressure}, {age}→{medication}; pass the tuple pair "{1,3}{4,5}{1 , 6}”, the functional dependencies obtained are: {blood pressure, medication}→{name}, {blood pressure, medication}→{gender}, {blood pressure, medication}→{age}; }" The obtained functional dependencies are: {gender, blood pressure, medication}→{name}, {gender, blood pressure, medication}→{age}. In this embodiment, invalid functional dependencies are obtained by comparing the field values on all fields with tuples, which pave the way for subsequent generation of valid functional dependencies.

该实施例生成的无效函数依赖中，“{性别}→{姓名}”和“{性别，血压，用药}→{姓名}”这两个函数依赖是存在包含关系的，在右部集相同的情况下，第一给无效函数依赖的左部集属于第二个无效函数依赖的左部集。实际上，在“性别”字段、“血压”字段以及“用药”字段三个字段都无法决定“姓名”字段的情况下，“性别”字段一定是无法决定“姓名”字段的，所以第一个无效函数依赖其实是不必要的，这种不必要的无效函数依赖会对后续生成有效函数依赖的效率造成影响。In the invalid functional dependencies generated by this example, the two functional dependencies "{sex}→{name}" and "{sex, blood pressure, medication}→{name}" have an inclusive relationship, and the right part sets the same case, the left set of the first given invalid functional dependency belongs to the left set of the second invalid functional dependency. In fact, in the case that the "Sex" field, the "Blood pressure" field and the "Medication" field cannot determine the "Name" field, the "Sex" field must not be able to determine the "Name" field, so the first Invalid functional dependencies are actually unnecessary, and such unnecessary invalid functional dependencies will affect the efficiency of subsequent generation of valid functional dependencies.

在一实施例中，确定不必要的无效函数依赖，所述不必要的无效函数依赖的左部集被包含于其他无效函数依赖的左部集，所述不必要的无效函数依赖的右部集与所述其他函数依赖的右部集相同；所述在任一无效函数依赖的左部集中添加候选字段，包括：在区别于所述不必要的无效函数依赖的任一无效函数依赖的左部集中添加候选字段。In one embodiment, determining unnecessary invalid functional dependencies whose left-hand set is contained in the left-hand set of other invalid functional dependencies, the unnecessary right-hand set of invalid functional dependencies The same as the right part set of the other functional dependencies; the adding a candidate field in the left part set of any invalid functional dependency includes: in the left part set of any invalid functional dependency different from the unnecessary invalid functional dependency Add candidate fields.

进一步的，通过构建以无效函数依赖的左部集为路径的二叉搜索树，来剪除不必要的函数依赖，提高无效函数依赖转换为有效函数依赖的效率。Further, by constructing a binary search tree with the left set of invalid functional dependencies as a path, unnecessary functional dependencies are pruned, and the efficiency of converting invalid functional dependencies into valid functional dependencies is improved.

图6是一示例性实施例提供的一种构建二叉搜索树的示意图，如图6所示，针对每个字段建立一颗二叉搜索树，为了方便画图，这里将“姓名、性别、年龄、血压、用药”五个字段以“A、B、C、D、E”表示。以“血压”字段为例，将“血压”字段作为根节点，从所有无效函数依赖中找出左部集包含血压字段的无效函数依赖，结果为“{血压，用药}→{姓名}、{血压，用药}→{性别}、{血压，用药}→{年龄}、{性别，血压，用药}→{姓名}、{性别，血压，用药}→{年龄}”。这5条无效函数依赖的右部集包含3个字段，分别为“{姓名}、{性别}、{年龄}”，因此将A、B、C画在根节点旁边(在实际的构建过程中无效函数依赖的右部集并不会出现在二叉树中，本说明书将右部集展示于节点旁边是为了方便理解)。5条无效函数依赖的左部集均包含“{血压，用药}”，因此在连成叶节点E，并在节点E边上画出A、B、C表示右部集。由于后两个无效函数依赖的左部集还包含性别字段，且这里两个无效函数依赖的右部集分别为“{姓名}、{年龄}”，因此将原先的E节点分为两个，左E节点旁边画上A、C，右E节点画上B。构建左E节点的子节点B，在子节点B边上延续A、C作为右部集。如图所示，在构建的二叉搜索树中存在两条分支，分别代表为“DEB→A、DEB→C”和“DE→B”，即“{性别，血压，用药}→{姓名}、{性别，血压，用药}→{年龄}、{血压，用药}→{性别}”。不难看出，“{血压，用药}→{姓名}、{血压，用药}→{年龄}”这两条无效函数依赖被剪除了，这其实不难理解，既然性别、血压、用药三个字段都无法决定姓名字段，那么血压和用药两个字段就更无法决定了。该实施例通过构建以无效函数依赖的左部集为路径的二叉搜索树，剪除了没必要的无效函数依赖，减少了后续的计算开销。Fig. 6 is a schematic diagram of constructing a binary search tree provided by an exemplary embodiment. As shown in Fig. 6, a binary search tree is established for each field. , blood pressure, medication" five fields are represented by "A, B, C, D, E". Take the "blood pressure" field as an example, take the "blood pressure" field as the root node, and find out the invalid functional dependencies whose left set contains the blood pressure field from all invalid functional dependencies. The result is "{blood pressure, medication}→{name}, { blood pressure, medication}→{sex}, {blood pressure, medication}→{age}, {sex, blood pressure, medication}→{name}, {sex, blood pressure, medication}→{age}”. The right-hand set of these 5 invalid functional dependencies contains 3 fields, namely "{name}, {gender}, {age}", so A, B, C are drawn next to the root node (in the actual construction process The right-hand set of invalid functional dependencies does not appear in a binary tree, and this specification displays the right-hand set next to the node for ease of understanding). The left sets of the five invalid functional dependencies all contain "{blood pressure, medication}", so they are connected to form a leaf node E, and A, B, and C are drawn on the edge of node E to represent the right set. Since the left-hand sets of the latter two invalid functional dependencies also contain gender fields, and the right-hand sets of the two invalid functional dependencies here are "{name}, {age}" respectively, the original E node is divided into two, Draw A and C next to the left E node, and draw B on the right E node. Construct the child node B of the left E node, and continue A and C as the right set on the edge of the child node B. As shown in the figure, there are two branches in the constructed binary search tree, which are represented as "DEB→A, DEB→C" and "DE→B", namely "{sex, blood pressure, medication}→{name} , {gender, blood pressure, medication}→{age}, {blood pressure, medication}→{gender}”. It is not difficult to see that the two invalid functional dependencies of "{blood pressure, medication}→{name}, {blood pressure, medication}→{age}" have been cut off, which is not difficult to understand, since the three fields of gender, blood pressure, and medication are The name field cannot be determined, so the two fields of blood pressure and medication cannot be determined. In this embodiment, by constructing a binary search tree with the left part set of invalid functional dependencies as a path, unnecessary invalid functional dependencies are pruned, and the subsequent calculation overhead is reduced.

数据集中，字段值的数据种类并不一定都相同，在元组对内两个元组进行字段值对比时，往往会因为数据类型的原因占用大量储存空间，可以用其他方式代替具体的字段值来进行对比。In the data set, the data types of field values are not necessarily the same. When comparing the field values of two tuples in a tuple pair, it often takes up a lot of storage space due to the data type. Other methods can be used to replace specific field values. to compare.

在一实施例中，将所述多个元组在所述原始数据集中对应的字段值分别替换为：相应字段值所属字段对应的划分集的序号与该字段值所属元组所处的元组集合的序号组合形成的字段值序号，以生成替换数据集；In an embodiment, the field values corresponding to the multiple tuples in the original data set are respectively replaced by: the serial number of the partition set corresponding to the field to which the corresponding field value belongs and the tuple where the tuple to which the field value belongs is located. The sequence number of the field value formed by the combination of the sequence numbers of the set to generate the replacement data set;

所述针对每一元组对生成相应的无效函数依赖，包括：根据所述替换数据集生成所述无效函数依赖。The generating a corresponding invalid functional dependency for each tuple pair includes: generating the invalid functional dependency according to the replacement data set.

元组序号为“1”的元组为例，假设“姓名、性别、年龄、血压、用药”这五个字段分别对应序号“1、2、3、4、5”，该元组在“姓名”划分集所处元组集合的序号为1(为了方便理解，元组集合的序号可以按照图2中元组集合在划分集中的书写顺序进行排序，例如：“性别”划分集中字段值为“男”的元组集合的序号为1，字段值为“女”的元组集合的序号为2)，那么该元组在“姓名”字段的字段值可以用“1-1”替换。同理，对其他字段值也作相应替换，替换后的结果如表2所示。在后续对元组对内两个元组的字段值进行比对时，可以直接对比序号的差异，无需比对具体字段值。Take the tuple with the tuple sequence number "1" as an example. Assume that the five fields of "name, gender, age, blood pressure, and medication" correspond to the sequence numbers "1, 2, 3, 4, and 5" respectively. "The sequence number of the tuple set where the division set is located is 1 (for the convenience of understanding, the sequence number of the tuple set can be sorted according to the writing order of the tuple set in the division set in Figure 2, for example: "Gender" The field value of the division set is " The serial number of the tuple set of "male" is 1, and the serial number of the tuple set of the field value of "female" is 2), then the field value of the tuple in the "name" field can be replaced with "1-1". Similarly, other field values are also replaced accordingly, and the results after replacement are shown in Table 2. When comparing the field values of the two tuples in the tuple pair subsequently, the difference in serial numbers can be directly compared without comparing specific field values.

元组序号Tuple sequence number 姓名Name 性别gender 年龄age 血压blood pressure 用药Medication 11 1-11-1 2-22-2 3-23-2 4-14-1 5-15-1 22 1-21-2 2-12-1 3-33-3 4-34-3 5-35-3 33 1-31-3 2-12-1 3-43-4 4-14-1 5-15-1 44 1-41-4 2-12-1 3-53-5 4-24-2 5-25-2 55 1-51-5 2-22-2 3-13-1 4-24-2 5-25-2 66 1-61-6 2-12-1 3-13-1 4-14-1 5-15-1

表2Table 2

该实施例将原有的字段值替换为同一的序号，一方面，考虑到了计算机底层存储结构的特性，减小了计算所需要的时间和空间开销；另一方面，在不影响后续计算的情况下，对原始数据进行了转换，减小了隐私泄露的风险。This embodiment replaces the original field value with the same serial number. On the one hand, considering the characteristics of the underlying storage structure of the computer, the time and space overhead required for calculation is reduced; The original data is transformed to reduce the risk of privacy leakage.

步骤205，对所述无效函数依赖集合进行反演得到有效函数依赖集合，所述有效函数依赖集合包含所述原始数据集对应的有效函数依赖。Step 205 , inverting the invalid functional dependency set to obtain a valid functional dependency set, where the valid functional dependency set includes valid functional dependencies corresponding to the original data set.

泛化指的是一种函数依赖之间的覆盖关系，一个无效函数依赖的泛化的左部集被包含于该无效函数依赖的左部集，右部集与该无效函数依赖的右部集相同，例如无效函数依赖“{性别，血压，用药}→{姓名}”可以完全覆盖无效函数依赖“{血压，用药}→{姓名}”，那么可以称“{血压，用药}→{姓名}”是“{性别，血压，用药}→{姓名}”的泛化。Generalization refers to a covering relationship between functional dependencies. The left part of the generalization of an invalid functional dependency is included in the left part of the invalid functional dependency, and the right part is the right part of the invalid functional dependency. The same, for example, the invalid function dependency "{gender, blood pressure, medication}→{name}" can completely cover the invalid function dependency "{blood pressure, medication}→{name}", then it can be called "{blood pressure, medication}→{name} " is a generalization of "{gender, blood pressure, medication}→{name}".

在一实施例中，所述对所述无效函数依赖集合进行反演得到有效函数依赖集合，包括：提取所述无效函数依赖集合中的任一无效函数依赖，所述无效函数依赖的左部集为相应元组对所含两个元组具有相同字段值的所有字段，右部集为相应元组对所含两个元组具有不同字段值的任一字段；在所述任一无效函数依赖的左部集中添加候选字段，得到待验证的函数依赖，所述候选字段为所述至少两个字段中区别于所述任一无效函数依赖的左部集和右部集所含字段的其他字段；若所述待验证的函数依赖不属于所述无效函数依赖集合及该集合中的无效函数依赖的泛化，则确定所述待验证的函数依赖为有效函数依赖，并将所述有效函数依赖添加至有效函数依赖集合中。In one embodiment, inverting the invalid functional dependency set to obtain a valid functional dependency set includes: extracting any invalid functional dependency in the invalid functional dependency set, a left part of the invalid functional dependency set. is all fields with the same field value in the two tuples contained in the corresponding tuple pair, and the right part is any field in which the two tuples contained in the corresponding tuple pair have different field values; in any of the invalid functional dependencies Add candidate fields to the left set of the to-be-verified functional dependencies, and the candidate fields are other fields in the at least two fields that are different from the fields contained in the left set and the right set of any invalid functional dependency. If the functional dependency to be verified does not belong to the generalization of the invalid functional dependency set and the invalid functional dependency in this set, then determine that the functional dependency to be verified is an effective functional dependency, and the valid functional dependency is Add to the set of valid functional dependencies.

下面集合图7对反演法进行详细介绍，图7是一示例性实施例提供的一种生成有效函数依赖的示意图，如图7所示，在任一无效函数依赖的左部集中添加候选字段，并验证其是否转换为有效函数依赖。以无效函数依赖“{血压，用药}→{年龄}”为例，该无效函数依赖的左部集和右部集所含字段为“血压”字段、“年龄”字段以及“用药”字段，因此候选字段可以为“性别”字段和“姓名”字段。将这2个字段分别添加指无效函数依赖的左部集中，得到2条待验证的函数依赖“{性别，血压，用药}→{年龄}、{姓名，血压，用药}→{年龄}”，可以看出，这2条函数依赖中，第一条待验证的函数依赖属于无效函数依赖集合，第二条待验证的函数依赖不属于无效函数依赖集合，因此可以判定“{姓名，血压，用药}→{年龄}”为有效函数依赖。该实施例将候选字段添加至无效函数依赖中得到待验证的函数依赖，并通过验证该待验证的无效函数依赖是否属于无效函数依赖集合，来判定该待验证的函数依赖是否为有效函数依赖。该过程将无效函数依赖转换为有效函数依赖，无需直接对函数依赖是否有效进行验证，而是利用无效函数依赖集合进行判定，减少了计算开销，提高了确定函数依赖的效率。The inversion method is described in detail below with reference to FIG. 7. FIG. 7 is a schematic diagram of generating a valid functional dependency provided by an exemplary embodiment. As shown in FIG. 7, a candidate field is added to the left part of any invalid functional dependency, and verify that it converts to a valid functional dependency. Taking the invalid functional dependency "{blood pressure, medication}→{age}" as an example, the fields in the left and right sets of the invalid functional dependency are the "blood pressure" field, the "age" field, and the "medication" field, so Candidate fields can be the "Gender" field and the "Name" field. Add these two fields to the left set of invalid functional dependencies respectively, and get two functional dependencies to be verified: "{gender, blood pressure, medication}→{age}, {name, blood pressure, medication}→{age}", It can be seen that among these two functional dependencies, the first functional dependency to be verified belongs to the invalid functional dependency set, and the second functional dependency to be verified does not belong to the invalid functional dependency set, so it can be determined that "{name, blood pressure, medication }→{age}" is a valid functional dependency. In this embodiment, candidate fields are added to the invalid functional dependencies to obtain the functional dependencies to be verified, and it is determined whether the functional dependencies to be verified are valid functional dependencies by verifying whether the invalid functional dependencies to be verified belong to the set of invalid functional dependencies. This process converts invalid functional dependencies into valid functional dependencies, without directly verifying whether the functional dependencies are valid, but uses the invalid functional dependency set for judgment, which reduces computational overhead and improves the efficiency of determining functional dependencies.

进一步的，若所述待验证的函数依赖属于所述无效函数依赖集合或者该集合中的无效函数依赖的泛化，则在所述待验证的函数依赖的左部集中继续添加区别于已添加的候选字段的其他候选字段，直至所述待验证的函数依赖被确定为所述有效函数依赖。继续往第二条待验证的函数依赖中添加候选字段，由于“{性别，血压，用药}→{年龄}”已添加的候选字段为“性别”字段，所以再次添加的候选字段应为“姓名”字段，得到待验证的函数依赖“{姓名，性别，血压，用药}→{年龄}”，该函数依赖不属于无效函数依赖集合，因此可以判定“{姓名，性别，血压，用药}→{年龄}”为有效函数依赖。该实施例使得无效函数依赖可以充分转换为有效函数依赖，避免确定的函数依赖有所遗漏。Further, if the functional dependency to be verified belongs to the set of invalid functional dependencies or the generalization of the invalid functional dependencies in the set, then continue to add in the left set of the functional dependencies to be verified that are different from the ones that have been added. Other candidate fields of candidate fields until the functional dependency to be verified is determined to be the valid functional dependency. Continue to add candidate fields to the second functional dependency to be verified. Since the added candidate field of "{gender, blood pressure, medication}→{age}" is the "gender" field, the candidate field added again should be "name" " field, get the functional dependency to be verified "{name, gender, blood pressure, medication}→{age}", this functional dependency does not belong to the set of invalid functional dependencies, so it can be determined that "{name, gender, blood pressure, medication}→{ age}" is a valid functional dependency. This embodiment enables invalid functional dependencies to be fully converted into valid functional dependencies, avoiding omission of certain functional dependencies.

如前所述，在采样过程中，可以根据效率值对元组集合进行排序，由于效率值高的元组集合提取的元组对数量更多，效率值低的元组集合提取的元组对数量更少，所以在面对元组数量以及字段数量较多的数据集时，若先对效率值较高的元组集合进行采样，可以确保在采样前期就可以提取大部分元组对。在实际情况下，为了保证采样的效率，可以在一定程度上牺牲采样的充分度，以提高生成函数依赖的效率。As mentioned above, in the sampling process, the tuple set can be sorted according to the efficiency value. Since the tuple set with high efficiency value extracts more tuple pairs, the tuple pair extracted from the tuple set with low efficiency value Therefore, when faced with a dataset with a large number of tuples and a large number of fields, if the tuple set with a higher efficiency value is sampled first, it can ensure that most of the tuple pairs can be extracted in the early stage of sampling. In practical situations, in order to ensure the efficiency of sampling, the adequacy of sampling can be sacrificed to a certain extent to improve the efficiency of generating function dependencies.

在一实施例中，对所有元组集合进行多轮采样得到元组对，包括：在首轮采样中，按照初始大小的采样窗口对所有元组集合进行滑动采样；或者，在非首轮采样中，在前一轮的基础上增大所述采样窗口，并对所有元组集合重启一轮滑动采样；其中，处于滑动窗口两端的元组用于生成一元组对；若当前轮滑动采样结束后计算出所述无效函数依赖的增长速率达到第一预设下限阈值，则结束采样；否则，进入下一轮滑动采样。In one embodiment, performing multiple rounds of sampling on all tuple sets to obtain tuple pairs includes: in the first round of sampling, performing sliding sampling on all tuple sets according to a sampling window of the initial size; or, in non-first rounds of sampling , increase the sampling window on the basis of the previous round, and restart a round of sliding sampling for all tuple sets; wherein, the tuples at both ends of the sliding window are used to generate a tuple pair; if the current round of sliding sampling ends After it is calculated that the growth rate of the invalid function dependency reaches the first preset lower limit threshold, the sampling is ended; otherwise, the next round of sliding sampling is entered.

图8是一示例性实施例提供的一种滑动窗口采样的示意图，如图8所示，使用滑动窗口对元组集合进行多轮采样(由于元组数量为1个的元组集合无法抽取两个元组作为元组对，不具备采样的价值，所以这里不将其排进队列中)。在首轮采样中，采样窗口的初始大小为2，从队列头部的元组集合，即“性别”划分集中字段值为“男”的元组集合，开始对所有元组集合进行滑动采样，提取滑动窗口两端的两个元组作为元组对。首轮采样的采样结果为：{2，3}、{3，4}、{4，6}、{1，5}、{5，6}、{1，3}、{3，6}、{4，5}。由图4可知，通过这些元组对生成的无效函数依赖包含了所有无效函数依赖，这里就不再一一说明。由于是首轮采样，所以无效函数依赖的增长速率为100％。FIG. 8 is a schematic diagram of a sliding window sampling provided by an exemplary embodiment. As shown in FIG. 8 , a sliding window is used to perform multiple rounds of sampling on a tuple set (because a tuple set with one tuple cannot extract two tuples as tuple pairs do not have the value of sampling, so they are not enqueued here). In the first round of sampling, the initial size of the sampling window is 2. From the set of tuples at the head of the queue, that is, the set of tuples whose field value is "male" in the "gender" partition set, sliding sampling is performed on all sets of tuples. Extract the two tuples at both ends of the sliding window as a tuple pair. The sampling results of the first round of sampling are: {2, 3}, {3, 4}, {4, 6}, {1, 5}, {5, 6}, {1, 3}, {3, 6}, {4, 5}. It can be seen from Figure 4 that the invalid functional dependencies generated by these tuples include all invalid functional dependencies, which will not be described one by one here. Since it is the first round of sampling, the growth rate of invalid functional dependencies is 100%.

在第二轮采样中，将滑动窗口大小增大至3，遍历所有元组集合，第二轮采样的采样结果为：{2，4}、{3，6}、{1，6}。显然，通过这些元组对生成的无效函数依赖一定与第一轮生成的无效函数依赖重复，因此这一轮无效函数依赖的增长速率为0。若第一预设下限阈值为5％，则第二轮结束后的无效函数依赖的增长速率已经达到了第一预设下限阈值，可以结束采样。为了验证第三轮采样的价值并不大，这里依旧将滑动窗口大小增大至4，以进行第三轮采样，采样结果为：{2，6}。通过元组对“{2，6}”生成的无效函数依赖也与首轮采样生成的无效函数依赖重复。因此可以看出，随着采样轮次的进行，采样的效率会越来越低，设置第一预设下限阈值可以在确保准确度的情况下，减少时间和计算开销。In the second round of sampling, the sliding window size is increased to 3, and all tuple sets are traversed. The sampling results of the second round of sampling are: {2, 4}, {3, 6}, {1, 6}. Obviously, the invalid functional dependencies generated by these tuple pairs must be repeated with the invalid functional dependencies generated in the first round, so the growth rate of invalid functional dependencies in this round is 0. If the first preset lower limit threshold is 5%, then the growth rate of the invalid function dependence after the second round has reached the first preset lower limit threshold, and the sampling can be ended. In order to verify that the value of the third round of sampling is not large, the sliding window size is still increased to 4 for the third round of sampling, and the sampling result is: {2, 6}. The invalid functional dependencies generated by the tuple pair "{2, 6}" also duplicate the invalid functional dependencies generated by the first round of sampling. Therefore, it can be seen that with the progress of the sampling round, the sampling efficiency will become lower and lower, and setting the first preset lower limit threshold can reduce the time and calculation overhead while ensuring the accuracy.

进一步的，若当前轮滑动采样结束后计算出所述有效函数依赖的增长速率未达到第二预设下限阈值，则进入下一轮滑动采样，直至任一轮滑动采样后计算出所述有效函数依赖的增长速率达到预设阈值。Further, if it is calculated that the growth rate of the effective function dependence does not reach the second preset lower threshold after the current round of sliding sampling ends, enter the next round of sliding sampling until the effective function is calculated after any round of sliding sampling. The growth rate of the dependency reaches a preset threshold.

在每轮采样结束后，不仅需要计算无效函数依赖的增长速率，还需要计算有效函数依赖的增长速率，这里的有效函数依赖的增长速率是指每轮采样结束后新增的无效函数依赖生成的有效函数依赖的增长速率。例如：若在一轮采样后，新增无效函数依赖“{年龄}→{姓名}”，通过该无效函数依赖生成的有效函数依赖为“{性别，年龄}→{姓名}、{血压，年龄}→{姓名}、{用药，年龄}→{姓名}”，若这3个有效函数依赖并未在之前的轮次生成过，则有效函数依赖的增长速率为新增有效函数依赖的数量(3)与有效函数依赖总数量(之前生成的有效函数依赖数量加上3)的比值。After each round of sampling, it is necessary to calculate not only the growth rate of invalid functional dependencies, but also the growth rate of valid functional dependencies. The growth rate of valid functional dependencies here refers to the newly added invalid functional dependencies after each round of sampling. The growth rate of the effective functional dependence. For example: if after one round of sampling, an invalid function dependency "{age}→{name}" is added, the valid function dependency generated by the invalid function dependency is "{gender, age}→{name}, {blood pressure, age }→{name}, {medication, age}→{name}", if these three valid functional dependencies have not been generated in previous rounds, the growth rate of effective functional dependencies is the number of new effective functional dependencies ( 3) Ratio to the total number of effective functional dependencies (the number of effective functional dependencies generated before plus 3).

在上面的实施例中，由于首轮采样就得出了所有无效函数依赖，所以第二轮采样结束后有效函数依赖的增长速率为0，无需继续进行采样。除了这种情况之外，还会出现无效函数依赖的增长速率达到了第一预设下限阈值，但是有效函数依赖的增长速率未达到第二预设下限阈值，此时采样不会结束，依旧需要继续采样。只有在无效函数依赖的增长速率达到第一预设下限阈值，且有效函数依赖的增长速率达到第二预设下限阈值的情况下，采样才会结束。In the above embodiment, since all invalid functional dependencies are obtained in the first round of sampling, the growth rate of valid functional dependencies is 0 after the second round of sampling ends, and there is no need to continue sampling. In addition to this situation, the growth rate of the invalid function dependence has reached the first preset lower threshold, but the growth rate of the effective function dependence has not reached the second preset lower threshold. At this time, the sampling will not end, and it is still necessary to Continue sampling. Sampling ends only when the growth rate of the invalid function dependency reaches the first preset lower threshold and the growth rate of the valid function dependency reaches the second preset lower threshold.

该实施例在提高采样效率的同时，在一定程度上确保了采样的充分度，从而减小了遗漏函数依赖的可能性。While improving the sampling efficiency, this embodiment ensures the adequacy of sampling to a certain extent, thereby reducing the possibility of missing functional dependencies.

在上述的实施例中，基于原始数据集可以生成有效函数依赖，该有效函数依赖可以用于表征原始数据集中字段之间的关系。以有效函数依赖“{性别，年龄}→{姓名}”为例，该函数依赖左部集中包括“性别”字段和“年龄”字段，右部集中包括“姓名”字段，所以该有效函数依赖可以表示：通过一个元组在“性别”字段的字段值和在“年龄”字段的字段值，可以确定该元组在“姓名”字段的字段值。由于函数依赖的特殊作用，所以函数依赖往往被应用于确定字段关系的场景中。In the above-mentioned embodiment, an effective functional dependency can be generated based on the original data set, and the effective functional dependence can be used to characterize the relationship between the fields in the original data set. Take the valid functional dependency "{gender, age}→{name}" as an example, the left part of the functional dependency includes the "gender" field and the "age" field, and the right part includes the "name" field, so the valid functional dependency can be Representation: The field value of the tuple in the "Name" field can be determined by the field value in the "Gender" field and the field value in the "Age" field of a tuple. Due to the special role of functional dependencies, functional dependencies are often used in scenarios where field relationships are determined.

在一实施例中，用户需要确定表1所示的数据集中“性别”字段和“年龄”字段之间的关系。确定有效函数依赖中不存在“{性别}→{年龄}”或者“{年龄}→{性别}”，且无效函数依赖集合中存在“{性别，血压，用药}→{年龄}”和“{年龄}→{性别}”。可以看出，在确定一个元组在“性别”字段的字段值的情况下，无法确定该元组在“年龄”字段的字段值，反之亦然。因此，“性别”字段和“年龄”字段之间不存在直接联系，两个字段无法进行相互推导。In one embodiment, the user needs to determine the relationship between the "gender" field and the "age" field in the data set shown in Table 1. It is determined that "{sex}→{age}" or "{age}→{sex}" does not exist in the valid functional dependencies, and there are "{sex, blood pressure, medication}→{age}" and "{ age}→{gender}". It can be seen that in the case of determining the field value of a tuple in the "gender" field, the field value of the tuple in the "age" field cannot be determined, and vice versa. Therefore, there is no direct connection between the "Gender" field and the "Age" field, and the two fields cannot be derived from each other.

在另一实施例中，用户需要确定表1所示的数据集中“姓名”字段和“年龄”字段之间的关系。确定有效函数依赖中存在“{姓名}→{年龄}”，不存在“{年龄}→{姓名}”，且无效函数依赖集合中存在“{年龄}→{姓名}”。可以看出，在确定一个元组在“姓名”字段的字段值的情况下，可以确定该元组在“年龄”字段的字段值；在确定一个元组在“年龄”字段的字段值的情况下，却无法确定该元组在“姓名”字段的字段值。因此，通过“姓名”字段的字段值可以推导出“年龄”字段的字段值，通过“年龄”字段的字段值却无法推导出“姓名”字段的字段值，“姓名”字段和“年龄”字段之间的关系是：“姓名”字段单方面确定“年龄”字段。In another embodiment, the user needs to determine the relationship between the "name" field and the "age" field in the dataset shown in Table 1. It is determined that "{name}→{age}" exists in valid functional dependencies, "{age}→{name}" does not exist, and "{age}→{name}" exists in the set of invalid functional dependencies. It can be seen that in the case of determining the field value of a tuple in the "name" field, the field value of the tuple in the "age" field can be determined; in the case of determining the field value of a tuple in the "age" field , but could not determine the field value of the tuple in the "Name" field. Therefore, the field value of the "age" field can be deduced from the field value of the "name" field, but the field value of the "name" field, the "name" field and the "age" field cannot be deduced from the field value of the "age" field. The relationship is: The "Name" field unilaterally determines the "Age" field.

具体的，确定函数依赖的方法可以被应用于敏感字段识别场景。在一实施例中，确定用户在所述至少两个字段中标记的敏感字段；从所述有效函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段；若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。Specifically, the method of determining functional dependencies can be applied to sensitive field recognition scenarios. In one embodiment, the sensitive fields marked by the user in the at least two fields are determined; the objective functional dependencies are determined from the effective functional dependencies, and the fields contained in the left part set of the objective functional dependencies belong to the sensitive fields ; If the fields contained in the right part set of the objective functional dependency are different from the sensitive fields, then determine that the fields contained in the right part set of the objective functional dependence are potential sensitive fields.

敏感字段可以指泄露后可能会给社会或个人带来严重危害的数据字段，包括个人隐私信息，例如姓名、性别等。用户对敏感字段的标记可以是将字段简单标记未敏感字段和非敏感字段，也可以是对字段进行分级，例如：字段级别划分为5个级别，将非敏感字段设置为0级，将敏感字段随着敏感程度的提高设置的字段级别越高，敏感程度最高的字段为5级，本说明书并不对具体的标记方法进行限制。Sensitive fields can refer to data fields that may cause serious harm to society or individuals after leakage, including personal privacy information, such as name, gender, etc. The user can mark sensitive fields by simply marking the fields as non-sensitive fields and non-sensitive fields, or by classifying the fields. For example, the field level is divided into 5 levels, the non-sensitive fields are set to level 0, and the sensitive fields are set to As the sensitivity increases, the higher the field level is set, and the field with the highest sensitivity is level 5. This specification does not limit the specific marking method.

若用户标记的敏感字段为：性别、年龄，以这两个敏感字段为左部集的有效函数依赖有：{性别，年龄}→{姓名}、{性别，年龄}→{血压}、{性别，年龄}→{用药}。这3个有效函数依赖的右部集分别为“姓名”字段、“血压”字段以及“用药”字段，这3个字段都不是敏感字段，因此可以判定“姓名”字段、“血压”字段、“用药”字段均为潜在的敏感字段。If the sensitive fields marked by the user are: gender and age, the valid functional dependencies with these two sensitive fields as the left set are: {gender, age}→{name}, {gender, age}→{blood pressure}, {gender , age}→{medication}. The right-hand sets of these three valid functional dependencies are the "name" field, the "blood pressure" field, and the "medication" field. These three fields are not sensitive fields, so it can be determined that the "name" field, "blood pressure" field, "" "Medication" fields are all potentially sensitive fields.

该实施例将函数依赖的确定方法应用于敏感字段识别领域，基于函数依赖确定字段间的关系，并通过函数依赖和敏感字段推理潜在的敏感字段，避免人工标记敏感字段，降低了时间和人力成本。This embodiment applies the method for determining functional dependencies to the field of sensitive field identification, determines the relationship between fields based on functional dependencies, and infers potential sensitive fields through functional dependencies and sensitive fields, avoiding manual marking of sensitive fields and reducing time and labor costs .

图9是一示例性实施例提供的一种基于函数依赖的敏感字段的推理方法的流程图，如图9所示，该方法可以包括以下步骤：FIG. 9 is a flowchart of a method for reasoning based on a sensitive field of functional dependencies provided by an exemplary embodiment. As shown in FIG. 9 , the method may include the following steps:

步骤901，获取基于数据集生成的函数依赖，所述函数依赖用于表征所述数据集中字段之间的关系。Step 901: Obtain a functional dependency generated based on a data set, where the functional dependency is used to characterize the relationship between fields in the data set.

函数依赖的确定方式可以通过前述实施例实现，也可以采用相关技术中的其他算法实现，本说明书并不对此进行限制。The way of determining the functional dependency may be implemented by the foregoing embodiments, or may be implemented by other algorithms in the related art, which are not limited in this specification.

步骤902，确定用户在所述数据集的字段中标记的敏感字段。Step 902: Determine the sensitive fields marked by the user in the fields of the data set.

步骤903，从所述函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段。Step 903: Determine an objective functional dependency from the functional dependencies, and the fields contained in the left part set of the objective functional dependencies belong to the sensitive fields.

步骤904，若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。Step 904 , if the fields contained in the right-hand set of the objective functional dependency are different from the sensitive fields, determine that the fields contained in the right-hand set of the objective functional dependence are potential sensitive fields.

这里仍以表1的数据集为例，若用户标记的敏感字段为“年龄”字段和“血压”字段，以这两个字段为左部集的函数依赖包括：{血压，年龄}→{姓名}、{血压，年龄}→{性别}、{血压，年龄}→{用药}。这3个函数依赖的右部集分别为“姓名”字段、“性别”字段以及“用药”字段，这3个字段都不是敏感字段，因此可以判定“姓名”字段、“性别”字段以及“用药”字段为潜在的敏感字段。Still taking the data set in Table 1 as an example, if the sensitive fields marked by the user are the "age" field and the "blood pressure" field, the functional dependencies with these two fields as the left set include: {blood pressure, age}→{name }, {blood pressure, age}→{gender}, {blood pressure, age}→{medication}. The right sets of these three functional dependencies are the "name" field, the "gender" field and the "medication" field. These three fields are not sensitive fields, so it can be determined that the "name" field, the "gender" field and the "medication" field ” field is a potentially sensitive field.

该实施例通过函数依赖表征数据集中字段之间的关系，使得用户可以基于函数依赖和已标记的敏感字段推理潜在的敏感字段，降低了时间和人力成本，提高了敏感字段识别的效率。This embodiment characterizes the relationship between fields in the data set through functional dependencies, so that users can infer potential sensitive fields based on functional dependencies and marked sensitive fields, reducing time and labor costs, and improving the efficiency of sensitive field identification.

图10是一示例性实施例提供的一种设备的示意结构图。请参考图10，在硬件层面，该设备包括处理器1002、内部总线1004、网络接口1006、内存1009以及非易失性存储器1010，当然还可能包括其他功能所需要的硬件。本说明书一个或多个实施例可以基于软件方式来实现，比如由处理器1002从非易失性存储器1010中读取对应的计算机程序到内存1008中然后运行。当然，除了软件实现方式之外，本说明书一个或多个实施例并不排除其他实现方式，比如逻辑器件抑或软硬件结合的方式等等，也就是说以下处理流程的执行主体并不限定于各个逻辑单元，也可以是硬件或逻辑器件。Fig. 10 is a schematic structural diagram of a device provided by an exemplary embodiment. Referring to FIG. 10 , at the hardware level, the device includes a processor 1002 , an internal bus 1004 , a network interface 1006 , a memory 1009 and a non-volatile memory 1010 , and may also include hardware required for other functions. One or more embodiments of the present specification may be implemented based on software, for example, the processor 1002 reads the corresponding computer program from the non-volatile memory 1010 into the memory 1008 and then executes it. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device.

请参考图11，一种函数依赖的确定装置可以应用于如图11所示的设备中，以实现本说明书的技术方案，该装置可以包括：Please refer to FIG. 11 , an apparatus for determining functional dependencies can be applied to the device shown in FIG. 11 to realize the technical solution of this specification, and the apparatus may include:

第一获取单元1101，用于获取原始数据集，所述原始数据集包含多个元组，每个元组包含至少两个字段对应的字段值；a first obtaining unit 1101, configured to obtain an original data set, where the original data set includes multiple tuples, and each tuple includes field values corresponding to at least two fields;

划分单元1102，用于针对每个字段分别生成相应的划分集，所述划分集中包含至少一个元组集合，每一元组集合内的元组在所述划分集对应字段上的字段值相同；The dividing unit 1102 is configured to generate a corresponding dividing set for each field, the dividing set includes at least one tuple set, and the tuples in each tuple set have the same field value on the corresponding field of the dividing set;

采样单元1103，用于对所有元组集合进行采样得到元组对，每一元组对包含属于同一元组集合的两个元组；Sampling unit 1103, for sampling all tuple sets to obtain tuple pairs, each tuple pair includes two tuples belonging to the same tuple set;

生成单元1104，用于针对每一元组对生成相应的无效函数依赖并添加至无效函数依赖集合中，以得到已采样的元组对对应的所有无效函数依赖；A generating unit 1104, configured to generate a corresponding invalid functional dependency for each tuple pair and add it to the invalid functional dependency set, so as to obtain all invalid functional dependencies corresponding to the sampled tuple pair;

反演单元1105，对所述无效函数依赖集合进行反演得到有效函数依赖集合，所述有效函数依赖集合包含所述原始数据集对应的有效函数依赖。The inversion unit 1105 inverts the invalid functional dependency set to obtain a valid functional dependency set, where the valid functional dependency set includes valid functional dependencies corresponding to the original data set.

可选的，还包括：Optionally, also include:

替换单元1106，用于将所述多个元组在所述原始数据集中对应的字段值分别替换为：相应字段值所属字段对应的划分集的序号与该字段值所属元组所处的元组集合的序号组合形成的字段值序号，以生成替换数据集；The replacement unit 1106 is configured to replace the field values corresponding to the multiple tuples in the original data set with: the serial number of the division set corresponding to the field to which the corresponding field value belongs and the tuple where the tuple to which the field value belongs is located. The sequence number of the field value formed by the combination of the sequence numbers of the set to generate the replacement data set;

所述生成单元1104具体用于根据所述替换数据集生成所述无效函数依赖。The generating unit 1104 is specifically configured to generate the invalid functional dependency according to the replacement data set.

可选的，所述采样单元1103具体用于将每一划分集作为一个队列单元，并根据效率值分别对每一队列单元中的元组集合进行排序，所述效率值的大小与相应元组集合包含的元组数量呈正相关；Optionally, the sampling unit 1103 is specifically configured to use each partition set as a queue unit, and sort the tuple sets in each queue unit according to the efficiency value, the size of the efficiency value being related to the corresponding tuple. The number of tuples contained in the set is positively correlated;

在对任一队列单元中的元组集合进行采样时，按照效率值从大到小的顺序依次对该队列单元中的元组集合进行采样得到元组对。When sampling the tuple set in any queue unit, the tuple set in the queue unit is sampled in order according to the efficiency value in descending order to obtain a tuple pair.

可选的，所述采样单元1103具体用于在首轮采样中，按照初始大小的采样窗口对所有元组集合进行滑动采样；或者，在非首轮采样中，在前一轮的基础上增大所述采样窗口，并对所有元组集合重启一轮滑动采样；其中，处于滑动窗口两端的元组用于生成一元组对；Optionally, the sampling unit 1103 is specifically configured to perform sliding sampling on all tuple sets according to the sampling window of the initial size in the first round of sampling; The sampling window is enlarged, and a round of sliding sampling is restarted for all tuple sets; wherein, the tuples at both ends of the sliding window are used to generate a tuple pair;

若当前轮滑动采样结束后计算出所述无效函数依赖的增长速率达到第一预设下限阈值，则结束采样；否则，进入下一轮滑动采样。If it is calculated that the growth rate of the invalid function dependency reaches the first preset lower limit threshold after the current round of sliding sampling ends, the sampling is ended; otherwise, the next round of sliding sampling is entered.

可选的，所述采样单元1103具体用于若当前轮滑动采样结束后计算出所述有效函数依赖的增长速率未达到第二预设下限阈值，则进入下一轮滑动采样，直至任一轮滑动采样后计算出所述有效函数依赖的增长速率达到预设阈值。Optionally, the sampling unit 1103 is specifically configured to enter the next round of sliding sampling if it is calculated that the growth rate of the effective function dependence does not reach the second preset lower threshold after the current round of sliding sampling ends, until any wheel sliding. After dynamic sampling, it is calculated that the growth rate of the effective function dependence reaches a preset threshold.

可选的，所述反演单元1105具体用于：Optionally, the inversion unit 1105 is specifically used for:

提取所述无效函数依赖集合中的任一无效函数依赖，所述无效函数依赖的左部集为相应元组对所含两个元组具有相同字段值的所有字段，右部集为相应元组对所含两个元组具有不同字段值的任一字段；Extract any invalid functional dependency in the invalid functional dependency set, the left part of the invalid functional dependency set is all fields with the same field value for the two tuples contained in the corresponding tuple, and the right part is the corresponding tuple any field that has different field values for the two tuples it contains;

在所述任一无效函数依赖的左部集中添加候选字段，得到待验证的函数依赖，所述候选字段为所述至少两个字段中区别于所述任一无效函数依赖的左部集和右部集所含字段的其他字段；A candidate field is added to the left set of any invalid functional dependency to obtain the functional dependency to be verified, and the candidate field is the left set and right set of the at least two fields that are different from the any invalid functional dependency. other fields of the fields contained in the part set;

若所述待验证的函数依赖不属于所述无效函数依赖集合及该集合中的无效函数依赖的泛化，则确定所述待验证的函数依赖为有效函数依赖，并将所述有效函数依赖添加至有效函数依赖集合中。If the functional dependency to be verified does not belong to the set of invalid functional dependencies and the generalization of the invalid functional dependencies in the set, the functional dependency to be verified is determined to be a valid functional dependency, and the valid functional dependency is added into the set of valid functional dependencies.

可选的，还包括：Optionally, also include:

第一确定单元1107，用于确定不必要的无效函数依赖，所述不必要的无效函数依赖的左部集被包含于其他无效函数依赖的左部集，所述不必要的无效函数依赖的右部集与所述其他函数依赖的右部集相同；The first determination unit 1107 is configured to determine unnecessary invalid functional dependencies, the left part set of the unnecessary invalid functional dependencies is included in the left part sets of other invalid functional dependencies, and the right part of the unnecessary invalid functional dependencies is included. The part set is the same as the right part set of the other functional dependencies;

所述反演单元1105具体用于在区别于所述不必要的无效函数依赖的任一无效函数依赖的左部集中添加候选字段。The inversion unit 1105 is specifically configured to add a candidate field in the left part set of any invalid functional dependency that is different from the unnecessary invalid functional dependency.

可选的，还包括：Optionally, also include:

添加单元1108，用于若所述待验证的函数依赖属于所述无效函数依赖集合或者该集合中的无效函数依赖的泛化，则在所述待验证的函数依赖的左部集中继续添加区别于已添加的候选字段的其他候选字段，直至所述待验证的函数依赖被确定为所述有效函数依赖。The adding unit 1108 is configured to, if the functional dependency to be verified belongs to the set of invalid functional dependencies or a generalization of the invalid functional dependencies in the set, continue to add in the left set of the functional dependencies to be verified the difference from Other candidate fields of candidate fields that have been added until the functional dependency to be verified is determined to be the valid functional dependency.

可选的，还包括：Optionally, also include:

第二确定单元1109，用于确定用户在所述至少两个字段中标记的敏感字段；a second determining unit 1109, configured to determine the sensitive fields marked by the user in the at least two fields;

第三确定单元1110，用于从所述有效函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段；a third determining unit 1110, configured to determine an objective functional dependency from the effective functional dependency, and the fields contained in the left part set of the objective functional dependency belong to the sensitive field;

第一判定单元1111，用于若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。The first determining unit 1111 is configured to determine that a field included in the right-hand set of the objective functional dependency is a potential sensitive field if the field included in the right-hand set of the objective functional dependency is different from the sensitive field.

请参考图12，一种基于函数依赖的敏感字段的推理装置可以应用于如图12所示的设备中，以实现本说明书的技术方案，该装置可以包括：Please refer to FIG. 12 , an inference apparatus based on functional dependency sensitive fields can be applied to the device shown in FIG. 12 to realize the technical solution of this specification, and the apparatus may include:

第二获取单元，用于获取基于数据集生成的函数依赖，所述函数依赖用于表征所述数据集所含字段的关系；a second obtaining unit, configured to obtain a functional dependency generated based on a data set, where the functional dependency is used to characterize the relationship of fields included in the data set;

第四确定单元，用于确定用户在所述数据集的字段中标记的敏感字段；a fourth determining unit, configured to determine the sensitive fields marked by the user in the fields of the data set;

第五确定单元，用于从所述函数依赖中确定目标函数依赖，所述目标函数依赖的左部集所含字段属于所述敏感字段；a fifth determining unit, for determining an objective functional dependency from the functional dependency, and the field contained in the left part set of the objective functional dependency belongs to the sensitive field;

第二判定单元，用于若所述目标函数依赖的右部集所含字段区别于所述敏感字段，则判定所述目标函数依赖的右部集所含字段为潜在的敏感字段。The second determination unit is configured to determine that the fields included in the right-hand set of objective functional dependencies are potential sensitive fields if the fields included in the right-hand set of objective functional dependencies are different from the sensitive fields.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机，计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable, or a combination of any of these devices.

在一个典型的配置中，计算机包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridges, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

在本说明书一个或多个实施例使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in one or more embodiments of this specification is for the purpose of describing a particular embodiment only and is not intended to limit the one or more embodiments of this specification. As used in the specification or embodiments and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本说明书一个或多个实施例可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本说明书一个或多个实施例范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of one or more embodiments of the present specification. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

以上所述仅为本说明书一个或多个实施例的较佳实施例而已，并不用以限制本说明书一个或多个实施例，凡在本说明书一个或多个实施例的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本说明书一个或多个实施例保护的范围之内。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principles of one or more embodiments of this specification, Any modifications, equivalent replacements, improvements, etc. made should be included within the protection scope of one or more embodiments of this specification.

Claims

1. a determination method of functional dependence, is characterized in that, described method comprises:

Obtaining an original data set, the original data set includes a plurality of tuples, and each tuple includes field values corresponding to at least two fields;

A corresponding partition set is generated for each field, the partition set includes at least one tuple set, and the tuples in each tuple set have the same field value on the corresponding field of the partition set;

Sampling all tuple sets to obtain tuple pairs, each tuple pair contains two tuples belonging to the same tuple set;

A corresponding invalid functional dependency is generated for each tuple pair and added to the invalid functional dependency set to obtain all invalid functional dependencies corresponding to the sampled tuple pair;

Inverting the invalid functional dependency set to obtain a valid functional dependency set, where the valid functional dependency set includes valid functional dependencies corresponding to the original data set.

2. The method according to claim 1, wherein

The method further includes: respectively replacing the field values corresponding to the plurality of tuples in the original data set with: the serial number of the division set corresponding to the field to which the corresponding field value belongs and the tuple where the tuple to which the field value belongs is located. The sequence number of the field value formed by the combination of the sequence numbers of the set to generate the replacement data set;

The generating a corresponding invalid functional dependency for each tuple pair includes: generating the invalid functional dependency according to the replacement data set.

3. The method according to claim 1, wherein the sampling of all tuple sets to obtain tuple pairs, comprising:

Taking each partition set as a queue unit, and sorting the tuple sets in each queue unit according to the efficiency value, the size of the efficiency value is positively correlated with the number of tuples included in the corresponding tuple set;

When sampling the tuple set in any queue unit, the tuple set in the queue unit is sampled in order according to the efficiency value in descending order to obtain a tuple pair.

4. The method according to claim 1, wherein the sampling of all tuple sets to obtain tuple pairs, comprising:

In the first round of sampling, sliding sampling is performed on all tuple sets according to the sampling window of the initial size; or, in the non-first round of sampling, the sampling window is increased on the basis of the previous round, and all tuple sets are Restart a round of sliding sampling; among them, the tuples at both ends of the sliding window are used to generate a tuple pair;

If it is calculated that the growth rate of the invalid function dependency reaches the first preset lower limit threshold after the current round of sliding sampling ends, the sampling is ended; otherwise, the next round of sliding sampling is entered.

5. The method according to claim 4, wherein the sampling of all tuple sets to obtain tuple pairs, comprising:

If it is calculated after the current round of sliding sampling that the growth rate of the effective functional dependency does not reach the second preset lower threshold, enter the next round of sliding sampling until the increase of the effective functional dependence is calculated after any round of sliding sampling The rate reaches a preset threshold.

6. The method according to claim 1, wherein the inverting the invalid functional dependency set to obtain a valid functional dependency set, comprising:

Extract any invalid functional dependency in the invalid functional dependency set, the left part of the invalid functional dependency set is all fields with the same field value for the two tuples contained in the corresponding tuple, and the right part is the corresponding tuple any field that has different field values for the two tuples it contains;

A candidate field is added to the left set of any invalid functional dependency to obtain the functional dependency to be verified, and the candidate field is the left set and right set of the at least two fields that are different from the any invalid functional dependency. other fields of the fields contained in the part set;

If the functional dependency to be verified does not belong to the set of invalid functional dependencies and the generalization of the invalid functional dependencies in the set, the functional dependency to be verified is determined to be a valid functional dependency, and the valid functional dependency is added into the set of valid functional dependencies.

7. The method of claim 6, wherein

The method further includes determining unnecessary invalid functional dependencies whose left-hand set is contained in the left-hand set of other invalid functional dependencies, the unnecessary right-hand set of invalid functional dependencies the same as the right-hand set of said other functional dependencies;

The adding a candidate field to the left part set of any invalid functional dependency includes: adding a candidate field to the left part set of any invalid functional dependency that is different from the unnecessary invalid functional dependency.

8. The method of claim 6, further comprising:

If the functional dependency to be verified belongs to the set of invalid functional dependencies or a generalization of the invalid functional dependencies in the set, continue to add fields that are different from the candidate fields already added in the left set of the functional dependencies to be verified. other candidate fields until the functional dependency to be verified is determined to be the valid functional dependency.

9. The method of claim 1, further comprising:

determining the sensitive fields marked by the user in the at least two fields;

Determine objective functional dependencies from the effective functional dependencies, and the fields contained in the left part set of the objective functional dependencies belong to the sensitive fields;

If the field contained in the right part set of the objective functional dependency is different from the sensitive field, it is determined that the field contained in the right part set of the objective functional dependence is a potential sensitive field.

10. An inference method based on functional dependency sensitive fields, characterized in that the method comprises:

Obtaining a functional dependency generated based on the dataset, the functional dependency is used to characterize the relationship between the fields in the dataset;

determine the sensitive fields that the user has flagged in the fields of the dataset;

Determine objective functional dependencies from the functional dependencies, and the fields contained in the left part set of the objective functional dependencies belong to the sensitive fields;

11. An electronic device, characterized in that, comprising:

processor;

memory for storing processor-executable instructions;

Wherein, the processor implements the method of any one of claims 1-10 by executing the executable instructions.

12. A computer-readable storage medium on which computer instructions are stored, wherein the instructions, when executed by a processor, implement the steps of the method according to any one of claims 1-10.