CN115048367B

CN115048367B - Method, device, terminal and storage medium for determining target set

Info

Publication number: CN115048367B
Application number: CN202210626216.3A
Authority: CN
Inventors: 韩哲; 蒋嘉琦; 陈鑫; 吴浩然; 李亚朋
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2025-08-05
Anticipated expiration: 2042-06-02
Also published as: CN115048367A

Abstract

The present application discloses a method, device, terminal and storage medium for determining a target set, including: receiving a first data source and a second data source; determining a first combination column and a second combination column based on the first data source and the second data source respectively; analyzing the first combination column and the second combination column in a negotiated interactive manner to determine a separator; determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column based on the first combination column, the second combination column and the separator, so as to obtain a target set through the first index number set and the second index number set. The present invention flexibly selects multiple columns of data for combination to obtain a combination column corresponding to the multiple columns of data. The target set can be obtained without manually converting the multiple columns of data in the combination column into a single column of data. The user can flexibly select and freely combine column data to form a combination column according to needs, and conveniently and efficiently implement the intersection operation of different combination columns.

Description

Target set determining method, device, terminal and storage medium

Technical Field

The present application relates to the field of data security technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for determining a target set.

Background

The private set intersection PSI (PRIVATE SET Intersection) refers to that the participating parties obtain the intersection of the data held by the parties without revealing any additional information. PSI is typically used to find samples common to the parties' data, but not reveal non-common sample information, prior to joint computation by multiple vendors.

At present, the private collection intersection of multiple column data combinations (i.e. column combinations) is generally performed in an indirect manner, that is, a user needs to clean data from different column combinations, then convert the cleaned different column combinations into different single-column data, and then perform PSI operation on different single-column data input systems.

However, the privacy set intersection operation steps of the multi-column data combination by adopting the method are complicated, so that the efficiency is low.

Disclosure of Invention

The application mainly aims to provide a method, a device, a terminal and a storage medium for determining a target set, so as to solve the problem of low efficiency in the related art.

To achieve the above object, in a first aspect, the present application provides a method for determining a target set, including:

receiving a first data source and a second data source;

determining a first combined column and a second combined column based on the first data source and the second data source, respectively, wherein the combined column is formed by combining a plurality of columns of data;

analyzing the first combined column and the second combined column by utilizing a negotiation interaction mode, and determining a separator;

And determining a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column based on the first combined column, the second combined column and the separator, so as to obtain a target set through the first index number set and the second index number set.

In one possible implementation, determining the first combined column and the second combined column based on the first data source and the second data source, respectively, includes:

Generating a first data table and a second data table based on the first data source and the second data source, respectively;

Selecting a preset number of column data from the first data table and the second data table respectively to obtain a preset number of first column data and a preset number of second column data;

And respectively combining the first column data with the preset number and the second column data with the preset number to obtain a first combined column and a second combined column.

In one possible implementation, a first data source is sent by a first client and a second data source is sent by a second client;

analyzing the first combination column and the second combination column by utilizing a negotiation interaction mode to determine separators, wherein the method comprises the following steps:

Determining a separator based on the first combined column and the second combined column in the case where the first client is a negotiation initiator;

In the case where the second client is the negotiation initiator, the separator is determined based on the first combined column and the second combined column.

In one possible implementation, in a case where the first client is a negotiation initiator, determining the separator based on the first combined column and the second combined column includes:

determining a first character difference set and a second character difference set based on the first combined column and the second combined column, respectively;

if any difference set of the first character difference set or the second character difference set is empty, acquiring a current time stamp, and determining a separator based on the current time stamp, wherein the separator is obtained by sequentially performing character string conversion, hash operation and character string interception on the current time stamp;

if the first character difference set and the second character difference set are not empty, selecting any character from the first character difference set as a target character;

and if the target character exists in the second character difference set, taking the target character as a separator.

In one possible implementation, the method further includes:

If the second character difference set does not have the target character, taking the second client as a negotiation initiator, and selecting any character from the second character difference set as the target character;

If the target character exists in the first character difference set, the target character is used as a separator;

And if the target character does not exist in the first character difference set, repeating the step of selecting any character from the first character difference set as the target character.

In one possible implementation, in a case where the second client is a negotiation initiator, determining the separator based on the first combined column and the second combined column includes:

If the first character difference set and the second character difference set are not empty, selecting any character from the second character difference set as a target character;

and if the target character exists in the first character difference set, taking the target character as a separator.

In one possible implementation, the method further includes:

if the first character difference set does not have the target character, the first client is used as a negotiation initiator, and any character is selected from the first character difference set to be used as the target character;

If the target character exists in the second character difference set, the target character is used as a separator;

and if the second character difference set does not have the target character, repeating the step of selecting any character from the second character difference set as the target character.

In one possible implementation, determining the first character difference set and the second character difference set based on the first combined column and the second combined column, respectively, includes:

Counting all characters in the first combination column to form a first character set, and differencing the preset character set and the first character set to obtain a first character difference set;

Counting all characters in the second combination column to form a second character set, and differencing the preset character set and the second character set to obtain a second character difference set.

In one possible implementation, determining, based on the first combined column, the second combined column, and the separator, a first set of index numbers corresponding to the first combined column and a second set of index numbers corresponding to the second combined column includes:

preprocessing the first combination column, the second combination column and the separator to obtain first combination data corresponding to the first combination column, a third index number set corresponding to the first combination data, and a second combination data corresponding to the second combination column and a fourth index number set corresponding to the second combination data;

and carrying out intersection operation on the first combined data and the second combined data, and combining the third index number set and the fourth index number set to obtain a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column.

In a second aspect, an embodiment of the present invention provides a device for determining a target set, including:

the data receiving module is used for receiving the first data source and the second data source;

a combined column determining module for determining a first combined column and a second combined column based on the first data source and the second data source, respectively, wherein the combined column is formed by combining a plurality of columns of data;

the separator determining module is used for analyzing the first combined column and the second combined column by utilizing a negotiation interaction mode to determine a separator;

the target set determining module is used for determining a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column based on the first combined column, the second combined column and the separator, so that a target set is obtained through the first index number set and the second index number set.

In a third aspect, an embodiment of the present invention provides a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of a method for determining a set of targets as any one of the above when the computer program is executed.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of a method for determining a set of objects, as in any of the above.

The embodiment of the invention provides a method, a device, a terminal and a storage medium for determining a target set, which comprise the steps of receiving a first data source and a second data source, determining a first combination column and a second combination column based on the first data source and the second data source respectively, analyzing the first combination column and the second combination column by utilizing a negotiation interaction mode, determining a separator, and determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column based on the first combination column, the second combination column and the separator so as to obtain the target set through the first index number set and the second index number set. According to the method, multiple rows of data are flexibly selected and combined to obtain the combined columns (namely the first combined column and the second combined column) corresponding to the multiple rows of data, the multiple rows of data in the combined columns are not required to be manually converted into single-column data, the first combined column and the second combined column are directly subjected to privacy set intersection to obtain a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column, then corresponding data are directly inquired through the index numbers to obtain a target set, a user can flexibly select and freely combine the column data according to requirements to form the combined columns, and further automation, integration and flexibility of multi-row data combination intersection are achieved, and the user is not required to manually clean the data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, are incorporated in and constitute a part of this specification. The drawings and their description are illustrative of the application and are not to be construed as unduly limiting the application. In the drawings:

FIG. 1 is a flowchart of a method for determining a target set according to an embodiment of the present invention;

FIG. 2 is a flowchart of an implementation of a method for determining a target set according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a data table formed by storing source data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the selecting and numbering operations of column data according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for negotiating delimiters according to an embodiment of the present invention;

FIG. 6 is a flowchart of a negotiation implementation of a first client (A) as a negotiation initiator implementing a separator provided by an embodiment of the present invention;

FIG. 7 is a flowchart of a negotiation implementation of a second client (B) as a negotiation initiator implementing a separator provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of a data table formed by the intersection preprocessing according to an embodiment of the present invention;

Fig. 9 is a schematic diagram of PSI operation results provided by an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a device for determining a target set according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a terminal according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein.

It should be understood that, in various embodiments of the present invention, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present invention, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements that are expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present invention, "plurality" means two or more. "and/or" is merely an association relationship describing the association object, and means that three relationships may exist, for example, and/or B, and that three cases of a alone, a and B together, and B alone may exist. The character "/" generally indicates that the context-dependent object is an "or" relationship. "comprising A, B and C", "comprising A, B, C" means that all three of A, B, C are comprised, "comprising A, B or C" means that one of A, B, C is comprised, "comprising A, B and/or C" means that any 1 or any 2 or 3 of A, B, C are comprised.

It should be understood that in the present invention, "B corresponding to a", "a corresponding to B", or "B corresponding to a" means that B is associated with a, from which B can be determined. Determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. The matching of A and B is that the similarity of A and B is larger than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at" or "when" depending on the context, "or" in response to a determination "or" in response to a detection.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the following description will be made by way of specific embodiments with reference to the accompanying drawings.

In one embodiment, as shown in fig. 1, a method for determining a target set is provided, including the following steps:

Step S101, receiving a first data source and a second data source.

Referring to fig. 2, two data sources for performing privacy set intersection according to the present application include a first data source sent by a first client a and a second data source sent by a second client B. When the first data source and the second data source are received, the first data source and the second data source are required to be subjected to data import, a first data table and a second data table are respectively generated after the data import, and the imported data sources can be in the forms of csv files, mySQL, hive and the like.

Step S102, determining a first combined column and a second combined column based on the first data source and the second data source respectively.

Wherein the combined columns are formed by combining multiple columns of data. The method and the device are used for generating a first data table and a second data table based on the first data source and the second data source respectively, wherein the first data table and the second data table comprise a plurality of columns of data. Taking the first data table including N columns of data as an example, description is made to determine the first combination column. Specifically, 3 columns of data are selected from the first data table, namely, column 1, column 4 and column 10, and then the selected columns are numbered and then combined according to the numbering sequence to form a first combined column, namely, column 1, column 4 and column 10. The manner in which the second data source forms the second combined column is similar to the first data source and is not specifically limited herein.

And step 103, analyzing the first combined column and the second combined column by utilizing a negotiation interaction mode, and determining the separator.

The negotiation interaction mode means that the first client and the second client determine the separator through a negotiation mode, that is, the obtained separator needs to be determined by the first client and the second client together and is approved by both the first client and the second client.

Step S104, based on the first combination column, the second combination column and the separator, determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column, so as to obtain a target set through the first index number set and the second index number set.

The embodiment of the invention provides a method for determining a target set, which comprises the steps of receiving a first data source and a second data source, determining a first combination column and a second combination column based on the first data source and the second data source respectively, analyzing the first combination column and the second combination column by utilizing a negotiation interaction mode, determining a separator, and determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column based on the first combination column, the second combination column and the separator so as to obtain the target set through the first index number set and the second index number set. According to the method, multiple lines of data are flexibly selected and combined to obtain the combined columns (namely the first combined column and the second combined column) corresponding to the multiple lines of data, the multiple lines of data in the combined columns are not required to be manually converted into single-column data, the first combined column and the second combined column are directly subjected to privacy set intersection, the first index number set corresponding to the first combined column and the second index number set corresponding to the second combined column are obtained, then the corresponding data are directly inquired through the index numbers, the target set is obtained, a user can flexibly select and freely combine the column data according to requirements to form the combined columns, and privacy set intersection operation of different combined columns is conveniently and efficiently realized.

In one embodiment, step S102 includes:

step S201, a first data table and a second data table are generated based on the first data source and the second data source respectively.

After the data source is imported, a data table corresponding to the data source is automatically generated, that is, a first data table corresponding to the first data source and a second data table corresponding to the second data source are generated, that is, the imported data source is stored in the form of the data table shown in fig. 3. Specifically, the first behavior index number index, column names column-1, column-2 corresponding to the data source. Each of the remaining rows consists of a unique index value and a data value, wherein index is automatically generated and ordered.

Step S202, selecting preset number of column data from a first data table and a second data table respectively to obtain preset number of first column data and preset number of second column data;

step S203, respectively combining the first column data with the preset number and the second column data with the preset number to obtain a first combined column and a second combined column.

The first column data and the second column data are respectively corresponding to the first data table and the second data table, and are not specific to one column of data in the data tables.

The selection and numbering of column data is described with reference to fig. 4, and the tables to which the column data belongs are distinguished by columnA and columnB, i.e. columnA represents the column corresponding to the first table and columnB represents the column corresponding to the second table. Specifically, columnA-2, columnA-50 and columnA-52 are selected from the first data table and numbered 1, 3 and 2 respectively, columnB-1, columnB-3 and columnB-82 are selected from the second data table and numbered 2, 1 and 3 respectively, namely the first combination columns columnA-2, columnA-52 and columnA-50 and the second combination columns columnB-3, columnB-1 and columnB-82 are finally obtained.

After determining the first combined column and the second combined column, the separator is determined according to the negotiation interaction mode. Because the effect of the intersection in the application is related to the delimiter determined by the negotiation interaction, namely, the values of each column in the first combined column and the second combined column (the middle is separated by the delimiter) are spliced into a character string for comparison, and if the delimiter appears in the data values of each column, the accuracy of the PSI intersection result is affected.

As shown in connection with FIG. 4, columns columnA-2, columnA-52, columnA-50 of the A-side require a one-to-one matching of the corresponding values of columns columnB-3, columnB-1, columnB-82 of the B-side, respectively. Assume that columnA-2, columnA-52, columnA-50 of party A have a record of values of "a", "B" and "c", respectively, and columnB-3, columnB-1, columnB-82 of party B have a record of values of "a", "B" and "c", respectively. These two records are obviously unmatched and the PSI column combination is a positive miss. However, if "as separator" is chosen, the comparison method according to the present application will convert the records of both sides a and B into strings "a, B, c" and hit when crossing. To avoid such errors, the present application adds the step of delimiter negotiation to ensure that delimiters selected by both A and B do not appear in the data values of both parties.

In an embodiment, referring to fig. 5-7, since the first data source is sent by the first client and the second data source is sent by the second client, the specific process implemented in S103 is described by using different entities as negotiation initiators based on the difference of the entities sent by the data sources.

In the case where the first client is a negotiation initiator, determining the separator based on the first combined column and the second combined column comprises:

(1) A first character difference set and a second character difference set are determined based on the first combined column and the second combined column, respectively.

Specifically, all characters in the first combination column are counted to form a first character set, the preset character set and the first character set are subjected to difference to obtain a first character difference set, all characters in the second combination column are counted to form a second character set, and the preset character set and the second character set are subjected to difference to obtain a second character difference set.

A represents a first client, B represents a second client, the A side and the B side respectively scan data values of PSI combination columns, namely, the A side scans data values in the first combination columns, the B side scans data values in the second combination columns, all characters appearing in the data values of the sides are counted respectively, and a first character set A_ CharSet and a second character set B_ CharSet are formed.

Assuming that an ASCII code table constitutes a character set as ASC, difference sets of ASC, the set A_ CharSet and the set B_ CharSet are respectively obtained, and a first character difference set A_EXCEPT and a second character difference set B_EXCEPT are obtained.

(2) If either the first character difference set or the second character difference set is empty, a current time stamp is obtained, and a separator is determined based on the current time stamp.

The separator is obtained by sequentially performing character string conversion, hash operation and character string interception on the current timestamp, and the separator determining step is described in a specific embodiment below.

As shown in fig. 6, when the a party is a negotiation initiator and the B party is a participant, if at least one of the first character difference set a_except and the second character difference set b_except is an empty set, the a party obtains a current timestamp and converts the current timestamp into a string form Str (GetCurrentTimeMillis ()), and then performs a Hash operation (including, but not limited to, MD5, SHA1, SHA256, etc.) on the string to form a Hash (Str (GetCurrentTimeMillis ()). Finally, a character string composed of the first 16 bytes of the Hash result, namely Str (Byte _(0,15) [ Hash (Str (GetCurrentTimeMillis ()))) ], is intercepted as a final negotiated separator, and the separator is sent to the party B, and the flow of the party A is ended. The B side receives Str (Byte _(0,15) [ Hash (Str (GetCurrentTimeMillis ()))) ]) sent by the a side, and uses the Str as a separator for final negotiation, and the B side flow ends.

(3) And if the first character difference set and the second character difference set are not empty, selecting any character from the first character difference set as a target character, and if the target character exists in the second character difference set, taking the target character as a separator.

Referring to fig. 6, when the a party is a negotiation initiator and the B party is a participant, if neither the first character difference set a_except nor the second character difference set b_except is null, one character a is selected from the set a_except, and the a is sent to the B party.

And after receiving the character a sent by the A side, the B side judges whether the character a is in a second character difference set B_EXCEPT. If yes, confirming a as a separator after final negotiation, feeding back a conclusion that a can be used as the separator to the A side, ending the B side flow, and confirming a as the separator after final negotiation by the A side, wherein the A side ends the A side flow.

(4) If the second character difference set does not have the target character, feeding back to the A party that the target character cannot be used as a separator, removing the target character from the first difference set by the A party, and executing a process of taking the B party as a negotiation initiator, wherein details are shown in a process of taking a second client side as the negotiation initiator. If the negotiation of B is not successful, the operation that A and B are alternately used as negotiation initiator is carried out until the delimiter is negotiated.

If a is not in the second character difference set b_escape, the delimiter is determined using the rotation of parties B and a as negotiation initiators. When the A party is the negotiation initiator, the specific steps of taking the B party and the A party as the negotiation initiator in turn are that the A party is taken as the negotiation initiator of the negotiation flow, the specific flow is shown in figure 6, and if the negotiation flow is not finished, taking the B party as the negotiation initiator of the negotiation flow, and carrying out negotiation again, wherein the specific flow is shown in figure 7. If the negotiation flow has not ended, a continues to act as the negotiation initiator of the negotiation flow. Of course, the case when the B-party is the negotiation initiator is similar to the above, and will not be described here again.

In the case that the a-party is the negotiation initiator with reference to fig. 5 to 7, a character a is optionally selected from a_except and sent to the B-party. If a is not in the second character difference set B_EXCEPT, taking the party B as a negotiation initiator, and if the first character difference set A_EXCEPT and the second character difference set B_EXCEPT are not empty, selecting one character B from the set B_EXCEPT, and sending B to the party A.

After receiving the character B sent by the B side, the A side judges whether the character B is in the first character difference set A_EXCEPT. If yes, confirming B as a separator after final negotiation, feeding back a conclusion that B can be used as the separator to the B side, ending the flow of the A side, and confirming B as the separator after final negotiation by the B side, wherein the flow of the B side is ended.

If B is not in the first character difference set A_EXCEPT, taking the A party as a negotiation initiator, if any one of the first character difference set A_EXCEPT and the second character difference set B_EXCEPT is empty, acquiring a current time stamp, determining a separator based on the current time stamp, and if neither the first character difference set A_EXCEPT nor the second character difference set B_EXCEPT is empty, selecting one character c from the set A_EXCEPT, and sending c to the B party. The a-party and the B-party take turns as negotiation initiators until the separator is determined.

In addition, in the case that the a party is used as the negotiation initiator, if the target character a does not exist in the second character difference set, the target character a is removed from the first character difference set, and then whether other characters are in the second character difference set is continuously judged until the separator is determined.

In the case where the second client is the negotiation initiator, determining the separator based on the first combined column and the second combined column includes:

(1) If either the first character difference set or the second character difference set is empty, a current time stamp is obtained, and a separator is determined based on the current time stamp. The separator is obtained by sequentially performing character string conversion, hash operation and character string interception on the current timestamp.

(2) And if the first character difference set and the second character difference set are not empty, selecting any character from the second character difference set as a target character, and if the target character exists in the first character difference set, taking the target character as a separator.

(3) If the first character difference set does not have the target character, any character is selected from the first character difference set as the target character, if the second character difference set has the target character, the target character is used as the separator, and if the second character difference set does not have the target character, the step of selecting any character from the second character difference set as the target character is repeatedly executed.

Steps (1) - (3) in this embodiment are the same as steps (2) - (4) in determining the separator if the first client is the negotiation initiator, and are not described here again.

In one embodiment, step S104 includes:

Step S301, preprocessing the first combination column, the second combination column and the separator to obtain first combination data corresponding to the first combination column, a third index number set corresponding to the first combination data, and a second combination data corresponding to the second combination column and a fourth index number set corresponding to the second combination data.

Specifically, assuming that columnA-2, columnA-52, columnA-50 in the first combination column have a recorded value of "a", "b", and "c", respectively, columnB-3, columnB-1, columnB-82 in the second combination column have a recorded value of "a", "b", and "c", respectively, and "|" is used as the separator, one combination data formed in the first combination column is "a, |b|c", and one combination data formed in the second combination column is "a|b|c". When all records in the first and second combination columns are separated by a separator, the first and second combination data are constituted.

Since the first combination column and the second combination column are selected from fig. 3, the index numbers corresponding to each value in the first combination column and the second combination column can be found from fig. 3, and are the third index number set and the fourth index number set respectively.

The step of preprocessing the first combination column and the second combination column comprises the steps of establishing a first empty table corresponding to the first combination column and a second empty table corresponding to the second combination column, and then respectively inserting combination data formed in the first combination column and the second combination column and corresponding index numbers into the first empty table and the second empty table according to the header fields in the first empty table and the second empty table to obtain a third data table and a fourth data table.

Specifically, two temporary space-time data tables, namely a first empty table and a second empty table, are established, and the two header fields respectively represent an index number set and a spliced column combined data value (namely combined data) and are respectively represented by index-set and column-group.

Traversing the first data table and the second data table according to rows, combining column combinations and column numbers selected by a user, taking out data values one by one, adding a separator in the middle, splicing the data values into a combined data value-group, inserting the combined data value-group together with an index number index into an empty table, and completing the data table as shown in fig. 8.

When data is inserted, in order to avoid repeated insertion of the same combined data, whether the value-group to be inserted exists in the data value corresponding to the column-group column in the data table is checked. If not, the index and the value-group are directly inserted, and if so, the corresponding record is found in the data table, and the index is added into the index-set of the corresponding row.

Step S302, carrying out intersection operation on the first combined data and the second combined data, and combining the third index number set and the fourth index number set to obtain a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column.

The intersection operation refers to privacy set intersection.

After the above preprocessing, a privacy set query (PRIVATE SET Intersection, PSI) operation may be performed to determine an index number set, as shown in fig. 9, which specifically includes the following steps:

(1) After the intersection pretreatment, the A side and the B side respectively generate A third datA table TB-A and A fourth datA table TB-B.

(2) The column-group column datA of the datA tables TB-A and TB-B are subjected to the privacy set intersection operation, and solutions commonly used in the art, such as careless transmission, hashing, public key encryption, garbling circuit, homomorphic encryption, etc. (including but not limited to these), may be adopted, and the combined datA value-group of all intersection is recorded. For example, as shown in fig. 9, the combined data values of rows 1 and 3 of the a-side are PSI hit, and the combined data values of rows 2 and 3 of the B-side are PSI hit.

(3) And according to the combined datA value-group in the intersection, combining the datA tables TB-A and TB-B, positioning the corresponding record rows, and finding all corresponding index-sets. Index-set values in all the intersections of the A side and the B side are respectively summarized to form index sets IndexSet _PSI (A) and IndexSet _PSI (B) obtained by combining the intersections. For example, as shown in FIG. 9, indexSet _PSI (A) obtained by the A-side combination intersection is {0,3,4}, and IndexSet _PSI (B) obtained by the B-side combination intersection is {2,4,5,6,7}.

Then, by traversing index values in the sets IndexSet _PSI (a) and IndexSet _PSI (B), respectively, and using the index values as index numbers, the source data line information corresponding to the PSI column-by-column combination intersection result can be obtained by comparing the data tables generated after the source data are respectively imported (as shown in fig. 3). The PSI combination intersection effect is equal to that 'columnA-2 data value is equal to columnB-3 data value' columnA-52 data value is equal to columnB-1 data value 'columnA-50 data value is equal to columnB-82 data value'.

The above results may be downloaded, that is, after the PSI operation is completed, the user may view the statistical information of the multi-column combined intersection result, and may download the record row data in the intersection in the data table (as shown in fig. 3), so as to perform the result analysis.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

The following are device embodiments of the invention, for details not described in detail therein, reference may be made to the corresponding method embodiments described above.

Fig. 10 shows a schematic structural diagram of a target set determining device according to an embodiment of the present invention, and for convenience of explanation, only a portion related to the embodiment of the present invention is shown, where the target set determining device includes a data receiving module 1001, a combined column determining module 1002, a separator determining module 1003, and an intersection determining module 1004, specifically as follows:

A data receiving module 1001, configured to receive a first data source and a second data source;

a combined column determining module 1002, configured to determine a first combined column and a second combined column based on the first data source and the second data source, where the combined column is formed by combining multiple columns of data;

The separator determining module 1003 is configured to analyze the first combined column and the second combined column by using a negotiation interaction manner, and determine a separator;

the target set determining module 1004 is configured to determine, based on the first combined column, the second combined column, and the separator, a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column, so as to obtain a target set through the first index number set and the second index number set.

In one possible implementation, the combined column determination module 1002 includes:

the table generation sub-module is used for generating a first data table and a second data table based on the first data source and the second data source respectively;

The column data selecting sub-module is used for selecting preset number of column data from the first data table and the second data table respectively to obtain preset number of first column data and preset number of second column data;

and the column data combination sub-module is used for respectively combining the first column data with the preset number and the second column data with the preset number to obtain a first combination column and a second combination column.

The separator determination module 1003 includes:

A first negotiation submodule, configured to determine a separator based on the first combination column and the second combination column in a case where the first client is a negotiation initiator;

and the second negotiation submodule is used for determining the separator based on the first combination column and the second combination column under the condition that the second client is a negotiation initiator.

In one possible implementation, the first negotiation submodule includes:

A first character difference set determining unit configured to determine a first character difference set and a second character difference set based on the first combination column and the second combination column, respectively;

The first judging unit is used for acquiring a current time stamp if any difference set of the first character difference set or the second character difference set is empty, and determining a separator based on the current time stamp, wherein the separator is obtained by sequentially performing character string conversion, hash operation and character string interception on the current time stamp;

The second judging unit is used for selecting any character from the first character difference set as a target character if the first character difference set and the second character difference set are not empty;

And the third judging unit is used for taking the target character as a separator if the target character exists in the second character difference set.

In one possible implementation, the method further includes:

A fourth judging unit, configured to select any character from the second character difference set as a target character if the second character difference set does not have the target character;

A fifth judging unit, configured to take the target character as a separator if the target character exists in the first character difference set;

and a sixth judging unit for repeating the step of selecting any character from the first character difference set as the target character if the target character does not exist in the first character difference set.

In one possible implementation, the second negotiation sub-module comprises:

a seventh judging unit, configured to obtain a current timestamp if any one of the first character difference set or the second character difference set is empty, and determine a separator based on the current timestamp, where the separator is obtained by sequentially performing character string conversion, hash operation, and character string interception on the current timestamp;

an eighth judging unit, configured to select any character from the second character difference set as a target character if neither the first character difference set nor the second character difference set is empty;

And a ninth judging unit, configured to take the target character as the separator if the target character exists in the first character difference set.

In one possible implementation, the method further includes:

A tenth judging unit, configured to select any character from the first character difference set as a target character if the target character does not exist in the first character difference set;

An eleventh judging unit for taking the target character as a separator if the target character exists in the second character difference set;

and a twelfth judging unit for repeating the step of selecting any character from the second character difference set as the target character if the target character does not exist in the second character difference set.

In one possible implementation, the first character difference set determining unit or the second character difference set determining unit includes:

The first statistics subunit is used for counting all characters in the first combination column to form a first character set, and differencing the preset character set and the first character set to obtain a first character difference set;

And the second statistics subunit is used for counting all characters in the second combination column to form a second character set, and performing difference between the preset character set and the second character set to obtain a second character difference set.

In one possible implementation, the target set determination module 1004 includes:

the preprocessing submodule is used for preprocessing the first combination column, the second combination column and the separator to obtain first combination data corresponding to the first combination column, a third index number set corresponding to the first combination data, second combination data corresponding to the second combination column and a fourth index number set corresponding to the second combination data;

And the PSI operation sub-module is used for carrying out the intersection operation on the first combined data and the second combined data, and combining the third index number set and the fourth index number set to obtain a first index number set corresponding to the first combined column and a second index number set corresponding to the second combined column.

Fig. 11 is a schematic diagram of a terminal according to an embodiment of the present invention. As shown in fig. 11, the terminal 11 of this embodiment includes a processor 110, a memory 111, and a computer program 112 stored in the memory 111 and executable on the processor 110. The steps in the above-described embodiments of the method for determining a set of objects are implemented by the processor 110 when executing the computer program 112, for example steps 101 to 104 shown in fig. 1. Or the processor 110, when executing the computer program 112, implements the functions of the modules/units in the above-described embodiments of the target set determination apparatus, for example, the functions of the modules/units 1001 to 1004 shown in fig. 10.

The present invention also provides a readable storage medium having a computer program stored therein, which when executed by a processor is configured to implement the method for determining a target set provided in the above-described various embodiments.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). In addition, the ASIC may reside in a user device. The processor and the readable storage medium may reside as discrete components in a communication device. The readable storage medium may be read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tape, floppy disk, optical data storage device, etc.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, the execution instructions being executed by the at least one processor to cause the device to implement the method of determining a set of targets provided by the various embodiments described above.

In the above embodiment of the apparatus, it should be understood that the Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), but may also be other general purpose processors, digital signal processors (english: DIGITAL SIGNAL Processor, abbreviated as DSP), application specific integrated circuits (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present invention and should be included in the protection scope of the present invention.

Claims

1. A method for determining a target set, comprising:

receiving a first data source and a second data source;

Determine a first combined column and a second combined column based on the first data source and the second data source respectively, wherein the combined column is formed by combining multiple columns of data;

Analyzing the first combination column and the second combination column in a negotiation interaction manner to determine a separator;

Based on the first combination column, the second combination column, and the delimiter, determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column, so as to obtain a target set through the first index number set and the second index number set;

The first data source is sent by a first client, and the second data source is sent by a second client;

The analyzing the first combination column and the second combination column in a negotiation interaction manner to determine a separator includes:

In a case where the first client is a negotiation initiator, determining the delimiter based on the first combination column and the second combination column;

In a case where the second client is the negotiation initiator, determining the delimiter based on the first combination column and the second combination column;

The determining the delimiter based on the first combination column and the second combination column when the first client is the negotiation initiator includes:

Determine a first character difference set and a second character difference set based on the first combination column and the second combination column respectively;

If either the first character difference set or the second character difference set is empty, obtaining a current timestamp, and determining the delimiter based on the current timestamp, wherein the delimiter is obtained by sequentially performing string conversion, hash operation, and string truncation on the current timestamp;

If both the first character difference set and the second character difference set are not empty, selecting any character from the first character difference set as the target character;

If the target character exists in the second character difference set, use the target character as the separator;

The determining the first character difference set and the second character difference set based on the first combination column and the second combination column respectively includes:

Counting all characters in the first combination column to form a first character set, and performing a subtraction between a preset character set and the first character set to obtain a first character difference set;

All characters in the second combination column are counted to form a second character set, and the preset character set is subtracted from the second character set to obtain a second character difference set.

2. The method for determining a target set according to claim 1, wherein determining the first combination column and the second combination column based on the first data source and the second data source respectively comprises:

generating a first data table and a second data table based on the first data source and the second data source respectively;

Selecting a preset number of columns of data from the first data table and the second data table respectively to obtain a preset number of first columns of data and a preset number of second columns of data;

The preset number of first column data and the preset number of second column data are respectively combined to obtain the first combined column and the second combined column.

3. The method for determining a target set according to claim 1, further comprising:

If the target character does not exist in the second character difference set, taking the second client as the negotiation initiator, and selecting any character from the second character difference set as the target character;

If the target character exists in the first character difference set, use the target character as the separator;

If the target character does not exist in the first character difference set, the step of selecting any character from the first character difference set as the target character is repeated.

4. The method for determining a target set according to claim 1, wherein, when the second client is the negotiation initiator, determining the delimiter based on the first combination column and the second combination column comprises:

If both the first character difference set and the second character difference set are not empty, selecting any character from the second character difference set as the target character;

If the target character exists in the first character difference set, the target character is used as the separator.

5. The method for determining a target set according to claim 4, further comprising:

If the target character does not exist in the first character difference set, taking the first client as the negotiation initiator, and selecting any character from the first character difference set as the target character;

If the target character does not exist in the second character difference set, the step of selecting any character from the second character difference set as the target character is repeated.

6. The method for determining a target set according to claim 1, wherein determining a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column based on the first combination column, the second combination column, and the delimiter comprises:

Preprocessing the first combination column, the second combination column, and the delimiter to obtain first combination data corresponding to the first combination column and a third index number set corresponding to the first combination data, and second combination data corresponding to the second combination column and a fourth index number set corresponding to the second combination data;

An intersection operation is performed on the first combination data and the second combination data, and combined with the third index number set and the fourth index number set to obtain a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column.

7. A device for determining a target set, comprising:

A data receiving module, configured to receive a first data source and a second data source;

a combination column determining module, configured to determine a first combination column and a second combination column based on the first data source and the second data source, respectively, wherein the combination column is formed by combining multiple columns of data;

a delimiter determination module, configured to analyze the first combination column and the second combination column in a negotiation interaction manner to determine a delimiter;

a target set determining module, configured to determine, based on the first combination column, the second combination column, and the delimiter, a first index number set corresponding to the first combination column and a second index number set corresponding to the second combination column, so as to obtain a target set using the first index number set and the second index number set;

8. A terminal comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the target set determination method according to any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for determining a target set according to any one of claims 1 to 6.