CN106709800B

CN106709800B - Community division method and device based on feature matching network

Info

Publication number: CN106709800B
Application number: CN201611110731.7A
Authority: CN
Inventors: 李旭瑞; 邱雪涛; 赵金涛; 钟毅; 胡奕
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2016-12-06
Filing date: 2016-12-06
Publication date: 2020-08-11
Anticipated expiration: 2036-12-06
Also published as: TWI662421B; CN106709800A; WO2018103456A1; TW201822022A

Abstract

The embodiment of the invention relates to the field of data processing, in particular to a community division method and device based on a feature matching network, which are used for dividing communities. In the embodiment of the invention, a K-bit hash vector corresponding to each account information is determined according to preset K hash functions; sequentially dividing the hash vector corresponding to each account number information into m-K/K sub-hash vectors; for each class, dividing account information with the same sub-hash vector into the same group; calculating the similarity between the account information in the same group; if the similarity between the account information is larger than a threshold value, establishing an interconnection edge between the account information to form a feature matching network; according to the characteristic matching network, community division is carried out on the account information, and then community analysis can be carried out according to the divided communities to find abnormal communities.

Description

A method and device for community division based on feature matching network

技术领域technical field

本发明实施例涉及数据处理领域，尤其涉及一种基于特征匹配网络的社团划分方法和装置。Embodiments of the present invention relate to the field of data processing, and in particular, to a method and device for community division based on a feature matching network.

背景技术Background technique

目前，国内信用卡市场面临的风险形势日益严峻，信用卡套现、伪卡欺诈、盗卡欺诈等案件日益增加，具体的，信用卡套现是指持卡人通过虚假消费交易或与商户合谋刷卡后获取现金，之后退款或购买容易变现商品后变卖获取现金等行为、伪卡欺诈是指按照银行卡的磁条信息格式写磁，凸印或平印伪造真实有效的银行卡进行交易的欺诈行为；盗卡欺诈是指欺诈者获得真实持卡人的部分或者全部信息并假冒真实持卡人对账户的信息进行变更以达到欺诈目的的行为。信用卡犯罪手段不断向着高科技、集团化、专业化发展，案件实施过程更为隐蔽，手法不断翻新，这对银行和持卡人的资金安全构成威胁，成为制约信用卡产业长期健康发展的重要因素。At present, the risk situation facing the domestic credit card market is becoming increasingly severe, and cases of credit card cash out, counterfeit card fraud, and card theft fraud are increasing. Subsequent refunds or purchases of easy-to-realize goods and then selling for cash, etc. Fake card fraud refers to the fraudulent behavior of writing magnetic, embossing or lithographic forging real and valid bank cards to conduct transactions according to the magnetic stripe information format of the bank card; card theft Fraud refers to the behavior that fraudsters obtain part or all of the real cardholder's information and impersonate the real cardholder to change the account information to achieve fraudulent purposes. The criminal methods of credit cards are constantly developing towards high-tech, group-based and professional development. The implementation process of cases is more hidden and the methods are constantly being renovated. This poses a threat to the financial security of banks and cardholders, and has become an important factor restricting the long-term healthy development of the credit card industry.

面对各种各样的欺诈手段，现有技术中，通常采用聚类的方法来应对，然而采用这种方法存在多种缺陷，例如，一方面，如果后续对反欺诈模型添加数据，会对反欺诈模型更新数据造成困难，另一方面，经过聚类之后，虽然能将节点划分为若干类，但群体内的结构以及结构之间的关联仍然难以描述。In the face of various fraudulent methods, in the prior art, the clustering method is usually used to deal with it. However, this method has many defects. For example, on the one hand, if data is subsequently added to the anti-fraud model, it will The anti-fraud model makes it difficult to update the data. On the other hand, after clustering, although the nodes can be divided into several categories, the structure within the group and the relationship between the structures are still difficult to describe.

综上所述，现有技术中存在着如果后续对反欺诈模型添加数据，造成反欺诈模型更新数据困难；经过聚类之后，群体内的结构以及结构之间的关联仍然难以描述的问题，因此，需要采取有效的措施来解决以上问题。To sum up, there is a problem in the prior art that if data is subsequently added to the anti-fraud model, it will be difficult to update the data of the anti-fraud model; after clustering, the structure within the group and the relationship between the structures are still difficult to describe. , need to take effective measures to solve the above problems.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种基于特征匹配网络的社团划分方法和装置，用以解决现有技术中存在着如果后续对反欺诈模型添加数据，造成反欺诈模型更新数据困难、经过聚类之后，群体内的结构以及结构之间的关联仍然难以描述的问题。Embodiments of the present invention provide a method and device for community division based on a feature matching network, to solve the problem in the prior art that if data is subsequently added to the anti-fraud model, it is difficult to update the data of the anti-fraud model, and after clustering, the group The internal structure and the relationship between the structures are still difficult to describe.

本发明实施例提供一种基于特征匹配网络的社团划分方法，包括：An embodiment of the present invention provides a method for dividing a community based on a feature matching network, including:

根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；According to the preset K hash functions, determine the K-bit hash vector corresponding to each account information;

将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量；Divide the hash vector corresponding to each account information into m=K/k sub-hash vectors in sequence;

针对每个类，将子哈希向量相同的账号信息划分为同一组；For each class, the account information with the same sub-hash vector is divided into the same group;

计算同一组内的各账号信息之间的相似度；Calculate the similarity between account information in the same group;

若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；If the similarity between the account information is greater than the threshold, then establish interconnected edges between the account information to form a feature matching network;

根据特征匹配网络，对各账号信息进行社团划分。According to the feature matching network, each account information is divided into communities.

可选地，计算同一组内的各账号信息之间的相似度，包括：Optionally, calculating the similarity between account information in the same group, including:

若第i账号信息与第j账号信息位于n类同组中，则将n/m作为第i帐号信息与第j账号信息之间的相似度；第i账号信息与第j账号信息为各账号信息中的任一个。If the i-th account information and the j-th account information are in the same group of n types, n/m is taken as the similarity between the i-th account information and the j-th account information; the i-th account information and the j-th account information are the respective accounts any of the information.

若第i账号信息与第j账号信息位于同一组中，统计第i账号信息的哈希向量与第j账号信息的哈希向量中位于同一位且哈希向量值相同的个数h；第i账号信息与第j账号信息为各账号信息中的任一个；If the i-th account information and the j-th account information are in the same group, count the number h of the hash vector of the i-th account information and the hash vector of the j-th account information that are in the same position and have the same hash vector value; The account information and the jth account information are any of the account information;

第i账号信息与第j账号信息的相似度s＝h/K。The degree of similarity between the i-th account information and the j-th account information is s=h/K.

可选地，根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，包括：Optionally, according to the preset K hash functions, determine the K-bit hash vector corresponding to each account information, including:

根据公式(1)确定每个账号信息对应的K位哈希向量

Determine the K-bit hash vector corresponding to each account information according to formula (1)

其中，2'b表示

是一个二进制数，

是预设的K个哈希函数中的一个，Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

表示账号信息的特征向量，其中，

c₁,c₂…,c_d表示账号信息的特征属性，

表示随机选取的一个非零向量，

is a feature vector representing account information, where,

c ₁ ,c ₂ …,c _d represent the characteristic attributes of account information,

represents a randomly selected non-zero vector,

可选地，根据特征匹配网络，对各账号信息进行社团划分，包括：Optionally, according to the feature matching network, each account information is divided into communities, including:

(1)将各账号信息划分在特征匹配网络中不同的社区中；(1) Divide the account information into different communities in the feature matching network;

(2)根据各账号信息之间的相似度，计算每个账号信息的相似强度，从而生成节点相似强度矩阵；(2) Calculate the similarity strength of each account information according to the similarity between the account information, thereby generating a node similarity strength matrix;

(3)针对每个账号信息，从节点相似强度矩阵中账号信息所在的行，按相似强度从大到小的的顺序尝试将账号信息划至其他社区中；若账号信息自第p社区划分至第q社区后的模块度差为正数，则将账号信息划分至第q社区后结束；(3) For each account information, from the row where the account information is located in the node similarity strength matrix, try to assign the account information to other communities in the order of similarity strength from large to small; if the account information is divided from the pth community to If the modularity difference after the qth community is a positive number, the account information will be divided into the qth community and it will end;

(4)重复执行，直到社区结构不再改变为止。(4) Repeat the execution until the community structure no longer changes.

可选地，根据各账号之间的相似度，计算每个账号信息的相似强度，包括：Optionally, according to the similarity between the accounts, the similarity strength of each account information is calculated, including:

根据公式(2)计算第i账号信息与第j账号信息之间的相似强度s_i,j；Calculate the similarity strength _si,j between the ith account information and the jth account information according to formula (2);

其中，w(z)＝w_ai,z 公式(2)

Among them, w(z)=w _ai,z Formula (2)

其中，Γ(i)表示第i账号信息的邻居集合，Γ(i)∩Γ(j)表示第i账号信息与第j账号信息的共同邻居集合，w_ai,z为任意账号信息ai与第z账号信息之间的边的权重和。Among them, Γ(i) represents the neighbor set of the ith account information, Γ(i)∩Γ(j) represents the common neighbor set of the ith account information and the jth account information, w _ai,z is the arbitrary account information ai and the jth account information. The weight sum of edges between z account information.

本发明实施例还提供一种基于特征匹配网络的社团划分装置，包括：The embodiment of the present invention also provides a community division device based on a feature matching network, including:

确定单元：用于根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；Determining unit: used to determine the K-bit hash vector corresponding to each account information according to the preset K hash functions;

第一划分单元：用于将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量；The first division unit: used to divide the hash vector corresponding to each account information into m=K/k sub-hash vectors in sequence;

第二划分单元：用于针对每个类，将子哈希向量相同的账号信息划分为同一组；The second division unit: for each class, the account information with the same sub-hash vector is divided into the same group;

计算单元：用于计算同一组内的各账号信息之间的相似度；Calculation unit: used to calculate the similarity between account information in the same group;

形成网络单元，用于若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；forming a network unit, for if the similarity between the account information is greater than the threshold, then establishing an interconnection edge between the account information to form a feature matching network;

第三划分单元：用于根据特征匹配网络，对各账号信息进行社团划分。The third division unit is used to perform community division for each account information according to the feature matching network.

可选地，计算单元具体用于：Optionally, the computing unit is specifically used for:

可选地，计算单元具体还用于：Optionally, the computing unit is further used for:

可选地，确定单元用于：Optionally, the determination unit is used to:

根据公式(3)确定每个账号信息对应的K位哈希向量

Determine the K-bit hash vector corresponding to each account information according to formula (3)

其中，2'b表示

是一个二进制数，

是预设的K个哈希函数中的一个，Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

表示账号信息的特征向量，其中，

c₁,c₂…,c_d表示账号信息的特征属性，

表示随机选取的一个非零向量，

is a feature vector representing account information, where,

represents a randomly selected non-zero vector,

可选地，第三划分单元具体用于：Optionally, the third dividing unit is specifically used for:

根据公式(4)计算第i账号信息与第j账号信息之间的相似强度s_i,j；Calculate the similarity strength _si,j between the ith account information and the jth account information according to formula (4);

其中，w(z)＝w_ai,z 公式(4)

Among them, w(z)=w _ai,z Formula (4)

本发明实施例中提供了一种基于特征匹配网络的社团划分方法和装置，根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量；针对每个类，将子哈希向量相同的账号信息划分为同一组；计算同一组内的各账号信息之间的相似度；若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；根据特征匹配网络，对各账号信息进行社团划分。本发明实施例中首先通过根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，对于网络中数量巨大的账号信息来说，仅仅产生两个哈希值的哈希函数是不够的，因此确定每个账号信息对应的K位哈希向量能够应对复杂的网络账号信息。然后针对每个类，将子哈希向量相同的账号信息划分为一组，计算同一组内任意账号信息之间的相似度，能够避免针对整个网络中任意账号信息之间计算相似度而带来的计算量非常大的缺点；本发明技术方案能够有效减少账号信息之间相似度的计算量，仅仅计算同一组内的账号信息之间的相似度。最后根据确定各账号信息之间的相似度大于阈值，在各账号信息之间建立互连边，形成特征匹配网络；根据特征匹配网络，对各账号信息进行社团划分，能够更精准的对各账号信息进行社团划分，这样不仅能够使社团之间的关联关系很清楚，而且能够对划分的社团进行分析，找出异常社团，进而对异常社团内的账号进行异常账号排查，更加有针对性地找出欺诈账号，提高应对欺诈账号的效率。此外，如果需要对划分出的社团添加账号信息，只需要对该添加的账号信息重复以上简单的几个步骤，将所添加的账号信息更新到相应的位置即可，并不会产生更新困难的问题。The embodiments of the present invention provide a method and device for community division based on a feature matching network. According to preset K hash functions, a K-bit hash vector corresponding to each account information is determined; The hash vector is divided into m=K/k sub-hash vectors in order; for each class, the account information with the same sub-hash vector is divided into the same group; the similarity between the account information in the same group is calculated ; If the similarity between the account information is greater than the threshold, then establish interconnected edges between the account information to form a feature matching network; according to the feature matching network, each account information is divided into communities. In the embodiment of the present invention, the K-bit hash vector corresponding to each account information is determined according to the preset K hash functions. For a huge amount of account information in the network, only two hash values are generated. The hash function is not enough, so determining the K-bit hash vector corresponding to each account information can deal with complex network account information. Then, for each class, the account information with the same sub-hash vector is divided into a group, and the similarity between any account information in the same group is calculated, which can avoid the calculation of the similarity between any account information in the entire network. However, the technical solution of the present invention can effectively reduce the calculation amount of the similarity between account information, and only calculate the similarity between account information in the same group. Finally, according to the determination that the similarity between the account information is greater than the threshold, an interconnected edge is established between the account information to form a feature matching network; according to the feature matching network, each account information is divided into communities, which can be more accurate for each account. The information is divided into communities, which can not only make the relationship between communities clear, but also analyze the divided communities, find out abnormal communities, and then conduct abnormal account investigations on accounts in abnormal communities, so as to find out more targeted Generate fraudulent accounts and improve the efficiency of dealing with fraudulent accounts. In addition, if you need to add account information to the divided community, you only need to repeat the above simple steps for the added account information, and update the added account information to the corresponding location, and there will be no difficulty in updating. question.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简要介绍。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments.

图1为本发明实施例提供了一种基于特征匹配网络的社团划分方法流程示意图；1 provides a schematic flowchart of a method for dividing a community based on a feature matching network according to an embodiment of the present invention;

图2为本发明实施例提供了本发明的整体思路流程图；FIG. 2 provides a flow chart of the overall idea of the present invention for an embodiment of the present invention;

图3为本发明实施例提供的一种基于特征匹配网络的社团划分装置结构示意图。FIG. 3 is a schematic structural diagram of a community division apparatus based on a feature matching network according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及有益效果更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and beneficial effects of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

应理解，本发明实施例的技术方案可以应用于各种银行出现的网络欺诈手段的场景，比如可以是信用卡产品的欺诈、银行卡产品的欺诈、盗卡欺诈、伪卡欺诈、套现欺诈等等。本发明实施例的技术方案的应用场景也可以是对异常账号信息社团的发现、发现特定种类欺诈的共性、根据欺诈账号信息样本发现其它欺诈账号信息、帮助发现未知欺诈类型等。It should be understood that the technical solutions of the embodiments of the present invention can be applied to various scenarios of network fraud in banks, such as fraud of credit card products, fraud of bank card products, fraud of stolen cards, fraud of counterfeit cards, fraud of cashing out, etc. . The application scenarios of the technical solutions of the embodiments of the present invention may also be the discovery of abnormal account information communities, the discovery of the commonality of specific types of fraud, the discovery of other fraudulent account information based on fraudulent account information samples, and the help to discover unknown fraud types, etc.

图1示例性示出了本发明实施例提供的一种基于特征匹配网络的社团划分方法流程示意图，如图1所示，包括以下步骤：FIG. 1 exemplarily shows a schematic flowchart of a method for dividing a community based on a feature matching network provided by an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S101：根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；Step S101: According to the preset K hash functions, determine the K-bit hash vector corresponding to each account information;

步骤S102：将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量；Step S102: Divide the hash vector corresponding to each account information into m=K/k sub-hash vectors in sequence;

步骤S103：针对每个类，将子哈希向量相同的账号信息划分为同一组；Step S103: for each class, divide account information with the same sub-hash vector into the same group;

步骤S104：计算同一组内的各账号信息之间的相似度；Step S104: Calculate the similarity between account information in the same group;

步骤S105：若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；Step S105: If the similarity between the account information is greater than the threshold, then establish an interconnection edge between the account information to form a feature matching network;

步骤S106：根据特征匹配网络，对各账号信息进行社团划分。Step S106: According to the feature matching network, perform community division on each account information.

步骤S101中，根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，具体来说，经过每个预设的哈希函数的处理都能得到一位哈希向量，那么，根据预设的K个哈希函数，就可以产生K位哈希向量，而每个账号信息对应K位哈希向量，具体实施中，每个账号信息是包含多个特征属性的，如果仅仅使用现有技术中一个账号信息只用一个哈希函数来表示的话，会存在不足以表达一个账号信息的多个特征属性的缺点，所以，本步骤可以有效避免这个缺点。其中，K的取值可以根据具体实施中各账号信息的具体情况来设定，比如，K可以设定为4，那么账号信息就可以表示为一个4位的哈希向量。In step S101, a K-bit hash vector corresponding to each account information is determined according to the preset K hash functions. Specifically, a one-bit hash vector can be obtained by processing each preset hash function. , then, according to the preset K hash functions, a K-bit hash vector can be generated, and each account information corresponds to a K-bit hash vector. In the specific implementation, each account information contains multiple characteristic attributes, If only one hash function is used to represent one account information in the prior art, there will be a disadvantage that it is insufficient to express multiple characteristic attributes of one account information. Therefore, this step can effectively avoid this shortcoming. The value of K can be set according to the specific situation of each account information in the specific implementation. For example, K can be set to 4, then the account information can be represented as a 4-bit hash vector.

步骤S102：将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量，具体来说，比如，K＝4，k＝2，那么，就将每个账号信息为4位的哈希向量划分为2类子哈希向量，划分的好处是为后续计算账号间的相似度减少计算量，避免出现像现有技术中并没有对账号信息的哈希向量进行划分而出现直接对所有账号信息中的任意两个账号来进行相似度计算而造成的计算量特别大的缺点。Step S102: Divide the hash vector corresponding to each account information into m=K/k sub-hash vectors in order. Specifically, for example, K=4, k=2, then, each account information The 4-bit hash vector is divided into two types of sub-hash vectors. The advantage of the division is to reduce the amount of calculation for the subsequent calculation of the similarity between accounts, and to avoid the occurrence of hash vectors that do not divide account information as in the prior art. However, there is a disadvantage of a particularly large amount of calculation caused by directly performing similarity calculation on any two accounts in all account information.

步骤S103：针对每个类，将子哈希向量相同的账号信息划分为同一组，具体来说，对每个账号信息划分为各类之后，针对划分的每个类，将子哈希向量相同的账号信息划分为同一组，比如，K＝4，k＝2的话，在第1类中，所有账号信息中4位哈希向量中前两位相同的为一组，同样，在第2类中，所有账号信息中4位哈希向量中后两位相同的账号信息为一组。这样划分的目的也是为了后面减少计算相似度的计算量，只计算各类之间子哈希向量相同的账号信息之间的相似度。Step S103: For each class, divide the account information with the same sub-hash vector into the same group. The account information is divided into the same group, for example, if K=4, k=2, in the first category, all account information in the 4-bit hash vector with the same first two digits is a group. Similarly, in the second category , the account information with the same last two digits in the 4-bit hash vector in all account information is a group. The purpose of this division is also to reduce the amount of calculation for calculating the similarity later, and only calculate the similarity between account information with the same sub-hash vector between various types.

步骤S104：计算同一组内的各账号信息之间的相似度，具体实施中，可以统计同一组内各账号信息的哈希向量的位相同的个数与位的大小的比值，比如，账号信息1的哈希向量为0010，账号信息2的哈希向量为0011，按照K＝4，k＝2，那么两个账号信息在第一类中位于同一组，则确定位于同一组的两个账号信息的相似度；那么，两个账号信息的哈希向量的位相同的个数是3，位的大小是4位的，所以，这两个账号信息之间的相似度为3/4，也可以根据关于相似度的计算公式来计算同一组内的任意两个账号信息之间的相似度，比如相似度的计算公式可以是欧式距离、余弦距离、杰卡德距离公式等。一方面，相比于计算所有账号信息中的任意两个账号信息的相似度，只计算同一组内的任意两个账号信息之间的相似度能够大大减少计算量。比如，取N个账号信息样本，那么N个账号信息样本就被分到了2^k个组内，每个组内的账号信息样本数为N/2^k，每组内进行任意两个账号信息进行相似度计算的次数为

2^k个组进行任意两个账号信息进行相似度计算的次数为

因此，所有类需要进行相似度计算的次数就为

其中，

是划分的类的个数，这个值是一个根据实际情况可以进行控制的常数，而传统的方法计算所有账号中任意两个账号信息进行相似度计算需要进行

次，综上可以看出，采用本发明的计算同一组内的任意两个账号信息之间的相似度的计算量比传统的方法计算所有账号中任意两个账号信息的相似度的计算量大约缩减2^k级别的倍数。另一方面，每一组内的账号信息的相似度是较大的，所以对同一组内的账号信息进行相似度计算，也能够提高网络建立的效率和准确率。Step S104: Calculate the similarity between the account information in the same group. In the specific implementation, the ratio of the same number of bits to the size of the bits of the hash vector of the account information in the same group can be counted, for example, the account information The hash vector of 1 is 0010, and the hash vector of account information 2 is 0011. According to K=4, k=2, then the two account information are in the same group in the first category, then it is determined that the two accounts are in the same group The similarity of information; then, the same number of bits of the hash vector of the two account information is 3, and the size of the bits is 4, so the similarity between the two account information is 3/4, also The similarity between any two account information in the same group may be calculated according to the calculation formula of similarity, for example, the calculation formula of similarity may be Euclidean distance, cosine distance, Jaccard distance formula, etc. On the one hand, compared to calculating the similarity between any two account information in all account information, only calculating the similarity between any two account information in the same group can greatly reduce the amount of calculation. For example, taking N account information samples, then the N account information samples are divided into 2 ^k groups, the number of account information samples in each group is N/2 ^k , and any two account information samples in each group are processed. The number of times of similarity calculation is

The number of times that 2 ^k groups perform similarity calculation for any two account information is:

Therefore, the number of times the similarity needs to be calculated for all classes is

in,

is the number of divided classes, this value is a constant that can be controlled according to the actual situation, and the traditional method to calculate the similarity of any two account information in all accounts needs to be calculated

Second, it can be seen from the above that the calculation amount of calculating the similarity between any two account information in the same group using the present invention is about Downscaling by a multiple of ^2k levels. On the other hand, the similarity of account information in each group is relatively large, so calculating the similarity of account information in the same group can also improve the efficiency and accuracy of network establishment.

步骤S105：若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络，具体来说，如果任意两个账号信息之间的相似度大于阈值，就在任意两个账号信息之间建立一条互连边，边的权重就是两个账号信息之间的相似度值，最终形成特征匹配网络。具体实施中，阈值的选取可以选择较高的值没这样最终可以生成较为稀疏的特征匹配网络，便于后续的计算，另外，阈值的取值可以根据实际情况进行调整。Step S105: If the similarity between the account information is greater than the threshold, then establish interconnected edges between the account information to form a feature matching network. Specifically, if the similarity between any two account information is greater than the threshold, An interconnected edge is established between any two account information, and the weight of the edge is the similarity value between the two account information, and finally a feature matching network is formed. In a specific implementation, a higher value can be selected for the threshold value, so that a relatively sparse feature matching network can be finally generated, which is convenient for subsequent calculations. In addition, the value of the threshold value can be adjusted according to the actual situation.

步骤S106：根据特征匹配网络，对各账号信息进行社团划分，具体来说，根据计算出来的各账号信息之间的相似度值，相似度值越接近的越容易被划分到同一个社团中。划分社团之后，对于网络中的欺诈账号更容易去排查，可以计算欺诈账号样本在每个社团中的比例，比例较大的，则该社团为异常社团的可能性就越大，可以根据业务需要进行相关调查，再对异常社团内的账号根据一些指标来进行计算，找出具有代表性的账号，对这些具有代表性的账号再进行相关案件排查，其中，一些指标可以是社团内账号信息的度中心性、紧密中心性、特征向量中心性等；或者也可以对社团内的账号信息进行特征再分析，以期发现该社团的一些共同行为的特征，进行有针对性地欺诈预防。此外，如果新加入的账号信息形成新的社团，则可以根据前面查出来的异常社团进行比对，这对于未知欺诈的侦测与预防是大有裨益的。Step S106: According to the feature matching network, each account information is divided into communities. Specifically, according to the calculated similarity value between each account information, the closer the similarity value is, the easier it is to be divided into the same community. After the community is divided, it is easier to check the fraudulent accounts in the network. The proportion of fraudulent account samples in each community can be calculated. The larger the ratio, the greater the possibility that the community is an abnormal community, and can be based on business needs. Carry out relevant investigations, and then calculate the accounts in the abnormal community according to some indicators, find out representative accounts, and then carry out related case investigations on these representative accounts. Some of the indicators can be the account information in the community. Degree centrality, closeness centrality, feature vector centrality, etc.; or feature re-analysis of account information in the community, in order to discover some common behavioral characteristics of the community, and conduct targeted fraud prevention. In addition, if the newly added account information forms a new community, it can be compared according to the abnormal community detected earlier, which is of great benefit to the detection and prevention of unknown fraud.

计算同一组内的各账号信息之间的相似度，可以以下面两种方法来计算：To calculate the similarity between account information in the same group, the following two methods can be used:

方式1：可选地，计算同一组内的各账号信息之间的相似度，包括：若第i账号信息与第j账号信息位于n类同组中，则将n/m作为第i帐号信息与第j账号信息之间的相似度；第i账号信息与第j账号信息为各账号信息中的任一个，具体来说，在所有账号信息中任意取两个账号信息，比如称为账号信息1与账号信息2，m取3，也就是账号信息1与账号信息2分在了3类中，这3类分别称为第1类、第2类、第3类，假设这两个账号信息在第1类与第3类中同组，那么，这两个账号信息在这3类中的相似度为2/3。Method 1: Optionally, calculating the similarity between account information in the same group, including: if the i-th account information and the j-th account information are located in the same group of the n-th category, taking n/m as the i-th account information The similarity with the jth account information; the ith account information and the jth account information are any of the account information, specifically, two account information are arbitrarily taken from all account information, such as called account information 1 and account information 2, m takes 3, that is, account information 1 and account information 2 are divided into 3 categories, these 3 categories are called category 1, category 2, category 3, assuming these two account information If the first category and the third category are in the same group, then the similarity of the two account information in these three categories is 2/3.

方式2：可选地，计算同一组内的各账号信息之间的相似度，包括：若第i账号信息与第j账号信息位于同一组中，统计第i账号信息的哈希向量与第j账号信息的哈希向量中位于同一位且哈希向量值相同的个数h；第i账号信息与第j账号信息为各账号信息中的任一个；第i账号信息与第j账号信息的相似度s＝h/K，具体来说，如果所有账号信息中任意的两个账号信息，账号信息1与账号信息2位于同一组，并且账号信息1与账号信息2都是4位的，也就是K为4，账号信息1与账号信息2的4位哈希向量中，前3位是完全相同的，第4位不同，那么，账号信息1与账号信息2的相似度s为3/4。Mode 2: Optionally, calculating the similarity between account information in the same group, including: if the i-th account information and the j-th account information are in the same group, calculating the hash vector of the i-th account information and the j-th account information. In the hash vector of the account information, the number h that is located in the same position and has the same hash vector value; the i-th account information and the j-th account information are any of the account information; the i-th account information and the j-th account information are similar Degree s=h/K, specifically, if any two account information in all account information, account information 1 and account information 2 are in the same group, and both account information 1 and account information 2 are 4-bit, that is K is 4. In the 4-bit hash vector of account information 1 and account information 2, the first three bits are exactly the same, and the fourth bit is different. Then, the similarity s between account information 1 and account information 2 is 3/4.

以上两种计算同一组内各个账号信息之间的相似度的计算方法，可以得出，第1中方法是计算的两个账号信息在各个类中的相似度，而第2种方法是计算的被分到了各类中同一组中的两个账号信息之间的相似度，可以看出，这两种方法中，相比于第2种方法，第1种方法是比较粗略的计算两个账号信息所属的类与类之间的相似度，而第2种计算的两个账号信息在同一组之间的相似度则更精准。不过，这两种方法都相比于现有技术中利用欧式距离公式等来计算网络中所有账号信息中任意两个账号信息之间的相似度的计算量上得到了明显的改善，进一步加速了网络的建立。The above two calculation methods for calculating the similarity between account information in the same group, it can be concluded that the first method is to calculate the similarity of two account information in each category, and the second method is to calculate The similarity between the two account information in the same group is divided into various categories. It can be seen that in these two methods, compared with the second method, the first method is to roughly calculate the two accounts. The similarity between the classes to which the information belongs, and the similarity between the two account information in the same group calculated by the second method is more accurate. However, both of these two methods have been significantly improved in terms of the amount of calculation used to calculate the similarity between any two account information in all account information in the network by using the Euclidean distance formula in the prior art, which further accelerates the establishment of the network.

可选地，根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，包括：根据公式(1)确定每个账号信息对应的K位哈希向量

Optionally, determining the K-bit hash vector corresponding to each account information according to the preset K hash functions, including: determining the K-bit hash vector corresponding to each account information according to formula (1)

其中，2'b表示

是一个二进制数，

是预设的K个哈希函数中的一个，Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

表示账号信息的特征向量，其中，

c₁,c₂…,c_d表示账号信息的特征属性，

表示随机选取的一个非零向量，

具体来说，预设的哈希函数是

是预设的K个哈希函数中的任一个，哈希函数

的值用0或1来表示，也就是说这样的一个哈希函数只能产生两个哈希值，对于数量巨大的账号信息来说明显是不够的，所以根据这样的哈希函数，来确定每个账号的K位哈希向量

是一个K位的二进制数，比如，可以是6位的二进制数，具体可以为010110，那么，

其中，

表示账号信息的特征向量，

c₁,c₂…,c_d表示账号信息的特征属性，具体的账号信息特征属性可以是交易金额、交易时间、交易地点、交易地点数、转账地点、转账金额、转账次数等。其中，各账号信息的特征向量在具体实施中可以经过筛选来得到一批理论上效果最好的特征向量，具体地，在一定时间段内抽取欺诈账号信息样本以及正常账号信息样本，将抽取的欺诈账号信息样本以及正常账号信息样本组合为一个整体账号信息样本，根据业务经验进行整体账号信息的数据预处理、特征筛选及属性相关性分析等步骤之后，筛选出一批理论上效果最好的特征向量。根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，能够充分提取每个账号信息的特征属性并用特征向量表示出来，能够应对复杂的网络中账号信息数量巨大的情况。此外，需要说明的是，第一，每个账号信息对应的K位哈希向量

的确定实际上是经过一个哈希随机映射的过程得来的，是由

经过哈希映射得到

这里使用哈希随机映射的主要目的是使得使得账号信息的特征向量能映射为0或1的统一表示，以便后续处理，而并非简单的降维；第二，原来的特征向量

映射到新的哈希空间中，会使得在原来的特征向量相似的数据在新的哈希空间中数据也相似的概率很大，这个概率为：

符合相似度s到概率p的单调递增映射关系。

is a feature vector representing account information, where,

represents a randomly selected non-zero vector,

Specifically, the preset hash function is

is any one of the preset K hash functions, the hash function

The value of is represented by 0 or 1, that is to say, such a hash function can only generate two hash values, which is obviously not enough for a huge amount of account information, so according to such a hash function, to determine K-bit hash vector for each account

is a K-bit binary number, for example, it can be a 6-bit binary number, specifically 010110, then,

in,

The feature vector representing account information,

c ₁ , c ₂ . . . , c _d represent characteristic attributes of account information, and specific characteristic attributes of account information may be transaction amount, transaction time, transaction location, number of transaction locations, transfer location, transfer amount, and number of transfers. Among them, the feature vectors of each account information can be screened in the specific implementation to obtain a batch of feature vectors with the best theoretical effect. Specifically, fraudulent account information samples and normal account information samples are extracted within a certain period of time, Fraudulent account information samples and normal account information samples are combined into an overall account information sample. After data preprocessing, feature screening, and attribute correlation analysis of the overall account information are performed according to business experience, a batch of theoretically the best results is selected. Feature vector. According to the preset K hash functions, the K-bit hash vector corresponding to each account information can be determined, which can fully extract the characteristic attributes of each account information and express it with the characteristic vector, which can cope with the huge amount of account information in the complex network. Happening. In addition, it should be noted that, first, the K-bit hash vector corresponding to each account information

The determination of is actually obtained through a process of hash random mapping, which is determined by

get hash map

The main purpose of using hash random mapping here is to make the feature vector of account information map to a unified representation of 0 or 1 for subsequent processing, rather than simple dimensionality reduction; second, the original feature vector

Mapping into the new hash space will make the data similar in the original feature vector have a high probability of being similar in the new hash space. This probability is:

It conforms to the monotonically increasing mapping relationship of similarity s to probability p.

以上实施方式中，对于每个账号信息对应的K位哈希向量以及将每个账号信息对应的哈希向量顺序划分为m＝K/k类子哈希向量的关系，下面以一个表格的方式将其展示出来，表1示例性地示出了账号信息样本与类之间的关系，如表1所示：In the above embodiment, for the K-bit hash vector corresponding to each account information and the relationship of sequentially dividing the hash vector corresponding to each account information into m=K/k sub-hash vectors, the following is a table. To show it, Table 1 exemplarily shows the relationship between account information samples and classes, as shown in Table 1:

表1：账号信息样本与类之间的关系Table 1: Relationship between account information samples and classes

表1中，账号信息样本与类之间的关系可以表示成一个K行N列的矩阵，N表示取的账号信息样本数，c₁到c_N代表N个账号信息样本，将N个账号信息样本分到m＝K/k个类，其中，表格中除第一行之外下面的每一行代表一个类，N个账号信息样本被分到了2^k个组内。In Table 1, the relationship between account information samples and classes can be expressed as a matrix with K rows and N columns, where N represents the number of account information samples taken, c ₁ to c _N represent N account information samples, and the N account information The samples are divided into m=K/k classes, wherein each row below except the first row in the table represents a class, and the N account information samples are divided into ^2k groups.

其中，w(z)＝w_ai,z 公式(2)

Among them, w(z)=w _ai,z Formula (2)

具体实施中，第(1)步骤，初始化特征匹配网络，将每个账号信息划分到不同的社区中，这一步骤中的划分可以是随机划分的；第(2)步骤，根据公式(2)来计算各账号信息的相似强度，具体地，假如账号信息1与账号信息2的共同邻居是账号信息3，账号信息1与账号信息2合起来与账号信息3的边的权重是5，那么，任意账号信息ai与账号信息3相连边的权重为5，因而，账号信息1与账号信息2的相似强度是1/5，类似的，其它账号信息之间也是用此方法来计算。假如，取4个账号信息样本，经过计算之后，形成一个4*4的矩阵，假如，这个矩阵为

从这个矩阵可以看出，账号信息1与账号信息2的相似度为0.25，账号信息1与账号信息3的相似度为0.7，账号信息2与账号信息3的相似度为0.4；第(3)步骤，从这个相似强度矩阵中账号信息所在的行，按相似强度从大到小的的顺序尝试将账号信息划至其他社区中，例如从这个相似矩阵第一行可以看出，想要把账号信息1划分到其它某一社团中时，优先选择相似度较大的账号信息3(第一行中0.6最大)所在的社区中去。如果△Q<0，再将账号信息1尝试划分到账号信息4(第一行中0.4次大)所在的社团中去。如果△Q<0，则再将账号信息1尝试划分到账号信息2所在的社团中去。如果仍然△Q<0，则账号信息1作为一个独立的社团进行保留，矩阵不做更新，再进行第2行的计算。如果上述尝试过程中只要发现△Q>0,比如优先尝试的将账号1划分到相似度较大的账号信息3(第一行中0.6最大)所在的社区中去以后，发现△Q>0，那么表示尝试成功，第一行计算结束。由于此时账号1的状态已经发生改变，因此将矩阵中第一行第一列所有数据删除，表示后续账号信息不再与账号信息1进行比较，也就是，

变成

然后以同样的过程开始新一轮的尝试计算，即对账号信息2进行社团划分。其中，模块度差△Q的计算公式：

来验证上面对账号信息的尝试划分社区是否正确，其中，n表示网络中所有的权重，k_i表示与顶点i连接的边的权重，k_i,in表示账号信息i在社区内部的权重之和，Σ_in表示社区内部的边权重和，Σ_tot表示与社区内部账号信息连接的边的权重和，包括社区内部的边以及社区外部的边，若△Q为正数，则接受本次的划分，若不为正数，则放弃本次的划分。通过账号信息的相似强度矩阵的计算，优先将账号信息划分到与其最相似的邻居账号信息的社团中去，大大节省了社团划分的尝试次数，进一步提高了算法的速度，另外，对账号信息尝试的划分是否合理通过模块度差公式来验证，更加有效保证了尝试划分的合理性与准确性。In the specific implementation, in step (1), the feature matching network is initialized, and each account information is divided into different communities. The division in this step can be randomly divided; in step (2), according to formula (2) To calculate the similarity strength of each account information, specifically, if the common neighbor of account information 1 and account information 2 is account information 3, and the weight of the edge of account information 1 and account information 2 combined with account information 3 is 5, then, The weight of the edge connecting any account information ai and account information 3 is 5. Therefore, the similarity strength of account information 1 and account information 2 is 1/5. Similarly, other account information is also calculated by this method. Suppose, take 4 account information samples, and after calculation, form a 4*4 matrix, if, this matrix is

It can be seen from this matrix that the similarity between account information 1 and account information 2 is 0.25, the similarity between account information 1 and account information 3 is 0.7, and the similarity between account information 2 and account information 3 is 0.4; Step, from the row of account information in this similarity strength matrix, try to assign account information to other communities in order of similarity strength from large to small. For example, it can be seen from the first row of this similarity matrix that if you want to add account When the information 1 is divided into some other community, the community where the account information 3 with the greater similarity (0.6 in the first row is the largest) is preferentially selected. If △Q<0, then try to divide account information 1 into the community where account information 4 (0.4 times larger in the first row) is located. If △Q<0, then try to divide account information 1 into the community where account information 2 is located. If △Q<0, account information 1 is retained as an independent community, the matrix is not updated, and then the calculation of the second line is performed. If only △Q>0 is found during the above attempt, for example, the first attempt is to divide account 1 into the community where account information 3 with greater similarity (0.6 in the first row is the largest) is located, and △Q>0 is found. Then it means that the attempt is successful, and the calculation of the first line ends. Since the status of account 1 has changed at this time, all data in the first row and first column of the matrix is deleted, indicating that the subsequent account information is no longer compared with account information 1, that is,

become

Then, a new round of attempted calculation is started with the same process, that is, the community division of account information 2 is performed. Among them, the calculation formula of the modularity difference △Q:

To verify whether the above attempt to divide the community for account information is correct, where n represents all the weights in the network, _ki represents the weight of the edge connected to vertex i, and ki _{, in} represent the weight of account information i within the community. sum, Σ _in represents the sum of the edge weights within the community, Σ _tot represents the weight sum of the edges connected to the account information within the community, including the edges inside the community and the edges outside the community, if △Q is a positive number, then accept the current If it is not a positive number, this division will be discarded. Through the calculation of the similarity strength matrix of the account information, the account information is preferentially divided into the community with the most similar neighbor account information, which greatly saves the number of attempts to divide the community and further improves the speed of the algorithm. Whether the division is reasonable is verified by the modularity difference formula, which more effectively ensures the rationality and accuracy of the attempted division.

为了更好的理解本发明技术方案，图2示例性地示出了本发明的整体思路流程图，如图2所示：In order to better understand the technical solution of the present invention, Fig. 2 exemplarily shows the flow chart of the overall idea of the present invention, as shown in Fig. 2:

步骤S201：将各账号信息的特征属性通过哈希映射的方法映射为一个多位的哈希映射向量；Step S201: Map the feature attributes of each account information into a multi-digit hash map vector by a hash mapping method;

步骤S202：将各账号信息的哈希映射向量进行分类；Step S202: classify the hash map vector of each account information;

步骤S203：对于每个类，将哈希映射向量相同的账号信息划分为一组；Step S203: for each class, divide account information with the same hash map vector into a group;

步骤S204：对每组中的任意两个账号信息进行相似度计算；Step S204: Perform similarity calculation on any two account information in each group;

步骤S205：若每组中的任意两个账号信息的相似度大于阈值，则建立这两个账号信息之间的互连边，边的权重为相似度，从而形成特征匹配网络，其中，形成的特征匹配网络是稀疏的特征匹配网络；Step S205: If the similarity of any two account information in each group is greater than the threshold, establish an interconnected edge between the two account information, and the weight of the edge is the similarity, thereby forming a feature matching network, wherein the formed The feature matching network is a sparse feature matching network;

步骤S206：根据特征匹配网络中各账号信息的相似强度矩阵对特征匹配网络进行社团划分。Step S206: Divide the feature matching network into communities according to the similarity strength matrix of each account information in the feature matching network.

与现有技术相比，本发明实施例中，第一，通过随机哈希映射的方法将各账号信息的特征属性映射到一个新的哈希空间中，形成各账号信息的哈希映射向量，对各账号信息的哈希映射向量进行分类，能够在高相似度的账号信息之间建立边，有效避免了大量的任意两个账号信息之间的相似度计算，且高效地为每条边建立了可信的权重值，能够提高后续社团划分的精度与速度；第二，根据各账号信息的相似度建立了特征匹配网络，然后根据网络中各账号信息的相似强度矩阵对特征匹配网络进行社团划分，不仅可以有效发现异常社团并进行有针对性地措施，同时可以侦测未知的欺诈类型，而且通过相似强度矩阵对对特征匹配网络进行社团划分，即优先将账号信息划分到与其最相似的邻居账号信息的社团中去，大大节省了社团划分尝试的次数，进一步提高了算法的速度；第三，通过形成特征匹配网络，相关账号信息间的相似度作为边的权重被永久存储，即使有较多的新的账号信息进来，也不会对网络中原来的互连边产生影响，仅仅需要将新的账号信息插入到原特征匹配网络中。在向原特征匹配网络图添加新数账号信息的时候，仍然先采用随机哈希映射方法及对各账号信息进行分类，然后与类内的账号信息进行相似度计算，如果该相似度大于阈值，则添加新的边。后续只需要进行计算量较小但是更加精准的社团划分算法即可实现功能。同时，特征匹配网络的结构能更加清晰地展示社团内部及社团间的关联结构，这是传统聚类方法所不能实现的。Compared with the prior art, in this embodiment of the present invention, first, the characteristic attributes of each account information are mapped into a new hash space by a random hash mapping method to form a hash map vector of each account information, By classifying the hash map vector of each account information, an edge can be established between account information with high similarity, effectively avoiding a large number of similarity calculations between any two account information, and efficiently establishing each edge. It can improve the accuracy and speed of subsequent community division; secondly, a feature matching network is established according to the similarity of each account information, and then the feature matching network is community based on the similarity strength matrix of each account information in the network. Division, not only can effectively discover abnormal communities and take targeted measures, but also detect unknown types of fraud, and divide the feature matching network through the similarity strength matrix. The number of community division attempts is greatly saved, and the speed of the algorithm is further improved; thirdly, by forming a feature matching network, the similarity between related account information is permanently stored as the weight of the edge, even if there are When more new account information comes in, it will not affect the original interconnected edges in the network, and it is only necessary to insert the new account information into the original feature matching network. When adding new account information to the original feature matching network graph, the random hash mapping method is still used to classify each account information, and then the similarity is calculated with the account information in the class. If the similarity is greater than the threshold, then Add new edges. In the follow-up, only a small but more accurate community division algorithm needs to be performed to realize the function. At the same time, the structure of the feature matching network can more clearly show the association structure within and between communities, which cannot be achieved by traditional clustering methods.

基于相同构思，本发明实施例提供的一种基于特征匹配网络的社团划分装置，如图3所示，该装置包括确定单元301、第一划分单元302、第二划分单元303、计算单元304、形成网络单元305和第三划分单元306。其中：Based on the same concept, an embodiment of the present invention provides an apparatus for community division based on a feature matching network. As shown in FIG. 3, the apparatus includes a determination unit 301, a first division unit 302, a second division unit 303, a calculation unit 304, A network unit 305 and a third dividing unit 306 are formed. in:

确定单元301：用于根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；Determining unit 301: for determining the K-bit hash vector corresponding to each account information according to the preset K hash functions;

第一划分单元302：用于将每个账号信息对应的哈希向量，顺序划分为m＝K/k类子哈希向量；The first dividing unit 302 is used to divide the hash vector corresponding to each account information into m=K/k sub-hash vectors in sequence;

第二划分单元303：用于针对每个类，将子哈希向量相同的账号信息划分为同一组；Second dividing unit 303: for dividing the account information with the same sub-hash vector into the same group for each class;

计算单元304：用于计算同一组内的各账号信息之间的相似度；Calculation unit 304: used to calculate the similarity between account information in the same group;

形成网络单元305：用于若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；Forming a network unit 305: if the similarity between the account information is greater than a threshold, then establishing an interconnection edge between the account information to form a feature matching network;

第三划分单元306：用于根据特征匹配网络，对各账号信息进行社团划分。The third dividing unit 306 is configured to perform community division on each account information according to the feature matching network.

可选地，计算单元304具体用于：Optionally, the computing unit 304 is specifically used for:

可选地，计算单元304具体还用于：Optionally, the computing unit 304 is also specifically used for:

可选地，确定单元301用于：Optionally, the determining unit 301 is configured to:

根据公式(3)确定每个账号信息对应的K位哈希向量

其中，2'b表示

是一个二进制数，

是预设的K个哈希函数中的一个，Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

表示账号信息的特征向量，其中，

c₁,c₂…,c_d表示账号信息的特征属性，

表示随机选取的一个非零向量，

is a feature vector representing account information, where,

represents a randomly selected non-zero vector,

可选地，第三划分单元306具体用于：Optionally, the third dividing unit 306 is specifically configured to:

(3)针对每个账号信息，从节点相似强度矩阵中账号信息所在的行，按相似强度从大到小的的顺序尝试将账号信息划至其他社区中；若账号信息自第p社区划分至第q社区后的模块度差为正数，则将账号信息划分至第q社区后结束；(3) For each account information, from the row where the account information is located in the node similarity strength matrix, try to assign the account information to other communities in the order of similarity strength from large to small; if the account information is divided from the pth community to If the modularity difference after the qth community is a positive number, the account information is divided into the qth community and it ends;

其中，w(z)＝w_ai,z 公式(4)

Among them, w(z)=w _ai,z Formula (4)

从上述内容可看出：本发明实施例中提供一种基于特征匹配网络的社团划分装置，根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量；将每个账号信息对应的哈希向量，顺序划分为类子哈希向量；针对每个类，将子哈希向量相同的账号信息划分为同一组；计算同一组内的各账号信息之间的相似度；若各账号信息之间的相似度大于阈值，则在各账号信息之间建立互连边，形成特征匹配网络；根据特征匹配网络，对各账号信息进行社团划分根据各账号信息之间的相似度，对各账号信息进行社团划分。本发明实施例中首先通过根据预设的K个哈希函数，确定每个账号信息对应的K位哈希向量，对于网络中数量巨大的账号信息来说，仅仅产生两个哈希值的哈希函数是不够的，因此确定每个账号信息对应的K位哈希向量能够应对复杂的网络账号信息。然后针对每个类，将子哈希向量相同的账号信息划分为一组，计算同一组内任意账号信息之间的相似度，能够避免针对整个网络中任意账号信息之间计算相似度而带来的计算量非常大的缺点；本发明技术方案能够有效减少账号信息之间相似度的计算量，仅仅计算同一组内的账号信息之间的相似度。最后根据确定各账号信息之间的相似度大于阈值，在各账号信息之间建立互连边，形成特征匹配网络；根据特征匹配网络，对各账号信息进行社团划分，能够更精准的对各账号信息进行社团划分，这样不仅能够使社团之间的关联关系很清楚，而且能够对划分的社团进行分析，找出异常社团，进而对异常社团内的账号进行异常账号排查，更加有针对性地找出欺诈账号，提高应对欺诈账号的效率。此外，如果需要对划分出的社团添加账号信息，只需要对该添加的账号信息重复以上简单的几个步骤，将所添加的账号信息更新到相应的位置即可，并不会产生更新困难的问题。It can be seen from the above that: an embodiment of the present invention provides a community division device based on a feature matching network, according to the preset K hash functions, determine the K-bit hash vector corresponding to each account information; The hash vector corresponding to the account information is sequentially divided into class sub-hash vectors; for each class, the account information with the same sub-hash vector is divided into the same group; the similarity between the account information in the same group is calculated; If the similarity between the account information is greater than the threshold, then establish interconnected edges between the account information to form a feature matching network; according to the feature matching network, each account information is divided into communities according to the similarity between the account information , to group each account information. In the embodiment of the present invention, the K-bit hash vector corresponding to each account information is determined according to the preset K hash functions. For a huge amount of account information in the network, only two hash values are generated. The hash function is not enough, so determining the K-bit hash vector corresponding to each account information can deal with complex network account information. Then, for each class, the account information with the same sub-hash vector is divided into a group, and the similarity between any account information in the same group is calculated, which can avoid the calculation of the similarity between any account information in the entire network. However, the technical solution of the present invention can effectively reduce the calculation amount of the similarity between account information, and only calculate the similarity between account information in the same group. Finally, according to the determination that the similarity between the account information is greater than the threshold, an interconnected edge is established between the account information to form a feature matching network; according to the feature matching network, each account information is divided into communities, which can be more accurate for each account. The information is divided into communities, which can not only make the relationship between communities clear, but also analyze the divided communities, find out abnormal communities, and then conduct abnormal account investigations on accounts in abnormal communities, so as to find out more targeted Generate fraudulent accounts and improve the efficiency of dealing with fraudulent accounts. In addition, if you need to add account information to the divided community, you only need to repeat the above simple steps for the added account information, and update the added account information to the corresponding location, and there will be no difficulty in updating. question.

本领域内的技术人员应明白，本发明的实施例可提供为方法、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, or as a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。Although preferred embodiments of the present invention have been described, additional changes and modifications to these embodiments may occur to those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiment and all changes and modifications that fall within the scope of the present invention.

显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention. Thus, provided that these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include these modifications and variations.

Claims

1. a community division method based on feature matching network, is characterized in that, comprises:

According to the preset K hash functions, determine the K-bit hash vector corresponding to each account information;

Divide the hash vector corresponding to each account information into m=K/k types of sub-hash vectors in order; wherein, k is the number of hash vectors in each type of sub-hash vector;

For each class, the account information with the same sub-hash vector is divided into the same group;

Calculate the similarity between account information in the same group;

If the similarity between the various account information is greater than the threshold, then establish interconnected edges between the various account information to form a feature matching network;

According to the feature matching network, the information of each account is divided into groups.

2. The method of claim 1, wherein calculating the similarity between each account information in the same group, comprising:

If the i-th account information and the j-th account information are in the same group of n types, n/m is taken as the similarity between the i-th account information and the j-th account information; The jth account information is any one of the account information.

3. The method according to claim 1, wherein calculating the similarity between each account information in the same group, comprising:

If the i-th account information and the j-th account information are in the same group, count the number of the hash vector of the i-th account information and the hash vector of the j-th account information that are in the same position and have the same hash vector value h; the i-th account information and the j-th account information are any of the account information;

The similarity between the i-th account information and the j-th account information is s=h/K.

4. The method of claim 1, wherein, according to preset K hash functions, determine the K-bit hash vector corresponding to each account information, comprising:

Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

is a feature vector representing account information, where,

represents a randomly selected non-zero vector,

5. The method according to any one of claims 1 to 4, wherein, according to the feature matching network, performing community division on the account information, comprising:

(1) dividing each account information into different communities in the feature matching network;

(2) Calculate the similarity strength of each account information according to the similarity between the account information, thereby generating a node similarity strength matrix;

(3) For each account information, from the row where the account information is located in the node similarity strength matrix, try to assign the account information to other communities in descending order of similarity strength; If the modularity difference after the information is divided into the qth community from the pth community is a positive number, the account information is divided into the qth community and the end is ended;

(4) Repeat the execution until the community structure no longer changes.

6. The method according to claim 5, wherein calculating the similarity strength of each account information according to the similarity between the account information, comprising:

Calculate the similarity strength _si,j between the i-th account information and the j-th account information according to formula (2);

Wherein, Γ(i) represents the neighbor set of the ith account information, Γ(i)∩Γ(j) represents the common neighbor set of the ith account information and the jth account information, w _ai,z is The weight sum of the edges between any account information ai and the zth account information.

7. A community division device based on a feature matching network, characterized in that, comprising:

a determination unit, configured to determine the K-bit hash vector corresponding to each account information according to the preset K hash functions;

The first division unit is used to divide the hash vector corresponding to each account information into m=K/k type sub-hash vectors in order; wherein, k is the number of hash vectors in each type of sub-hash vector;

The second division unit is used to divide account information with the same sub-hash vector into the same group for each class;

a calculation unit, used to calculate the similarity between account information in the same group;

forming a network unit, for if the similarity between the account information is greater than a threshold, then establishing an interconnection edge between the account information to form a feature matching network;

The third dividing unit is configured to perform community division on each account information according to the feature matching network.

8. The apparatus of claim 7, wherein

The computing unit is specifically configured to use n/m as the similarity between the i-th account information and the j-th account information if the i-th account information and the j-th account information are in the same group of n types; The i-th account information and the j-th account information are any one of the account information.

9. The apparatus of claim 7, wherein

The computing unit is further configured to, if the i-th account information and the j-th account information are located in the same group, count the hash vector of the i-th account information and the hash vector of the j-th account information in the same position. And the same number h of hash vector values; the i-th account information and the j-th account information are any one of the account information;

10. The apparatus of claim 7, wherein the determining unit is configured to determine the K-bit hash vector corresponding to each account information according to formula (3)

Among them, 2'b means

is a binary number,

is one of the preset K hash functions,

is a feature vector representing account information, where,

represents a randomly selected non-zero vector,

11. The device of any one of claims 7 to 10, wherein

The third dividing unit is specifically used for (1) dividing each account information into different communities in the feature matching network;

(3) For each account information, from the row where the account information is located in the node similarity strength matrix, try to assign the account information to other communities in order of similarity strength from large to small; if the If the modularity difference after the account information is divided into the qth community from the pth community is a positive number, the account information is divided into the qth community and the end is ended;

(4) Repeat the execution until the community structure no longer changes.

12. The apparatus of claim 11, wherein

The calculating unit is further configured to calculate the similarity strength _si,j between the i-th account information and the j-th account information according to formula (4);