CN114298203B

CN114298203B - Method, apparatus, device and computer readable medium for data classification

Info

Publication number: CN114298203B
Application number: CN202111593409.5A
Authority: CN
Inventors: 闵际达; 郭瑞; 申世豪; 张战胜
Original assignee: Taikang Insurance Group Co Ltd; Taikang Pension Insurance Co Ltd
Current assignee: Taikang Insurance Group Co Ltd; Taikang Pension Insurance Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2024-10-25
Anticipated expiration: 2041-12-23
Also published as: CN114298203A

Abstract

The invention discloses a method, a device, equipment and a computer readable medium for classifying data, and relates to the technical field of computers. One embodiment of the method comprises the following steps: classifying data by adopting a KNN algorithm; according to the classification of the data, the data similarity threshold and the data clustering threshold, storing cluster middle points in the data in a cluster server, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in a bridge server; and acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified. The embodiment can improve the speed and accuracy of data classification.

Description

Method, apparatus, device and computer readable medium for data classification

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a computer readable medium for data classification.

Background

In the current enterprise data platform, in order to accurately develop a service, a matched subsequent industry service is provided for an enterprise, so that the enterprise is classified according to a certain classification rule. But due to the fact that enterprises are more and classification rules are disordered, the same enterprise can be divided into a plurality of industries at the same time. Such as: features contained by an enterprise are classified into both the IT industry and the financial industry according to the original classification rules. The problem of disordered enterprise positioning is caused, and proper follow-up business service cannot be provided for the enterprise.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art: the enterprise classification is carried out by only carrying out exhaustive judgment through basic if-else sentences, and writing all judgment conditions into the system, so that the classification speed is low and the accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, apparatus, device, and computer readable medium for data classification, which can improve the speed and accuracy of data classification.

To achieve the above object, according to one aspect of the embodiments of the present invention, there is provided a method of classifying data, including:

Classifying data by adopting a KNN algorithm;

according to the classification of the data, the data similarity threshold and the data clustering threshold, storing cluster middle points in the data in a cluster server, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in a bridge server;

And acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified.

The classification of the data divided by the KNN algorithm comprises the following steps:

And classifying the data by adopting a KNN algorithm optimized by an insert ordering algorithm.

the classification of the data is divided by adopting a KNN algorithm optimized by a plug-in sorting algorithm and a mode of setting identification for the exchanged data.

Storing cluster points in the data in a cluster server according to the classification of the data, the data similarity threshold and the data clustering threshold, and storing outliers in the data, bridge nodes in the data and multiple vertices in the data in the bridge server, wherein the method comprises the following steps:

according to the classification of the data, the data similarity threshold and the data clustering threshold, taking the data smaller than the data similarity threshold and equal to the data clustering threshold as cluster midpoints, and storing the cluster midpoints in a cluster server;

According to a data similarity threshold and a data clustering threshold, taking data which are larger than the data similarity threshold and larger than the data clustering threshold in the data classification as outliers, and storing the outliers in a bridging server;

According to the classification of the data, the data similarity threshold and the data clustering threshold, the data which is equal to the data similarity threshold and the data clustering threshold and belongs to only one cluster is used as a bridge node, and the bridge node is stored in a bridge node server;

And according to the classification of the data, the data similarity threshold and the data clustering threshold, taking the data which is equal to the data similarity threshold and the data clustering threshold and belongs to a plurality of clusters as multiple vertexes, and storing the multiple vertexes in a bridging server.

The obtaining data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, and the data similarity threshold value and the data clustering threshold value to determine the group of the data to be classified comprises the following steps:

The similarity threshold value to be classified of the data to be classified is smaller than or equal to the data similarity threshold value, and the clustering parameter to be classified of the data to be classified is smaller than or equal to the data clustering threshold value, the data are acquired from the servers in the cluster, so that the group of the data to be classified is determined;

And if the similarity threshold value to be classified of the data to be classified is larger than the data similarity threshold value and the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold value, acquiring the data from the bridging server to determine the group of the data to be classified.

The obtaining data from the bridging server to determine the group of the data to be classified according to the data to be classified similarity threshold value being greater than the data similarity threshold value and the data to be classified clustering parameter being greater than the data clustering threshold value, includes:

The similarity threshold value to be classified of the data to be classified is larger than the data similarity threshold value, and the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold value, the data are acquired from the bridging server, so that the data to be classified and the acquired data belong to the same group;

and storing the data to be classified and the acquired data in the server in the cluster.

The data includes enterprise classification data.

According to a second aspect of an embodiment of the present invention, there is provided an apparatus for classifying data, including:

the division module is used for dividing the classification of the data by adopting a KNN algorithm;

The storage module is used for storing cluster middle points in the data in the cluster server according to the classification of the data, the data similarity threshold value and the data clustering threshold value, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in the bridge server;

The grouping module is used for acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified.

According to a third aspect of an embodiment of the present invention, there is provided an electronic device for data classification, including:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods as described above.

According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements a method as described above.

One embodiment of the above invention has the following advantages or benefits: classifying data by adopting a KNN algorithm; according to the classification of the data, the data similarity threshold and the data clustering threshold, storing cluster middle points in the data in a cluster server, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in a bridge server; and acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified. By acquiring data from the servers in the cluster or the bridging server, the group to which the data to be classified belongs can be determined without judging whether the grouping condition is satisfied for a plurality of times, so that the speed and accuracy of data classification can be improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of an enterprise classification using if-else;

FIG. 2 is a schematic diagram of the main flow of a method of data classification according to an embodiment of the invention;

FIG. 3 is a clustering diagram of a KNN algorithm in accordance with an embodiment of the invention;

FIG. 4 is a flow diagram of storing data at a server according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a point in a cluster where a server in the cluster stores the cluster in accordance with an embodiment of the invention;

FIG. 6 is a flow chart of acquiring data to determine a packet to which the data to be classified belongs according to an embodiment of the present invention;

FIG. 7 is a flow diagram of updating servers in a cluster according to an embodiment of the invention;

FIG. 8 is a schematic diagram of data to be classified and clusters according to an embodiment of the present invention;

fig. 9 is a schematic diagram of the main structure of an apparatus for classifying data according to an embodiment of the present invention;

FIG. 10 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 11 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the existing enterprise data platform, a developer cannot realize accurate classification of enterprises according to classification rules, and the system judges enterprise characteristics by using basic if-else sentences.

Referring specifically to FIG. 1, FIG. 1 is a schematic diagram of an enterprise classification implemented using if-else. The enterprise features are traversed by if-else judgment statements in FIG. 1 to implement classification of enterprises. In the case of newly added enterprise feature files, the judgment statement needs to be re-entered for judgment again.

A large number of similar judgment sentences are written into the system, the codes are long, and when new enterprise classification features appear, the redundant codes still need to be maintained by developers.

The existing distributed clustering algorithm is large in calculation cost in the first step. Existing algorithms require computation of similarity between each vertex and its neighbors and comparison with given parameters when performing graph clustering tasks. When the data volume is large, the algorithm needs to consume high calculation cost to calculate the similarity between the vertexes, so that the real-time requirement of a user cannot be met. Second, space costs are great. In a big data environment, the scale of vertices in the graph is typically large. At the same time, since vertices are typically connected to multiple edges, in most cases the size of the edges in the graph is much larger than the size of the vertices. This results in the existing algorithm requiring costly space to maintain the vertex information and side information in the graph. Thirdly, the communication cost is high. As previously described, vertices in artwork are typically partitioned into multiple hosts. The graph clustering algorithm in the traditional distributed environment needs to continuously read the data set from each node to calculate and then store the data set to different nodes, and the result of the operation is that the I/O overhead between the disks is large and the communication time is long.

It is known that the existing classification schemes are slow and less accurate.

In order to solve the problems of low speed and low accuracy of the classification scheme, the following technical scheme in the embodiment of the invention can be adopted.

Referring to fig. 2, fig. 2 is a schematic diagram of a main flow of a method for classifying data according to an embodiment of the present invention, by comparing data to be classified with data acquired in a cluster server or a bridge server. As shown in fig. 2, the method specifically comprises the following steps:

In the embodiment of the invention, the data of different categories are firstly required to be classified based on the existing data, and the data of different categories are stored in different servers. Then, the belonging group of the data to be classified is determined by comparing the data to be classified with the data from the corresponding server.

S201, classifying data by adopting a KNN algorithm.

The KNN algorithm generally comprises three elements: the set of marked objects, the distance between objects, the value of the threshold k, i.e. the number of objects nearest to the distance. Where the value of k and the way the distance between two points is calculated determine the final classification situation.

Referring to fig. 3, fig. 3 is a clustering schematic diagram of a KNN algorithm according to an embodiment of the present invention. In the case where k is set to 3, the triangle icon is a new icon. The distance between objects has been calculated, and the left graph is shown by a dotted circle box, and since the circle comprises two circles and one rectangle, the circles are mostly included, and therefore, the triangle icons are classified into the category to which the circles belong. The right graph is also in the case that k is 3, the diamond-shaped icons are classified into the category to which the rectangle belongs according to the proportion of the graph in the dotted circle, and the classification mode of the KNN algorithm is adopted.

In the embodiment of the invention, the data is classified by adopting a KNN algorithm. As one example, data refers to existing data. Furthermore, on the basis of the data classification, the classification of the data to be classified is realized.

In one embodiment of the invention, classification of data is divided using a KNN algorithm optimized by a plug-in ordering algorithm.

The optimized KNN algorithm is mainly embodied in the sorting process. After calculating the distance, the KNN algorithm performs bubbling ordering on all values or selects an ordering algorithm.

The KNN algorithm optimized by the insert ordering algorithm is optimized according to the following points:

the first point optimization is to reduce the number of comparisons.

If there are 2 ordered groups [2,3,4,6,7] and [10], the Merge function will compare with 105 times before the data in the first group is compared. Because the number of first groups is less than 10 at a time, the number of first groups is placed into MERGELIST at a time.

Aiming at the situation, the optimized KNN algorithm adopts a method of using insertion sequencing for small-scale subarrays, so that 10 is efficiently inserted into a certain position of ordered sequences [2,3,4,6,7] and the comparison times are reduced.

The second point optimization is that elements are not copied to the auxiliary array in the merge () method, so that the time for copying the array is saved.

Two ordering methods are called, the first: ordering data from the input array to the auxiliary array; second kind: the data is ordered from the auxiliary array to the input array.

The average time complexity of the optimized KNN algorithm is the lowest, and the stability is higher. It should be noted that, the insert sort algorithm in the embodiment of the present invention may be used for a small-scale child array. And the finite sorting is carried out on the small-scale subarrays through insertion instead of merging, so that the consumption of memory and time is reduced, and the optimization of a merging sorting algorithm is realized.

In one embodiment of the invention, not only can the KNN algorithm be optimized by adopting the insert ordering algorithm, but also the identification can be set for further optimization. Namely, classification of data is divided by a KNN algorithm optimized by a insert sort algorithm and a manner of setting an identification for exchanged data.

Specifically, the KNN algorithm is optimized by setting the flag bit while the KNN algorithm is optimized by adopting the insert ordering algorithm. In the process of merging and sorting, each element is continuously close to the position of the element, and if the position exchange does not occur after one comparison process in the merging and sorting, the sequence is ordered. Therefore, a flag bit is set in the sorting process to judge whether elements are exchanged or not, so that unnecessary comparison is reduced, and the aim of optimization is fulfilled.

As an example, in the ordering process, a recursive process, if element exchange is performed, the flag is set to 1, and the merge ordering is continued. If the recursion finds no element exchange, the flag is set to 0, the recursion is exited, and the merging and sorting are not performed.

In one embodiment of the invention, the k value in the KNN algorithm is determined by means of cross-validation. The cross validation is to split the sample data into training data and validation data according to a certain proportion. As one example, use 6: the ratio of 4 splits the sample data.

Starting from the smaller value of k, the value of k is continuously increased, then the variance of the set is calculated through cross verification, and finally k with the minimum error rate is found as the value.

The scheme of the embodiment of the invention is applied to the scene of enterprise classification, and the k value calculation already considers the enterprise classification. 300 pieces of sample data were prepared according to 6:4 into 180 pieces of training data and 120 pieces of verification data.

Starting from k=1, the value of k is continuously increased, 180 pieces of training data and classification result are taken as sample data, the error rate of 120 pieces of verification data starting from k=1 is calculated through cross verification, and k is 1, and the error rate is 50%. After that, when k is increased to 2, the error rate of 120 pieces of verification data is reduced, and when k is increased to 10, the error rate of 120 pieces of verification data is minimized. As k continues to increase, the error rate rises, so the k value is set to 10.

S202, according to the classification of the data, the data similarity threshold value and the data clustering threshold value, storing the cluster middle points in the data in the cluster server, the outlier points in the data, the bridge nodes in the data and the multiple vertices in the data in the bridge server.

Considering that the classification of data is divided only by adopting the KNN algorithm, the speed is low and the prediction result is not accurate enough.

In the embodiment of the invention, classification of data, a data similarity threshold and a data clustering threshold are divided through a KNN algorithm, and the data is classified and stored. As one example, the data similarity threshold is equal to 0.6 and the data cluster threshold is equal to 0.6.

Wherein the data similarity threshold e is the structural similarity between two data nodes. The structural similarity is the ratio of the number of the common neighbors of two data nodes to the geometric mean of the number of neighbors of two nodes. Wherein the neighbors all contain the data nodes themselves. That is, the data similarity threshold is a parameter determined by the number of neighbors of two data nodes.

The data cluster threshold u is a parameter for distinguishing data nodes. In particular, a data clustering threshold is used to distinguish bridge nodes, outliers, and multiple vertices. A bridge node is an isolated node adjacent to at least two clusters. An outlier is an isolated node that is adjacent to only one cluster or not adjacent to any cluster. Multiple vertices are vertices divided into one cluster and connected to vertices outside the cluster. The difference between the multiple vertices and the bridge nodes is: bridge nodes are vertices that connect two or more clusters, but the vertices cannot be divided into the same cluster, while multiple vertices are vertices that are divided into a cluster and are also connected to vertices outside the cluster.

In the embodiment of the invention, the data similarity threshold is a parameter obtained by classifying data according to a KNN algorithm. The data clustering threshold is a preset parameter.

Referring to fig. 4, fig. 4 is a schematic flow chart of storing data in a server according to an embodiment of the present invention, and specifically includes the following steps:

S401, taking the data smaller than the data similarity threshold and equal to the data clustering threshold as cluster midpoints according to the classification of the data, the data similarity threshold and the data clustering threshold, and storing the cluster midpoints in a cluster server.

In the embodiment of the invention, the data belonging to the same class is used as a cluster. That is, data may be divided into a plurality of clusters according to classification of the data. Specifically, among the data belonging to the same cluster, data smaller than the data similarity threshold and equal to the data cluster threshold are taken as cluster midpoints, and the cluster midpoints are stored in the server in the cluster. As one example, the data for each cluster is stored in the server in the same cluster. In the case where the servers in one cluster have no storage space, the remaining data in that cluster may also be stored at the servers in another cluster.

Referring to fig. 5, fig. 5 is a schematic diagram of a point in a cluster where a server in the cluster stores the cluster according to an embodiment of the invention. Fig. 5 includes three servers in a cluster, namely, a server in a cluster 1, a server in a cluster 2, and a server in a cluster 3. The in-cluster server 1 stores data of a first cluster. The in-cluster server 1 stores data of the second cluster. The data v11 to v15 of the first cluster are stored in the in-cluster server 3 due to insufficient storage space of the in-cluster server 1. Similarly, the data v22 to v27 of the second cluster are stored in the in-cluster server 3 due to insufficient storage space of the in-cluster server 2.

Specifically, the capacity of the server in each cluster is C, and the data corresponding to the cluster, that is, the subgraphs { G1, G2, G3...once the data is divided into the servers in the cluster { E1, E2..en }. The rule of allocation is that if |gi|+|gj| < =c, it cannot be allocated to the server in the same cluster; otherwise, they may be partitioned onto servers in the same cluster. If the cluster Gi satisfies |gi| > C, vertices in Gi are repeatedly divided under other parameters.

And S402, according to the data similarity threshold and the data clustering threshold, taking the data which are larger than the data similarity threshold and larger than the data clustering threshold in the data classification as outliers, and storing the outliers in the bridging server.

In the embodiment of the invention, outliers are screened out according to the data similarity threshold and the data clustering threshold. Specifically, data greater than a data similarity threshold and greater than a data clustering threshold are taken as outliers, and the outliers are stored at the bridge server.

S403, according to the classification of the data, the data similarity threshold and the data clustering threshold, the data which is equal to the data similarity threshold and the data clustering threshold in the data and only belongs to one cluster is used as a bridge node, and the bridge node is stored in the bridge node server.

In the embodiment of the invention, bridge nodes are screened out according to the classification of data, the data similarity threshold value and the data clustering threshold value. Specifically, the cluster to which the data belongs is determined according to the classification of the data. The category to which the data belongs is the cluster to which the data belongs. And taking the data which is equal to the data similarity threshold value and is equal to the data clustering threshold value and belongs to only one cluster as a bridge node, and storing the bridge node in a bridge node server.

S404, according to the classification of the data, the data similarity threshold and the data clustering threshold, the data which is equal to the data similarity threshold and the data clustering threshold in the data and belongs to a plurality of clusters is used as a multi-vertex, and the multi-vertex is stored in the bridge server.

In the embodiment of the invention, multiple vertexes are screened out according to the classification of data, the data similarity threshold value and the data clustering threshold value. Specifically, data which is equal to the data similarity threshold, is equal to the data clustering threshold, and belongs to a plurality of clusters is regarded as multiple vertices, and the multiple vertices are stored in the bridge server.

In the embodiment of fig. 4, cluster midpoints are stored in a cluster server; to reduce communication costs, outliers, bridge nodes, and multiple vertices are stored in a bridge server, which then stores the data in a sorted manner.

S203, acquiring data from a server or a bridging server in the cluster according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified.

After a clustering request is received, determining which server needs to acquire data from based on a to-be-classified similarity threshold value and to-be-classified clustering parameters of to-be-classified data and a data similarity threshold value and a data clustering threshold value; then, the data to be classified is compared with the acquired data to determine the belonging group of the data to be classified.

From the perspective of clustering, the vertices of the data to be classified are classified into the following three categories: cluster midpoints, off-cluster points, and multi-vertices.

Cluster midpoint: if a vertex must belong to its original cluster, it is called a cluster midpoint, and if the vertex is a cluster midpoint, it is not necessary to find a new cluster for it.

Out-of-cluster points: if a vertex must not belong to its original cluster, it is called an out-of-cluster point, and if the vertex is an out-of-cluster point, a new cluster needs to be found for it.

Multiple vertices: it is necessary to further evaluate whether the vertex belongs to a new cluster or to an original cluster.

Referring to fig. 6, fig. 6 is a schematic flow chart of acquiring data to determine a packet to which data to be classified belongs according to an embodiment of the present invention, and specifically includes the following steps:

S601, obtaining data from a server in a cluster to determine the group of the data to be classified if the similarity threshold value to be classified of the data to be classified is smaller than or equal to the data similarity threshold value and the clustering parameter to be classified of the data to be classified is smaller than or equal to the data clustering threshold value.

For the data to be classified in the clustering request, the similarity threshold to be classified and the parameters of the clustering parameters to be classified are included. Then, according to the fact that the similarity threshold value to be classified of the data to be classified is smaller than or equal to the data similarity threshold value, and the clustering parameter to be classified of the data to be classified is smaller than or equal to the data clustering threshold value, the fact that the vertexes of the data to be classified are inside a clustering cluster is indicated, the data to be classified belong to the middle points of the clusters, and then the data can be obtained from servers in the clusters.

The in-cluster server is used for storing cluster midpoints of each cluster. And determining the group of the data to be classified by comparing the data to be classified with the cluster midpoint of each cluster. As one example, there are a total of 3 servers in a cluster, each storing data for one cluster. Three vertexes are obtained from the servers in the 3 clusters, and the data to be classified are compared with the three vertexes, so that the group of the data to be classified is determined.

S602, acquiring data from a bridging server to determine the group of the data to be classified if the similarity threshold value to be classified of the data to be classified is larger than the data similarity threshold value and the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold value.

The similarity threshold value to be classified of the data to be classified is larger than the data similarity threshold value, the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold value, and the fact that the vertex of the data to be classified is outside a clustering cluster is indicated, the data to be classified does not belong to the middle point of the clustering cluster, and the data can be obtained from the bridging server.

Specifically, the bridge server is used for storing data nodes independent of each cluster, and specifically comprises outliers, bridge nodes and multiple vertexes. And determining the group of the data to be classified by comparing the data to be classified with the data nodes of the bridging server. As an example, the data to be classified is compared with the data nodes of the server one by one, so as to determine the group of the data to be classified.

Referring to fig. 7, fig. 7 is a schematic flow chart of updating servers in a cluster according to an embodiment of the present invention, which specifically includes:

S701, acquiring data from a bridging server to determine that the data to be classified and the acquired data belong to the same group if the similarity threshold to be classified of the data to be classified is larger than the data similarity threshold and the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold.

And if the similarity threshold value to be classified of the data to be classified is larger than the data similarity threshold value and the clustering parameter to be classified of the data to be classified is larger than the data clustering threshold value, the data can be acquired from the bridging server to realize grouping. The data nodes in the bridging server do not belong to the cluster, and the data to be classified and the data obtained from the bridging server belong to the same group, so that a cluster can be newly built.

S702, storing data to be classified and acquired data in a server in a cluster.

The bridging server is used for storing data which does not belong to the cluster, so that the data to be classified and the acquired data are required to be stored in the server in the cluster, and the server in the cluster is updated.

In the embodiment of fig. 7, the data of the newly created cluster is stored to the in-cluster server to implement the update of the in-cluster server.

Embodiments of the present invention are described below in connection with specific cluster requests.

Classifying the graphs G (V, E) by adopting a KNN algorithm, wherein the data similarity threshold value is 0.4, and the data clustering parameter is 6. A total of three clusters can be found: c1, C2, C3. The three clusters are assigned to servers in the 3 clusters, respectively. In addition, it is necessary to store all bridge nodes, outliers and multiple vertices using one bridge server.

The clustering request parameter Q (0.5,8) is submitted because the similarity threshold to be classified 0.5 is greater than 0.4, while the clustering parameter to be classified 8 is greater than 6. At this point it can be inferred that for each vertex in C1, it must not fall within any other cluster, such as cluster { C2} or { C3} under the parameter Q (0.5,8).

Furthermore, if the vertex is an outlier at Q (0.5,8), then it remains an outlier at parameter (0.4,6).

Referring to fig. 8, fig. 8 is a schematic diagram of data to be classified and clusters according to an embodiment of the present invention. The cluster { C1} in FIG. 8 includes 10 data nodes; the cluster { C2} includes 16 data nodes. The vertex of the data to be classified is v11.

The vertex v11 cannot be divided into any one cluster under the parameter (0.4,6), and when considered under the parameter (0.5,8), it is necessary in principle to calculate the similarity between the vertex v11 and its neighbors, and analyze whether or not it can be divided into the same cluster. However, because 0.5 is greater than 0.4 and 8 is greater than 6, the similarity between the vertex v11 and the adjacent space is not required to be calculated, and the similarity between the vertices except the vertex v11 in the bridge server and the adjacent space is only required to be calculated, so that clustering is completed.

At this time, if the three clusters C1, C2, C3 are stored in the in-cluster Server C1Server, the in-cluster Server C2Server, and the in-cluster Server C3Server, respectively, under the parameter (0.4,6).

The outlier, the bridge node and the multiple vertices are stored in a bridge node server Hub server, respectively. When a new cluster request (0.5,8) is submitted, since 0.4<0.5,6<8, the vertex at (0.4,6) is the core vertex, and the vertex at (0.5,8) is still the core vertex, and at this time, the similarity of the vertices in C1, C2, C3 and their neighbors need not be recalculated. Only the similarity between the vertexes in the Hub server and the neighbors thereof needs to be calculated, and if the vertexes can be divided into the same cluster, the cluster where the vertexes are positioned is merged into a new cluster.

In the embodiment of the invention, classification of data is divided by adopting a KNN algorithm; according to the classification of the data, the data similarity threshold and the data clustering threshold, storing cluster middle points in the data in a cluster server, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in a bridge server; and acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified. By acquiring data from the servers in the cluster or the bridging server, the group to which the data to be classified belongs can be determined without judging whether the grouping condition is satisfied for a plurality of times, so that the speed and accuracy of data classification can be improved.

Specifically, classification results output by the KNN algorithm correspond to all clusters and are divided into servers in the clusters, and meanwhile, outliers, bridge nodes and multiple vertexes are stored in the bridge server. When a cluster request is submitted, only the size between the submitted cluster parameters and the cluster parameters that have been completed need be compared.

If the submitted parameters are larger than the existing cluster parameters, then for any one vertex in a cluster that has been completed, they cannot be divided into the same cluster anyway with vertices in another cluster. In this case, only clustering on the bridging servers is considered, and the clustering of vertex information and side information between a plurality of servers is not required to be read. Meanwhile, the data nodes already stored on the bridge service do not need to be divided into any cluster, because if the data nodes are outliers and bridge nodes under smaller parameters, the data nodes are still outliers and bridge nodes under larger parameters, so that the communication cost can be reduced.

In another case, the submitted parameters are smaller than the existing clustering parameters, and the vertexes which are already divided into the same cluster do not need to be calculated any more, only the data nodes stored on the bridge server need to be considered, whether the data nodes can be divided into clusters or not, if the data nodes cannot be divided into clusters, the data nodes can be stored on the bridge server continuously, and therefore communication cost can be reduced.

Throughout the clustering process, it is mainly checked whether clusters of bridging servers can be merged. When the bridging server receives all the multiple vertices, it is also necessary to check whether there are vertices that are greater than the data similarity threshold. If so, the corresponding clusters need to be merged together. Or detecting that a vertex stored on a bridge server cannot be divided into a certain cluster, if so, dividing the vertex onto the server storing the cluster first, and then deleting the vertex stored on the bridge server.

In general, the operations of the steps are all performed on a local server, and all vertex operations are not involved, and are not calculated on all servers. The worst case at this time is: clusters that have been clustered before cannot be stored completely on the same server. In this case, the communication cost is still relatively high, but the communication cost is not calculated on all servers, so that a lot of communication time is saved.

In summary, the graph data is first clustered by adopting a KNN algorithm, the original graph is divided according to the clustering result, and each sub graph is distributed to the server. And keep the principle to assign vertices in the same cluster to the same server as much as possible. Wherein graph data is one way of data.

At the same time, the outliers, bridge nodes, and multiple vertices are stored in the bridge server. When a cluster request arrives, cluster calculation is implemented on servers in the cluster as much as possible. If not, the clustering task can be completed rapidly by accessing the bridging server.

The scheme of the embodiment of the invention has the following advantages:

(1) KNN algorithm classification

The whole graph data can be divided into a group of clusters by adopting an optimized KNN algorithm, and a data similarity threshold value and a data clustering threshold value between vertexes can be obtained according to a clustering result. When a cluster request is submitted, the structural similarity does not need to be calculated again.

(2) Vertices are assigned to different server nodes.

According to the result of the division, it is ensured that vertices in the same cluster are stored on as few server nodes as possible. In this way, the communication costs can be greatly reduced.

(3) Processing cluster requests

When submitting a clustering request, in most cases, the vertex relationship between servers need not be considered. Clustering can be achieved at most by accessing vertices in the bridge servers. In this way, the communication cost is reduced again.

The following illustrates an implementation of enterprise classification using the technical solution in the embodiment of the present invention.

First, business characteristics according to which the business classification is based are introduced. Besides classifying according to basic service characteristics, annuity enterprise characteristics and insurance enterprise characteristics are preset as newly added industry characteristics and used for mining potential annuity enterprises and insurance enterprise clients, widening annuity service coverage and solving the problem of disordered enterprise classification.

As an example, an enterprise belongs to both communication enterprises and financial enterprises according to basic service characteristics, but because the enterprise has more characteristics in terms of annuity or insurance and has great supporting force, the enterprise is proposed to be classified into annuity enterprises or insurance enterprises, so that a subsequent service department can conveniently and rapidly find the enterprise which possibly becomes a large client according to annuity enterprises/insurance enterprises classification, develop the enterprise into a large client and effectively promote the income of the enterprise.

Classification features of the newly added industry are divided into two main categories: annuity enterprise features and insurance enterprise features.

The annuity enterprise features include:

(1) Enterprise trusted scale: the larger the trusted scale of the enterprise, the more favorable the enterprise to develop into annuity enterprises, and therefore the trusted scale of the enterprise is used as an annuity enterprise classification standard.

(2) Number of staff: the more employees a business has, the more likely it will be to develop into annuity business later, and therefore the employee count is used as an annuity business classification criterion.

(3) Enterprise support dynamics: the greater the supporting strength of the enterprise to annuity, the easier the enterprise to develop into annuity enterprise, and the supporting strength is divided into weight division, namely whether welfare planning exists, enterprise high-level attitude and basic staff appeal. The weight ratio is set to 3:4: and 3, calculating the supporting strength of the enterprise, and helping to mine potential users.

(4) Whether annuity business exists: businesses of annuity business exist and are classified into annuity business categories by default.

(5) Company registered capital: the larger the registered capital of the enterprise, the stronger the company's strength is proved, the more likely it is to develop into an annuity enterprise, and thus the registered capital of the company is taken as an annuity enterprise classification standard.

The insurance enterprise features include:

(1) Average income of enterprise staff: the higher the average revenue of the enterprise employee, the more likely it is to purchase insurance to ensure their own lives, thus attributing the average enterprise revenue to the characteristics of the insurance enterprise.

(2) Enterprise employee academy: the higher the enterprise employee's academic, the more likely the insurance apprehension is, and the independent ideas are present, thus attributing the enterprise employee's academic to the insurance enterprise features.

(3) Enterprise employee assets: the more assets that an enterprise employee has, the more financial will be demonstrated by the employee, the more likely it is to purchase insurance to secure itself, thus attributing the enterprise employee's assets to the insurance enterprise features.

(4) Average age of staff of enterprises: the smaller the average age of the enterprise employee, the more likely it will develop to be a customer, and the longer it will take to return to the business, and the longer the investment management will be for the company, thus attributing the average age of the enterprise employee to the insurance enterprise characteristics.

The enterprise classification is described below in connection with enterprises.

Defining an insurance unit, wherein the insurance unit name is as follows: some capital development communication company, assuming that some capital development communication company exists in the column of the industry standard of the insurance organization, find the group of the corresponding classification of some capital development communication company, set the step size of the classification to the x-axis coordinate of the point, such as: the communication classification step size is 2. The point coordinates are found by comparing the insuring unit names, and the Euclidean distance between the point and the midpoint of the existing training model is calculated under the assumption that the coordinates are (2, 4).

For example, the Euclidean distance between the point and the coordinates (2, 6) may be calculated as 2, and the Euclidean distance between the point and the coordinates (3, 5) may be 1.414. And similarly, after the Euclidean distance between the point and all points in the existing training model is calculated, the HashMap is used for storing the point coordinates and Euclidean distance, and then the value in the HashMap is stored in a list and then is sequenced. And taking out the top 10 ordered point coordinates according to the k value in the existing training model, and judging the classification of the point coordinates, wherein most of the classes are the classes with the proportion as the enterprise classification of the point. If the even number k five-open condition occurs, the next position is taken out from the ordered list, the category judgment is carried out, and finally the enterprise is classified into most enterprise categories in k+1 positions.

And then taking the coordinates of all enterprises in the sample as nodes, setting a data similarity threshold value e as 0.6, and setting a data clustering threshold value u as a k value. The enterprises in each category are clustered according to the following steps: and clustering according to the structural similarity between the two vertexes not smaller than e and the neighbor number of the vertexes not smaller than u. According to the result of cluster allocation, it can be ensured that vertices in the same cluster are stored on as few server nodes as possible. Thus, the communication cost can be greatly reduced when the enterprise data information is accessed, the speed is higher, and the classification result is more accurate.

In addition, the existing training model is updated according to the industry standard at intervals, verification is carried out again, and after a period of time, the names are assumed to have capital and no longer belong to communication classification, at the moment, the system deploys the updated industry standard on a server to automatically carry out model training, and meanwhile, the most suitable k value is found according to cross verification to continue classification.

When a new clustering request is submitted, the clustering can be supported by accessing the vertexes in the bridging server without considering the vertex relation among the servers, so that the system is in the process of continuous updating and automatic learning. If the final classification result is wrong, error information is provided, and the manual service is transferred.

When a new clustering task is submitted, the clustering task can be completed only by searching a server needing to be calculated in the bridging server and calculating a small number of outliers and bridging nodes. Compared with the existing distributed graph clustering algorithm, the distributed graph computing mode stores vertexes on specific server nodes according to the previous clustering result and characteristics, and when a new clustering request is submitted, similarity of all nodes does not need to be computed, data on each server does not need to be computed, so that a large amount of computation cost can be saved, and communication cost can also be saved.

Referring to fig. 9, fig. 9 is a schematic diagram of a main structure of a data classification apparatus according to an embodiment of the present invention, where the data classification apparatus may implement a data classification method, as shown in fig. 9, where the data classification apparatus specifically includes:

the division module 901 is used for classifying data by adopting a KNN algorithm;

a storage module 902, configured to store cluster points in the data in a cluster server according to the classification of the data, a data similarity threshold, and a data clustering threshold, where the cluster points in the data, the bridge nodes in the data, and the multiple vertices in the data are stored in the bridge server;

The grouping module 903 is configured to obtain data from the servers in the cluster or the bridge server according to a similarity threshold value to be classified and a clustering parameter to be classified of the data to be classified in the clustering request, and the data similarity threshold value and the data clustering threshold value, so as to determine an affiliated group of the data to be classified.

In one embodiment of the present invention, the dividing module 901 is specifically configured to divide the classification of the data by adopting a KNN algorithm optimized by a plug-in sorting algorithm.

In one embodiment of the present invention, the partitioning module 901 is specifically configured to partition the classification of the data by adopting a KNN algorithm optimized by a permutation algorithm and setting an identifier for the exchanged data.

In one embodiment of the present invention, the storage module 902 is specifically configured to take, as a cluster midpoint, data smaller than the data similarity threshold and equal to the data clustering threshold in the data according to the classification of the data, the data similarity threshold, and the data clustering threshold, and store the cluster midpoint in a cluster server;

In one embodiment of the present invention, the grouping module 903 is specifically configured to obtain data from the server in the cluster if the similarity threshold to be classified of the data to be classified is less than or equal to the data similarity threshold and the clustering parameter to be classified of the data to be classified is less than or equal to the data clustering threshold, so as to determine the group of the data to be classified;

In one embodiment of the present invention, the grouping module 903 is specifically configured to obtain data from the bridge server if the similarity threshold to be classified of the data to be classified is greater than the data similarity threshold and the clustering parameter to be classified of the data to be classified is greater than the data clustering threshold, so as to determine that the data to be classified and the obtained data belong to the same group;

In one embodiment of the invention, the data includes enterprise classification data.

Fig. 10 illustrates an exemplary system architecture 1000 of a method of data classification or apparatus of data classification to which embodiments of the invention may be applied.

As shown in fig. 10, a system architecture 1000 may include terminal devices 1001, 1002, 1003, a network 1004, and a server 1005. The network 1004 serves as a medium for providing a communication link between the terminal apparatuses 1001, 1002, 1003 and the server 1005. The network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user can interact with a server 1005 via a network 1004 using terminal apparatuses 1001, 1002, 1003 to receive or transmit messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 1001, 1002, 1003.

The terminal devices 1001, 1002, 1003 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 1005 may be a server providing various services, such as a background management server (merely an example) providing support for shopping-type websites browsed by the user using the terminal apparatuses 1001, 1002, 1003. The background management server may analyze and process the received data such as the product information query request, and feedback the processing result (e.g., the target push information, the product information—only an example) to the terminal device.

It should be noted that, the method for classifying data provided in the embodiment of the present invention is generally executed by the server 1005, and accordingly, the device for classifying data is generally disposed in the server 1005.

It should be understood that the number of terminal devices, networks and servers in fig. 10 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 11, there is illustrated a schematic diagram of a computer system 1100 suitable for use in implementing the terminal device of an embodiment of the present invention. The terminal device shown in fig. 11 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU) 1101, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the system 1100 are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1101.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes

The names of these modules do not constitute a limitation on the module itself in some cases, and for example, the transmitting unit may also be described as "a unit that transmits a picture acquisition request to a connected server".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:

Classifying data by adopting a KNN algorithm;

According to a similarity threshold value to be classified and a clustering parameter to be classified of data to be classified in a clustering request, and the data similarity threshold value and the data clustering threshold value, acquiring data from a server in a cluster or a bridging server to determine an affiliated group of the data to be classified

According to the technical scheme of the embodiment of the invention, classification of data is divided by adopting a KNN algorithm; according to the classification of the data, the data similarity threshold and the data clustering threshold, storing cluster middle points in the data in a cluster server, and storing outlier points in the data, bridge nodes in the data and multiple vertexes in the data in a bridge server; and acquiring data from the servers in the cluster or the bridging server according to the similarity threshold value to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, as well as the data similarity threshold value and the data clustering threshold value, so as to determine the group of the data to be classified. By acquiring data from the servers in the cluster or the bridging server, the group to which the data to be classified belongs can be determined without judging whether the grouping condition is satisfied for a plurality of times, so that the speed and accuracy of data classification can be improved.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of data classification, comprising:

Classifying data by adopting a KNN algorithm;

Obtaining data from a server in a cluster or a bridging server according to a similarity threshold value to be classified and a clustering parameter to be classified of the data to be classified in a clustering request, and the data similarity threshold value and the data clustering threshold value so as to determine the group of the data to be classified;

2. The method for classifying data according to claim 1, wherein said classifying data using KNN algorithm comprises:

3. The method for classifying data according to claim 1, wherein said classifying data using KNN algorithm comprises:

4. The method according to claim 1, wherein the obtaining data from the server in the cluster or the bridge server according to the similarity threshold to be classified and the clustering parameter to be classified of the data to be classified in the clustering request, and the data similarity threshold and the data clustering threshold to determine the group of the data to be classified includes:

5. The method according to claim 4, wherein the obtaining data from the bridge server to determine the group of the data to be classified if the similarity to be classified threshold is greater than the data similarity threshold and the clustering to be classified parameter of the data to be classified is greater than the data clustering threshold comprises:

6. The method of data classification as claimed in claim 1, wherein the data comprises enterprise classification data.

7. An apparatus for classifying data, comprising:

And according to the classification of the data, the data similarity threshold and the data clustering threshold, taking the data smaller than the data similarity threshold and equal to the data clustering threshold as cluster midpoints, and storing the cluster midpoints in a cluster server;

according to the classification of the data, the data similarity threshold and the data clustering threshold, the data which is equal to the data similarity threshold and the data clustering threshold and belongs to a plurality of clusters is used as a multi-vertex, and the multi-vertex is stored in a bridging server;

8. An electronic device for data classification, comprising:

one or more processors;

storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

9. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.