CN104731916A

CN104731916A - Optimizing initial center K-means clustering method based on density in data mining

Info

Publication number: CN104731916A
Application number: CN201510131975.2A
Authority: CN
Inventors: 袁启龙; 史海波; 周晓锋
Original assignee: WUXI ZHONGKE FANZAI INFORMATION TECHNOLOGY RESEARCH DEVELOPMENT CENTER Co Ltd
Current assignee: WUXI ZHONGKE FANZAI INFORMATION TECHNOLOGY RESEARCH DEVELOPMENT CENTER Co Ltd
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2015-06-24

Abstract

The invention relates to an optimizing initial center K-means clustering method based on density in data mining. The method comprises the following steps that step 1, a needed data set is given, and the clustering number K is determined; step 2, the density of all data objects in the data set is calculated, and the mean density of the data set is calculated according to the obtained density of the data objects; step 3, a minimum density distance value of each data object in the data set is calculated; step 4, descending sort is performed on the minimum density distance values of the data objects in the data set, and the data objects which are corresponding to the first K minimum density distance values and of which the density is larger than the mean density are selected as an initial clustering center according to the determined clustering number K; step 5, clustering is performed on the data set by using a K-means clustering method according to the above obtained initial clustering center until a clustering result is output. According to the optimizing initial center K-means clustering method based on the density in data mining, the calculating complexity is reduced, classification accuracy rate is improved, the stability is high, and rapid convergence is improved.

Description

In data mining, density based optimizes the K means clustering method of initial center

Technical field

The present invention relates to a kind of clustering method, in especially a kind of data mining, density based optimizes the K means clustering method of initial center, belongs to the technical field of cluster analysis.

Background technology

Data mining is one of heat topic of computer nowadays research, cluster analysis is as the unsupervised machine learning method of one, refer to for a set of data objects, how research be divided into data object in different bunches automatically, allow the object in same cluster have higher similarity under certain criterion, and the data object in different bunches have low similarity.Cluster analysis is widely used at Disciplinary Frontiers such as machine learning, data mining, speech recognition, Iamge Segmentation, business analysis and Bioinformatics.At present, traditional clustering algorithm mainly comprises five classes, they respectively: based on the clustering algorithm divided, clustering algorithm, density-based algorithms, the clustering algorithm based on grid and the clustering algorithm based on model based on level.

In the middle of clustering algorithm, K-means algorithm belongs to the clustering algorithm based on dividing, and it is succinctly quick, with efficient and famous.But there are some defects in original K-means algorithm: 1), primal algorithm requires that user provides K value, i.e. the number of class bunch, and this value gets primarily of experience, so the difficulty of true defining K value is larger; 2), algorithm is responsive to initial cluster center, and the quality that initial center is selected, can affect cluster result, affect the efficiency of algorithm operation; 3), this algorithm is comparatively responsive to abnormal data, and result can be caused to be absorbed in locally optimal solution.

At present, some scholars have made a little improvement to initial center point problem, and as preventing result to be absorbed in local optimum, the point of the comparison dispersion that normally chosen distance is far away is as initial center point.If but only consider distance factor, then easily choose abnormity point, and then have influence on Clustering Effect.Scholar also considers these problems, and then from the angle of density, filters out abnormity point.Problem is also had to be the point that initial center point likely can be chosen in same class bunch, although namely the density ratio of certain point is larger, but somewhat selectedly in the class of this some correspondence bunch do central point, now should select the representational point in other class, otherwise, result also can be caused easily to be absorbed in locally optimal solution.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide density based in a kind of data mining to optimize the K means clustering method of initial center, it reduces computation complexity, and improve the accuracy rate of classification, stability is high, improves Fast Convergent.

According to technical scheme provided by the invention, in a kind of data mining, density based optimizes the K means clustering method of initial center, and described clustering method comprises the steps:

Step 1, given required data set, and determine cluster number K;

Step 2, calculate the density of all data objects in data set, and according to obtaining the average density of density calculation data set of data object;

The minimum density distance value of each data object in step 3, calculating data set;

Step 4, descending sort is carried out to the minimum density distance value of data object in data set, according to the cluster number K determined, select and density corresponding with front K minimum density distance value to be greater than the data object initial cluster center the most of average density;

Step 5, initial cluster center according to above-mentioned acquisition, utilize K-means clustering method to carry out cluster to data set, until export cluster result.

Described step 5 comprises the steps:

Step 5.1, according to selected initial cluster center, the data object in data set is assigned to the initial cluster center nearest with described data object, and the error sum of squares of data object in a calculating K cluster, to obtain initial error quadratic sum;

Step 5.2, after the data object in data set is assigned to nearest initial cluster center, calculate the cluster centre of K cluster, to obtain revising cluster centre;

Step 5.3, according to correction cluster centre, determine the error sum of squares of data object in K cluster, to obtain round-off error quadratic sum;

When step 5.4, difference between round-off error quadratic sum and initial error quadratic sum do not meet the condition of convergence, then using the correction cluster centre that obtains again as initial cluster center, and repeat above-mentioned steps, until the difference between round-off error quadratic sum and initial error quadratic sum meets the condition of convergence.

For data set X={x _i| i=1,2 ..., n}, data object has m dimensional feature, then the density of data object is

density (x_{i}) = Σ_{j = 1}^{n} e^{\frac{- {(d (x_{i}, x_{j}))}^{2}}{2 R^{2}}}

Wherein, d (x _i, x _j) be data object x _iwith data object x _jbetween Euclidean distance,

d (x_{i}, x_{j}) = \sqrt{Σ_{p = 1}^{m} {(x_{ip} {- x}_{jp})}^{2}}, i = 1,2, . . ., n; j = 1,2, . . ., n;

R is data object x _ithe radius of neighbourhood.

For data object x _i, calculate data object x _ito the distance of the data object larger than its density, then minimum density distance value is data object x _iminimum value in the distance than its density large data objects; As described data object x _iduring for data object that density is maximum, then minimum density distance value is data object x _iand the maximum distance in data set between data object.

Advantage of the present invention: consider from the bulk properties of class, inner for class general point and density maximum point are made a distinction, because the dot density in same class has certain correlativity, this correlativity can reflect from point to the lowest distance value of the point higher than its density, the point that this distance value of general point is corresponding must be the point with oneself in same bunch, and distance value is less; And this distance value of the central point of class i.e. the maximum point of class Midst density, the meeting that correspond to is the high density point in other class, and distance value is comparatively large, can general point and central point to distinguishing.In addition, this distance value of abnormity point is also comparatively large, is filtered out just passable by average density.So, just can filter out high-quality initial cluster center point, finally use K means clustering method cluster.Simulation result shows to compare existing K Mean Method, and the present invention has higher accuracy rate and less iterations, can Fast Convergent.

Accompanying drawing explanation

Fig. 1 is the schematic diagram utilizing random selecting mode to select cluster centre point.

Fig. 2 is the schematic diagram that the present invention obtains initial cluster center point.

Fig. 3 is the distribution situation containing 240 data objects.

Fig. 4 is process flow diagram of the present invention.

Embodiment

Below in conjunction with concrete drawings and Examples, the invention will be further described.

As shown in Figure 4: in order to improve the accuracy rate of classification, stability is high, improve Fast Convergent, clustering method of the present invention comprises the steps:

Step 1, given required data set, and determine cluster number K;

In the embodiment of the present invention, for data set X={x _i| i=1,2 ..., n}, data object has m dimensional feature, C _j(j=1,2 ..., K) represent K classification of cluster, c _j(j=1,2 ..., K) represent initial cluster center.

In the embodiment of the present invention, in data set, the density of data object is

density (x_{i}) = Σ_{j = 1}^{n} e^{\frac{- {(d (x_{i}, x_{j}))}^{2}}{2 R^{2}}}

d (x_{i}, x_{j}) = \sqrt{Σ_{p = 1}^{m} {(x_{ip} {- x}_{jp})}^{2}}, i = 1,2, . . ., n; j = 1,2, . . ., n;

R is data object x _ithe radius of neighbourhood.The radius of neighbourhood of data object is: calculate the distance between all data objects, adjusts the distance and carries out ascending sort, gets radius value R for being positioned at the value of position, wherein, num is number of distances, and for given data set, number of distances num is the number of distances between data object, can uniquely determine according to data object in data set; Percent is coverage number percent, and in the specific implementation, when percent is set to 1%-2%, effect is better.For arbitrary data object x _i, with data object x _icentered by, R is the border circular areas of radius, is called data object x _ineighborhood, i.e. neighbourhood={x|0 < d (x, x _i) < R}.The average density of data set can be calculated by arithmetic mean method according to the density of all data objects, and concrete computation process, known by the art personnel, repeats no more herein.

In the embodiment of the present invention, for data object x _i, calculate data object x _ito the distance of the data object larger than its density, then minimum density distance value is data object x _iminimum value in the distance than its density large data objects; As described data object x _iduring for data object that density is maximum, then minimum density distance value is data object x _iand the maximum distance in data set between data object.

In the embodiment of the present invention, descending sort is carried out to the minimum density distance value of data object in data set, then choose with front K minimum density distance value is corresponding and density is greater than the data object of average density as initial cluster center, can c be obtained _j(j=1,2 ..., K).

In the embodiment of the present invention, utilize K-means clustering method to carry out cluster process known by the art personnel to data set, particularly, described step 5 comprises the steps:

Initial error quadratic sum can be obtained by formulae discovery, namely

E = Σ_{i = 1}^{K} Σ_{t = 1}^{n_{i}} {| | x_{it} - c_{i} | |}^{2}

Wherein, x _itbe t data object of i-th cluster, n _ibe the data object number in i-th cluster, initial error quadratic sum E is less, then the similarity in cluster is higher, otherwise the similarity in cluster is lower.

In the embodiment of the present invention, revising cluster centre can be obtained by below method

c_{j}^{'} = \frac{1}{n_{j}} \underset{x &Element; C_{j}}{Σ} x

Wherein, j=1,2 ..., K; n _jfor the data object quantity of jth cluster.

In the embodiment of the present invention, the computing method of round-off error quadratic sum are identical with initial error quadratic sum, repeat no more herein.

In the specific implementation, the condition of convergence can be determined according to data set, usually, when the difference between round-off error quadratic sum and initial error quadratic sum is less than a definite value or stabilizes to a definite value, then can think and meet the condition of convergence.

In order to verify the validity of clustering method of the present invention, under identical experimental situation, running the method for algorithm and traditional random acquisition central point herein respectively, contrast their operational effect.Experimental situation is: operating system Mac OS X 10.10.1, composing software Matlab2014a, eclipse, Hardware I ntel Core i72.4GHz, internal memory 8GB.

First, testing algorithm chooses the effect of representative central point.

Data set adopts self-defining data set, and it has 6 classifications, and described data set is used to the validity that test center point Algorithms of Selecting is each class Selecting Representative Points from A on irregular data collection, and the distinction to common high density point and representative central point.Data distribution8 as shown in Figures 1 and 2.Fig. 1 is the central point that the method for random selecting obtains, and Fig. 2 is the central point that clustering method of the present invention obtains.Asterisk is the central point of algorithm picks, and circle is general point.

As can be seen from Figure 1, the central point that existing random selection method obtains has instability, and representativeness is not very strong, and common high density point and effective central point are not made a distinction, some points are arranged in same bunch.In fig. 2, the central point of acquisition is distributed in each class respectively, for each classification have chosen representative central point, has well distinguished active center point and common high density point, avoids central point and is chosen in phenomenon in same class bunch.Further, during Selection Center point, also abnormity point has been filtered out by average density, for the subsequent step of algorithm is laid a good groundwork.

Then, the accuracy of testing algorithm Selection Center point.

Test data set adopts Iris (3 classes in UCI database, 4 attributes, 150 objects), through consulting, the practical center of Iris data set, be respectively (5.003.421.460.24), (6.582.975.552.02), (5.932.774.261.32).Algorithm parameter percent herein, when choosing percent=2%, Clustering Effect is better.The experimental result of clustering method of the present invention and existing random selecting method gained, contrasts with Iris real data center, the results are shown in Table 1..

Table 1 compares with original I ris data set center

As can be seen from Table 1, central point and the original I ris data set center standard error of clustering method acquisition of the present invention are less, and more close to practical center, accuracy is high and stable, has more representativeness, illustrates that this algorithm is effective to original I ris data set.

Then, the Clustering Effect of testing algorithm.

Test data adopts the Iris in UCI database, Wine, Glass Identification raw data set, and Clustering Effect error sum of squares E value is weighed, E value is less, illustrate object in class and cluster centre nearer, Clustering Effect is better, on the contrary, E value is larger, then Clustering Effect is poorer.On data set, run 20 times with two kinds of clustering methods respectively, the error sum of squares E value of gained is averaged, as shown in table 2.The mean value of the error sum of squares E that clustering method of the present invention is corresponding, has reduction compared to random algorithm as can be seen from the table, illustrates that the method for the present invention to K-means initial center point improves to Clustering Effect.

The contrast of table 2 two kinds of algorithm Clustering Effects

Finally, testing algorithm working time and iterations.

K-means algorithm is suitable for the data set of convex structure, following experimental data is respectively with (1.01.0), (3.33.3), (-3.253.25), (-3.25-2.25), (3.25-2.25) be mean vector, then respectively with [0.90; 00.3], [0.70; 00.5], [1.50; 00.9], [0.30; 01.3], [1.10; 00.7] be covariance matrix, what generate respectively has 240,1100, the two-dimentional data set meeting normal distribution of 4500 and 11000 points.Data acquisition when generating 240 points distributes as shown in Figure 3.

High-quality initial center point contributes to the time and the iterations that reduce the operation of K-means algorithm.Two kinds of initial center point-generating algorithm, in conjunction with K-means algorithm, each data set run 20 times, the average operating time of testing algorithm when different pieces of information amount and mean iterative number of time, as shown in table 3.

The working time contrast of table 3 two kinds of methods when different pieces of information amount

Along with increasing of data object, the working time of two kinds of methods is all in increase.But the clustering method of the present invention time used is minimum.Along with the increase of data volume, clustering method of the present invention iterations used is obviously less than the algorithm of random selecting central point.The method that clustering method of the present invention compares traditional random selecting central point has spent the time when calculating minor increment density (MDD) Selection Center point, but algorithm actual run time is less than traditional K-means algorithm, analyzing reason is that the representativeness of initial cluster center point owing to choosing is strong, quality is high, decrease the iterations of algorithm, for the subsequent step of algorithm saves the time, so initial cluster center point of the present invention can make K-means algorithm Fast Convergent, improve the operational efficiency of algorithm.

The influential effect chosen for cluster of the initial cluster center of K-means algorithm is comparatively large, and the bad meeting of choosing makes cluster result unstable, is easily absorbed in locally optimal solution.The present invention is on the basis calculating data object density, further calculating minimum density distance, though filtered density height with this, but the point that representativeness is not strong, abnormity point is filtered out by average density, for data set selects representative initial cluster center point, and by one of this starting condition as K-means algorithm.As can be seen from experimental result, clustering method of the present invention is conducive to K-means algorithm Fast Convergent, improves the operational efficiency of algorithm, and accuracy, and improves Clustering Effect, demonstrates the validity of algorithm.

The present invention considers from the bulk properties of class, inner for class general point and density maximum point are made a distinction, because the dot density in same class has certain correlativity, this correlativity can reflect from point to the lowest distance value of the point higher than its density, the point that this distance value of general point is corresponding must be the point with oneself in same bunch, and distance value is less; And this distance value of the central point of class i.e. the maximum point of class Midst density, the meeting that correspond to is the high density point in other class, and distance value is comparatively large, can general point and central point to distinguishing.In addition, this distance value of abnormity point is also comparatively large, is filtered out just passable by average density.So, just can filter out high-quality initial cluster center point, finally use K means clustering method cluster.Simulation result shows to compare traditional K Mean Method, and the present invention has higher accuracy rate and less iterations, can Fast Convergent.

Claims

1. in data mining, density based optimizes a K means clustering method for initial center, and it is characterized in that, described clustering method comprises the steps:

Step 1, given required data set, and determine cluster number K;

2. in data mining according to claim 1, density based optimizes the K means clustering method of initial center, and it is characterized in that, described step 5 comprises the steps:

3. in data mining according to claim 1, density based optimizes the K means clustering method of initial center, it is characterized in that, for data set X={x _i| i=1,2 ..., n}, data object has m dimensional feature, then the density of data object is

density (x_{i}) = Σ_{j = 1}^{n} e^{\frac{- {(d (x_{i}, x_{j}))}^{2}}{2 R^{2}}}

Wherein, d (x _i, x _j) be data object x _iwith data object x _jbetween Euclidean distance, i=1,2 ..., n; J=1,2 ..., n; R is data object x _ithe radius of neighbourhood.

4. in data mining according to claim 1, density based optimizes the K means clustering method of initial center, it is characterized in that, for data object x _i, calculate data object x _ito the distance of the data object larger than its density, then minimum density distance value is data object x _iminimum value in the distance than its density large data objects; As described data object x _iduring for data object that density is maximum, then minimum density distance value is data object x _iand the maximum distance in data set between data object.