[go: up one dir, main page]

TWI402701B - Method for density-based data clustering - Google Patents

Method for density-based data clustering Download PDF

Info

Publication number
TWI402701B
TWI402701B TW98131795A TW98131795A TWI402701B TW I402701 B TWI402701 B TW I402701B TW 98131795 A TW98131795 A TW 98131795A TW 98131795 A TW98131795 A TW 98131795A TW I402701 B TWI402701 B TW I402701B
Authority
TW
Taiwan
Prior art keywords
point
points
data
query
seed
Prior art date
Application number
TW98131795A
Other languages
Chinese (zh)
Other versions
TW201112016A (en
Inventor
Cheng Fa Tsai
Heng Fu Yeh
Original Assignee
Univ Nat Pingtung Sci & Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Pingtung Sci & Tech filed Critical Univ Nat Pingtung Sci & Tech
Priority to TW98131795A priority Critical patent/TWI402701B/en
Publication of TW201112016A publication Critical patent/TW201112016A/en
Application granted granted Critical
Publication of TWI402701B publication Critical patent/TWI402701B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

基於密度式之資料分群方法Density-based data grouping method

本發明係關於一種資料分群方法,尤其是一種基於密度式之資料分群方法。The invention relates to a data grouping method, in particular to a density based data grouping method.

資料探勘(Data Mining)主要係提供使用者發掘原始資料中隱含著的有用資訊,用以於龐大資料集中尋找各資料間所隱含的特徵與關係,以建立一套完整資料分析模式。藉此,可應用於如商業交易行為分析、空間資料分析、文件管理及網路入侵行為分析等各種不同領域,以有效發掘潛藏且有用之資訊,進而提供決策人員參考。另外,在資料探勘中之資料分群技術可讓使用者快速得知資料本質間的相關程度,例如顧客購買行為、年齡市場區隔等。資料分群是指在原始資料庫中依據某些自訂的維度特性將各相似性高的資料歸類到各自的群組裡。Data Mining mainly provides users with the useful information hidden in the original data to find the features and relationships implied between the data in a large data set to establish a complete data analysis model. In this way, it can be applied to various fields such as business transaction behavior analysis, spatial data analysis, file management and network intrusion behavior analysis to effectively explore hidden and useful information, and provide reference for decision makers. In addition, the data grouping technology in data exploration allows users to quickly know the correlation between the nature of the data, such as customer purchase behavior, age market segmentation and so on. Data grouping refers to the classification of data with high similarity into their respective groups based on some custom dimensional characteristics in the original database.

然而,隨著服務更加多變及所必須擷取的隱藏訊息更多,現今的資料分群技術開始趨向將是否可輕易處理極大量的資料做為評估效能的一項重大因素。以下針對幾種較具代表性的習知資料分群方法進行說明:However, as services become more diverse and there are more hidden messages that must be captured, today's data-sorting technologies are beginning to trend toward the ability to easily process extremely large amounts of data as a significant factor in assessing performance. The following is a description of several representative data clustering methods:

1、K-means資料分群方法:步驟一係預先隨機選擇k個資料點(即進行分群作業後所產生之資料群集數量),並將該k個資料點設定為k個資料群集的初始形心點;步驟二係依據各初始形心點計算各資料點x彼此間之遠近距離,亦即對每一個資料點x尋找最接近的初始形心點,並將各資料點x分別加入最接近的初始形心點,以進行分群動作;步驟三係由目前的各資料點x分佈重新計算出較佳的形心點;最後係重覆步驟二及步驟三,直至所有資料群集的形心點皆已不再變動時,即終止資料分群作業。1. K-means data grouping method: Step 1 pre-randomly selects k data points (that is, the number of data clusters generated after grouping operations), and sets the k data points as the initial centroids of k data clusters. Point; step 2 is to calculate the distance between each data point x according to each initial centroid point, that is, to find the closest initial centroid point for each data point x, and add each data point x to the nearest one. The initial centroid point is used for grouping action; the third step is to recalculate the preferred centroid point from the current data point x distribution; finally, repeat steps 2 and 3 until the centroids of all data clusters are When there is no longer any change, the data grouping operation is terminated.

然而,雖然該習知K-means資料分群方法的分群速度相當快速,但因為一開始隨機選取初始形心點的因素,使得每次的分群結果皆不甚穩定,且在進行步驟二的過程中,僅採用簡單的距離判斷,故造成分群結果不佳。However, although the conventional K-means data grouping method is quite fast, the clustering results are not stable at all because of the initial selection of the initial centroid points, and in the process of step two. Only simple distance judgment is used, which results in poor grouping results.

2、DBSCAN資料分群方法:步驟一係由一資料集中之數個資料點預先隨機選擇其中一資料點做為初始種子點;步驟二係判斷目前初始種子點的半徑ε範圍內是否有超過半徑MinPts的資料點,若達到門檻值則將目前範圍內的資料點歸類到同一群集內並作為種子點,並從範圍內的其它種子點逐一進行擴張;步驟三係持續前述步驟二,直到該資料集中所有資料點都被歸類完畢為止。該習知DBSCAN資料分群方法是以較為合乎邏輯的密度判斷方式來進行分群,故可用以濾除雜訊及適用於不規則圖樣的資料點等。2, DBSCAN data grouping method: Step 1 is to select one of the data points as the initial seed point in advance by several data points in a data set; Step 2 is to determine whether the current radius of the initial seed point exceeds the radius MinPts If the threshold is reached, the data points in the current range are classified into the same cluster and used as seed points, and expanded from other seed points in the range; Step 3 continues the foregoing step 2 until the data All the data points in the collection are classified. The conventional DBSCAN data grouping method is grouped in a more logical density judgment manner, so it can be used to filter out noise and data points suitable for irregular patterns.

然而,上述習知DBSCAN資料分群方法因為必須對每個資料點進行繁複的擴張及密度判斷,故造成分群時間較為冗長。However, the above-mentioned conventional DBSCAN data grouping method is complicated because of the complicated expansion and density judgment of each data point.

3、IDBSCAN資料分群方法:主要針對前述習知DBSCAN資料分群方法係循序判斷資料點進行擴散而耗時的行為進行改良,而採用經由減少查詢次數而提升分群速度的策略。該習知IDBSCAN資料分群方法係於擴張種子點半徑ε之掃描範圍邊界上等距設置8個標記邊界點,該擴張種子點半徑ε之掃描範圍內的資料點僅選取最靠近該8個標記邊界點之資料點作為種子點,如此減少種子點之數量,便可減少重複的擴張動作,以克服該習知DBSCAN資料分群方法中種子點數量過多而造成速度緩慢之缺點。3, IDBSCAN data grouping method: mainly for the above-mentioned conventional DBSCAN data grouping method is to gradually judge the data points to spread and time-consuming behavior improvement, and adopt the strategy of improving the grouping speed by reducing the number of queries. The conventional IDBSCAN data grouping method is to set eight mark boundary points equidistantly on the scan range boundary of the expanded seed point radius ε, and the data points in the scan range of the expanded seed point radius ε are only selected closest to the eight mark boundaries. The data point of the point is used as a seed point, so that the number of seed points can be reduced, and the repeated expansion action can be reduced to overcome the shortcoming of the excessive number of seed points in the conventional DBSCAN data grouping method.

然而,上述習知IDBSCAN資料分群方法由於離該擴張種子點較近的資料點覆蓋面積較大,若將該些資料點納入作為種子點,則會增加搜尋的時間成本。再且,即使該擴張種子點掃描範圍內之種子點數量係不大於8個,惟相鄰之種子點其掃描範圍重疊比例較高,造成其重複擴張之比例亦相當高,進而增加時間成本。However, the above-mentioned conventional IDBSCAN data grouping method has a large coverage area of data points close to the expanded seed point, and if these data points are included as seed points, the time cost of searching is increased. Moreover, even if the number of seed points in the scanning range of the expanded seed is not more than 8, the adjacent seed points have a higher overlapping ratio of scanning ranges, and the proportion of repeated expansion is also relatively high, thereby increasing the time cost.

4、KIDBSCAN資料分群方法:主要針對前述習知IDBSCAN資料分群方法中尋找標記邊界點時,若依序從資料集中所讀進來的資料點是位於低密度的區域時,將會造成多餘種子點搜尋的動作,而造成擴張不佳的效果。為了減少取樣次數,該習知KIDBSCAN資料分群方法提出加入菁英點來進行擴張的概念。其所要設定的三個參數為:菁英點(K)、半徑(ε)與最少包含點(MinPts)。而執行步驟說明如下:(1)先利用K-means演算法找到資料集中的K個重心點,再尋找距離這些重心點最近的K個資料點將其定義為該菁英點;(2)將K個菁英點移到資料集的最前端;(3)執行IDBSCAN資料分群方法。如此便可加快資料分群之速度,惟所能減少的分群時間仍相當有限。4. KIDBSCAN data grouping method: mainly for the above-mentioned known IDBSCAN data grouping method, when searching for the boundary point of the mark, if the data points read in from the data set are located in the low-density area, the redundant seed point search will be caused. The action that caused the poor expansion. In order to reduce the number of sampling times, the conventional KIDBSCAN data grouping method proposes the concept of adding an elite point for expansion. The three parameters to be set are: elite point (K), radius (ε) and minimum inclusion point (MinPts). The execution steps are as follows: (1) Firstly use the K-means algorithm to find the K center of gravity points in the data set, and then find the K data points closest to these center of gravity points to define it as the elite point; (2) K The elite points are moved to the forefront of the data set; (3) the IDBSCAN data grouping method is implemented. This speeds up the grouping of data, but the grouping time that can be reduced is still quite limited.

5、NPUST資料分群方法:主要針對前述習知DBSCAN、IDBSCAN及KIDBSCAN資料分群方法中以每一資料點進行擴張動作時,必須對資料集中所有資料點進行擴張詢問,將會造成擴張不佳的效果。為了減少擴張詢問次數,習知NPUST資料分群方法提出利用習知K-means及IDBSCAN資料分群方法來進行擴張的概念;其所要設定之三個參數為:群數(K)、半徑(ε)與最少包含點(MinPts)。其執行步驟說明如下:(1)先利用習知K-means資料分群方法將資料集分佈空間切割為K個子空間;(2)利用該習知IDBSCAN資料分群方法分別於K個子空間進行分群,當進行擴張詢問時,該習知IDBSCAN資料分群方法之掃描範圍限制於本身之子空間內;(3)定義各群集之邊界點,並計算任二邊界點之間的距離,將距離最接近之二邊界點所屬的群集進行合併為一群,重複進行合併直到群集總數到達所設定之群數(K)即完成分群動作,如此,便可減少該習知IDBSCAN資料分群方法對所有資料點的擴張詢問,以加快分群之速度。5. NPUST data grouping method: mainly for the above-mentioned conventional DBSCAN, IDBSCAN and KIDBSCAN data grouping methods, when each data point is expanded, it is necessary to expand the query of all data points in the data set, which will result in poor expansion. . In order to reduce the number of expansion queries, the conventional NPUST data grouping method proposes the concept of expansion using the conventional K-means and IDBSCAN data grouping methods; the three parameters to be set are: group number (K), radius (ε) and Minimum inclusion point (MinPts). The execution steps are as follows: (1) Firstly, the data set distribution space is cut into K subspaces by using the conventional K-means data grouping method; (2) using the conventional IDBSCAN data grouping method to perform grouping in K subspaces respectively. When performing the expansion query, the scanning range of the conventional IDBSCAN data grouping method is limited to the subspace of itself; (3) defining the boundary points of each cluster, and calculating the distance between any two boundary points, the closest boundary of the distance The clusters to which the points belong are merged into a group, and the merge is repeated until the total number of clusters reaches the set number of groups (K), that is, the grouping action is completed, so that the expansion query of all the data points by the conventional IDBSCAN data grouping method can be reduced. Speed up the speed of grouping.

然而,上述習知NPUST資料分群方法之群數(K)的設定必須依據資料集包含的群集數而定,若將群數(K)設定愈大,代表切割後每個子空間愈小,即進行擴張詢問之範圍亦相對愈小,執行效率自然會提高。惟,在未知資料集之群集數時,該群數(K)設定若與資料集之群集數相差太大,則會產生不正確的分群結果。However, the number of groups (K) of the above-mentioned conventional NPUST data grouping method must be determined according to the number of clusters included in the data set. If the number of groups (K) is set to be larger, the smaller each subspace after cutting is performed, that is, The scope of the expansion inquiry is relatively small, and the execution efficiency will naturally increase. However, when the number of clusters of the unknown data set is set, if the number of groups (K) is too different from the number of clusters of the data set, an incorrect clustering result will result.

基於上述原因,有必要進一步改良上述習知資料分群方法。For the above reasons, it is necessary to further improve the above-described conventional data grouping method.

本發明目的乃改良上述缺點,以提供一種基於密度式之資料分群方法,將資料點之搜尋限制於一查詢範圍內之資料點者。The object of the present invention is to improve the above disadvantages to provide a density-based data grouping method for limiting the search of data points to data points within a query range.

本發明目的係提供一種基於密度式之資料分群方法,該資料分群方法具有提升分群效率者。The object of the present invention is to provide a density-based data grouping method, which has the ability to improve grouping efficiency.

為達到前述發明目的,本發明所運用之技術內容如下:一種基於密度式之資料分群方法係包含一設定步驟,係設定一掃描半徑參數、一最少包含點參數及一種子列表;一切割步驟,係依據該掃描半徑參數將具有數個資料點之資料集的分佈空間進行切割,以獲得數個網格,使該些資料點分佈於該些網格內;一讀取步驟,於該資料集中讀取一資料點作為核心點;一擴張詢問步驟,係將該核心點所在網格及周圍鄰近網格定義為一查詢範圍,且位於該查詢範圍內之資料點定義為查詢點,分別計算此核心點與該些查詢點之間的距離,將距離小於或等於該掃描半徑參數之查詢點定義為鄰近點;一分群判斷步驟,係判斷該些鄰近點之數量是否小於該最少包含點參數,若判斷為「否」,則將該核心點及鄰近點視為同一群集,並進行一邊界標記步驟,若判斷為「是」,則將該核心點及鄰近點視為雜訊,並進行一第一判斷步驟;該邊界標記步驟,係於核心點之掃描範圍邊界上標記數個邊界記號,將距離該些邊界記號最接近之鄰近點加入該種子列表中,並由該種子列表讀取一種子作為核心點,重新進行該擴張詢問步驟;該第一判斷步驟,係判斷是否該種子列表中之種子皆完成該擴張詢問步驟,若判斷為「是」,則進行一終止判斷步驟,若判斷為「否」,則由該種子列表中讀取一種子作為核心點,重新進行該擴張詢問步驟;及該終止判斷步驟,係判斷是否所有資料點皆已完成分群或視為雜訊,若判斷為「是」,則終止,若判斷為「否」,則重新進行該讀取步驟。In order to achieve the foregoing object, the technical content of the present invention is as follows: a density-based data grouping method includes a setting step of setting a scan radius parameter, a minimum inclusion point parameter, and a sub-list; a cutting step, According to the scanning radius parameter, the distribution space of the data set having several data points is cut to obtain a plurality of meshes, so that the data points are distributed in the meshes; and a reading step is performed in the data set Reading a data point as a core point; an expansion query step is to define the grid where the core point is located and the surrounding adjacent grid as a query range, and the data points located in the query range are defined as query points, and respectively calculate this The distance between the core point and the query points defines a query point whose distance is less than or equal to the scan radius parameter as a neighboring point; a group judgment step determines whether the number of the neighboring points is less than the minimum included point parameter, If the judgment is "No", the core point and the neighboring point are regarded as the same cluster, and a boundary marking step is performed. If the judgment is "Yes" The core point and the neighboring point are regarded as noise, and a first determining step is performed; the boundary marking step marks a plurality of boundary marks on the boundary of the scan range of the core point, and the closest to the boundary marks is The neighboring point is added to the seed list, and the seed list is read as a core point, and the expansion query step is re-executed; the first determining step determines whether the seed in the seed list completes the expansion query step. If the determination is "Yes", a termination determination step is performed. If the determination is "No", then one of the seed lists is read as a core point, and the expansion inquiry step is performed again; and the termination determination step is determined. Whether all the data points have been grouped or regarded as noise, if it is judged as "Yes", it will terminate. If the judgment is "No", the reading step will be repeated.

為讓本發明之上述及其他目的、特徵及優點能更明顯易懂,下文特舉本發明之較佳實施例,並配合所附圖式,作詳細說明如下:請參照第1及2圖所示,本發明較佳實施例之基於密度式之資料分群方法,係藉由一電腦系統連接至少一資料庫作為執行架構,該資料庫中係存有一資料集1,該資料集1係由數筆資料點11所共同組成之群集,本發明之基於密度式之資料分群方法係包含一設定步驟S1、一切割步驟S2、一讀取步驟S3、一擴張詢問步驟S4、一分群判斷步驟S5、一邊界標記步驟S6、一第一判斷步驟S7及一終止判斷步驟S8。藉由上述步驟流程,可快速且正確的完成資料分群作業。The above and other objects, features and advantages of the present invention will become more <RTIgt; Preferably, the density-based data grouping method according to the preferred embodiment of the present invention connects at least one database as an execution architecture by using a computer system, and the data library has a data set 1 stored in the database. The density data grouping method of the present invention comprises a setting step S1, a cutting step S2, a reading step S3, an expansion query step S4, a grouping determining step S5, A boundary marking step S6, a first determining step S7 and a termination determining step S8. Through the above steps, the data grouping operation can be completed quickly and correctly.

請參照第1及2圖所示,本發明較佳實施例之基於密度式之資料分群方法之設定步驟S1,以於該電腦系統設定一掃描半徑(Eps)參數R、一最少包含點(Minpts)參數及一種子列表。更詳言之,該掃描半徑參數R及最少包含點參數係為正比的關係,當該掃描半徑參數R的值設定愈大,該最少包含點參數的值也跟著設定愈大,反之,當該掃描半徑參數R的值設定愈小,該最少包含點參數的值也跟著設定愈小,藉此,以提高資料分群的正確率。此外,為方便後續說明,於此將「掃描範圍S」一詞定義為以該資料集1的任一個資料點11為中心,並以該掃描半徑參數R作為半徑進行掃描所涵蓋之範圍。Referring to FIG. 1 and FIG. 2, a setting step S1 of the density-based data grouping method according to the preferred embodiment of the present invention is used to set a scan radius (Eps) parameter R and a minimum inclusion point (Minpts) in the computer system. ) parameters and a sublist. More specifically, the scan radius parameter R and the minimum inclusion point parameter are in a proportional relationship. When the value of the scan radius parameter R is set larger, the value of the minimum inclusion point parameter is also set to be larger, and vice versa. The smaller the value of the scan radius parameter R is set, the smaller the value of the minimum included point parameter is, so as to improve the correct rate of data grouping. In addition, for convenience of the following description, the term "scanning range S" is defined herein as the range covered by scanning with the scanning radius parameter R as the center centering on any of the data points 11 of the data set 1.

請參照第1及3圖所示,本發明較佳實施例之基於密度式之資料分群方法之切割步驟S2,係依據該掃描半徑參數R將具有該數個資料點11之分佈空間進行切割,以獲得數個網格2,使該些資料點11分佈於該些網格2內,該掃描半徑參數R即代表該網格2之尺寸;舉例而言,以二維平面座標為例進行說明,該數個資料點11皆分佈於一個二維平面空間中,因此,各個資料點11皆對應有一個二維平面座標(X,Y),該切割步驟S2係針對所有資料點11每一維度之座標(X座標或Y座標),分別找出該座標的最大值,並如以下所述公式(1)計算該維度被切割之網格數:Referring to FIG. 1 and FIG. 3, the cutting step S2 of the density-based data grouping method according to the preferred embodiment of the present invention cuts the distribution space having the plurality of data points 11 according to the scanning radius parameter R. Obtaining a plurality of grids 2, the data points 11 are distributed in the grids 2, and the scanning radius parameter R represents the size of the grid 2; for example, a two-dimensional plane coordinate is taken as an example for illustration. The plurality of data points 11 are all distributed in a two-dimensional plane space. Therefore, each data point 11 corresponds to a two-dimensional plane coordinate (X, Y), and the cutting step S2 is for each data point 11 for each dimension. The coordinates (X coordinate or Y coordinate) are used to find the maximum value of the coordinate, and calculate the number of meshes that the dimension is cut as shown in the following formula (1):

其中,Gi 為該i維度之網格數,Ci-Max 為該i座標的最大值,且i=X、Y、Z...等,R為該掃描半徑參數,S代表常數項,如以下所述之公式(2)表示:Where G i is the number of grids of the i dimension, C i-Max is the maximum value of the i coordinate, and i=X, Y, Z, etc., R is the scan radius parameter, and S represents a constant term. Equation (2) as described below indicates:

更詳言之,於二維平面座標中i係為X及Y,意即GX 代表X維度之網格數,GY 代表Y維度之網格數,CX-Max 為X座標的最大值,CY-Max 為Y座標的最大值,分別將該最大值CX-Max 及CY-Max 除以該掃描半徑參數R,並以”高斯符號[]”將所得的”商數”以整數表示,若能整除則表示所得的”商數”即為該X或Y維度之網格數,不必加該常數項S;若不能整除,則將所得的”商數”再加該常數項S即為該X或Y維度之網格數。藉此,確保將該數資料點11之分佈空間切割後,所獲得之數個網格2大小相等,且該資料集1中所有資料點11均位於該些網格2內。More specifically, in the two-dimensional plane coordinates, i is X and Y, meaning that G X represents the number of grids in the X dimension, G Y represents the number of grids in the Y dimension, and C X-Max is the maximum value of the X coordinate. , C Y-Max is the maximum value of the Y coordinate, and the maximum values C X-Max and C Y-Max are respectively divided by the scanning radius parameter R, and the obtained "quotient" is obtained by "Gaussian symbol []" An integer indicates that if divisible, the resulting "quotient" is the number of grids in the X or Y dimension, and the constant term S is not added; if it is not divisible, the resulting "quotient" is added to the constant term. S is the number of grids in the X or Y dimension. Thereby, after the distribution space of the data point 11 is cut, the obtained plurality of meshes 2 are equal in size, and all the data points 11 in the data set 1 are located in the meshes 2.

請參照第1及3圖所示,本發明較佳實施例之基於密度式之資料分群方法之讀取步驟S3,係由該資料集1中讀取其中一筆資料點11作為核心點12,並進行該擴張詢問步驟S4。Referring to FIGS. 1 and 3, the reading step S3 of the density-based data grouping method according to the preferred embodiment of the present invention reads one of the data points 11 as the core point 12 from the data set 1, and The expansion inquiry step S4 is performed.

請參照第1、4及5圖所示,本發明較佳實施例之基於密度式之資料分群方法之擴張詢問步驟S4,係將該核心點12所在網格2a及數鄰近網格2b定義為一查詢範圍,且位於該查詢範圍內之資料點11定義為查詢點13(如第4圖所示),分別計算此核心點12與該些查詢點13之間的距離,將距離小於或等於該掃描半徑參數R之查詢點13定義為鄰近點14(如第5圖所示)。Referring to Figures 1, 4 and 5, the expansion query step S4 of the density-based data grouping method according to the preferred embodiment of the present invention defines the grid 2a and the number of adjacent grids 2b of the core point 12 as A query range, and the data point 11 located in the query range is defined as the query point 13 (as shown in FIG. 4), and the distance between the core point 12 and the query points 13 is calculated respectively, and the distance is less than or equal to The query point 13 of the scan radius parameter R is defined as the neighboring point 14 (as shown in Figure 5).

請再參照第4圖所示,舉例而言,以網格狀的分佈而言,該查詢範圍之定義係如以下所述三種樣態:其一、該核心點12所在網格2a可以明確地找出8個鄰近網格2b;其二、該核心點12’所在網格2a’可以明確地找出3個鄰近網格2b’;其三、該核心點12”所在網格2a”可以明確地找出5個鄰近網格2b”;更詳言之,本發明所定義之「查詢範圍」便為該核心點12所在網格2a及該些鄰近網格2b所涵蓋之範圍;藉此,該核心點12僅需針對該查詢範圍內之查詢點13計算距離,以減少該核心點12與所有資料點11計算距離的次數,進而節省於該資料集1中搜尋大量資料點11的時間成本。Referring to FIG. 4 again, for example, in the case of a grid-like distribution, the definition of the query range is as follows: First, the grid 2a of the core point 12 can be explicitly Find 8 adjacent grids 2b; second, the grid 2a' where the core points 12' are located can clearly find 3 adjacent grids 2b'; third, the grid 2a of the core points 12" can be clear Find 5 adjacent meshes 2b"; more specifically, the "query range" defined by the present invention is the range covered by the mesh 2a of the core point 12 and the adjacent meshes 2b; The core point 12 only needs to calculate the distance for the query point 13 in the query range to reduce the number of times the core point 12 calculates the distance from all the data points 11, thereby saving the time cost of searching for a large number of data points 11 in the data set 1. .

請參照第1及5圖所示,本發明較佳實施例之基於密度式之資料分群方法之分群判斷步驟S5,係判斷該些鄰近點14之數量是否小於該最少包含點參數,若判斷為「否」,則將該核心點12及該些鄰近點14視為同一群集,並進行一邊界標記步驟S6;若判斷為「是」,則將該核心點12及該些鄰近點14視為雜訊,並進行一第一判斷步驟S7。Referring to FIG. 1 and FIG. 5, the grouping judging step S5 of the density-based data grouping method according to the preferred embodiment of the present invention determines whether the number of the neighboring points 14 is less than the minimum inclusion point parameter, and if it is determined to be If no, the core point 12 and the neighboring points 14 are regarded as the same cluster, and a boundary marking step S6 is performed; if the determination is YES, the core point 12 and the neighboring points 14 are regarded as The noise is transmitted, and a first judgment step S7 is performed.

請參照第1及6圖所示,本發明較佳實施例之基於密度式之資料分群方法之邊界標記步驟S6,係於該分群判斷步驟S5所獲得群集之核心點12的掃描範圍S邊界上標記數個邊界記號3,於該核心點12之掃描範圍S內,將距離該些邊界記號3最接近之鄰近點14加入該種子列表中當做種子,再由該種子列表讀取一種子作為核心點12,並重新進行該擴張詢問步驟S4;舉例而言,本實施例係於該群集之核心點12的掃描範圍S邊界上,依順時針方向依序等距設置8個邊界記號3a、3b、3c、3d、3e、3f、3g及3h;接著,再分別計算該些邊界記號3a、3b、3c、3d、3e、3f、3g及3h與此核心點12之掃描範圍S內所有鄰近點14之間的距離,分別將距離最接近該些邊界記號3a、3b、3c、3d、3e、3f、3g及3h之鄰近點14a、14b、14c、14d、14e、14f及14g加入該種子列表中當做種子;此外,請參照第6圖所示,若有二個以上之邊界記號3(如第6圖所示為二個邊界記號3f及3g)所對應最接近之鄰近點14f係為相同,則僅需將比鄰近點14f加入該種子列表一次。Referring to FIGS. 1 and 6, the boundary labeling step S6 of the density-based data grouping method according to the preferred embodiment of the present invention is performed on the boundary of the scanning range S of the core point 12 of the cluster obtained in the grouping determining step S5. Marking a plurality of boundary marks 3, in the scan range S of the core point 12, adding the neighboring points 14 closest to the boundary marks 3 to the seed list as a seed, and then reading a sub-category as a core from the seed list Point 12, and repeat the expansion query step S4; for example, the embodiment is on the scan range S boundary of the core point 12 of the cluster, and 8 boundary marks 3a, 3b are arranged in a clockwise direction in an equidistant manner. 3c, 3d, 3e, 3f, 3g, and 3h; then, calculate the boundary marks 3a, 3b, 3c, 3d, 3e, 3f, 3g, and 3h and all adjacent points in the scan range S of the core point 12, respectively. The distance between 14 is added to the seed list by adjacent points 14a, 14b, 14c, 14d, 14e, 14f and 14g which are closest to the boundary marks 3a, 3b, 3c, 3d, 3e, 3f, 3g and 3h, respectively. As a seed; in addition, please refer to Figure 6, if there are two 3 marks the boundary (as shown in FIG. 6 is a two boundary marks 3f and 3g) of the nearest adjacent points corresponding to the same line 14f, 14f adjacent points than the only added to the list of primary seeds.

請參照第7圖所示,一般而言,靠近該核心點12之鄰近點14的掃描範圍S’與該核心點12的掃描範圍S重疊比例較高,又由於該核心點12之掃描範圍S內的資料點11數量已大於或等於該最少包含點參數,因此,靠近該核心點12之鄰近點14的掃描範圍S’內的資料點11視為雜訊之機率通常較低,若將靠近該核心點12之鄰近點14加入該種子列表當做種子,則將增加執行的時間成本。然而,該鄰近點14a之掃描範圍S’’與該核心點12之掃描範圍S重疊比例較低;因此,請參照第6圖所示,本發明之邊界標記步驟S6每次僅分別選取最靠近該些邊界記號3之一個鄰近點14加入該種子列表中當做種子,可篩選去除掉較靠近該核心點12之鄰近點14,如此,便可於維持高分群準確率的前提下,有效提升分群作業的效率。Referring to FIG. 7, in general, the scan range S' of the adjacent point 14 near the core point 12 overlaps with the scan range S of the core point 12, and the scan range S of the core point 12 is also The number of data points 11 in the data point is greater than or equal to the minimum inclusion point parameter. Therefore, the probability that the data point 11 in the scanning range S' of the neighboring point 14 near the core point 12 is regarded as noise is generally low, if it is to be close The neighboring point 14 of the core point 12 joins the seed list as a seed, which increases the time cost of execution. However, the scanning range S'' of the neighboring point 14a and the scanning range S of the core point 12 overlap to a lower ratio; therefore, referring to FIG. 6, the boundary marking step S6 of the present invention only selects the closest one at a time. A neighboring point 14 of the boundary mark 3 is added to the seed list as a seed, and the neighboring point 14 closer to the core point 12 can be selected and removed, so that the cluster can be effectively improved while maintaining high group accuracy. The efficiency of the work.

請參照第1及2圖所示,本發明較佳實施例之基於密度式之資料分群方法之第一判斷步驟S7,係判斷是否該種子列表中之種子皆已完成該擴張詢問步驟S4,若判斷為「是」,則進行一終止判斷步驟S8;若判斷為「否」,則由該種子列表中讀取另一種子作為核心點12,重新進行該擴張詢問步驟S4;更詳言之,該種子列表中之種子在進行該擴張詢問步驟S4後即從該種子列表中刪除,然而,於該邊界點標記步驟S6中持續會有數資料點11加入該種子列表,因此,該種子列表係不斷有資料點11被刪除或加入,直到該種子列表中無任何種子可進行該擴張詢問步驟S4,即完成一個群集的擴散動作,並進行該終止判斷步驟S8。Referring to FIGS. 1 and 2, the first determining step S7 of the density-based data grouping method according to the preferred embodiment of the present invention determines whether the seed in the seed list has completed the expansion query step S4. If the determination is YES, a termination determination step S8 is performed; if the determination is "NO", another seed is read from the seed list as the core point 12, and the expansion inquiry step S4 is performed again; more specifically, The seed in the seed list is deleted from the seed list after performing the expansion query step S4. However, in the boundary point marking step S6, the number of data points 11 continues to be added to the seed list. Therefore, the seed list is continuously The data point 11 is deleted or added until the seed list has no seeds to perform the expansion inquiry step S4, that is, a cluster diffusion operation is completed, and the termination determination step S8 is performed.

請參照第1及2圖所示,本發明較佳實施例之基於密度式之資料分群方法之終止判斷步驟S8,係判斷是否所有資料點11皆已完成分群或視為雜訊,若判斷為「是」,則終止,即完成整個資料集1之分群動作;若判斷為「否」,則重新進行該讀取步驟S3。Referring to FIG. 1 and FIG. 2, the termination determining step S8 of the density-based data grouping method according to the preferred embodiment of the present invention determines whether all the data points 11 have been grouped or regarded as noise, and if it is determined to be If "Yes", the termination is completed, that is, the grouping operation of the entire data set 1 is completed; if the determination is "NO", the reading step S3 is re-executed.

此外,請參照第8圖所示,於該擴張詢問步驟S4及該分群判斷步驟S5獲得一群集C,且利用該邊界標記步驟S6將該群集C之數個鄰近點14加入該種子列表中,並由該種子列表讀取一種子作為核心點12,以進行該擴張詢問步驟S4及該分群判斷步驟S5情形下,若於該分群判斷步驟S5判斷該核心點12之掃描範圍S內的鄰近點14數量小於該最少包含點參數時,則將該核心點12及該核心點12之掃描範圍S內所有鄰近點14視為雜訊。然而,該核心點12之掃描範圍S係與該群集C互相重疊,使該核心點12之掃描範圍S與該群集C具有數交叉資料點11a,由於該群集C之所有鄰近點14皆已視為同一群集,因此,當將該核心點12之掃描範圍S內所有鄰近點14視為雜訊時,該群集C與該核心點12之掃描範圍S重疊區域的數交叉資料點11a仍視為歸屬於該群集C。In addition, as shown in FIG. 8, a cluster C is obtained in the expansion query step S4 and the cluster determination step S5, and the neighboring points 14 of the cluster C are added to the seed list by using the boundary labeling step S6. And the seed list is read as a core point 12, in the case of performing the expansion query step S4 and the grouping determining step S5, if the neighboring point in the scanning range S of the core point 12 is determined in the grouping determining step S5 When the number of 14 is less than the minimum inclusion point parameter, then all the neighboring points 14 in the scanning range S of the core point 12 and the core point 12 are regarded as noise. However, the scan range S of the core point 12 overlaps with the cluster C such that the scan range S of the core point 12 and the cluster C have a number of intersections 11a, since all neighbors 14 of the cluster C have been viewed For the same cluster, therefore, when all the neighboring points 14 in the scanning range S of the core point 12 are regarded as noise, the number of intersections 11a of the overlapping area of the cluster C and the scanning range S of the core point 12 is still regarded as Belongs to the cluster C.

為驗證本發明之資料分群方法具有分群效率高之優點,於此針對資料集A至F進行分群,並與習知K-means、DBSCAN、IDBSCAN、KIDBSCAN及NPUST資料分群方法進行比較。其中,資料集A至F皆具有575,000筆資料點,且含75,000筆之雜訊點。其中,資料集A至F之圖形皆不相同,且所包含之正確群集數分別為10、5、2、4、4及4;此外,本次實驗模擬所使用之設備包含CPU[Intel Pentium D3.4GHz]、記憶體[2GB],並以Java作為演算法之實作程式語言;再者,本次實驗模擬針對不同資料集其設定之參數如表一所示,每一資料集A至F皆實驗30次後取其平均值如表二所示。In order to verify that the data grouping method of the present invention has the advantage of high grouping efficiency, the data sets A to F are grouped and compared with the conventional K-means, DBSCAN, IDBSCAN, KIDBSCAN and NPUST data grouping methods. Among them, data sets A to F have 575,000 data points and contain 75,000 noise points. Among them, the data sets A to F are different, and the correct number of clusters included are 10, 5, 2, 4, 4 and 4; in addition, the equipment used in this experiment simulation includes CPU [Intel Pentium D3 .4GHz], memory [2GB], and Java as the implementation language of the algorithm; in addition, this experiment simulates the parameters set for different data sets as shown in Table 1, each data set A to F After the experiment was performed for 30 times, the average value was as shown in Table 2.

請參照表二所示,由習知資料分群方法與本發明之基於密度式之資料分群方法模擬結果比較,可得知本發明之基於密度式之資料分群方法可於維持相當高之分群正確率及雜訊濾除率的前提下,大幅降低執行時間的成本,可驗證本發明之資料分群方法確實有良好之分群效果。Referring to Table 2, comparing the conventional data grouping method with the density-based data grouping method simulation result of the present invention, it can be known that the density-based data grouping method of the present invention can maintain a relatively high group correcting rate. Under the premise of noise filtering rate, the cost of execution time is greatly reduced, and it can be verified that the data grouping method of the present invention does have a good grouping effect.

本發明之基於密度式之資料分群方法,係藉由將數資料點之分佈空間切割成網格狀,使該些資料點分佈於該些網格內,並以該核心點所在網格及周圍鄰近網格定義該查詢範圍,及在該查詢範圍內定義數查詢點,以將資料點之搜尋限制於該查詢範圍內之查詢點的功效。The density-based data grouping method of the present invention divides the distribution space of the data points into a grid shape, and distributes the data points in the grids, and the grid and the surrounding area of the core points The proximity grid defines the scope of the query, and defines a number of query points within the scope of the query to limit the search of the data points to the power of the query points within the scope of the query.

本發明之基於密度式之資料分群方法,係藉由將資料點之搜尋限制於該查詢範圍內之查詢點,僅需計算該核心點與該些查詢點之間的距離以定義出鄰近點,以便於該分群判斷步驟中,加速判斷是否分群或視為雜訊,使得本發明之基於密度式之資料分群方法具有提升分群效率的功效。The density-based data grouping method of the present invention is to limit the search of the data points to the query points in the query range, and only needs to calculate the distance between the core points and the query points to define the neighboring points. In order to facilitate the judging whether to group or treat the noise in the group judging step, the density-based data grouping method of the present invention has the effect of improving the grouping efficiency.

雖然本發明已利用上述較佳實施例揭示,然其並非用以限定本發明,任何熟習此技藝者在不脫離本發明之精神和範圍之內,相對上述實施例進行各種更動與修改仍屬本發明所保護之技術範疇,因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。While the invention has been described in connection with the preferred embodiments described above, it is not intended to limit the scope of the invention. The technical scope of the invention is protected, and therefore the scope of the invention is defined by the scope of the appended claims.

[本發明][this invention]

1...資料集1. . . Data set

11...資料點11. . . Data point

12...核心點12. . . Core point

12’...核心點12’. . . Core point

12’’...核心點12’’. . . Core point

13...查詢點13. . . Query point

14...鄰近點14. . . Adjacent point

14a至14g...鄰近點14a to 14g. . . Adjacent point

2...網格2. . . grid

2a...網格2a. . . grid

2a’...網格2a’. . . grid

2a’’...網格2a’’. . . grid

2b...鄰近網格2b. . . Adjacent grid

2b’...鄰近網格2b’. . . Adjacent grid

2b’’...鄰近網格2b’’. . . Adjacent grid

3...邊界記號3. . . Boundary mark

3a至3h...邊界記號3a to 3h. . . Boundary mark

C...群集C. . . Cluster

S...掃描範圍S. . . Scanning range

S’...掃描範圍S’. . . Scanning range

S’’...掃描範圍S’’. . . Scanning range

第1圖:本發明較佳實施例之資料分群方法的流程圖。Figure 1 is a flow chart of a data grouping method in accordance with a preferred embodiment of the present invention.

第2圖:本發明較佳實施例之掃描範圍示意圖。Figure 2 is a schematic illustration of the scanning range of a preferred embodiment of the invention.

第3圖:本發明較佳實施例之切割步驟示意圖。Figure 3 is a schematic illustration of the cutting steps of a preferred embodiment of the invention.

第4圖:本發明較佳實施例之擴張詢問步驟示意圖。Figure 4 is a schematic illustration of the steps of the expansion query in accordance with a preferred embodiment of the present invention.

第5圖:本發明較佳實施例之擴張詢問步驟示意圖。Figure 5 is a schematic illustration of the steps of the expansion query in accordance with a preferred embodiment of the present invention.

第6圖:本發明較佳實施例之邊界標記步驟示意圖。Figure 6 is a schematic illustration of the boundary marking steps of a preferred embodiment of the invention.

第7圖:本發明較佳實施例之邊界標記步驟示意圖。Figure 7 is a schematic illustration of the boundary marking steps of a preferred embodiment of the invention.

第8圖:本發明較佳實施例之分群判斷步驟示意圖。Figure 8 is a schematic diagram showing the steps of grouping judging in accordance with a preferred embodiment of the present invention.

Claims (4)

一種基於密度式之資料分群方法,包含:一設定步驟,係設定一掃描半徑參數、一最少包含點參數及一種子列表;一切割步驟,係依據該掃描半徑參數將具有數個資料點之資料集的分佈空間進行切割,以獲得數個網格,使該些資料點分佈於該些網格內;一讀取步驟,於該資料集中讀取一資料點作為核心點;一擴張詢問步驟,係將該核心點所在網格及數鄰近網格定義為一查詢範圍,且位於該查詢範圍內之資料點定義為查詢點,分別計算此核心點與該些查詢點之間的距離,將距離小於或等於該掃描半徑參數之查詢點定義為鄰近點;一分群判斷步驟,係判斷該些鄰近點之數量是否小於該最少包含點參數,若判斷為「否」,則將該核心點及鄰近點視為同一群集,並進行一邊界標記步驟,若判斷為「是」,則將該核心點及鄰近點視為雜訊,並進行一第一判斷步驟;該邊界標記步驟,係以該核心點為中心,以該掃描半徑參數進行掃描所涵蓋範圍的邊界上標記數個邊界記號,將距離該些邊界記號最接近之鄰近點加入該種子列表中,並由該種子列表讀取一種子作為核心點,重新進行該擴張詢問步驟;該第一判斷步驟,係判斷是否該種子列表中之種子皆完 成該擴張詢問步驟,若判斷為「是」,則進行一終止判斷步驟,若判斷為「否」,則由該種子列表中讀取一種子作為核心點,重新進行該擴張詢問步驟;及該終止判斷步驟,係判斷是否所有資料點皆已完成分群或視為雜訊,若判斷為「是」,則終止,若判斷為「否」,則重新進行該讀取步驟。 A density-based data grouping method includes: a setting step of setting a scan radius parameter, a minimum inclusion point parameter, and a sub-list; and a cutting step, the data having a plurality of data points according to the scan radius parameter The set distribution space is cut to obtain a plurality of grids, so that the data points are distributed in the grids; a reading step, reading a data point as a core point in the data set; and an expansion query step, The grid of the core point and the number of adjacent grids are defined as a query range, and the data points located in the query range are defined as query points, and the distance between the core points and the query points is calculated separately, and the distance is determined. A query point less than or equal to the scan radius parameter is defined as a neighboring point; a group judging step is to determine whether the number of the neighboring points is less than the minimum inclusion point parameter, and if the determination is "No", the core point and the neighboring point are The point is regarded as the same cluster, and a boundary marking step is performed. If the determination is yes, the core point and the neighboring point are regarded as noises, and a first judgment is made. The boundary marking step is centered on the core point, and the boundary points of the range covered by the scan radius parameter are marked with a plurality of boundary marks, and the neighboring points closest to the boundary marks are added to the seed list. And reading, by the seed list, a child as a core point, and performing the expansion inquiry step again; the first determining step is to determine whether the seed in the seed list is finished In the expansion inquiry step, if the determination is YES, a termination determination step is performed, and if the determination is "NO", a sub-sub-read is read from the seed list as a core point, and the expansion inquiry step is performed again; The termination judgment step determines whether all the data points have been grouped or regarded as noise. If the determination is "Yes", the process is terminated. If the determination is "No", the reading step is repeated. 依申請專利範圍第1項所述之基於密度式之資料分群方法,其中,該切割步驟中係找出該分佈空間中每一維度之座標的最大值,並如以下所述之公式計算該維度被切割之網格數: 其中,Gi 為該i維度之網格數,S代表常數項,如以下所述之公式表示: 在上述公式中,Ci-Max 為該i座標的最大值,R為該掃描半徑參數。The density-based data grouping method according to claim 1, wherein the cutting step finds a maximum value of coordinates of each dimension in the distribution space, and calculates the dimension according to the formula described below. Number of grids being cut: Where G i is the number of grids of the i dimension, and S represents a constant term, as expressed by the formula described below: In the above formula, C i-Max is the maximum value of the i coordinate, and R is the scan radius parameter. 依申請專利範圍第1或2項所述之基於密度式之資料分群方法,其中,該邊界標記步驟中二個以上之邊界記號所對應最接近之鄰近點相同,則僅將此鄰近點加入該種子列表一次。 According to the density-based data grouping method according to Item 1 or 2 of the patent application, wherein the adjacent points corresponding to the two or more boundary marks in the boundary marking step are the same, only the neighboring points are added to the Seed list once. 依申請專利範圍第1或2項所述之基於密度式之資料分群方法,其中,該第一判斷步驟中由該種子列表讀取一 種子作為核心點進行該擴張詢問步驟後,該種子即從該種子列表中刪除。The density-based data grouping method according to claim 1 or 2, wherein the first determining step reads a list from the seed list After the seed performs the expansion query step as a core point, the seed is removed from the seed list.
TW98131795A 2009-09-21 2009-09-21 Method for density-based data clustering TWI402701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98131795A TWI402701B (en) 2009-09-21 2009-09-21 Method for density-based data clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98131795A TWI402701B (en) 2009-09-21 2009-09-21 Method for density-based data clustering

Publications (2)

Publication Number Publication Date
TW201112016A TW201112016A (en) 2011-04-01
TWI402701B true TWI402701B (en) 2013-07-21

Family

ID=44909100

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98131795A TWI402701B (en) 2009-09-21 2009-09-21 Method for density-based data clustering

Country Status (1)

Country Link
TW (1) TWI402701B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI463341B (en) * 2012-12-22 2014-12-01 Univ Nat Pingtung Sci & Tech Data clustering method based on density

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6298140B1 (en) * 1998-02-20 2001-10-02 Christos Manavopoulos Electroacoustic transducer with improved tonal quality
TW200828053A (en) * 2006-12-22 2008-07-01 Univ Nat Pingtung Sci & Tech A method for grid-based data clustering
TW200846950A (en) * 2007-05-29 2008-12-01 Univ Nat Pingtung Sci & Tech Data clustering method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6298140B1 (en) * 1998-02-20 2001-10-02 Christos Manavopoulos Electroacoustic transducer with improved tonal quality
TW200828053A (en) * 2006-12-22 2008-07-01 Univ Nat Pingtung Sci & Tech A method for grid-based data clustering
TW200846950A (en) * 2007-05-29 2008-12-01 Univ Nat Pingtung Sci & Tech Data clustering method

Also Published As

Publication number Publication date
TW201112016A (en) 2011-04-01

Similar Documents

Publication Publication Date Title
TWI385544B (en) Density-based data clustering method
Cheng et al. A novel approximate spectral clustering algorithm with dense cores and density peaks
CN102663100B (en) A two-stage hybrid particle swarm optimization clustering method
CN111523576B (en) A Density Peak Clustering Outlier Detection Method Applicable to Electronic Mass Inspection
CN111475596A (en) Sub-segment similarity matching method based on multi-level track coding tree
TWI396106B (en) Grid-based data clustering method
CN104573705A (en) Clustering method for building laser scan point cloud data
CN110493221A (en) A kind of network anomaly detection method based on the profile that clusters
WO2014109127A1 (en) Index generating device and method, and search device and search method
CN107357844A (en) Outlier detection method and device
CN104731916A (en) Optimizing initial center K-means clustering method based on density in data mining
TWI391837B (en) Data clustering method based on density
CN106202388B (en) A kind of user gradation Automated Partition Method and system
CN102184216A (en) Automatic clustering method based on data field grid division
CN103714153A (en) Density clustering method based on limited area data sampling
TWI460680B (en) Data clustering method based on density
CN110781943A (en) A Clustering Method Based on Adjacent Grid Search
CN109978023A (en) Feature selection approach and computer storage medium towards higher-dimension big data analysis
CN115965318B (en) Logistics center site selection method based on variable center evolution clustering
CN105631465A (en) Density peak-based high-efficiency hierarchical clustering method
TWI402701B (en) Method for density-based data clustering
CN111708853A (en) A Taxi Hotspot Extraction Method Based on Characterized Density Peak Clustering
KR20140130014A (en) Method for producing co-occurrent subgraph for graph classification
CN119356021B (en) Optical proximity correction method and device, storage medium and electronic equipment
CN102184215A (en) Data-field-based automatic clustering method

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees