CN103150354A

CN103150354A - Data mining algorithm based on rough set

Info

Publication number: CN103150354A
Application number: CN2013100548420A
Authority: CN
Inventors: 王少夫
Original assignee: Individual
Current assignee: Individual
Priority date: 2013-01-30
Filing date: 2013-01-30
Publication date: 2013-06-12

Abstract

In order to perfect the outlier detecting algorithm, based on a rough set theory and a certain data mining technology, the invention provides a method for outlier data mining. Uncertain information is researched by using a rough feature selection approach and adopting the distance measure of similar knowledge granularities, the data feature performance is still retained while the data features decreases. Then, objects are sequenced through the given feature values to improve the computing complexity. Finally, the experimental analysis is carried out on a lymph data set. The result shows that the data mining algorithm can detect most of outliers. Compared with the conventional algorithm, the data mining algorithm has the advantages that the outlier detecting property is improved by about 10 to 20%, showing the great superiority.

Description

A kind of data mining algorithm based on rough set

Technical field

A kind of data digging method based on rough set of the present invention.Belong to the computer information technology field.

Technical background

Along with the development of the communication technology in modern times, increasing data are collected and combine, and setting up a large community network becomes possibility.For example, can set up related network between the user by the daily record of Email, perhaps by modes such as network log and network communication contact books, the associated person information that the user submits to be set up community network.So present community network scale is huger than early stage network, usually comprises several thousand or several ten thousand node, the nearly network of 1,000,000 nodes is arranged even.In the face of the network of such bulky complex, simple mathematical knowledge and original artificial treatment can not effectively be analyzed.The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data now in order to solve exactly, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, huge effect has been brought into play in many aspects such as natural language processing.

In order to obtain best data mining effect, with adopting certain algorithm, set up model, a kind of new algorithm of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.

Summary of the invention

The present invention proposes a kind of data digging method based on rough set, and the method mainly solves the data mining problem of abnormity point, guarantees to obtain best data mining effect.

For achieving the above object, the technical scheme that the present invention takes is: be at first that at first the method is according to using the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, study uncertain information, also keep its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can be most of outlier detection out

The technical scheme concrete steps that the present invention proposes comprise:

Rough set is that it is nested into knowledge classification in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly; The below provides its definition.

Suppose that U is the finite aggregate of non-NULL, I is an equivalence class relation in U, and binary is referred to as to gather the approximation space of U to K=(U, I).Suppose that X is the subset of set U, x is the object of set in U, and the set that the object of all and x undistinguishable forms is made as I (x), and each object in I (x) has same characteristic attribute with object x.For every subset

With a relation of equivalence I ∈ Ind (K), can define two subsets.

Set X is lower approximate suc as formula 1 about I's:

I * (X) = Y {Y &Element; U / I | Y &SubsetEqual; X} = {x &Element; U | {[x]}_{I} &SubsetEqual; X} - - - (1)

Set X is upper approximate suc as formula 2 about II's:

I ^*(X)=Y{Y ∈ U/I|Y ∩ X ≠ ∮ }=x ∈ U|[x] I ∩ X ≠ many } (2)

The frontier district of set X is suc as formula 3:

BND(X)＝I ^*(X)-I*(X) (3)

BND (X) is upper approximate and lower be similar to poor of set X.If BND (X) is empty set, claim that XX is (crisp) clearly about I; If instead BND (X) is not empty set, claim that set X is the rough set (rough Set) about I.Its collecting structure as shown in Figure 1.

Rough set theory is regarded knowledge as to domain division, thereby makes knowledge have graininess.

If K=(U, Y) is a knowledge base, R ∈ Y is the undistinguishable relation on domain U, and the granularity of knowledge R ∈ Y is designated as KG (R)

KG (R) = \frac{| R |}{{| U |}^{2}} - - - (4)

The resolution Dis of knowledge R (R) is

Dis(R)＝1-KG(R) (5)

Data mining (Data Mining) is exactly from a large amount of, incomplete, noisy, fuzzy, random real application data, extract lie in wherein, people are ignorant but be the information of potentially useful and the process of knowledge in advance.

Utilize rough set that abnormity point is detected, its algorithm is produced by following steps:

(1) according to original state input system information.

(2) information is sorted, divides equivalence class.

(3) then judge the attribute number.

(4) build the sequence of attributes of successively decreasing.

(5) repeat 2,3, otherwise object is carried out Knowledge Granulation and weight calculation.

(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted

Technique effect of the present invention: the data mining technology scheme of the above-mentioned abnormity point that the present invention proposes, can obtain best data mining effect, abnormity point number larger, just more can manifest this kind to the superiority of abnormity point data mining algorithm.Will be for further regularity, the dynamic of research complex network abnormity point are established certain basis.

Description of drawings

Fig. 1 is the structural drawing of rough set.

Fig. 2 is lymph data set abnormity point comparative analysis experimental result

Embodiment

Embodiment 1:

The lymph data set is detected in batches, first detect the interior abnormity point number of 7 objects of front, respectively to 9 of the front, 12 objects detect, and draw their abnormity point in order.Comparing with KNN and DIS algorithm.

The lymph data set is transfused to an information table IS (U, A), and wherein U comprises all 480 lymph data set examples, and A comprises 9 lymph attribute data collection.IS (U, A) is carried out outlier detection.Experimental result as shown in Figure 2.

So by above-mentioned application as can be known, utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.This data mining algorithm can be most of outlier detection out.Can carry out well the data mining of useful information.

The above is only the preferred embodiments of the present invention, be not limited to the present invention, obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention, within if of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.

Claims

1. data digging method based on rough set is characterized in that: at first the method is according to rough set theory and certain data mining technology, has proposed a kind of method of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can most of outlier detection out, can carry out the data mining of useful information well.

2. method according to claim 1, is characterized in that, rough set is that knowledge classification is nested in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly.Rough set is regarded knowledge as to domain division, thereby makes knowledge have graininess.

3. method according to claim 1, is characterized in that, utilizes rough set that abnormity point is detected, and its algorithm is produced by following steps:

(1) according to original state input system information.

(2) information is sorted, divides equivalence class.

(3) then judge the attribute number.

(4) build the sequence of attributes of successively decreasing.

(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted.