CN103150354A - Data mining algorithm based on rough set - Google Patents
Data mining algorithm based on rough set Download PDFInfo
- Publication number
- CN103150354A CN103150354A CN2013100548420A CN201310054842A CN103150354A CN 103150354 A CN103150354 A CN 103150354A CN 2013100548420 A CN2013100548420 A CN 2013100548420A CN 201310054842 A CN201310054842 A CN 201310054842A CN 103150354 A CN103150354 A CN 103150354A
- Authority
- CN
- China
- Prior art keywords
- data mining
- data
- rough
- knowledge
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007418 data mining Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 15
- 210000002751 lymph Anatomy 0.000 claims abstract description 8
- 238000005516 engineering process Methods 0.000 claims abstract description 6
- 238000004458 analytical method Methods 0.000 claims abstract description 4
- 238000005469 granulation Methods 0.000 claims description 6
- 230000003179 granulation Effects 0.000 claims description 6
- 238000011160 research Methods 0.000 claims description 5
- 238000013450 outlier detection Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000003247 decreasing effect Effects 0.000 claims description 2
- 230000007423 decrease Effects 0.000 abstract 1
- 235000019580 granularity Nutrition 0.000 abstract 1
- 230000000717 retained effect Effects 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010835 comparative analysis Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
In order to perfect the outlier detecting algorithm, based on a rough set theory and a certain data mining technology, the invention provides a method for outlier data mining. Uncertain information is researched by using a rough feature selection approach and adopting the distance measure of similar knowledge granularities, the data feature performance is still retained while the data features decreases. Then, objects are sequenced through the given feature values to improve the computing complexity. Finally, the experimental analysis is carried out on a lymph data set. The result shows that the data mining algorithm can detect most of outliers. Compared with the conventional algorithm, the data mining algorithm has the advantages that the outlier detecting property is improved by about 10 to 20%, showing the great superiority.
Description
Technical field
A kind of data digging method based on rough set of the present invention.Belong to the computer information technology field.
Technical background
Along with the development of the communication technology in modern times, increasing data are collected and combine, and setting up a large community network becomes possibility.For example, can set up related network between the user by the daily record of Email, perhaps by modes such as network log and network communication contact books, the associated person information that the user submits to be set up community network.So present community network scale is huger than early stage network, usually comprises several thousand or several ten thousand node, the nearly network of 1,000,000 nodes is arranged even.In the face of the network of such bulky complex, simple mathematical knowledge and original artificial treatment can not effectively be analyzed.The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data now in order to solve exactly, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, huge effect has been brought into play in many aspects such as natural language processing.
In order to obtain best data mining effect, with adopting certain algorithm, set up model, a kind of new algorithm of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.
Summary of the invention
The present invention proposes a kind of data digging method based on rough set, and the method mainly solves the data mining problem of abnormity point, guarantees to obtain best data mining effect.
For achieving the above object, the technical scheme that the present invention takes is: be at first that at first the method is according to using the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, study uncertain information, also keep its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can be most of outlier detection out
The technical scheme concrete steps that the present invention proposes comprise:
Rough set is that it is nested into knowledge classification in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly; The below provides its definition.
Suppose that U is the finite aggregate of non-NULL, I is an equivalence class relation in U, and binary is referred to as to gather the approximation space of U to K=(U, I).Suppose that X is the subset of set U, x is the object of set in U, and the set that the object of all and x undistinguishable forms is made as I (x), and each object in I (x) has same characteristic attribute with object x.For every subset
With a relation of equivalence I ∈ Ind (K), can define two subsets.
Set X is lower approximate suc as formula 1 about I's:
Set X is upper approximate suc as formula 2 about II's:
I
*(X)=Y{Y ∈ U/I|Y ∩ X ≠ ∮ }=x ∈ U|[x] I ∩ X ≠ many } (2)
The frontier district of set X is suc as formula 3:
BND(X)=I
*(X)-I*(X) (3)
BND (X) is upper approximate and lower be similar to poor of set X.If BND (X) is empty set, claim that XX is (crisp) clearly about I; If instead BND (X) is not empty set, claim that set X is the rough set (rough Set) about I.Its collecting structure as shown in Figure 1.
Rough set theory is regarded knowledge as to domain division, thereby makes knowledge have graininess.
If K=(U, Y) is a knowledge base, R ∈ Y is the undistinguishable relation on domain U, and the granularity of knowledge R ∈ Y is designated as KG (R)
The resolution Dis of knowledge R (R) is
Dis(R)=1-KG(R) (5)
Data mining (Data Mining) is exactly from a large amount of, incomplete, noisy, fuzzy, random real application data, extract lie in wherein, people are ignorant but be the information of potentially useful and the process of knowledge in advance.
Utilize rough set that abnormity point is detected, its algorithm is produced by following steps:
(1) according to original state input system information.
(2) information is sorted, divides equivalence class.
(3) then judge the attribute number.
(4) build the sequence of attributes of successively decreasing.
(5) repeat 2,3, otherwise object is carried out Knowledge Granulation and weight calculation.
(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted
Technique effect of the present invention: the data mining technology scheme of the above-mentioned abnormity point that the present invention proposes, can obtain best data mining effect, abnormity point number larger, just more can manifest this kind to the superiority of abnormity point data mining algorithm.Will be for further regularity, the dynamic of research complex network abnormity point are established certain basis.
Description of drawings
Fig. 1 is the structural drawing of rough set.
Fig. 2 is lymph data set abnormity point comparative analysis experimental result
Embodiment
Embodiment 1:
The lymph data set is detected in batches, first detect the interior abnormity point number of 7 objects of front, respectively to 9 of the front, 12 objects detect, and draw their abnormity point in order.Comparing with KNN and DIS algorithm.
The lymph data set is transfused to an information table IS (U, A), and wherein U comprises all 480 lymph data set examples, and A comprises 9 lymph attribute data collection.IS (U, A) is carried out outlier detection.Experimental result as shown in Figure 2.
So by above-mentioned application as can be known, utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.This data mining algorithm can be most of outlier detection out.Can carry out well the data mining of useful information.
The above is only the preferred embodiments of the present invention, be not limited to the present invention, obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention, within if of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.
Claims (3)
1. data digging method based on rough set is characterized in that: at first the method is according to rough set theory and certain data mining technology, has proposed a kind of method of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can most of outlier detection out, can carry out the data mining of useful information well.
2. method according to claim 1, is characterized in that, rough set is that knowledge classification is nested in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly.Rough set is regarded knowledge as to domain division, thereby makes knowledge have graininess.
3. method according to claim 1, is characterized in that, utilizes rough set that abnormity point is detected, and its algorithm is produced by following steps:
(1) according to original state input system information.
(2) information is sorted, divides equivalence class.
(3) then judge the attribute number.
(4) build the sequence of attributes of successively decreasing.
(5) repeat 2,3, otherwise object is carried out Knowledge Granulation and weight calculation.
(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100548420A CN103150354A (en) | 2013-01-30 | 2013-01-30 | Data mining algorithm based on rough set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100548420A CN103150354A (en) | 2013-01-30 | 2013-01-30 | Data mining algorithm based on rough set |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103150354A true CN103150354A (en) | 2013-06-12 |
Family
ID=48548431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013100548420A Pending CN103150354A (en) | 2013-01-30 | 2013-01-30 | Data mining algorithm based on rough set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103150354A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103699622A (en) * | 2013-12-19 | 2014-04-02 | 浙江工商大学 | Rough set and granular computing merged method for mining online data of distributed heterogeneous mass urban safety data flows |
CN104698838A (en) * | 2014-12-23 | 2015-06-10 | 清华大学 | Discourse domain based dynamic division and learning fuzzy scheduling rule mining method |
CN105245498A (en) * | 2015-08-28 | 2016-01-13 | 中国航天科工集团第二研究院七〇六所 | Attack digging and detecting method based on rough set |
CN105824785A (en) * | 2016-03-11 | 2016-08-03 | 中国石油大学(华东) | Rapid abnormal point detection method based on penalized regression |
CN110858974A (en) * | 2018-08-23 | 2020-03-03 | 华为技术有限公司 | Communication method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002073530A1 (en) * | 2001-03-07 | 2002-09-19 | Rockwell Scientific Company, Llc | Data mining apparatus and method with user interface based ground-truth tool and user algorithms |
US20020169735A1 (en) * | 2001-03-07 | 2002-11-14 | David Kil | Automatic mapping from data to preprocessing algorithms |
US20090119281A1 (en) * | 2007-11-03 | 2009-05-07 | Andrew Chien-Chung Wang | Granular knowledge based search engine |
CN102142031A (en) * | 2011-03-18 | 2011-08-03 | 南京邮电大学 | Rough set-based mass data partitioning method |
-
2013
- 2013-01-30 CN CN2013100548420A patent/CN103150354A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002073530A1 (en) * | 2001-03-07 | 2002-09-19 | Rockwell Scientific Company, Llc | Data mining apparatus and method with user interface based ground-truth tool and user algorithms |
US20020169735A1 (en) * | 2001-03-07 | 2002-11-14 | David Kil | Automatic mapping from data to preprocessing algorithms |
US20090119281A1 (en) * | 2007-11-03 | 2009-05-07 | Andrew Chien-Chung Wang | Granular knowledge based search engine |
CN102142031A (en) * | 2011-03-18 | 2011-08-03 | 南京邮电大学 | Rough set-based mass data partitioning method |
Non-Patent Citations (1)
Title |
---|
陈玉明等: "基于知识粒度的异常数据挖掘算法", 《计算机工程与应用》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103699622A (en) * | 2013-12-19 | 2014-04-02 | 浙江工商大学 | Rough set and granular computing merged method for mining online data of distributed heterogeneous mass urban safety data flows |
CN104698838A (en) * | 2014-12-23 | 2015-06-10 | 清华大学 | Discourse domain based dynamic division and learning fuzzy scheduling rule mining method |
CN104698838B (en) * | 2014-12-23 | 2017-03-29 | 清华大学 | Based on the fuzzy scheduling rule digging method that domain dynamic is divided and learnt |
CN105245498A (en) * | 2015-08-28 | 2016-01-13 | 中国航天科工集团第二研究院七〇六所 | Attack digging and detecting method based on rough set |
CN105824785A (en) * | 2016-03-11 | 2016-08-03 | 中国石油大学(华东) | Rapid abnormal point detection method based on penalized regression |
CN110858974A (en) * | 2018-08-23 | 2020-03-03 | 华为技术有限公司 | Communication method and device |
CN110858974B (en) * | 2018-08-23 | 2021-04-20 | 华为技术有限公司 | Communication method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fournier-Viger et al. | VMSP: Efficient vertical mining of maximal sequential patterns | |
JP6047017B2 (en) | Pattern extraction apparatus and control method | |
CN103150354A (en) | Data mining algorithm based on rough set | |
CN109948125A (en) | Method and system of improved Simhash algorithm in text deduplication | |
CN105138916B (en) | Multi-trace rogue program characteristic detection method based on data mining | |
CN102306190A (en) | Method for dynamically updating rule set during changing process of attribute set in rough set | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN103955542A (en) | Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method | |
CN105335368A (en) | Product clustering method and apparatus | |
CN107291877A (en) | A kind of Mining Frequent Itemsets based on Apriori algorithm | |
CN109145605A (en) | A kind of Android malware family clustering method based on SinglePass algorithm | |
Lee et al. | Hashnwalk: Hash and random walk based anomaly detection in hyperedge streams | |
Ren et al. | A weighted adaptive mean shift clustering algorithm | |
Yun et al. | An efficient approach for mining weighted approximate closed frequent patterns considering noise constraints | |
CN110096900A (en) | A kind of Frequent Pattern Mining method of efficient difference secret protection | |
Christen | Towards parameter-free blocking for scalable record linkage | |
Kusumakumari et al. | Frequent pattern mining on stream data using Hadoop CanTree-GTree | |
CN105677757A (en) | Big data similarity join method based on prefix-affix filtering | |
Wang et al. | A new way to choose splitting attribute in ID3 algorithm | |
CN106354753A (en) | Bayes classifier based on pattern discovery in data flow | |
Lin et al. | Mining of high average-utility patterns with item-level thresholds | |
CN103491074A (en) | Botnet detection method and device | |
CN109543049A (en) | Method and system for automatically pushing materials according to writing characteristics | |
CN107609110A (en) | The method for digging and device of maximum various frequent mode based on classification tree | |
Yin et al. | An efficient clustering algorithm for mixed type attributes in large dataset |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130612 |