[go: up one dir, main page]

CN103150354A - Data mining algorithm based on rough set - Google Patents

Data mining algorithm based on rough set Download PDF

Info

Publication number
CN103150354A
CN103150354A CN2013100548420A CN201310054842A CN103150354A CN 103150354 A CN103150354 A CN 103150354A CN 2013100548420 A CN2013100548420 A CN 2013100548420A CN 201310054842 A CN201310054842 A CN 201310054842A CN 103150354 A CN103150354 A CN 103150354A
Authority
CN
China
Prior art keywords
data mining
data
rough
knowledge
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100548420A
Other languages
Chinese (zh)
Inventor
王少夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2013100548420A priority Critical patent/CN103150354A/en
Publication of CN103150354A publication Critical patent/CN103150354A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

In order to perfect the outlier detecting algorithm, based on a rough set theory and a certain data mining technology, the invention provides a method for outlier data mining. Uncertain information is researched by using a rough feature selection approach and adopting the distance measure of similar knowledge granularities, the data feature performance is still retained while the data features decreases. Then, objects are sequenced through the given feature values to improve the computing complexity. Finally, the experimental analysis is carried out on a lymph data set. The result shows that the data mining algorithm can detect most of outliers. Compared with the conventional algorithm, the data mining algorithm has the advantages that the outlier detecting property is improved by about 10 to 20%, showing the great superiority.

Description

A kind of data mining algorithm based on rough set
Technical field
A kind of data digging method based on rough set of the present invention.Belong to the computer information technology field.
Technical background
Along with the development of the communication technology in modern times, increasing data are collected and combine, and setting up a large community network becomes possibility.For example, can set up related network between the user by the daily record of Email, perhaps by modes such as network log and network communication contact books, the associated person information that the user submits to be set up community network.So present community network scale is huger than early stage network, usually comprises several thousand or several ten thousand node, the nearly network of 1,000,000 nodes is arranged even.In the face of the network of such bulky complex, simple mathematical knowledge and original artificial treatment can not effectively be analyzed.The non-trivial process of effective, novel, potentially useful and final intelligible pattern is found in data mining from mass data.Data mining has mass data now in order to solve exactly, but lacks the predicament of effective analysis means and the research field that occurs.At present, comprising bioinformatics, huge effect has been brought into play in many aspects such as natural language processing.
In order to obtain best data mining effect, with adopting certain algorithm, set up model, a kind of new algorithm of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.
Summary of the invention
The present invention proposes a kind of data digging method based on rough set, and the method mainly solves the data mining problem of abnormity point, guarantees to obtain best data mining effect.
For achieving the above object, the technical scheme that the present invention takes is: be at first that at first the method is according to using the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, study uncertain information, also keep its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can be most of outlier detection out
The technical scheme concrete steps that the present invention proposes comprise:
Rough set is that it is nested into knowledge classification in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly; The below provides its definition.
Suppose that U is the finite aggregate of non-NULL, I is an equivalence class relation in U, and binary is referred to as to gather the approximation space of U to K=(U, I).Suppose that X is the subset of set U, x is the object of set in U, and the set that the object of all and x undistinguishable forms is made as I (x), and each object in I (x) has same characteristic attribute with object x.For every subset
Figure BSA00000856614400021
With a relation of equivalence I ∈ Ind (K), can define two subsets.
Set X is lower approximate suc as formula 1 about I's:
I * ( X ) = Y { Y ∈ U / I | Y ⊆ X } = { x ∈ U | [ x ] I ⊆ X } - - - ( 1 )
Set X is upper approximate suc as formula 2 about II's:
I *(X)=Y{Y ∈ U/I|Y ∩ X ≠ ∮ }=x ∈ U|[x] I ∩ X ≠ many } (2)
The frontier district of set X is suc as formula 3:
BND(X)=I *(X)-I*(X) (3)
BND (X) is upper approximate and lower be similar to poor of set X.If BND (X) is empty set, claim that XX is (crisp) clearly about I; If instead BND (X) is not empty set, claim that set X is the rough set (rough Set) about I.Its collecting structure as shown in Figure 1.
Rough set theory is regarded knowledge as to domain division, thereby makes knowledge have graininess.
If K=(U, Y) is a knowledge base, R ∈ Y is the undistinguishable relation on domain U, and the granularity of knowledge R ∈ Y is designated as KG (R)
KG ( R ) = | R | | U | 2 - - - ( 4 )
The resolution Dis of knowledge R (R) is
Dis(R)=1-KG(R) (5)
Data mining (Data Mining) is exactly from a large amount of, incomplete, noisy, fuzzy, random real application data, extract lie in wherein, people are ignorant but be the information of potentially useful and the process of knowledge in advance.
Utilize rough set that abnormity point is detected, its algorithm is produced by following steps:
(1) according to original state input system information.
(2) information is sorted, divides equivalence class.
(3) then judge the attribute number.
(4) build the sequence of attributes of successively decreasing.
(5) repeat 2,3, otherwise object is carried out Knowledge Granulation and weight calculation.
(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted
Technique effect of the present invention: the data mining technology scheme of the above-mentioned abnormity point that the present invention proposes, can obtain best data mining effect, abnormity point number larger, just more can manifest this kind to the superiority of abnormity point data mining algorithm.Will be for further regularity, the dynamic of research complex network abnormity point are established certain basis.
Description of drawings
Fig. 1 is the structural drawing of rough set.
Fig. 2 is lymph data set abnormity point comparative analysis experimental result
Embodiment
Embodiment 1:
The lymph data set is detected in batches, first detect the interior abnormity point number of 7 objects of front, respectively to 9 of the front, 12 objects detect, and draw their abnormity point in order.Comparing with KNN and DIS algorithm.
The lymph data set is transfused to an information table IS (U, A), and wherein U comprises all 480 lymph data set examples, and A comprises 9 lymph attribute data collection.IS (U, A) is carried out outlier detection.Experimental result as shown in Figure 2.
So by above-mentioned application as can be known, utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.This data mining algorithm can be most of outlier detection out.Can carry out well the data mining of useful information.
The above is only the preferred embodiments of the present invention, be not limited to the present invention, obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention, within if of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.

Claims (3)

1. data digging method based on rough set is characterized in that: at first the method is according to rough set theory and certain data mining technology, has proposed a kind of method of the data mining for abnormity point.Utilize the rough features system of selection, adopt the distance metric of similar Knowledge Granulation, the research uncertain information also keeps its performance when reducing data characteristics.And then given eigenwert is sorted to object, to improve computational complexity.At last by carried out experimental analysis on the lymph data set.Result shows, this data mining algorithm can most of outlier detection out, can carry out the data mining of useful information well.
2. method according to claim 1, is characterized in that, rough set is that knowledge classification is nested in set, and as gathering a part that forms.Judge that according to traditional method whether an object a belongs to set X, is divided into 3 kinds of situations usually: (1) object a may belong to also may not belong to set X; (2) object a does not belong to collection X certainly; (3) object a belongs to set X certainly.Rough set is regarded knowledge as to domain division, thereby makes knowledge have graininess.
3. method according to claim 1, is characterized in that, utilizes rough set that abnormity point is detected, and its algorithm is produced by following steps:
(1) according to original state input system information.
(2) information is sorted, divides equivalence class.
(3) then judge the attribute number.
(4) build the sequence of attributes of successively decreasing.
(5) repeat 2,3, otherwise object is carried out Knowledge Granulation and weight calculation.
(6) then judge the attribute number.Judgement object number.Otherwise abnormity point is sorted.
CN2013100548420A 2013-01-30 2013-01-30 Data mining algorithm based on rough set Pending CN103150354A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100548420A CN103150354A (en) 2013-01-30 2013-01-30 Data mining algorithm based on rough set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100548420A CN103150354A (en) 2013-01-30 2013-01-30 Data mining algorithm based on rough set

Publications (1)

Publication Number Publication Date
CN103150354A true CN103150354A (en) 2013-06-12

Family

ID=48548431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100548420A Pending CN103150354A (en) 2013-01-30 2013-01-30 Data mining algorithm based on rough set

Country Status (1)

Country Link
CN (1) CN103150354A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699622A (en) * 2013-12-19 2014-04-02 浙江工商大学 Rough set and granular computing merged method for mining online data of distributed heterogeneous mass urban safety data flows
CN104698838A (en) * 2014-12-23 2015-06-10 清华大学 Discourse domain based dynamic division and learning fuzzy scheduling rule mining method
CN105245498A (en) * 2015-08-28 2016-01-13 中国航天科工集团第二研究院七〇六所 Attack digging and detecting method based on rough set
CN105824785A (en) * 2016-03-11 2016-08-03 中国石油大学(华东) Rapid abnormal point detection method based on penalized regression
CN110858974A (en) * 2018-08-23 2020-03-03 华为技术有限公司 Communication method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002073530A1 (en) * 2001-03-07 2002-09-19 Rockwell Scientific Company, Llc Data mining apparatus and method with user interface based ground-truth tool and user algorithms
US20020169735A1 (en) * 2001-03-07 2002-11-14 David Kil Automatic mapping from data to preprocessing algorithms
US20090119281A1 (en) * 2007-11-03 2009-05-07 Andrew Chien-Chung Wang Granular knowledge based search engine
CN102142031A (en) * 2011-03-18 2011-08-03 南京邮电大学 Rough set-based mass data partitioning method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002073530A1 (en) * 2001-03-07 2002-09-19 Rockwell Scientific Company, Llc Data mining apparatus and method with user interface based ground-truth tool and user algorithms
US20020169735A1 (en) * 2001-03-07 2002-11-14 David Kil Automatic mapping from data to preprocessing algorithms
US20090119281A1 (en) * 2007-11-03 2009-05-07 Andrew Chien-Chung Wang Granular knowledge based search engine
CN102142031A (en) * 2011-03-18 2011-08-03 南京邮电大学 Rough set-based mass data partitioning method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈玉明等: "基于知识粒度的异常数据挖掘算法", 《计算机工程与应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699622A (en) * 2013-12-19 2014-04-02 浙江工商大学 Rough set and granular computing merged method for mining online data of distributed heterogeneous mass urban safety data flows
CN104698838A (en) * 2014-12-23 2015-06-10 清华大学 Discourse domain based dynamic division and learning fuzzy scheduling rule mining method
CN104698838B (en) * 2014-12-23 2017-03-29 清华大学 Based on the fuzzy scheduling rule digging method that domain dynamic is divided and learnt
CN105245498A (en) * 2015-08-28 2016-01-13 中国航天科工集团第二研究院七〇六所 Attack digging and detecting method based on rough set
CN105824785A (en) * 2016-03-11 2016-08-03 中国石油大学(华东) Rapid abnormal point detection method based on penalized regression
CN110858974A (en) * 2018-08-23 2020-03-03 华为技术有限公司 Communication method and device
CN110858974B (en) * 2018-08-23 2021-04-20 华为技术有限公司 Communication method and device

Similar Documents

Publication Publication Date Title
Fournier-Viger et al. VMSP: Efficient vertical mining of maximal sequential patterns
JP6047017B2 (en) Pattern extraction apparatus and control method
CN103150354A (en) Data mining algorithm based on rough set
CN109948125A (en) Method and system of improved Simhash algorithm in text deduplication
CN105138916B (en) Multi-trace rogue program characteristic detection method based on data mining
CN102306190A (en) Method for dynamically updating rule set during changing process of attribute set in rough set
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN103955542A (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
CN105335368A (en) Product clustering method and apparatus
CN107291877A (en) A kind of Mining Frequent Itemsets based on Apriori algorithm
CN109145605A (en) A kind of Android malware family clustering method based on SinglePass algorithm
Lee et al. Hashnwalk: Hash and random walk based anomaly detection in hyperedge streams
Ren et al. A weighted adaptive mean shift clustering algorithm
Yun et al. An efficient approach for mining weighted approximate closed frequent patterns considering noise constraints
CN110096900A (en) A kind of Frequent Pattern Mining method of efficient difference secret protection
Christen Towards parameter-free blocking for scalable record linkage
Kusumakumari et al. Frequent pattern mining on stream data using Hadoop CanTree-GTree
CN105677757A (en) Big data similarity join method based on prefix-affix filtering
Wang et al. A new way to choose splitting attribute in ID3 algorithm
CN106354753A (en) Bayes classifier based on pattern discovery in data flow
Lin et al. Mining of high average-utility patterns with item-level thresholds
CN103491074A (en) Botnet detection method and device
CN109543049A (en) Method and system for automatically pushing materials according to writing characteristics
CN107609110A (en) The method for digging and device of maximum various frequent mode based on classification tree
Yin et al. An efficient clustering algorithm for mixed type attributes in large dataset

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130612