[go: up one dir, main page]

CN113780372A - Heterogeneous characteristic mixed extraction method - Google Patents

Heterogeneous characteristic mixed extraction method Download PDF

Info

Publication number
CN113780372A
CN113780372A CN202110974424.8A CN202110974424A CN113780372A CN 113780372 A CN113780372 A CN 113780372A CN 202110974424 A CN202110974424 A CN 202110974424A CN 113780372 A CN113780372 A CN 113780372A
Authority
CN
China
Prior art keywords
feature
heterogeneous
attribute
numerical
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110974424.8A
Other languages
Chinese (zh)
Inventor
乔付
刘瑶
郝博麟
刘忠艳
姜微
熊建芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingnan Normal University
Original Assignee
Lingnan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingnan Normal University filed Critical Lingnan Normal University
Priority to CN202110974424.8A priority Critical patent/CN113780372A/en
Publication of CN113780372A publication Critical patent/CN113780372A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种异构特征混合提取的方法,属于模式识别和机器学习领域。异构数据属性决策表中有数值特征属性和类别特征属性,将异构数据属性划分成数值特征属性空间和类别特征属性空间,计算样本在两个空间并集上的粒度,再计算目标子集的近似上限和近似下限,进而可以提取混合特征。该异构特征混合提取的方法是模式识别和机器学习领域关键的预处理步骤,能够为异构特征属性数据的正确分类提供准确的混合特征。

Figure 202110974424

The invention discloses a method for mixed extraction of heterogeneous features, which belongs to the field of pattern recognition and machine learning. The heterogeneous data attribute decision table contains numerical feature attributes and category feature attributes. The heterogeneous data attributes are divided into numerical feature attribute space and category feature attribute space, and the granularity of the sample on the union of the two spaces is calculated, and then the target subset is calculated. The approximate upper and lower bounds of , and then the mixed features can be extracted. The method for mixed extraction of heterogeneous features is a key preprocessing step in the field of pattern recognition and machine learning, and can provide accurate mixed features for the correct classification of heterogeneous feature attribute data.

Figure 202110974424

Description

Heterogeneous characteristic mixed extraction method
Technical Field
The invention belongs to the field of pattern recognition and machine learning, and particularly relates to feature extraction.
Technical Field
Feature extraction is a preprocessing step for pattern recognition and machine learning, and correct classification can be performed only if data features are accurately extracted. The feature extraction operation of homogeneous data is relatively easy, but a hybrid extraction method is required for feature extraction of heterogeneous data composed of both numerical feature attributes and category feature attributes.
Disclosure of Invention
The invention aims to provide a heterogeneous feature hybrid extraction method to solve the problem of extraction of data hybrid features in the field of pattern recognition and machine learning.
In order to achieve the purpose, the invention adopts the following technical scheme: the heterogeneous data attribute decision table is provided with a numerical characteristic attribute and a category characteristic attribute, the heterogeneous data attribute is divided into a numerical characteristic attribute space and a category characteristic attribute space, the granularity of the sample on the union of the two spaces is calculated, the approximate upper limit and the approximate lower limit of the target subset are calculated, and then the mixed features can be extracted.
Compared with the prior art, the invention has the beneficial effects that: the method can solve the problem of mixed extraction of heterogeneous characteristic data and prepare preprocessing for data classification.
Drawings
Fig. 1 is a heterogeneous decision table.
Detailed Description
A heterogeneous characteristic mixed extraction method comprises the following steps: the decision table of a structured data information system can be expressed as:
DT=<U,A〉 (1)
wherein the universe U is a non-empty finite sample set { x1,x2,…xnA is a feature attribute set { a }1,a2,…amAnd n and m are any natural numbers.
Order: a ═ C @ D, where C is the conditional attribute total and D is the decision attribute. For arbitrary xiE.g. U and
Figure BDA0003227129230000021
x is theniNeighborhood δ in feature space BB(xi) Can be expressed as:
δB(xi)={xj|xj∈U,ΔB(xi,xj)≤δ} (2)
wherein, Δ is a distance function, which may be a Manhattan distance, an Euclidean distance, and a Chebychev distance, and which distance function is used is determined according to specific attribute conditions in the decision table; delta is a threshold value, the value is any non-negative real number, and the threshold value determines the granularity of a neighborhood; i and j are any positive integer.
Heterogeneous feature attribute set
Figure BDA0003227129230000022
And
Figure BDA0003227129230000023
respectively representing a numerical value characteristic attribute set and a category characteristic attribute set, and then the sample x is in the characteristic attribute set B1、B2And B1∪B2The neighborhood granularity above may be expressed as:
Figure BDA0003227129230000024
Figure BDA0003227129230000025
Figure BDA0003227129230000026
wherein, the operation A represents the conjunction, i is any positive integer; expression (3) represents a numerical characteristic attribute, expression (4) represents a category characteristic attribute, and expression (5) represents a mixed attribute of a numerical value and a category; according to equation (3) and equation (4), the samples have the same value on the class feature, and the distance on the numerical feature is smaller than the threshold value δ.
For arbitrary
Figure BDA0003227129230000031
Then X is in the decision table<U,A>Two subsets of medium targets, i.e. upper and lower limits, can be approximated as:
Figure BDA0003227129230000032
Figure BDA0003227129230000033
Figure BDA0003227129230000034
i.e. the set of extracted features.
A heterogeneous Decision table, where the data set is composed of a numerical characteristic attribute and a category characteristic attribute, as shown in the table of fig. 1, numerical _ attribute is the numerical characteristic attribute, category _ attribute is the category characteristic attribute, and Decision is the Decision attribute.
Numerical characteristic Attribute
Figure BDA0003227129230000035
Category feature attributes
Figure BDA0003227129230000036
The threshold δ is 0.1, and the neighborhood granularity of the sample is calculated on the numerical characteristic attribute according to equation (1) using Euclidean distance:
Figure BDA0003227129230000037
Figure BDA0003227129230000038
the neighborhood granularity of the sample is computed on the class feature attribute according to equation (2):
Figure BDA0003227129230000039
the two subsets of decision attributes are: x1={x1,x3,x6},X2={x2,x4,x5};
The neighborhood granularity of the sample is computed over the numerical and class attributes according to equation (3):
Figure BDA00032271292300000310
Figure BDA00032271292300000311
Figure BDA00032271292300000312
calculating X on the numerical characteristic attribute and the class characteristic attribute according to the formula (4) and the formula (5)1And X2Approximate upper limit and approximate lower limit of (c):AX1={x1,x3,x6},
Figure BDA00032271292300000313
AX2={x2,x4,x5},
Figure BDA00032271292300000314
Figure BDA0003227129230000041
and
Figure BDA0003227129230000042
is the set of extracted features.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (1)

1.一种异构特征混合提取方法,其特征在于:1. a heterogeneous feature hybrid extraction method, is characterized in that: 该方法具体为:异构特征属性集合
Figure FDA0003227129220000011
Figure FDA0003227129220000012
分别表示数值特征属性集合和类别特征属性集合,则样本x在特征属性集合B1、B2和B1∪B2上的邻域粒度可以表示为:
The method is specifically: heterogeneous feature attribute set
Figure FDA0003227129220000011
and
Figure FDA0003227129220000012
Representing the numerical feature attribute set and the categorical feature attribute set respectively, the neighborhood granularity of the sample x on the feature attribute sets B 1 , B 2 and B 1 ∪ B 2 can be expressed as:
Figure FDA0003227129220000013
Figure FDA0003227129220000013
Figure FDA0003227129220000014
Figure FDA0003227129220000014
Figure FDA0003227129220000015
Figure FDA0003227129220000015
其中,操作∧表示合取,i是任意的正整数;式(1)表示数值特征属性,式(2)表示类别特征属性,式(3)表示数值和类别的混合属性;根据式(1)和式(2),样本在类别特征上具有相同的值,而在数值特征上的距离小于门限值δ;Among them, the operation ∧ represents the conjunction, and i is any positive integer; formula (1) represents the numerical feature attribute, formula (2) represents the category feature attribute, and formula (3) represents the mixed attribute of numerical value and category; according to formula (1) and formula (2), the samples have the same value in the category feature, but the distance in the numerical feature is less than the threshold value δ; 对于任意的
Figure FDA0003227129220000016
则X在决策表<U,A>中目标的两个子集,即上限和下限近似表示为:
for any
Figure FDA0003227129220000016
Then the two subsets of the objective of X in the decision table <U, A>, that is, the upper limit and the lower limit are approximately expressed as:
Figure FDA0003227129220000017
Figure FDA0003227129220000017
Figure FDA0003227129220000018
Figure FDA0003227129220000018
Figure FDA0003227129220000019
即为提取的特征集。
Figure FDA0003227129220000019
is the extracted feature set.
CN202110974424.8A 2021-08-24 2021-08-24 Heterogeneous characteristic mixed extraction method Pending CN113780372A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110974424.8A CN113780372A (en) 2021-08-24 2021-08-24 Heterogeneous characteristic mixed extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110974424.8A CN113780372A (en) 2021-08-24 2021-08-24 Heterogeneous characteristic mixed extraction method

Publications (1)

Publication Number Publication Date
CN113780372A true CN113780372A (en) 2021-12-10

Family

ID=78838796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110974424.8A Pending CN113780372A (en) 2021-08-24 2021-08-24 Heterogeneous characteristic mixed extraction method

Country Status (1)

Country Link
CN (1) CN113780372A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140337255A1 (en) * 2013-05-07 2014-11-13 Wise Io, Inc. Scalable, memory-efficient machine learning and prediction for ensembles of decision trees for homogeneous and heterogeneous datasets
US20180121759A1 (en) * 2016-10-28 2018-05-03 International Business Machines Corporation Simultaneous feature extraction and dictionary learning using deep learning architectures for characterization of images of heterogeneous tissue samples
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN111553127A (en) * 2020-04-03 2020-08-18 河南师范大学 A method and device for feature selection of multi-label text data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140337255A1 (en) * 2013-05-07 2014-11-13 Wise Io, Inc. Scalable, memory-efficient machine learning and prediction for ensembles of decision trees for homogeneous and heterogeneous datasets
US20180121759A1 (en) * 2016-10-28 2018-05-03 International Business Machines Corporation Simultaneous feature extraction and dictionary learning using deep learning architectures for characterization of images of heterogeneous tissue samples
CN109101632A (en) * 2018-08-15 2018-12-28 中国人民解放军海军航空大学 Product quality abnormal data retrospective analysis method based on manufacture big data
CN111553127A (en) * 2020-04-03 2020-08-18 河南师范大学 A method and device for feature selection of multi-label text data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QINGHUA HU, DAREN YU, JINFU LIU, CONGXIN WU: "Neighborhood rough set based heterogeneous feature subset selection", INFORMATION SCIENCES, vol. 178, no. 18, pages 2 *
李志华;顾言;陈孟涛;王士同;陈秀宏;: "异构数据的结构熵聚类算法", 计算机科学, no. 02 *

Similar Documents

Publication Publication Date Title
CN104008174B (en) A kind of secret protection index generation method of massive image retrieval
Chai et al. From data and model levels: Improve the performance of few-shot malware classification
WO2021135910A1 (en) Machine reading comprehension-based information extraction method and related device
CN111382248B (en) A question answering method, device, storage medium and terminal equipment
CN111159377B (en) Attribute recall model training method, device, electronic device and storage medium
CN103324745B (en) Text garbage recognition methods and system based on Bayesian model
CN108897989A (en) A Biological Event Extraction Method Based on Candidate Event Element Attention Mechanism
CN114547102B (en) Model Stealing Attack Method Based on Gradient Driven Data Generation
CN102750379B (en) Fast character string matching method based on filtering type
Hu et al. Cross-modal hashing method with properties of hamming space: A new perspective
CN109886334B (en) Shared neighbor density peak clustering method for privacy protection
CN108763918A (en) A kind of password reinforcement method based on semantic transforms
CN107992549A (en) Dynamic short text stream Clustering Retrieval method
CN109842614B (en) Network intrusion detection method based on data mining
CN118537624A (en) 3D point cloud anti-attack method, device and medium based on data multi-scale characteristics
CN112116965B (en) Material Process Matching Method Based on Embedding Attribute Similarity
CN118427644A (en) A load curve clustering method and system based on dimensionality reduction technology and improved K-means
CN118797056A (en) Unstructured text data classification and grading method, device, equipment, storage medium and program product
CN103309851B (en) The rubbish recognition methods of short text and system
CN113780372A (en) Heterogeneous characteristic mixed extraction method
CN107423580A (en) Grand genomic fragment attribute reduction and sorting technique based on neighborhood rough set
CN115730312A (en) Deep hash-based family malware detection method
CN112288045B (en) Seal authenticity distinguishing method
CN114841256A (en) DGA domain name classification method based on multi-dimensional feature fusion
CN113919351A (en) Network security named entity and relationship joint extraction method and device based on transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20211210