CN119397222B

CN119397222B - Data feature extraction method based on artificial intelligence

Info

Publication number: CN119397222B
Application number: CN202411975543.5A
Authority: CN
Inventors: 孙勇; 胡泽平; 万青
Original assignee: Shenzhen Tobo Software Co ltd
Current assignee: Shenzhen Tobo Software Co ltd
Priority date: 2024-12-31
Filing date: 2024-12-31
Publication date: 2025-05-06
Anticipated expiration: 2044-12-31
Also published as: CN119397222A

Abstract

The application provides a data characteristic extraction method and a system based on artificial intelligence, wherein the method comprises the steps of acquiring heterogeneous data of a plurality of clients; the method comprises the steps of extracting features according to heterogeneous data to generate a standardized feature set, carrying out model aggregation according to the standardized feature set to determine important features, determining target data in the heterogeneous data according to the important features, carrying out data preprocessing according to the target data to generate standardized data, determining sensitive information in the standardized data according to the standardized data and a preset deep learning model, and determining target features according to the sensitive information and the standardized data. Reasonable data feature extraction can be carried out according to the importance degree of the data while protecting the privacy data, so that the processing time of the data feature extraction is greatly reduced, and the complexity of data processing and the calculation cost are reduced.

Description

Data feature extraction method based on artificial intelligence

Technical Field

The application relates to the technical field of feature extraction, in particular to a data feature extraction method based on artificial intelligence.

Background

In modern data processing systems, data feature extraction is a critical ring in Artificial Intelligence (AI) technology. It involves extracting representative features from a large amount of raw data to facilitate subsequent model training and analysis.

However, with the continued development of data processing technology, the problem of privacy data protection is becoming more and more important. Privacy data refers to information related to personal privacy, such as name, identification card number, address, bank card number, etc. Such data is compromised without authorization and may pose a serious threat to personal privacy.

In the prior art, the importance degree of the data is not considered in the process of extracting the characteristics of the data, so that some useless data information is analyzed together when the characteristics of the data are extracted, a large amount of time is consumed for data characteristics due to a large amount of data analysis, the efficiency is reduced, and the complexity and the calculation cost of data processing are increased because the data encryption technology is also required to be carried out on all the data in order to protect the privacy data.

Disclosure of Invention

In view of the foregoing, the present application has been developed to provide an artificial intelligence based data feature extraction method and system thereof that overcomes or at least partially solves the problems, including:

A data feature extraction method based on artificial intelligence, the method comprising:

heterogeneous data of a plurality of clients are obtained;

Performing feature extraction according to the heterogeneous data to generate a standardized feature set;

Performing model aggregation processing according to the standardized feature set to determine important features;

Determining target data in the heterogeneous data according to the important features;

performing data preprocessing according to the target data to generate standardized data;

determining sensitive information in the standardized data according to the standardized data and a preset deep learning model;

and determining target characteristics according to the sensitive information and the standardized data.

Further, the step of generating a standardized feature set by feature extraction according to the heterogeneous data includes:

Performing principal component analysis processing on the heterogeneous data to generate dimension reduction data;

determining a feature set according to the dimension reduction data;

determining initial characteristics according to the characteristic set and a preset algorithm, wherein the preset algorithm is an LASSO algorithm;

And carrying out standardization processing on the initial features to generate a standardized feature set.

Further, the step of normalizing the initial feature to generate a normalized feature set includes:

determining a mean and variance corresponding to each of the initial features;

determining a sample value from the mean and the variance;

And generating the standardized feature set according to the mean value, the variance and the sample value.

Further, the step of determining important features by performing model aggregation processing according to the standardized feature set includes:

determining model parameters corresponding to a plurality of clients according to the standardized feature set;

carrying out weighted average treatment on the model parameters to obtain initial parameters;

Regularizing according to the target parameters to obtain target parameters;

Generating an important feature model according to the target parameters and the standardized feature set;

And determining important features according to the important feature model.

Further, the step of performing data preprocessing according to the target data to generate standardized data includes:

Converting the target data according to a preset format to generate format data;

Determining local data belonging to a participant in the format data;

encrypting according to the local data to generate encrypted data;

Performing federal learning processing on the encrypted data to generate a standardized model;

And generating the standardized data according to the standardized model.

Further, the step of determining the sensitive information in the standardized data according to the standardized data and a preset deep learning model includes:

performing feature extraction on the standardized data through a convolutional neural network and a cyclic neural network to generate text features;

carrying out named entity recognition processing on the text features to determine key information;

and carrying out semantic analysis on the key information to determine the sensitive information.

Further, the step of determining the target feature from the sensitive information and the standardized data includes:

Performing feature extraction on the standardized data according to the sensitive information to generate an initial feature extraction model;

Determining a feature type corresponding to each extracted feature according to the initial feature extraction model and the sensitive information, wherein the feature type comprises filtering and normal;

Generating a target feature extraction model for the normal extraction feature, the initial feature extraction model and a preset evaluation index according to the feature type, wherein the preset evaluation index comprises one or more of accuracy, recall and F1 value;

And generating target features according to the target feature extraction model.

The embodiment of the application also discloses a data characteristic extraction system based on artificial intelligence, which comprises:

the acquisition module is used for acquiring heterogeneous data of a plurality of clients;

The first generation module is used for carrying out feature extraction according to the heterogeneous data to generate a standardized feature set;

The first determining module is used for determining important features by performing model aggregation processing according to the standardized feature set;

the second determining module is used for determining target data in the heterogeneous data according to the important characteristics;

The third determining module is used for carrying out data preprocessing according to the target data to generate standardized data;

The fourth determining module is used for determining sensitive information in the standardized data according to the standardized data and a preset deep learning model;

and a fifth determining module, configured to determine a target feature according to the sensitive information and the standardized data.

An embodiment of the present application also discloses a computer device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the computer program when executed by the processor implements the steps of an artificial intelligence based data feature extraction method as described above.

An embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of an artificial intelligence based data feature extraction method as described above.

The application has the following advantages:

In the embodiment of the application, aiming at the situation that the importance degree of the data is not considered in the process of extracting the characteristics of the data in the prior art, some useless data information is also analyzed together when the characteristics of the data are extracted, and a large amount of time is consumed for data characteristics due to a large amount of data analysis, so that the efficiency is reduced; the application provides a solution for extracting data characteristics according to importance degree of data while protecting privacy data, and particularly provides a solution for acquiring heterogeneous data of a plurality of clients, carrying out characteristic extraction according to the heterogeneous data to generate a standardized characteristic set, carrying out model aggregation processing according to the standardized characteristic set to determine important characteristics, determining target data in the heterogeneous data according to the important characteristics, carrying out data preprocessing according to the target data to generate standardized data, determining sensitive information in the standardized data according to the standardized data and a preset deep learning model, and determining target characteristics according to the sensitive information and the standardized data. The method comprises the steps of determining target data in heterogeneous data according to important features, preprocessing the data according to the target data to generate standardized data, determining sensitive information in the standardized data according to the standardized data and a preset deep learning model, determining target features according to the sensitive information and the standardized data, wherein the importance degree of the data is not considered in the process of feature extraction of the data, so that useless data information is analyzed together when the data is subjected to feature extraction, a large amount of data analysis causes a large amount of time to be consumed for data features, the efficiency is reduced, and the complexity and the calculation cost of data processing are increased to achieve reasonable data feature extraction according to the importance degree of the data while the private data is protected, so that the processing time of the data feature extraction is greatly shortened, and the complexity and the calculation cost of the data processing are reduced.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of steps of an artificial intelligence based data feature extraction method according to an embodiment of the present application;

FIG. 2 is a block diagram of an artificial intelligence based data feature extraction system according to one embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order that the manner in which the above recited objects, features and advantages of the present application are obtained will become more readily apparent, a more particular description of the application briefly described above will be rendered by reference to the appended drawings. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of steps in an artificial intelligence based data feature extraction method according to an embodiment of the present application is shown;

S110, heterogeneous data of a plurality of clients are obtained;

s120, carrying out feature extraction according to the heterogeneous data to generate a standardized feature set;

s130, performing model aggregation processing according to the standardized feature set to determine important features;

s140, determining target data in the heterogeneous data according to the important characteristics;

s150, preprocessing data according to the target data to generate standardized data;

S160, determining sensitive information in the standardized data according to the standardized data and a preset deep learning model;

S170, determining target characteristics according to the sensitive information and the standardized data.

Next, a data feature extraction method based on artificial intelligence in the present exemplary embodiment will be further described.

As described in the step S110, heterogeneous data of a plurality of clients is acquired.

It should be noted that heterogeneous data is collected from multiple clients, and these data may come from different fields or application scenarios, and thus have different characteristics and distributions. Because the data of different clients may have different characteristics and distribution, it is difficult to find a unified method to process all heterogeneous data, and because of the heterogeneous nature of the data, the model parameters of different clients may not be converged, thus resulting in reduced model performance. In another embodiment, data from the medical field may also include patient history, physical examination results, and the like.

As described in the step S120, feature extraction is performed according to the heterogeneous data to generate a standardized feature set.

In one embodiment of the present invention, the specific process of "generating a standardized feature set based on feature extraction of the heterogeneous data" in step S120 may be further described in conjunction with the following description.

As will be described in the following steps,

S210, performing principal component analysis processing on the heterogeneous data to generate dimension reduction data;

S220, determining a feature set according to the dimension reduction data;

s230, determining initial characteristics according to the characteristic set and a preset algorithm, wherein the preset algorithm is an LASSO algorithm;

s240, carrying out standardization processing on the initial features to generate a standardized feature set.

It should be noted that principal component analysis (PRINCIPAL COMPONENT ANALYSIS, abbreviated as PCA) is a common dimension reduction technique for projecting high-dimensional data into a low-dimensional space while preserving the main features of the data as much as possible. The main purpose of PCA is to reduce the dimensionality of the data while minimizing information loss.

As an example, heterogeneous data is normalized such that its mean is 0 and variance is 1. The method comprises the steps of eliminating the influence of different feature sizes, enabling a covariance matrix to reflect the linear relation among the features of data, enabling the covariance matrix to be an n-n matrix for the data of n features, enabling the covariance matrix to be subjected to feature value decomposition to obtain n feature values and corresponding feature vectors, enabling the feature values to represent variances of the data in the feature vector direction, enabling the feature vectors to represent the data projection direction, selecting the feature vectors corresponding to the k feature values in descending order, enabling k to be the dimension to be reduced, enabling the k feature vectors to form a new coordinate system, enabling the projection of the dimension-reduced data in the new coordinate system to be the dimension-reduced data to project heterogeneous data onto the selected feature vectors, obtaining the dimension-reduced data, enabling a representative operator to be extracted according to the dimension-reduced data, enabling feature sets to be formed according to the representative feature, enabling a LASO (least absolute shrinkage and selection) algorithm to conduct feature selection to obtain initial features which are more significant after further optimization, enabling initial features to be obtained, enabling the feature sets to be subjected to standardization processing, and enabling the feature sets to be generated.

The initial features are normalized to generate a normalized feature set, as described in step S240.

In one embodiment of the present invention, the specific process of "normalizing the initial feature to generate a normalized feature set" described in step S240 may be further described in conjunction with the following description.

As will be described in the following steps,

S310, determining a mean and a variance corresponding to each initial feature;

S320, determining a sample value according to the mean value and the variance;

s330, generating the standardized feature set according to the mean value, the variance and the sample value.

It should be noted that the average value of all the initial features is adjusted to 0, the variance is adjusted to 1, so as to generate the normalized feature set, and by normalization, the data of different features can be compared on the same scale.

As one example, for each feature \x_i\its mean (\mu_i\) and variance (\sigma_i\) are calculated by calculating the mean (μ) and variance (σ).

\[ \mu_i = \frac{1}{n} \sum_{j=1}^{n} x_{ij} \] \[ \sigma_i^2 = \frac{1}{n} \sum_{j=1}^{n} (x_{ij} - \mu_i)^2 \]

Where \x_ { ij } \is the \j\sample value of the feature \x_i\and \n\is the total number of samples;

Each sample value is normalized using the calculated mean and variance.

\[ z_{ij} = \frac{x_{ij} - \mu_i}{\sigma_i} \]

So that normalized data \ (z_ { ij } \) will have a mean value of 0 and a variance of 1, and replacing the original data \ (x_ { ij } \) with the normalized value \ (z_ { ij } \).

As described in the step S130, the important features are determined by performing a model aggregation process according to the standardized feature set.

In one embodiment of the present invention, the specific process of determining important features by performing model aggregation processing according to the standardized feature sets in step S130 may be further described in conjunction with the following description.

As will be described in the following steps,

S410, determining model parameters corresponding to a plurality of clients according to the standardized feature set;

s420, carrying out weighted average processing on the model parameters to obtain initial parameters;

s430, regularizing according to the target parameters to obtain target parameters;

S440, generating an important feature model according to the target parameters and the standardized feature set;

S450, determining important features according to the important feature model.

The extracted and normalized features, i.e., the normalized feature set, were model-aggregated.

As an example, first, model parameters of different clients are weighted and averaged to obtain initial parameters, then regularized to obtain target parameters, so that the obtained target parameters can be better converged, and then important feature models are generated by the target parameters and the standardized feature set. Thereby obtaining important features.

In a specific implementation, the regularization process is L2 regularization, which can prevent overfitting.

As described in the step S150, data preprocessing is performed according to the target data to generate standardized data.

In one embodiment of the present invention, the specific process of "generating standardized data by preprocessing data according to the target data" in step S150 may be further described in conjunction with the following description.

As will be described in the following steps,

S510, converting the target data into format data according to a preset format;

s520, determining local data belonging to the participant in the format data;

S530, carrying out encryption processing according to the local data to generate encrypted data;

S540, performing federal learning processing on the encrypted data to generate a standardized model;

s550, generating the standardized data according to the standardized model.

The data format refers to the data format that the data of different types are converted into the data format capable of being calculated by multiple parties safely, the data is encrypted and transmitted, namely, the data is encrypted, and model training and optimization are carried out in a distributed calculation mode, so that knowledge sharing is realized while the data privacy is protected.

As an example, the target data is first unified into a data format capable of performing secure multiparty computation, the local data belonging to the participants is determined in the format data to be encrypted and disturbed to generate encrypted data, so as to protect the data privacy, and the secure multiparty computation technology is used to realize the encryption of the model parameters and the security protection of the collaborative training process. For example, the model parameters may be encrypted using the Paillier encryption algorithm or the ElGamal encryption algorithm so that only the participants may decrypt and access the model parameters, a standardized model may be generated by performing federal learning processing on the encrypted data, wherein a more effective attack-resistant defense mechanism may be introduced during the federal learning processing, and finally the standardized data may be obtained from the standardized model.

In a specific implementation, text data is converted into numeric data, image data is converted into pixel value data and the like, local data is subjected to noise adding processing through a Laplace mechanism or a Gaussian mechanism, and model parameters are encrypted through a Paillier encryption algorithm or an ElGamal encryption algorithm, so that only participants can decrypt and access the model parameters.

As described in the step S160, sensitive information in the standardized data is determined according to the standardized data and a preset deep learning model.

In one embodiment of the present invention, the specific process of determining the sensitive information in the standardized data according to the standardized data and the preset deep learning model in step S160 may be further described in conjunction with the following description.

As will be described in the following steps,

S610, performing feature extraction on the standardized data through a convolutional neural network and a cyclic neural network to generate text features;

s620, carrying out named entity recognition processing on the text features to determine key information;

S630, carrying out semantic analysis on the key information to determine the sensitive information.

It should be noted that, the sensitive information refers to information that may cause significant loss to an individual, an enterprise or a country once revealed, for example, personal identity information (such as an identification card number, a bank card number), trade secrets (such as a customer list, a financial statement, etc.

As an example, sensitive information identification is performed on target data by utilizing a machine learning algorithm and natural language processing technology, specifically, a deep learning model is adopted to train data through a Convolutional Neural Network (CNN) and a cyclic neural network (RNN), and sensitive privacy information in the data is identified

In a specific implementation, the natural language processing technology is that a word embedding technology is adopted to convert texts in target data into vector representations so as to facilitate processing of a machine learning model, key information in the extracted texts is obtained through Named Entity Recognition (NER) and partial syntactic analysis, accuracy of sensitive information recognition is improved, and then comprehensive analysis and judgment are carried out on the extracted key information through semantic analysis and context understanding, so that comprehensive recognition and processing of multiple types of sensitive information are realized.

As described in the step S170, a target feature is determined according to the sensitive information and the standardized data.

In one embodiment of the present invention, the specific process of "determining target features from the sensitive information and the standardized data" described in step S170 may be further described in conjunction with the following description.

As will be described in the following steps,

S710, carrying out feature extraction on the standardized data according to the sensitive information to generate an initial feature extraction model;

s720, determining a feature type corresponding to each extracted feature according to the initial feature extraction model and the sensitive information, wherein the feature type comprises filtering and normal;

s730, generating a target feature extraction model for the normal extraction feature, the initial feature extraction model and a preset evaluation index according to the feature type, wherein the preset evaluation index comprises one or more of accuracy, recall and F1 value;

s740, generating target features according to the target feature extraction model.

In the feature extraction process, the extracted features are filtered by using the result of the sensitive information identification, and the part containing the sensitive privacy information is shielded.

As an example, for each feature, judging whether the feature contains sensitive privacy information, if so, determining that the feature type of the extracted feature is filtering, and then, shielding the current extracted feature from subsequent feature extraction and application, evaluating a feature extraction model, obtaining a target feature extraction model through the accuracy, recall rate and F1 value index in preset indexes, wherein the target feature extraction model can ensure the performance and stability of the target feature extraction model, ensure the accuracy and reliability of the extracted feature, and obtaining the target feature through the target feature extraction model.

In one embodiment, if the model evaluation result is not ideal, the model parameters or algorithms need to be adjusted to improve the performance of the model

For system embodiments, the description is relatively simple as it is substantially similar to method embodiments, and reference is made to the description of method embodiments for relevant points.

Referring to FIG. 2, a block diagram of an artificial intelligence based data feature extraction system is shown, in accordance with one embodiment of the present application;

an artificial intelligence based data feature extraction system, the system comprising:

An obtaining module 210, configured to obtain heterogeneous data of a plurality of clients;

a first generating module 220, configured to perform feature extraction according to the heterogeneous data to generate a standardized feature set;

a first determining module 230, configured to determine important features by performing a model aggregation process according to the standardized feature set;

A second determining module 240, configured to determine target data in the heterogeneous data according to the important features;

a third determining module 250, configured to perform data preprocessing according to the target data to generate standardized data;

A fourth determining module 260, configured to determine sensitive information in the standardized data according to the standardized data and a preset deep learning model;

A fifth determining module 270 is configured to determine a target feature according to the sensitive information and the standardized data.

In an embodiment of the present invention, the first generating module 220 includes:

The first generation submodule is used for carrying out principal component analysis processing on the heterogeneous data to generate dimension reduction data;

a first determining sub-module for determining a feature set according to the reduced data;

the second determining submodule is used for determining initial characteristics according to the characteristic set and a preset algorithm, wherein the preset algorithm is an LASSO algorithm;

and the second generation sub-module is used for carrying out standardization processing on the initial characteristics to generate a standardized characteristic set.

In an embodiment of the present invention, the second generating sub-module includes:

a first determining unit configured to determine a mean and a variance corresponding to each of the initial features;

a second determining unit configured to determine a sample value according to the mean and the variance;

A first generation unit for generating the normalized feature set according to the mean, the variance and the sample value.

In an embodiment of the present invention, the first determining module 230 includes:

a third determining sub-module, configured to determine model parameters corresponding to a plurality of clients according to the standardized feature set;

the first processing submodule is used for carrying out weighted average processing on the model parameters to obtain initial parameters;

The second processing sub-module is used for carrying out regularization processing according to the target parameters to obtain the target parameters;

the third generation sub-module is used for generating an important feature model according to the target parameters and the standardized feature set;

and the fourth determination submodule is used for determining important features according to the important feature model.

In an embodiment of the present invention, the third determining module 250 includes:

A fourth generation sub-module, configured to convert the target data according to a preset format to generate format data;

a fourth determining submodule, configured to determine local data belonging to a participant in the format data;

A fifth generation sub-module, configured to perform encryption processing according to the local data to generate encrypted data;

A sixth generation sub-module, configured to perform federal learning processing on the encrypted data to generate a standardized model;

And a seventh generation sub-module, configured to generate the standardized data according to the standardized model.

In an embodiment of the present invention, the fourth determining module 260 includes:

The eighth generation submodule is used for carrying out feature extraction on the standardized data through a convolutional neural network and a cyclic neural network to generate text features;

a fifth determining submodule, configured to perform named entity recognition processing on the text feature to determine key information;

And the sixth determination submodule is used for carrying out semantic analysis on the key information to determine the sensitive information.

In an embodiment of the present invention, the fifth determining module 270 includes:

a ninth generation sub-module, configured to perform feature extraction on the standardized data according to the sensitive information to generate an initial feature extraction model;

a seventh determining submodule, configured to determine a feature type corresponding to each extracted feature according to the initial feature extraction model and the sensitive information, where the feature type includes filtering and normalization;

A tenth generation sub-module, configured to generate a target feature extraction model according to the feature type, which is the normal extracted feature, the initial feature extraction model, and a preset evaluation index, where the preset evaluation index includes one or more of an accuracy rate, a recall rate, and an F1 value;

and an eleventh generation sub-module for generating target features according to the target feature extraction model.

Referring to fig. 3, a computer device of the present invention for an artificial intelligence based data feature extraction method may specifically include the following:

The computer device 12 described above is in the form of a general purpose computing device and the components of the computer device 12 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, a processor, or a local bus 18 using any of a variety of bus 18 architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus 18, micro channel architecture (MAC) bus 18, enhanced ISA bus 18, video Electronics Standards Association (VESA) local bus 18, and Peripheral Component Interconnect (PCI) bus 18.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (commonly referred to as a "hard disk drive"). Although not shown in fig. 3, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk such as a CD-ROM, DVD-ROM, or other optical media may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42, the program modules 42 being configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, a memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, camera, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown in FIG. 3, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to, microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage system 34, among others.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, to implement an artificial intelligence-based data feature extraction method provided by an embodiment of the present invention.

The processing unit 16 is implemented when executing the program, and is configured to acquire heterogeneous data of a plurality of clients, perform feature extraction according to the heterogeneous data to generate a standardized feature set, perform model aggregation according to the standardized feature set to determine important features, determine target data in the heterogeneous data according to the important features, perform data preprocessing according to the target data to generate standardized data, determine sensitive information in the standardized data according to the standardized data and a preset deep learning model, and determine target features according to the sensitive information and the standardized data.

In an embodiment of the present application, the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an artificial intelligence based data feature extraction method as provided in all embodiments of the present application:

The method comprises the steps of obtaining heterogeneous data of a plurality of clients when a program is executed by a processor, carrying out feature extraction according to the heterogeneous data to generate a standardized feature set, carrying out model aggregation according to the standardized feature set to determine important features, determining target data in the heterogeneous data according to the important features, carrying out data preprocessing according to the target data to generate standardized data, determining sensitive information in the standardized data according to the standardized data and a preset deep learning model, and determining target features according to the sensitive information and the standardized data.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The above detailed description of the data feature extraction method and system based on artificial intelligence provided by the present application has been provided, and specific examples are used herein to illustrate the principles and embodiments of the present application, and the above description of the examples is only for aiding in understanding the method and core concept of the present application, and meanwhile, to those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application ranges, so the disclosure should not be construed as limiting the present application.

Claims

1. A method for extracting data features based on artificial intelligence, the method comprising:

heterogeneous data of a plurality of clients are obtained;

Performing feature extraction according to the heterogeneous data to generate a standardized feature set, performing principal component analysis on the heterogeneous data to generate dimension reduction data, determining a feature set according to the dimension reduction data, determining initial features according to the feature set and a preset algorithm, wherein the preset algorithm is a LASSO algorithm, performing standardization on the initial features to generate a standardized feature set, determining a mean value and a variance corresponding to each initial feature, determining a sample value according to the mean value and the variance, and generating the standardized feature set according to the mean value, the variance and the sample value;

The method comprises the steps of determining target features according to sensitive information and standardized data, carrying out feature extraction on the standardized data according to the sensitive information to generate an initial feature extraction model, determining feature types corresponding to each extracted feature according to the initial feature extraction model and the sensitive information, wherein the feature types comprise filtering and normal, generating a target feature extraction model according to the feature types, namely the normal extracted features, the initial feature extraction model and a preset evaluation index, wherein the preset evaluation index comprises one or more of accuracy, recall rate and F1 value, and generating the target features according to the target feature extraction model.

2. The method of claim 1, wherein the step of determining important features from the model aggregation process of the standardized set of features comprises:

regularizing according to the initial parameters to obtain target parameters;

And determining important features according to the important feature model.

3. The method of claim 1, wherein the step of generating standardized data by data preprocessing from the target data comprises:

Determining local data belonging to a participant in the format data;

encrypting according to the local data to generate encrypted data;

And generating the standardized data according to the standardized model.

4. The method of claim 1, wherein the step of determining sensitive information within the normalized data from the normalized data and a pre-set deep learning model comprises:

5. An artificial intelligence based data feature extraction system, the system comprising:

The first generation module is used for carrying out feature extraction according to the heterogeneous data to generate a standardized feature set; performing principal component analysis processing on the heterogeneous data to generate dimension reduction data, determining a feature set according to the dimension reduction data, determining initial features according to the feature set and a preset algorithm, wherein the preset algorithm is a LASSO algorithm, performing standardization processing on the initial features to generate a standardized feature set, determining a mean value and a variance corresponding to each initial feature, determining a sample value according to the mean value and the variance, and generating the standardized feature set according to the mean value, the variance and the sample value;

The fifth determining module is used for determining target features according to the sensitive information and the standardized data, performing feature extraction on the standardized data according to the sensitive information to generate an initial feature extraction model, determining feature types corresponding to each extracted feature according to the initial feature extraction model and the sensitive information, wherein the feature types comprise filtering and normal, generating a target feature extraction model according to the feature types, namely the normal extracted features, the initial feature extraction model and a preset evaluation index, wherein the preset evaluation index comprises one or more of accuracy, recall and F1 values, and generating target features according to the target feature extraction model.

6. A computer device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, implements the method of any one of claims 1 to 4.

7. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 4.