CN114816506B - Quick processing method and device for model features, storage medium and electronic equipment - Google Patents
Quick processing method and device for model features, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN114816506B CN114816506B CN202210425982.3A CN202210425982A CN114816506B CN 114816506 B CN114816506 B CN 114816506B CN 202210425982 A CN202210425982 A CN 202210425982A CN 114816506 B CN114816506 B CN 114816506B
- Authority
- CN
- China
- Prior art keywords
- processing
- configuration
- feature
- features
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/35—Creation or generation of source code model driven
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/61—Installation
- G06F8/63—Image based installation; Cloning; Build to order
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention discloses a method and a device for rapidly processing model features, a storage medium and electronic equipment, wherein the method comprises the following steps: determining a plurality of functional modules corresponding to different processing types of the model features, and respectively generating configuration subfiles to be configured for each functional module; classifying the features in the data set to be processed according to the feature types; determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file; and processing the corresponding type of features based on the generated complete configuration file.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for rapidly processing model features, a storage medium and electronic equipment.
Background
With the popularity of artificial intelligence, more and more business scenarios need to be handled with the help of algorithmic models, where features play a decisive role in the model. The processing of the features takes up most of the workload in the whole modeling process, and it is also important for the models after being online to ensure the consistency of the feature calculation logic and the feature calculation logic during training.
In the current modeling process, a developer is required to pay attention to processing logic, the service level requirement on the developer is high, the workload is high, and in order to ensure the consistency of feature computing logic and feature computing logic during training of an online prediction model, a code is often required to be developed during training, and the code is developed again during online prediction, so that time and labor are wasted.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide a method and a device for rapidly processing model features, a storage medium and electronic equipment, so that the aim of processing the model features can be realized by only configuring corresponding configuration files by enabling an developer to pay attention to processing steps instead of processing logic when training the model features each time.
The invention further aims to provide a method and a device for rapidly processing model features, a storage medium and electronic equipment, wherein the model feature processing process is used for generating a processing mirror image, so that the calculation logic of the model features after being online is consistent with the feature calculation logic during training.
In order to achieve the above object, the present invention provides a method for rapidly processing model features, comprising the steps of:
Determining a plurality of functional modules corresponding to different processing types of the model features, and respectively generating configuration subfiles to be configured for each functional module;
Classifying the features in the data set to be processed according to the feature types;
Determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file;
And processing the corresponding type of features based on the generated complete configuration file.
Optionally, in the above method embodiments of the present invention, the determining a function module of the plurality of function modules corresponding to different processing types of the model features includes a missing value processing, a normalization processing, a binning processing, and an encoding processing.
Optionally, in the above embodiments of the present invention, the configuration sub-file to be configured has configuration parameters to provide a developer configuration.
Optionally, in the above embodiments of the present invention, in the step of classifying the features in the data set to be processed according to the feature types, the features in the data set to be processed are classified into a numerical type feature and a category type feature, and a numerical type feature list and a category type feature list are generated.
Optionally, in the foregoing embodiments of the present invention, the determining, for different types of features, corresponding configuration subfiles to be configured, obtaining configuration parameters of each configuration subfile, and in the step of generating a complete configuration file, determining the configuration subfiles to be configured required for performing feature processing on the numerical type features, obtaining the configuration parameters based on the determined configuration subfiles to be configured, and integrating each configuration subfile after obtaining the configuration parameters, to obtain the complete configuration file for processing the numerical type features.
Optionally, in the above embodiments of the present invention, the determined configuration subfiles to be configured for the log-type feature include a configuration subfile to be configured for the missing value processing, a configuration subfile to be configured for the normalization processing, and a configuration subfile to be configured for the binning processing.
Optionally, in the foregoing embodiments of the present invention, the determining, for different types of features, corresponding configuration subfiles to be configured, obtaining configuration parameters of each configuration subfile, and in the step of generating a complete configuration file, determining, for a type feature, a configuration subfile to be configured required for performing feature processing, obtaining the configuration parameters based on the determined configuration subfiles to be configured, and integrating each configuration subfile after obtaining the configuration parameters, to obtain the complete configuration file for processing the type feature.
Optionally, in the above embodiments of the present invention, for the category type feature, the determined configuration subfiles to be configured include a configuration subfile to be configured for missing value processing and a configuration subfile to be configured for encoding processing.
Optionally, in the above embodiments of the present invention, after the processing of the corresponding type of features based on the generated complete configuration file, the method further includes:
The feature processing during the training phase generates a processing image.
In order to achieve the above object, the present invention further provides a device for rapidly processing model features, comprising:
The configuration sub-file generating module to be configured is used for determining a plurality of functional modules corresponding to different processing types of the model characteristics and respectively generating configuration sub-files to be configured aiming at the realization of each functional module;
The feature classification module is used for classifying the features in the data set to be processed according to the feature types;
The configuration file generation module is used for determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile and generating a complete configuration file;
And the feature processing module is used for processing the features of the corresponding types based on the configuration file generated by the configuration file generating module.
To achieve the above object, the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described model feature rapid processing method.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to realize the steps of the model feature rapid processing method.
Compared with the prior art, the method and the device for rapidly processing the model features, the storage medium and the electronic equipment perform feature processing by using the configuration file, so that a developer only needs to pay attention to processing steps and does not need to pay attention to processing logic when training the model to process the features each time, and the flow of feature processing is greatly shortened only by configuring the corresponding configuration file. According to the invention, the model characteristic processing process is automatically packaged into the mirror image file, so that the secondary development of training and online is avoided, and the consistency of the pre-and post-processing modes is ensured.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing embodiments of the present invention in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.
FIG. 1 is a flow chart of a method for quickly processing model features according to an exemplary embodiment of the present invention;
FIG. 2 is a schematic diagram of a model feature rapid processing apparatus according to an exemplary embodiment of the present invention
Fig. 3 is a structure of an electronic device provided in an exemplary embodiment of the present invention.
Detailed Description
Hereinafter, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present invention are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.
It should also be understood that in embodiments of the present invention, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.
It should also be appreciated that any component, data, or structure referred to in an embodiment of the invention may be generally understood as one or more without explicit limitation or the contrary in the context.
In addition, the term "and/or" in the present invention is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In the present invention, the character "/" generally indicates that the front and rear related objects are an or relationship.
It should also be understood that the description of the embodiments of the present invention emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods, and apparatus should be considered part of the specification.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations with electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the foregoing, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.
Exemplary method
Fig. 1 is a flow chart of a method for quickly processing model features according to an exemplary embodiment of the present invention. The embodiment can be applied to electronic equipment, as shown in fig. 1, the method for rapidly processing model features comprises the following steps:
Step 101, determining a plurality of functional modules corresponding to different processing types of the model features, and generating configuration subfiles to be configured for the implementation of each functional module respectively.
In the modeling process, the processing of the model features mainly comprises the processing of missing values of the features, the standardized processing and the binning processing of the digital type features and the encoding processing of the type features, and in the embodiment of the invention, the processing of the model features is divided into a plurality of functional modules corresponding to different processing types of the model features, namely the missing value processing module, the standardized processing module, the binning processing module and the encoding processing module, and corresponding configuration subfiles are generated for each functional module respectively to realize the corresponding functions of each functional module.
The missing value processing module is configured to perform filling processing on missing values of the model features, and in general, the filling processing of the missing values needs to consider configuration parameters such as a filling mode, whether to specify numerical filling, filling of numerical type features, filling modes of category features, and the like, wherein the filling modes include a simple sampling mode and a knn algorithm sampling mode, a user can configure one of the modes according to needs, and the missing values are filled according to sampling results of the corresponding sampling modes, for example, for feature a:1,2, null, 4,2, assuming that the configured filling mode is simple sampling, when the sampling result of the system simple sampling is 2, the filling missing value (null) is 2, and finally, the result of processing the missing value of the feature a through the configuration subfile is 1,2,2,4,2, average value filling, median filling and frequency filling can be configured for the filling mode of the numerical type feature, and frequency filling can be selected for the filling mode of the category type feature, so that for the missing value processing module, the configuration subfiles to be configured are generated for the configuration parameters.
In the embodiment of the present invention, the configuration subfiles to be configured by the missing value processing module are exemplified as follows:
The normalization processing module is used for performing normalization processing on the numerical value type feature, generally, the configuration parameters of the normalization processing on the numerical value type feature mainly comprise a processing mode and a feature column needing normalization, wherein the configuration parameters of the processing mode can be configured into maximum absolute value scaling, minimum value maximum value scaling and normalization scaling, the processing mode of maximizing the absolute value scaling is to obtain the maximum value of each feature column firstly, then divide each feature by the maximum value of the corresponding feature column, so that the feature scaling is within a [ -1,1] interval, the processing mode of maximizing the minimum value scaling is to obtain the maximum value and the minimum value of each feature column firstly, then subtract the feature of the corresponding column from the maximum value of each feature column, divide the difference between the maximum value and the minimum value, so that the feature scaling is within a [0,1] interval, and the processing mode of normalization scaling is to perform normalization scaling: the method is characterized in that the processed features conform to standard normal distribution by carrying out mean value removal and normalization operation on the features, wherein the mean value is 0, the standard deviation is 1, and the configuration parameters of the feature columns needing to be standardized are used for configuring the feature columns needing to be standardized in the model features.
In the embodiment of the present invention, an example of a configuration subfile to be configured by the standardized processing module is as follows:
The box-dividing processing module is used for carrying out box-dividing processing on the numerical type characteristics, wherein the box-dividing processing refers to dividing a continuous value into a plurality of sections, and the value of each section is regarded as a classification, for example, after the age box-dividing processing, the values can be divided into: the method is characterized in that 0-1 coding or numerical coding is carried out on the result selection after the box division, and the box division strategy can select the width of each box division to be consistent or the number of the characteristics of each box division to be consistent for teenagers, young people, middle-aged people and elderly people. The configuration parameters for the binning process include the feature columns required for binning, the method used to encode the conversion result, and the strategy used to define the bin width, where the configuration parameters for the method used to encode the conversion result may be configured as one of onehot, onehot-dense, ordinal, the onehot method refers to encoding the converted result with one-hot encoding and returning a sparse matrix, the ignored features always overlap right, the onehot-dense method refers to single hot encoding the converted result and returning a dense array. The ignored features are always piled up on the right, the ordinal method refers to returning bin identifiers encoded as integers, the configuration parameters of the strategy used to define bin width can be configured as uniform, quantile and one of kmeans, the unimorph strategy is that all bins in each feature have the same width, the quantile strategy is that all bins in each feature have the same number of points, and the kmeans strategy is that the values in each bin have the same nearest center of a one-dimensional k-means cluster.
In the embodiment of the invention, the configuration subfiles to be configured by the binning processing module are exemplified as follows:
The coding processing module is configured to perform coding processing on the category type feature to enable the category type feature to participate in numerical calculation, wherein the configuration parameters of the coding processing include a feature column to be coded and a coding mode, the configuration parameters of the coding mode can be configured into 0-1 coding and category coding, and 0-1 coding, namely one-hot, refers to performing 0-1 coding on the feature, for example: the sex characteristics are: male and female, if the sex characteristic of one piece of data is male, the sex characteristic is [1,0] after coding, category coding, namely ordinal, means that the characteristic is subjected to category coding, and the coding starts from 1, for example: the sex characteristics are: male and female, if the sex characteristic of one piece of data is female, it is encoded to be 2.
In the embodiment of the present invention, an example of a configuration subfile to be configured by the encoding processing module is as follows:
Step 102, classifying the features in the data set to be processed according to the feature types.
In the embodiment of the invention, the model features included in the data set comprise numerical value type features and category type features, so that the data set to be processed is initialized, the model features in the data set to be processed are classified according to the feature types, the numerical value type features and the category type features are determined, a numerical value type feature list is generated according to the determined numerical value type features, and the category type feature list is generated according to the determined category type features.
Step 103, determining corresponding configuration subfiles to be configured according to different types of features, obtaining configuration parameters of each configuration subfile, and generating a complete configuration file.
In the embodiment of the invention, for the numerical value type characteristics in the numerical value type characteristic list, determining configuration subfiles to be configured required for characteristic processing, namely a configuration subfile to be configured for missing value processing, a configuration subfile to be configured for standardized processing and a configuration subfile to be configured for box division processing, based on the determined configuration subfiles to be configured, carrying out corresponding configuration parameter configuration by a developer according to requirements, acquiring configuration parameters configured by the developer after the developer configures the parameters, and integrating all the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the numerical value type characteristics.
Similarly, for the category type features in the category type feature list, determining configuration subfiles to be configured required for feature processing, namely, a configuration subfile to be configured for missing value processing and a configuration subfile to be configured for encoding processing, based on the determined configuration subfiles to be configured, performing corresponding configuration parameter configuration by a developer according to needs, acquiring configuration parameters configured by the developer after the developer configures the parameters, and integrating all the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the category type features.
Step 104, processing the corresponding type of features based on the configuration file generated in step 103.
After the complete configuration file is generated, the configuration file only defines how and how to process the features, and the generated complete configuration file and the corresponding features are input into a processing system for feature processing. Specifically, for the features in the numerical type feature list, inputting the complete configuration file for processing the numerical type features into a processing system, and processing the features in the numerical type feature list one by one based on the complete configuration file for processing the numerical type features so as to achieve the aim of rapid feature processing; and inputting the complete configuration file for processing the category type features into a processing system for the category type features in the category type feature list, and processing the features in the category type feature list one by one based on the complete configuration file for processing the category type features so as to realize the purpose of rapid feature processing.
Optionally, in an embodiment of the present invention, after step 104, a method for quickly processing model features further includes:
Step 105, the feature processing procedure of step 104 in the training phase generates a processing image.
Specifically, during the feature processing of step 104, the configuration file may be input to the processing system, and the system state during the feature processing based on the configuration file to obtain the feature processing result is packaged into an image file.
In general, there are two processes for feature processing, respectively training and prediction, and the training and prediction inputs are all raw data, so that the results generated by the training and prediction are the same, and the processing modes must be consistent. For example, for missing value filling, a certain feature is filled according to frequency, a lot of data is provided during training, at this time, which frequency is highest can be known, and only one piece of data is provided during prediction, so that the real prediction is dependent on the training process, the problem is solved, the prior art is that the training process and the online prediction develop codes respectively, which is time-consuming and labor-consuming, in the embodiment of the invention, the configuration file is input into the processing system in the training stage, and the training process of obtaining the feature processing result by carrying out feature processing based on the configuration file is packaged into the mirror image file, so that the development of a code during training and the development of a code during online prediction can be avoided, and the consistency of data is ensured.
Exemplary apparatus
Fig. 2 is a schematic structural diagram of a model feature rapid processing apparatus according to an exemplary embodiment of the present invention. As shown in fig. 2, a rapid model feature processing device of the present embodiment includes:
the to-be-configured configuration sub-file generating module 201 is configured to determine a plurality of functional modules corresponding to different processing types of the model feature, and generate to-be-configured configuration sub-files for implementation of each functional module respectively.
In the modeling process, the processing of the model features mainly comprises the processing of missing values of the features, the standardized processing and the binning processing of the digital type features and the encoding processing of the type features, and in the embodiment of the invention, the processing of the model features is divided into a plurality of functional modules corresponding to different processing types of the model features, namely the missing value processing module, the standardized processing module, the binning processing module and the encoding processing module, and corresponding configuration subfiles are generated for each functional module respectively to realize corresponding functions of each functional module.
The missing value processing module is configured to perform filling processing on missing values of the model features, and in general, the filling processing of the missing values needs to consider configuration parameters such as a filling mode, whether to specify numerical filling, filling of numerical type features, filling modes of category features, and the like, wherein the filling modes include a simple sampling mode and a knn algorithm sampling mode, a user can configure one of the modes according to needs, and the missing values are filled according to sampling results of the corresponding sampling modes, for example, for feature a:1,2, null, 4,2, assuming that the configured filling mode is simple sampling, when the sampling result of the system simple sampling is 2, the filling missing value (null) is 2, and finally, the result of processing the missing value of the feature a through the configuration subfile is 1,2,2,4,2, average value filling, median filling and frequency filling can be configured for the filling mode of the numerical type feature, and frequency filling can be selected for the filling mode of the category type feature, so that for the missing value processing module, the configuration subfiles to be configured are generated for the configuration parameters.
The normalization processing module is used for performing normalization processing on the numerical value type feature, generally, the configuration parameters of the normalization processing on the numerical value type feature mainly comprise a processing mode and a feature column needing normalization, wherein the configuration parameters of the processing mode can be configured into maximum absolute value scaling, minimum value maximum value scaling and normalization scaling, the processing mode of maximizing the absolute value scaling is to obtain the maximum value of each feature column firstly, then divide each feature by the maximum value of the corresponding feature column, so that the feature scaling is within a [ -1,1] interval, the processing mode of maximizing the minimum value scaling is to obtain the maximum value and the minimum value of each feature column firstly, then subtract the feature of the corresponding column from the maximum value of each feature column, divide the difference between the maximum value and the minimum value, so that the feature scaling is within a [0,1] interval, and the processing mode of normalization scaling is to perform normalization scaling: the method is characterized in that the processed features conform to standard normal distribution by carrying out mean value removal and normalization operation on the features, wherein the mean value is 0, the standard deviation is 1, and the configuration parameters of the feature columns needing to be standardized are used for configuring the feature columns needing to be standardized in the model features.
The box-dividing processing module is used for carrying out box-dividing processing on the numerical type characteristics, wherein the box-dividing processing refers to dividing a continuous value into a plurality of sections, and the value of each section is regarded as a classification, for example, after the age box-dividing processing, the values can be divided into: the method is characterized in that 0-1 coding or numerical coding is carried out on the result selection after the box division, and the box division strategy can select the width of each box division to be consistent or the number of the characteristics of each box division to be consistent for teenagers, young people, middle-aged people and elderly people. The configuration parameters for the binning process include the feature columns required for binning, the method used to encode the conversion result, and the strategy used to define the bin width, where the configuration parameters for the method used to encode the conversion result may be configured as one of onehot, onehot-dense, ordinal, the onehot method refers to encoding the converted result with one-hot encoding and returning a sparse matrix, the ignored features always overlap right, the onehot-dense method refers to single hot encoding the converted result and returning a dense array. The ignored features are always piled up on the right, the ordinal method refers to returning bin identifiers encoded as integers, the configuration parameters of the strategy used to define bin width can be configured as uniform, quantile and one of kmeans, the unimorph strategy is that all bins in each feature have the same width, the quantile strategy is that all bins in each feature have the same number of points, and the kmeans strategy is that the values in each bin have the same nearest center of a one-dimensional k-means cluster.
The coding processing module is used for coding the category type feature to enable the category type feature to participate in numerical calculation, the configuration parameters of the coding processing comprise a feature column needing to be coded and a coding mode, the configuration parameters of the coding mode can be configured into 0-1 coding and category coding, and the 0-1 coding refers to the 0-1 coding of the feature, for example: the sex characteristics are: male and female, if the sex characteristic of one piece of data is male, the sex characteristic is [1,0] after coding, the category coding means that the characteristic is subjected to category coding, and the coding starts from 1, for example: the sex characteristics are: male and female, if the sex characteristic of one piece of data is female, it is encoded to be 2.
The feature classification module 202 classifies features in the data set to be processed according to feature types.
In the embodiment of the invention, the model features included in the data set comprise numerical value type features and category type features, so that the data set to be processed is initialized, the model features in the data set to be processed are classified according to the feature types, the numerical value type features and the category type features are determined, a numerical value type feature list is generated according to the determined numerical value type features, and the category type feature list is generated according to the determined category type features.
The configuration file generating module 203 is configured to determine corresponding configuration subfiles to be configured for different types of features, obtain configuration parameters of each configuration subfile, and generate a complete configuration file.
In the embodiment of the invention, for the numerical value type characteristics in the numerical value type characteristic list, determining configuration subfiles to be configured required for characteristic processing, namely a configuration subfile to be configured for missing value processing, a configuration subfile to be configured for standardized processing and a configuration subfile to be configured for box division processing, based on the determined configuration subfiles to be configured, carrying out corresponding configuration parameter configuration by a developer according to requirements, acquiring configuration parameters configured by the developer after the developer configures the parameters, and integrating all the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the numerical value type characteristics.
Similarly, for the category type features in the category type feature list, determining configuration subfiles to be configured required for feature processing, namely, a configuration subfile to be configured for missing value processing and a configuration subfile to be configured for encoding processing, based on the determined configuration subfiles to be configured, performing corresponding configuration parameter configuration by a developer according to needs, acquiring configuration parameters configured by the developer after the developer configures the parameters, and integrating all the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the category type features.
The feature processing module 204 processes the corresponding type of features based on the configuration file generated by the configuration file generating module 203.
After the complete configuration file is generated, the configuration file only defines how and how to process the features, and the generated complete configuration file and the corresponding features are input into a processing system for feature processing. Specifically, for the features in the numerical type feature list, inputting the complete configuration file for processing the numerical type features into a processing system, and processing the features in the numerical type feature list one by one based on the complete configuration file for processing the numerical type features so as to achieve the aim of rapid feature processing; and inputting the complete configuration file for processing the category type features into a processing system for the category type features in the category type feature list, and processing the features in the category type feature list one by one based on the complete configuration file for processing the category type features so as to realize the purpose of rapid feature processing.
Optionally, in an embodiment of the present invention, a device for quickly processing model features further includes:
the image file generating module 205 is configured to generate a processing image from the feature processing procedure of the training phase feature processing module 204.
Specifically, during the feature processing of the training stage step 104, the configuration file may be input to the processing system, and the system state during the feature processing based on the configuration file to obtain the feature processing result may be packaged into an image file.
In general, there are two processes for feature processing, respectively training and prediction, and the training and prediction inputs are all raw data, so that the results generated by the training and prediction are the same, and the processing modes must be consistent. For example, for missing value filling, a certain feature is filled according to frequency, a lot of data is provided during training, at this time, which frequency is highest can be known, and only one piece of data is provided during prediction, so that the real prediction is dependent on the training process, the problem is solved, the prior art is that the training process and the online prediction develop codes respectively, which is time-consuming and labor-consuming, in the embodiment of the invention, the configuration file is input into the processing system in the training stage, and the training process of obtaining the feature processing result by carrying out feature processing based on the configuration file is packaged into the mirror image file, so that the development of a code during training and the development of a code during online prediction can be avoided, and the consistency of data is ensured.
Exemplary electronic device
Fig. 3 is a structure of an electronic device provided in an exemplary embodiment of the present invention. The electronic device may be either or both of the first device and the second device, or a stand-alone device independent thereof, which may communicate with the first device and the second device to receive the acquired input signals therefrom. Fig. 3 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 3, the electronic device includes one or more processors 31 and memory 32.
The processor 31 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
Memory 32 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 31 to implement the model feature fast processing method and/or other desired functions of the software program of the various embodiments of the present disclosure described above. In one example, the electronic device may further include: an input device 33 and an output device 34, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).
In addition, the input device 33 may also include, for example, a keyboard, a mouse, and the like.
The output device 34 can output various information to the outside. The output device 34 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.
Of course, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 3 for simplicity, components such as buses, input/output interfaces, etc. being omitted. In addition, the electronic device may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a model feature fast processing method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification.
The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in a model feature fast processing method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present disclosure.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.
The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure. The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.
Claims (7)
1. A quick processing method for model features comprises the following steps:
Determining a plurality of functional modules corresponding to different processing types of the model features, and generating configuration subfiles to be configured for the functional modules respectively, wherein the determining the functional modules in the plurality of functional modules corresponding to the different processing types of the model features comprises: missing value processing, normalization processing, binning processing and encoding processing;
Classifying the features in the data set to be processed according to the feature types, including: dividing the features in the data set to be processed into a numerical value type feature and a category type feature, and generating a numerical value type feature list and a category type feature list;
Determining corresponding configuration subfiles to be configured according to different types of features, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file, wherein the method comprises the following steps: determining a configuration sub-file to be configured required for processing the numerical value type characteristics, acquiring configuration parameters based on the determined configuration sub-file to be configured, and integrating all the configuration sub-files after acquiring the configuration parameters to obtain a complete configuration file for processing the numerical value type characteristics; determining a configuration sub-file to be configured required for processing the category type characteristics, acquiring configuration parameters based on the determined configuration sub-file to be configured, and integrating the configuration sub-files after acquiring the configuration parameters to obtain a complete configuration file for processing the category type characteristics;
and processing the corresponding type of features based on the generated complete configuration file, and generating a processing mirror image in the feature processing process in the training stage.
2. A method for rapidly processing model features as claimed in claim 1, characterized in that: the configuration sub-file to be configured is provided with configuration parameters to provide configuration for a developer.
3. A method for rapidly processing model features as claimed in claim 1, characterized in that: the determined configuration subfiles to be configured for the value missing processing comprise the configuration subfiles to be configured for the normalization processing and the configuration subfiles to be configured for the binning processing.
4. A method for rapidly processing model features as claimed in claim 1, characterized in that: for the category type feature, the determined configuration subfiles to be configured comprise a configuration subfile to be configured for missing value processing and a configuration subfile to be configured for encoding processing.
5. A model feature rapid processing apparatus comprising:
the system comprises a configuration sub-file generating module to be configured, a configuration sub-file generating module and a configuration sub-file generating module, wherein the configuration sub-file generating module is used for determining a plurality of functional modules corresponding to different processing types of model characteristics and respectively generating configuration sub-files to be configured for the realization of each functional module, and the functional modules in the plurality of functional modules corresponding to the different processing types of the model characteristics comprise: missing value processing, normalization processing, binning processing and encoding processing;
The feature classification module is used for classifying the features in the data set to be processed according to the feature types, and comprises the following steps: dividing the features in the data set to be processed into a numerical value type feature and a category type feature, and generating a numerical value type feature list and a category type feature list;
the configuration file generation module is used for determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file, and comprises the following steps: determining a configuration sub-file to be configured required for processing the numerical value type characteristics, acquiring configuration parameters based on the determined configuration sub-file to be configured, and integrating all the configuration sub-files after acquiring the configuration parameters to obtain a complete configuration file for processing the numerical value type characteristics; determining a configuration sub-file to be configured required for processing the category type characteristics, acquiring configuration parameters based on the determined configuration sub-file to be configured, and integrating the configuration sub-files after acquiring the configuration parameters to obtain a complete configuration file for processing the category type characteristics;
And the feature processing module is used for processing the features of the corresponding types based on the configuration file generated by the configuration file generating module and generating a processing mirror image in the feature processing process of the training stage.
6. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the model feature fast processing method according to any one of claims 1 to 4.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the model feature fast processing method according to any one of claims 1 to 4 when the computer program is executed.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210425982.3A CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210425982.3A CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114816506A CN114816506A (en) | 2022-07-29 |
| CN114816506B true CN114816506B (en) | 2024-08-09 |
Family
ID=82506455
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210425982.3A Active CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114816506B (en) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109460396A (en) * | 2018-10-12 | 2019-03-12 | 中国平安人寿保险股份有限公司 | Model treatment method and device, storage medium and electronic equipment |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108733639B (en) * | 2018-04-09 | 2023-08-01 | 中国平安人寿保险股份有限公司 | Configuration parameter adjustment method and device, terminal equipment and storage medium |
| CN108764273B (en) * | 2018-04-09 | 2023-12-05 | 中国平安人寿保险股份有限公司 | Data processing method, device, terminal equipment and storage medium |
| CN111382347A (en) * | 2018-12-28 | 2020-07-07 | 广州市百果园信息技术有限公司 | Object feature processing and information pushing method, device and equipment |
| US11914606B2 (en) * | 2019-03-04 | 2024-02-27 | Walmart Apollo, Llc | Systems and methods for a machine learning framework |
| CN112487180B (en) * | 2019-09-12 | 2024-06-11 | 北京地平线机器人技术研发有限公司 | Text classification method and apparatus, computer-readable storage medium, and electronic device |
| EP4094194A1 (en) * | 2020-01-23 | 2022-11-30 | Umnai Limited | An explainable neural net architecture for multidimensional data |
| US20220101178A1 (en) * | 2020-09-25 | 2022-03-31 | EMC IP Holding Company LLC | Adaptive distributed learning model optimization for performance prediction under data privacy constraints |
| CN112394942B (en) * | 2020-11-24 | 2021-06-04 | 深圳君南信息系统有限公司 | Distributed software development compiling method and software development platform based on cloud computing |
| CN113094116B (en) * | 2021-04-01 | 2022-10-11 | 中国科学院软件研究所 | A deep learning application cloud configuration recommendation method and system based on load feature analysis |
-
2022
- 2022-04-21 CN CN202210425982.3A patent/CN114816506B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109460396A (en) * | 2018-10-12 | 2019-03-12 | 中国平安人寿保险股份有限公司 | Model treatment method and device, storage medium and electronic equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114816506A (en) | 2022-07-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP6549332B2 (en) | Network model construction method and apparatus based on machine learning | |
| US20230325687A1 (en) | System and method for de novo drug discovery | |
| CN113658100A (en) | Three-dimensional target object detection method, device, electronic device and storage medium | |
| CN115994177A (en) | Intellectual property management method and system based on data lake | |
| CN113822315A (en) | Attribute graph processing method and device, electronic equipment and readable storage medium | |
| CN102929646B (en) | Application program generation method and device | |
| Zhao et al. | Unsupervised multiview nonnegative correlated feature learning for data clustering | |
| CN116881641A (en) | Pre-training model adjustment method and device, storage medium, computing equipment | |
| CN116186522A (en) | Big data core feature extraction method, electronic equipment and storage medium | |
| WO2024131194A1 (en) | Model generating method and apparatus | |
| CN117786299A (en) | Sparse matrix solving method, system, equipment and medium | |
| CN114816506B (en) | Quick processing method and device for model features, storage medium and electronic equipment | |
| US20220365762A1 (en) | Neural network model conversion method server, and storage medium | |
| CN112817560A (en) | Method and system for processing calculation task based on table function and computer readable storage medium | |
| CN118779372A (en) | Customized interactive scene export method, device and storage medium | |
| US12430933B2 (en) | Method for detecting cells in images using autoencoder, computer device, and storage medium | |
| CN114880915B (en) | Comprehensive energy metering simulation data generation method, device and storage medium | |
| CN110909018A (en) | SQL statement generation method, device, device and storage medium | |
| CN117390473A (en) | Object processing method and device | |
| CN115641475A (en) | Method, device, equipment and storage medium for extracting features | |
| CN117217431A (en) | Material audit methods, devices, computer equipment and storage media | |
| US20250363273A1 (en) | Design support system, design support method, and program | |
| Li et al. | CUSNTF: A scalable sparse non-negative tensor factorization model for large-scale industrial applications on multi-GPU | |
| CN118069932B (en) | Recommendation method and device for configuration information and computer equipment | |
| US12547377B2 (en) | Data processing apparatus and method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |