CN118520303A

CN118520303A - Automatic machine learning method and system based on structured data

Info

Publication number: CN118520303A
Application number: CN202410972993.2A
Authority: CN
Inventors: 陈铁金; 李国庆
Original assignee: Athena Eyes Co Ltd
Current assignee: Athena Eyes Co Ltd
Priority date: 2024-07-19
Filing date: 2024-07-19
Publication date: 2024-08-20

Abstract

The present application discloses an automatic machine learning method and system based on structured data, which comprises the following steps: obtaining a structured data set and labeling target variables for the structured data; performing type identification on the structured data set to obtain a target task type and a target feature type; constructing a corresponding feature engineering according to the target feature type to obtain a target feature combination; constructing a base model according to the target task type and selecting a corresponding model fusion strategy; inputting the target feature combination and the target variable into the base model for training to obtain a target base model; fusing the target base model according to the model fusion strategy to obtain a fusion model, and automatically deploying the target base model and the fusion model; the present method greatly reduces the workload of manually selecting a model, and improves the versatility and flexibility of the model in various tasks, and can handle multiple tasks at one time, thereby saving a lot of time and energy and improving efficiency.

Description

An automatic machine learning method and system based on structured data

技术领域Technical Field

本申请涉及机器学习技术领域，特别是涉及一种基于结构化数据的自动机器学习方法及系统。The present application relates to the technical field of machine learning, and in particular to an automatic machine learning method and system based on structured data.

背景技术Background Art

在现在的信息化社会中，数据已成为推动科技进步和业务发展的核心动力；结构化数据，如关系型数据库中的表格数据，由于其固有的规则性和可预测性，在机器学习领域具有广泛的应用；然而，传统的机器学习框架往往要求用户具备深厚的数据科学和编程背景，以便进行特征工程、模型选择和参数调优等一系列繁琐的操作；这不仅增加了学习成本，也限制了机器学习技术在更广泛领域的应用。随着自动化和智能化技术的发展，自动机器学习（AutoML）逐渐成为研究热点。AutoML旨在通过自动化流程，减少机器学习过程中的手动干预，从而提高建模效率并降低技术门槛。基于结构化数据的AutoML框架及系统，更是在这一背景下应运而生，为处理和分析结构化数据提供了更加便捷和高效的解决方案。In today's information society, data has become the core driving force for technological progress and business development; structured data, such as tabular data in relational databases, has a wide range of applications in the field of machine learning due to its inherent regularity and predictability; however, traditional machine learning frameworks often require users to have a deep background in data science and programming in order to perform a series of tedious operations such as feature engineering, model selection, and parameter tuning; this not only increases the cost of learning, but also limits the application of machine learning technology in a wider range of fields. With the development of automation and intelligent technology, automatic machine learning (AutoML) has gradually become a research hotspot. AutoML aims to reduce manual intervention in the machine learning process through automated processes, thereby improving modeling efficiency and lowering technical barriers. The AutoML framework and system based on structured data came into being in this context, providing a more convenient and efficient solution for processing and analyzing structured data.

目前，通过分析跟研究现有开源的AutoML框架，发现有以下几个方面的不足：1）通用性跟适应用，某些AutoML框架可能对特定类型的数据或问题领域有更好的适应性，而对其他类型的数据或问题则不够灵活，无法涵盖所有可能的机器学习场景和需求，限制了其在多样化应用中的通用性；2）模型解释性，在需要理解模型决策过程的领域（如金融、医疗等），缺乏解释性可能会成为采用AutoML的一个障碍；3）超参数优化的局限性，尽管已有的AutoML框架提供了超参数优化的功能，但搜索空间可能仍然受限，无法找到全局最优解；4）集成学习的限制，一些AutoML系统可能没有充分利用集成学习方法。At present, through the analysis and research of existing open source AutoML frameworks, it is found that there are deficiencies in the following aspects: 1) Generality and adaptability. Some AutoML frameworks may have better adaptability to specific types of data or problem areas, but are not flexible enough for other types of data or problems. They cannot cover all possible machine learning scenarios and needs, limiting their generality in diverse applications; 2) Model interpretability. In fields that require understanding of the model decision process (such as finance, medical care, etc.), lack of interpretability may become an obstacle to the adoption of AutoML; 3) Limitations of hyperparameter optimization. Although existing AutoML frameworks provide hyperparameter optimization functions, the search space may still be limited and the global optimal solution cannot be found; 4) Limitations of ensemble learning. Some AutoML systems may not fully utilize ensemble learning methods.

鉴于此，如何提供一种通用性及灵活性强，且效率高的基于结构化数据的自动机器学习方法是本领域技术人员亟待解决的技术问题。In view of this, how to provide an automatic machine learning method based on structured data that is versatile, flexible, and efficient is a technical problem that needs to be urgently solved by technical personnel in this field.

发明内容Summary of the invention

为解决上述技术问题，本发明的目的为提供一种基于结构化数据的自动机器学习方法及系统，极大地减少了手动选择模型的工作量，同时提高了模型在各种任务上的通用性及灵活性，并且能够一次性处理多种任务，从而节省了大量的时间和精力，提高了效率。In order to solve the above technical problems, the purpose of the present invention is to provide an automatic machine learning method and system based on structured data, which greatly reduces the workload of manual model selection, while improving the versatility and flexibility of the model in various tasks, and can handle multiple tasks at one time, thereby saving a lot of time and energy and improving efficiency.

本发明的第一个目的为提供一种基于结构化数据的自动机器学习方法；The first object of the present invention is to provide an automatic machine learning method based on structured data;

本发明提供的技术方案如下：The technical solution provided by the present invention is as follows:

一种基于结构化数据的自动机器学习方法，包括如下步骤：An automatic machine learning method based on structured data comprises the following steps:

获取结构化数据集，并对所述结构化数据标注目标变量；Obtaining a structured data set, and labeling a target variable for the structured data;

将所述结构化数据集进行类型识别，以获取目标任务类型和目标特征类型；Performing type identification on the structured data set to obtain a target task type and a target feature type;

根据所述目标特征类型构建对应的特征工程，以获取目标特征组合；Construct corresponding feature engineering according to the target feature type to obtain a target feature combination;

根据所述目标任务类型构建基模型，并选择与所述基模型对应的模型融合策略；Constructing a base model according to the target task type, and selecting a model fusion strategy corresponding to the base model;

将所述目标特征组合输入所述基模型进行训练，以获取目标基模型；Inputting the target feature combination into the base model for training to obtain a target base model;

根据所述模型融合策略对所述目标基模型进行融合以获取融合模型，并将所述目标基模型和所述融合模型进行自动化部署。The target base model is fused according to the model fusion strategy to obtain a fusion model, and the target base model and the fusion model are automatically deployed.

优选地，所述将所述结构化数据集进行类型识别，以获取目标任务类型和目标特征类型，具体包括：Preferably, the type identification of the structured data set to obtain the target task type and the target feature type specifically includes:

对所述结构化数据集中的所述目标变量进行去重后，统计类别数：After removing duplicates from the target variable in the structured data set, the number of categories is counted:

若类别数等于2，则识别为二分类任务类型；If the number of categories is equal to 2, it is identified as a binary classification task type;

若所述目标变量识别出来的数据类型为字符串并去重后统计的类别数大于2，则识别为多分类任务类型；If the data type identified by the target variable is a string and the number of categories counted after deduplication is greater than 2, it is identified as a multi-classification task type;

若所述目标变量识别出来的数据类型为整数或浮点数，则识别为回归任务。If the data type of the target variable is an integer or a floating point number, it is identified as a regression task.

从所述结构化数据集中剔除所述目标变量以获取目标特征集；Eliminating the target variable from the structured data set to obtain a target feature set;

遍历所述目标特征集以获取所述目标特征类型。The target feature set is traversed to obtain the target feature type.

优选地，所述根据所述目标特征类型获取目标特征组合之前，还包括：Preferably, before acquiring the target feature combination according to the target feature type, the method further includes:

将所述目标特征类型进行数据预处理。The target feature type is subjected to data preprocessing.

优选地，所述根据所述目标特征类型构建对应的特征工程，以获取目标特征组合，具体包括：Preferably, constructing corresponding feature engineering according to the target feature type to obtain a target feature combination specifically includes:

根据所述目标特征类型进行特征变换，以获取目标特征；Performing feature transformation according to the target feature type to obtain the target feature;

将所述目标特征进行特征选择，以获取所述目标特征的权重分数；Performing feature selection on the target feature to obtain a weight score of the target feature;

遍历每个所述目标特征的数量，并进行特征递归消除及5折交叉验证，以获取平均交叉验证得分；Traversing the number of each target feature, and performing feature recursive elimination and 5-fold cross validation to obtain an average cross validation score;

根据所述均交叉验证得分得到所述目标特征组合。The target feature combination is obtained according to the average cross-validation score.

优选地，所述根据所述目标任务类型构建基模型，具体包括：Preferably, the building of a base model according to the target task type specifically includes:

根据所述目标任务类型选择对应的算法作为基模型。Select a corresponding algorithm as the base model according to the target task type.

优选地，所述将所述目标特征组合输入所述基模型进行训练，以获取目标基模型，具体包括：Preferably, the step of inputting the target feature combination into the base model for training to obtain a target base model specifically includes:

将所述目标特征组合输入所述基模型的列表进行并行训练，以获取目标基模型。The target feature combination is input into the list of the base models for parallel training to obtain a target base model.

本发明的第二个目的为提供一种基于结构化数据的自动机器学习系统；The second object of the present invention is to provide an automatic machine learning system based on structured data;

一种基于结构化数据的自动机器学习系统，包括：第一获取模块、识别模块、第二获取模块、构建模块、训练模块和部署模块；An automatic machine learning system based on structured data, comprising: a first acquisition module, a recognition module, a second acquisition module, a construction module, a training module and a deployment module;

所述第一获取模块，用于获取结构化数据集，并对所述结构化数据设置目标变量；The first acquisition module is used to acquire a structured data set and set a target variable for the structured data;

所述识别模块，用于将所述结构化数据集进行类型识别，以获取目标任务类型和目标特征类型；The identification module is used to perform type identification on the structured data set to obtain a target task type and a target feature type;

所述第二获取模块，用于根据所述目标特征类型构建对应的特征工程，以获取目标特征组合；The second acquisition module is used to construct a corresponding feature engineering according to the target feature type to obtain a target feature combination;

所述构建模块，用于根据所述目标任务类型构建基模型，并选择与所述基模型对应的模型融合策略；The construction module is used to construct a base model according to the target task type and select a model fusion strategy corresponding to the base model;

所述训练模块，用于将所述目标特征组合输入所述基模型进行训练，以获取目标基模型；The training module is used to input the target feature combination into the base model for training to obtain a target base model;

所述部署模块，用于根据所述模型融合策略对所述目标基模型进行融合以获取融合模型，并将所述目标基模型和所述融合模型进行自动化部署。The deployment module is used to fuse the target base model according to the model fusion strategy to obtain a fusion model, and automatically deploy the target base model and the fusion model.

本发明的第三个目的为提供一种电子设备；The third object of the present invention is to provide an electronic device;

一种电子设备，包括：An electronic device, comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器，所述存储器存储有可被所述至少一个处理器执行的计算机程序，所述计算机程序被所述至少一个处理器执行，以使所述至少一个处理器能够执行基于结构化数据的自动机器学习方法所述的方法步骤。A memory communicatively connected to the at least one processor, the memory storing a computer program executable by the at least one processor, the computer program being executed by the at least one processor so as to enable the at least one processor to perform the method steps described in the automatic machine learning method based on structured data.

本发明的第四个目的为提供一种计算机可读存储介质；A fourth object of the present invention is to provide a computer readable storage medium;

一种计算机可读存储介质，所述存储介质用于存储计算机程序，所述计算机程序用于使计算机执行基于结构化数据的自动机器学习方法所述的方法步骤。A computer-readable storage medium is used to store a computer program, wherein the computer program is used to cause a computer to execute the method steps described in the automatic machine learning method based on structured data.

本发明提供的一种基于结构化数据的自动机器学习方法，包括步骤：获取结构化数据集，并对所述结构化数据标注目标变量；将所述结构化数据集进行类型识别，以获取目标任务类型和目标特征类型；根据所述目标特征类型构建对应的特征工程，以获取目标特征组合；根据所述目标任务类型构建基模型，并选择与所述基模型对应的模型融合策略；将所述目标特征组合输入所述基模型进行训练，以获取目标基模型；根据所述模型融合策略对所述目标基模型进行融合以获取融合模型，并将所述目标基模型和所述融合模型进行自动化部署；本方法可以极大地减少了手动选择模型的工作量，同时提高了模型在各种任务上的通用性及灵活性，并且能够一次性处理多种任务，从而节省了大量的时间和精力，提高了效率。The present invention provides an automatic machine learning method based on structured data, comprising the steps of: obtaining a structured data set and labeling the target variables for the structured data; performing type identification on the structured data set to obtain a target task type and a target feature type; constructing a corresponding feature engineering according to the target feature type to obtain a target feature combination; constructing a base model according to the target task type, and selecting a model fusion strategy corresponding to the base model; inputting the target feature combination into the base model for training to obtain a target base model; fusing the target base model according to the model fusion strategy to obtain a fusion model, and automatically deploying the target base model and the fusion model; the method can greatly reduce the workload of manually selecting a model, while improving the versatility and flexibility of the model in various tasks, and can handle multiple tasks at one time, thereby saving a lot of time and energy and improving efficiency.

本发明还提供了一种基于结构化数据的自动机器学习系统，由于该系统与该基于结构化数据的自动机器学习方法解决相同的技术问题，属于相同的技术构思，理应具有相同的有益效果，在此不再赘述。The present invention also provides an automatic machine learning system based on structured data. Since the system and the automatic machine learning method based on structured data solve the same technical problem and belong to the same technical concept, they should have the same beneficial effects and will not be described in detail here.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明实施例中一种基于结构化数据的自动机器学习方法的流程图；FIG1 is a flow chart of an automatic machine learning method based on structured data in an embodiment of the present invention;

图2为本发明实施例中一种基于结构化数据的自动机器学习系统的结构示意图；FIG2 is a schematic diagram of the structure of an automatic machine learning system based on structured data in an embodiment of the present invention;

图3为本发明实施例中一种电子设备的结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本领域的技术人员更好地理解本申请中的技术方案，下面将对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请的一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in the embodiments of this application are clearly and completely described below. Obviously, the described embodiments are only part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

需要说明的是，当元件被称为“固定于”或“设置于”另一个元件上，它可以直接在另一个元件上或者间接设置在另一个元件上；当一个元件被称为是“连接于”另一个元件，它可以是直接连接到另一个元件或间接连接至另一个元件上。It should be noted that when an element is referred to as being "fixed on" or "set on" another element, it can be directly on the other element or indirectly set on the other element; when an element is referred to as being "connected to" another element, it can be directly connected to the other element or indirectly connected to the other element.

需要理解的是，术语“长度”、“宽度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本申请和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本申请的限制。It should be understood that the terms "length", "width", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside", etc., indicating the orientation or position relationship, are based on the orientation or position relationship shown in the drawings, and are only for the convenience of describing the present application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be understood as a limitation on the present application.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多该特征。在本申请的描述中，“多个”、“若干个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of the features. In the description of this application, "multiple" and "several" mean two or more, unless otherwise clearly and specifically defined.

须知，本说明书附图所绘示的结构、比例、大小等，均仅用以配合说明书所揭示的内容，以供熟悉此技术的人士了解与阅读，并非用以限定本申请可实施的限定条件，故不具技术上的实质意义，任何结构的修饰、比例关系的改变或大小的调整，在不影响本申请所能产生的功效及所能达成的目的下，均应仍落在本申请所揭示的技术内容所能涵盖的范围内。It should be noted that the structures, proportions, sizes, etc. illustrated in the drawings of this specification are only used to match the contents disclosed in the specification for people familiar with this technology to understand and read, and are not used to limit the conditions under which this application can be implemented. Therefore, they have no substantive technical significance. Any structural modification, change in proportional relationship or adjustment of size should still fall within the scope of the technical content disclosed in this application without affecting the effects and purposes that can be achieved by this application.

如图1所示，本发明实施例提供一种基于结构化数据的自动机器学习方法，包括如下步骤：As shown in FIG1 , an embodiment of the present invention provides an automatic machine learning method based on structured data, comprising the following steps:

S1.获取结构化数据集，并对所述结构化数据标注目标变量；S1. Obtain a structured data set and annotate the target variable for the structured data;

步骤S1中，通过数据导入模块导入结构化数据集，导入方式有二种，一种直接通过本地文件进行上传，另外一种直接读取数据库中表的结构化数据集，并在导入结构化数据集的过程中，需要标注目标变量的列，如果有用户主键的话，也需要指定用户主键的列，如果没有的话，就不需要，系统中会默认为空。In step S1, a structured data set is imported through the data import module. There are two ways to import, one is to directly upload through a local file, and the other is to directly read the structured data set of the table in the database. In the process of importing the structured data set, the columns of the target variables need to be marked. If there is a user primary key, the column of the user primary key also needs to be specified. If not, it is not necessary and the system will default to empty.

S2.将所述结构化数据集进行类型识别，以获取目标任务类型和目标特征类型；S2. performing type identification on the structured data set to obtain a target task type and a target feature type;

步骤S2中，把导入的结构化数据集输入到自动识别模块，自动识别特征类型以及任务类型；根据目标变量的数据分析，自动识别目标任务类型，其中目标任务类型包括二分类、多分类以及回归；然后剔除目标变量跟用户主键，剩余的就是特征变量，然后依次遍历特征变量，并根据数据分布，自动识别出特征变量的目标特征类型，其中目标特征类型包括离散型、整数型、浮点型、文本型、时间型；通过这种方法，系统能够自动分析数据的分布特征，从而判断数据的类型（如连续型、离散型、分类型、文本型、时间型）；同时，系统还可以根据目标变量的数据分布情况和相关算法的特性，自动推断出适合的任务类型，如二分类任务、多分类任务、回归任务；这种自动识别的过程大大简化了数据处理的流程，减少了人为因素的干扰，提高了数据处理的效率；此外，由于自动识别方法基于数据的统计特征，因此其识别结果更加客观和准确，有助于提升后续数据分析和建模的效果。In step S2, the imported structured data set is input into the automatic identification module to automatically identify the feature type and task type; according to the data analysis of the target variable, the target task type is automatically identified, wherein the target task type includes binary classification, multi-classification and regression; then the target variable and the user primary key are removed, and the remaining is the feature variable, and then the feature variables are traversed in turn, and according to the data distribution, the target feature type of the feature variable is automatically identified, wherein the target feature type includes discrete type, integer type, floating point type, text type, and time type; through this method, the system can automatically analyze the distribution characteristics of the data, so as to determine the type of the data (such as continuous type, discrete type, categorical type, text type, and time type); at the same time, the system can also automatically infer the appropriate task type, such as binary classification task, multi-classification task, and regression task, according to the data distribution of the target variable and the characteristics of the relevant algorithm; this automatic identification process greatly simplifies the data processing process, reduces the interference of human factors, and improves the efficiency of data processing; in addition, since the automatic identification method is based on the statistical characteristics of the data, its identification result is more objective and accurate, which helps to improve the effect of subsequent data analysis and modeling.

S3.根据所述目标特征类型构建对应的特征工程，以获取目标特征组合；S3. Construct corresponding feature engineering according to the target feature type to obtain the target feature combination;

步骤S3中，将结构化数据集进行特征工程的处理，根据目标特征类型做不同的特征变换，然后将特征变换后的目标特征类型进行特征选择，在进行自动化调参以及交叉验证，得到特征的权重分数，并进行排序，遍历每个特征的数量，并进行递归特征消除及交叉验证，获得平均交叉验证得分，从而得到最佳的特征组合；递归特征消除的主要思想是反复构建模型，然后选出最好的特征，把选出来的特征放到一边，然后在剩余的特征上重复这个过程，直到遍历了所有的特征；本实施例中采用的特征工程就是一个把原始数据转变成特征的过程。In step S3, the structured data set is processed by feature engineering, different feature transformations are performed according to the target feature type, and then the target feature type after feature transformation is subjected to feature selection, and automatic parameter adjustment and cross-validation are performed to obtain the weight score of the feature, and the weight score is sorted, the number of each feature is traversed, and recursive feature elimination and cross-validation are performed to obtain the average cross-validation score, thereby obtaining the best feature combination; the main idea of recursive feature elimination is to repeatedly build a model, then select the best feature, put the selected feature aside, and then repeat this process on the remaining features until all features are traversed; the feature engineering used in this embodiment is a process of converting raw data into features.

S4.根据所述目标任务类型构建基模型，并选择与所述基模型对应的模型融合策略；S4. construct a base model according to the target task type, and select a model fusion strategy corresponding to the base model;

步骤S4中，基于自动识别的目标任务类型，选择与目标任务类型相应的算法分别构建基模型，并选择与基模型对应的模型融合策略；本实施例中的基模型指选择的算法列表；模型融合策略包括Stacking分层集成方法和Voting集成策略。In step S4, based on the automatically identified target task type, algorithms corresponding to the target task type are selected to construct base models respectively, and model fusion strategies corresponding to the base models are selected; the base model in this embodiment refers to the selected algorithm list; the model fusion strategies include the Stacking hierarchical integration method and the Voting integration strategy.

S5.将所述目标特征组合输入所述基模型进行训练，以获取目标基模型；S5. Inputting the target feature combination into the base model for training to obtain a target base model;

步骤S5中，利用最佳的特征组合和目标变量并行训练基模型，对每种基模型都设置一套超参数范围，然后利用GridSearch网格搜索的方式搜索最优参数组合，并且针对每个基模型都使用交叉验证（5折交叉验证）来评估模型性能，以选取模型性能做优的基模型为目标基模型。In step S5, the base models are trained in parallel using the best feature combination and target variable, a set of hyperparameter ranges is set for each base model, and then the optimal parameter combination is searched using GridSearch. Cross-validation (5-fold cross-validation) is used to evaluate the model performance for each base model, and the base model with the best model performance is selected as the target base model.

S6.根据所述模型融合策略对所述目标基模型进行融合以获取融合模型，并将所述目标基模型和所述融合模型进行自动化部署。S6. According to the model fusion strategy, the target base model is fused to obtain a fusion model, and the target base model and the fusion model are automatically deployed.

步骤S6中，利用Stacking分层集成方法和Voting集成策略对目标基模型列表进行融合或集成，其中第二层模型（元模型）使用第一层模型（即目标基模型）的预测结果作为输入特征；实现步骤如下：1、训练第一层模型，并使用它们对训练数据进行预测；2、使用第一层模型的预测结果作为特征，训练第二层模型（元模型）；3、使用测试数据评估目标基模型的性能指标，从而得到融合模型；Voting是一种简单的集成策略，其中多个模型对同一数据集进行预测，并通过投票方式（如硬投票或软投票）选择最终预测结果；实现步骤如下：1、使用选定的目标基模型进行投票，并评估目标基模型的性能指标，从而得到融合模型；并将目标基模型以及融合模型进行自动化部署，并为用户提供推理服务接口，使用户在调用过程中可以根据模型名称调用不同的模型，并提供灵活的配置选项和扩展接口，实现了对训练完成模型的自动化部署和高效推理服务，为用户提供了便捷、可靠的模型推理解决方案；同时还提供了用户与系统的交互界面，允许用户设置系统的参数和配置，允许用户通过用户界面选择合适的算法和模型结构，并为用户提供可视化界面，并能够通过页面查看模型的效果，包括模型任务类型、以及对应的各种评估指标、特征重要性、ROC曲线、P-R曲线。In step S6, the target base model list is fused or integrated using the Stacking hierarchical integration method and the Voting integration strategy, wherein the second-layer model (meta-model) uses the prediction results of the first-layer model (i.e., the target base model) as input features; the implementation steps are as follows: 1. Train the first-layer models and use them to predict the training data; 2. Use the prediction results of the first-layer models as features to train the second-layer models (meta-models); 3. Use the test data to evaluate the performance indicators of the target base model to obtain a fused model; Voting is a simple integration strategy in which multiple models predict the same data set and select the final prediction result by voting (such as hard voting or soft voting); the implementation steps are as follows: 1. Vote using the selected target base model , and evaluate the performance indicators of the target base model to obtain the fusion model; the target base model and the fusion model are automatically deployed, and an inference service interface is provided for users, so that users can call different models according to the model name during the calling process, and flexible configuration options and extension interfaces are provided to realize the automatic deployment and efficient inference service of the trained model, providing users with a convenient and reliable model inference solution; at the same time, an interactive interface between the user and the system is provided, allowing the user to set the parameters and configuration of the system, allowing the user to select the appropriate algorithm and model structure through the user interface, and providing a visual interface for the user, and being able to view the effect of the model through the page, including the model task type, and the corresponding various evaluation indicators, feature importance, ROC curve, P-R curve.

与现有技术相比，本方法具有：1）通用性及灵活性，能自动选择适合的算法和模型，并根据数据的特性进行训练和优化，这极大地减少了手动选择模型和调参的工作量，同时提高了模型在各种任务上的通用性及灵活性；2）降低门槛，对非机器学习专家提供了一个友好的接口，使得他们无需深入了解机器学习原理和细节，就能够轻松地利用结构化的数据进行模型训练和预测；3）提升效率，对于数据科学家和机器学习工程师来说，不再需要为不同类型的任务分别设计和调整模型，基于设计的自动化机器学习方法能够一次性处理多种任务，从而节省了大量的时间和精力。Compared with the existing technology, this method has the following advantages: 1) versatility and flexibility. It can automatically select suitable algorithms and models, and train and optimize them according to the characteristics of the data. This greatly reduces the workload of manual model selection and parameter adjustment, while improving the versatility and flexibility of the model in various tasks; 2) Lowering the threshold and providing a friendly interface for non-machine learning experts, so that they can easily use structured data for model training and prediction without having to deeply understand the principles and details of machine learning; 3) Improving efficiency. For data scientists and machine learning engineers, there is no need to design and adjust models for different types of tasks separately. The design-based automated machine learning method can handle multiple tasks at one time, saving a lot of time and energy.

对所述目标变量进行去重后，统计类别数：After removing duplicates from the target variable, the number of categories is counted:

若所述目标变量识别出来的数据类型为字符串并去重后统计的类别数大于2，则识别为多分类任务类型。If the data type identified by the target variable is a string and the number of categories counted after deduplication is greater than 2, it is identified as a multi-classification task type.

在实际运用过程中，根据目标变量的数据分析，自动识别目标任务类型，而目标任务类型包括二分类、多分类以及回归；其中，目标任务类型的识别标准：对目标变量进行去重后，统计类别数，若类别数等于2，则识别为二分类任务；若目标变量识别出来的数据类型为字符串并去重后统计的类别数大于2，则识别为多分类任务；若目标变量识别出来的数据类型为整数或浮点数，则识别为回归任务。In actual application, the target task type is automatically identified based on the data analysis of the target variable, and the target task types include binary classification, multi-classification and regression. The identification standard of the target task type is as follows: after deduplication of the target variable, the number of categories is counted. If the number of categories is equal to 2, it is identified as a binary classification task. If the data type identified by the target variable is a string and the number of categories counted after deduplication is greater than 2, it is identified as a multi-classification task. If the data type identified by the target variable is an integer or a floating-point number, it is identified as a regression task.

在实际运用过程中，剔除目标变量跟用户主键，剩余的就是特征变量，然后依次遍历特征变量，根据数据分布，自动识别出特征变量的目标特征类型，而目标特征类型包括离散型、整数型、浮点型、文本型和时间型；其中离散型的识别标准：数据类型为object、bool或category，并且字段长度少于或等于25；整数型的识别标准：数据类型包含“int”；浮点型的识别标准：数据类型包括“float”；文本型的识别标准：数据类型为object、bool或category，并且字段长度大于25；时间型的识别标准：数据类型为object或category，并且在pandas库中能用to_datetime函数处理。In actual application, the target variable and the user primary key are removed, and the remaining ones are the feature variables. Then the feature variables are traversed in turn, and the target feature type of the feature variable is automatically identified according to the data distribution. The target feature types include discrete, integer, floating-point, text and time types. The recognition standard for discrete type: the data type is object, bool or category, and the field length is less than or equal to 25; the recognition standard for integer type: the data type contains "int"; the recognition standard for floating-point type: the data type includes "float"; the recognition standard for text type: the data type is object, bool or category, and the field length is greater than 25; the recognition standard for time type: the data type is object or category, and can be processed by the to_datetime function in the pandas library.

在实际运用过程中，将目标特征类型中的特征数据源进行数据预处理，首先统计各个特征的空值率，过滤出空值率超过80%的列，并对其进行剔除。其次进行缺失值的处理，针对离散型的特征，利用“-99999”填充缺失值；针对整数型的特征，利用0填充缺失值；针对浮点型的特征，利用所在特征列的平均值填充缺失值；针对文本的特征，利用空字符串填充缺失值；针对时间型的特征，利用所在特征列的前一行不为空值的行填充缺失值。In the actual application process, the feature data source in the target feature type is preprocessed. First, the null value rate of each feature is counted, and the columns with null value rates exceeding 80% are filtered out and removed. Secondly, missing values are processed. For discrete features, "-99999" is used to fill missing values; for integer features, 0 is used to fill missing values; for floating-point features, the average value of the feature column is used to fill missing values; for text features, empty strings are used to fill missing values; for time features, the previous row of the feature column is used to fill missing values.

在实际运用过程中，将数据预处理后的目标特征类型进行特征工程的构建，首先对于离散型的特征列，对其进行One-Hot编码，利用sklean.preprocessing库中的OneHotEncoder实现；对于浮点型的特征，对其进行归一化处理，利用sklean.preprocessing库中的StandardScaler实现；对于整数型的特征，对其进行分箱处理，利用sklean.preprocessing库中的KBinsDiscretizer实现，其中箱子的个数用斯特吉斯公式进行计算，计算方式为num_bins=int(1+np.log2(n)，其中int为取整，np.log2后NumPy(一个在Pyt唐on中广泛使用的教值计算库)中的一个函数，用于计算以2为底的对数；针对时间型的特征，提取时间中的年、月、日、小时、所属周、一周的第几天、是否工作日、是否月初、是否月末，做为新的特征列；针对文本型的特征，利用sklearn.feature_extraction.text中的TfidfVectorizer实现；然后基于特征变换后的特征列表进行特征选择，特征选择基于随机森林以及基于变换后的特征进行自动化调参以及5折交叉验证，得到特征的权重分数，并进行排序，遍历每个特征数量，并进行RFE及5折交叉验证，存储平均交叉验证得分，从而得到最佳的特征组合，即目标特征组合；而权重的重要性以及特征的选择过程都进行了可视化。In the actual application process, the target feature type after data preprocessing is used to construct feature engineering. First, for discrete feature columns, One-Hot encoding is performed on them, using OneHotEncoder in the sklean.preprocessing library; for floating-point features, they are normalized, using StandardScaler in the sklean.preprocessing library; for integer features, they are binned, using KBinsDiscretizer in the sklean.preprocessing library, where the number of bins is calculated using the Sturgess formula, which is num_bins=int(1+np.log2(n), where int is rounded, and np.log2 is NumPy (a Python library). n) is a function used to calculate the logarithm with base 2; for time-type features, the year, month, day, hour, week, day of the week, whether it is a working day, whether it is the beginning of the month, and whether it is the end of the month are extracted as new feature columns; for text-type features, TfidfVectorizer in sklearn.feature_extraction.text is used to implement it; then feature selection is performed based on the feature list after feature transformation, and feature selection is based on random forest and automatic parameter adjustment and 5-fold cross-validation based on the transformed features to obtain the weight score of the feature, and sort it, traverse each feature number, and perform RFE and 5-fold cross-validation, store the average cross-validation score, and obtain the best feature combination, that is, the target feature combination; the importance of weights and the feature selection process are visualized.

在实际运用过程中，基于自动识别的任务类型，选择相应的算法作为基模型，如二分类任务，选择RF(随机森林）、LightGBM（轻量级梯度提升机）、AdaBoost（自适应boosting）、XGBoost(极致梯度提升树)和LR（逻辑回归）五种分类器作为基模型；针对多分类任务，选择RF(随机森林）、LightGBM（轻量级梯度提升机）、AdaBoost（自适应boosting）、XGBoost(极致梯度提升树)和LR（逻辑回归）五种分类器作为基模型；针对回归任务，选择RF(随机森林）、LightGBM（轻量级梯度提升机）、XGBoost(极致梯度提升树)三种回归算法做为基模型。In actual application, based on the automatically identified task type, the corresponding algorithm is selected as the base model. For example, for binary classification tasks, five classifiers, RF (random forest), LightGBM (lightweight gradient boosting machine), AdaBoost (adaptive boosting), XGBoost (extreme gradient boosting tree) and LR (logistic regression) are selected as base models; for multi-classification tasks, five classifiers, RF (random forest), LightGBM (lightweight gradient boosting machine), AdaBoost (adaptive boosting), XGBoost (extreme gradient boosting tree) and LR (logistic regression) are selected as base models; for regression tasks, three regression algorithms, RF (random forest), LightGBM (lightweight gradient boosting machine) and XGBoost (extreme gradient boosting tree) are selected as base models.

在实际运用过程中，基于选择的特征列表对基模型进行训练并对每个基模型都进行了5折交叉验证，并结合超参数优化单元对每个基模型的超参数进行自动化调优，为每种模型任务中的基模型都设置一套超参数范围，能够根据任务的特点自动调整模型的参数和结构，从而优化模型的性能，能够保证在各种任务上都能达到或接近手动调参的效果；基于利用GridSearch 网格搜索的方式搜索最优参数组合，并将目标特征组合输入基模型进行训练，对每个基模型都进行评估，并输出最优的基模型；其中二分类任务的性能指标包括准确率(Accuracy)、精确率(Precision)、召回率(Recall)、F1-Score、线下面积(Area undercurve，AUC)、(Receiver operating characteristic curve，ROC)曲线、P-R曲线以及混淆矩阵；多分类任务的评估指标有准确率(Accuracy)、精确率(Precision)、召回率(Recall)、F1-Score、线下面积(Area under curve，AUC)、(Receiver operating characteristiccurve，ROC)曲线、P-R曲线以及混淆矩阵；回归任务的性能指标包括平均绝对误差(MeanAbsolute Error，MAE)、均方误差(Mean Sequared Error，MSE)、平均绝对百分比误差(MeanAbsolute Percentage Error，MAPE)、回归得分函数(R2 Score决定系数)。In the actual application process, the base model is trained based on the selected feature list and each base model is cross-validated by 5 folds. The hyperparameter optimization unit is combined to automatically tune the hyperparameters of each base model. A set of hyperparameter ranges is set for the base model in each model task, which can automatically adjust the parameters and structure of the model according to the characteristics of the task, thereby optimizing the performance of the model and ensuring that the effect of manual parameter adjustment can be achieved or close to that of manual parameter adjustment on various tasks. The optimal parameter combination is searched based on the GridSearch grid search method, and the target feature combination is input into the base model for training. Each base model is evaluated and the optimal base model is output. The performance indicators of the binary classification task include accuracy, precision, recall, F1-Score, area under curve (AUC), receiver operating characteristic curve (ROC) curve, P-R curve and confusion matrix; the evaluation indicators of the multi-classification task include accuracy, precision, recall, F1-Score, area under curve (AUC), receiver operating characteristic curve (ROC) curve, P-R curve and confusion matrix. The performance indicators of the regression task include mean absolute error (MAE), mean square error (MSE), mean absolute percentage error (MAPE), and regression score function (R2 Score determination coefficient).

如图2所示，本发明还提供了一种基于结构化数据的自动机器学习系统，包括：第一获取模块、识别模块、第二获取模块、构建模块、训练模块和部署模块；As shown in FIG2 , the present invention also provides an automatic machine learning system based on structured data, comprising: a first acquisition module, a recognition module, a second acquisition module, a construction module, a training module, and a deployment module;

在实际运用过程中，设置了第一获取模块、识别模块、第二获取模块、构建模块、训练模块和部署模块；识别模块分别与第一获取模块、第二获取模块和构建模块连接；训练模块分别与第二获取模块、构建模块和部署模块连接；第一获取模块将获取结构化数据集，并对结构化数据设置目标变量后，将结构化数据集发送至识别模块中；识别模块将结构化数据集进行类型识别获取目标任务类型和目标特征类型后，将目标任务类型发送至构建模块中，将目标特征类型发送至第二获取模块中；第二获取模块将根据目标特征类型获取目标特征组合后，将目标特征组合发送至训练模块中；构建模块将根据目标任务类型构建基模型后，将基模型发送至训练模块中；训练模块将目标特征组合输入基模型进行训练获取目标基模型后，将目标基模型发送至部署模块中；部署模块将根据模型融合策略对目标基模型进行融合以获取融合模型，并将目标基模型和融合模型进行自动化部署；本系统可以极大地减少了手动选择模型的工作量，同时提高了模型在各种任务上的通用性及灵活性，并且能够一次性处理多种任务，从而节省了大量的时间和精力，提高了效率。In the actual application process, a first acquisition module, an identification module, a second acquisition module, a construction module, a training module and a deployment module are set; the identification module is connected to the first acquisition module, the second acquisition module and the construction module respectively; the training module is connected to the second acquisition module, the construction module and the deployment module respectively; the first acquisition module acquires a structured data set, sets target variables for the structured data, and then sends the structured data set to the identification module; the identification module performs type identification on the structured data set to acquire the target task type and the target feature type, and then sends the target task type to the construction module, and sends the target feature type to the second acquisition module; the second acquisition module acquires the target task type according to the target feature type, and then sends the target feature type to the second acquisition module. After obtaining the target feature combination, the target feature combination is sent to the training module; after the construction module constructs the base model according to the target task type, the base model is sent to the training module; after the training module inputs the target feature combination into the base model for training to obtain the target base model, the target base model is sent to the deployment module; the deployment module fuses the target base model according to the model fusion strategy to obtain the fusion model, and automatically deploys the target base model and the fusion model; this system can greatly reduce the workload of manual model selection, while improving the versatility and flexibility of the model in various tasks, and can handle multiple tasks at one time, thereby saving a lot of time and energy and improving efficiency.

进一步的，本申请实施例还公开了一种电子设备，图3是根据一示例性实施例示出的电子设备20结构图，图中的内容不能认为是对本申请的使用范围的任何限制。Furthermore, an embodiment of the present application also discloses an electronic device. FIG3 is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content in the diagram cannot be regarded as any limitation on the scope of use of the present application.

图3为本申请实施例提供的一种电子设备20的结构示意图。该电子设备20，具体可以包括：至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中，所述存储器22用于存储计算机程序，所述计算机程序由所述处理器21加载并执行，以实现前述任一实施例公开的基于结构化数据的自动机器学习方法中的相关步骤。另外，本实施例中的电子设备20具体可以为电子计算机。FIG3 is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input/output interface 25, and a communication bus 26. The memory 22 is used to store a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the automatic machine learning method based on structured data disclosed in any of the aforementioned embodiments. In addition, the electronic device 20 in this embodiment may specifically be an electronic computer.

本实施例中，电源23用于为电子设备20上的各硬件设备提供工作电压；通信接口24能够为电子设备20创建与外界设备之间的数据传输通道，其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议，在此不对其进行具体限定；输入输出接口25，用于获取外界输入数据或向外界输出数据，其具体的接口类型可以根据具体应用需要进行选取，在此不进行具体限定。In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and the external device, and the communication protocol it follows is any communication protocol that can be applied to the technical solution of the present application, and is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs and is not specifically limited here.

另外，存储器22作为资源存储的载体，可以是只读存储器、随机存储器、磁盘或者光盘等，其上所存储的资源可以包括操作系统221、计算机程序222及数据223等，存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a carrier for storing resources, can be a read-only memory, a random access memory, a disk or an optical disk, etc. The resources stored thereon may include an operating system 221, a computer program 222 and data 223, etc. The storage method can be temporary storage or permanent storage.

其中，操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机程序222，以实现处理器21对存储器22中数据223的运算与处理，其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的基于结构化数据的自动机器学习方法的计算机程序之外，还可以进一步包括能够用于完成其他特定工作的计算机程序。数据223除了可以包括基于结构化数据的自动机器学习设备接收到的由外部设备传输进来的数据，也可以包括由自身输入输出接口25采集到的数据等。Among them, the operating system 221 is used to manage and control the hardware devices and computer programs 222 on the electronic device 20 to realize the operation and processing of the data 223 in the memory 22 by the processor 21, which can be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program that can be used to complete the automatic machine learning method based on structured data performed by the electronic device 20 disclosed in any of the aforementioned embodiments, the computer program 222 can further include a computer program that can be used to complete other specific tasks. In addition to data transmitted from an external device received by the automatic machine learning device based on structured data, the data 223 can also include data collected by its own input and output interface 25, etc.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器（RAM）、内存、只读存储器（ROM）、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM或技术领域内所公知的任意其他形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

进一步的，本申请还公开了一种计算机可读存储介质，用于存储计算机程序；其中，所述计算机程序被处理器执行时实现前述公开的基于结构化数据的自动机器学习方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容，在此不再进行赘述。Furthermore, the present application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the automatic machine learning method based on structured data disclosed above is implemented. The specific steps of the method can be referred to the corresponding contents disclosed in the above embodiments, and will not be repeated here.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

本申请中如若使用了流程图，则该流程图是用来说明根据本申请的实施例的系统所执行的操作。应当理解的是，前面或后面操作不一定按照顺序来精确地执行。相反，可以按照倒序或同时处理各个步骤。同时，也可以将其他操作添加到这些过程中，或从这些过程移除某一步或数步操作。If a flow chart is used in the present application, the flow chart is used to illustrate the operations performed by the system according to the embodiment of the present application. It should be understood that the preceding or following operations are not necessarily performed accurately in order. On the contrary, each step can be processed in reverse order or simultaneously. At the same time, other operations can also be added to these processes, or a certain step or several steps of operations can be removed from these processes.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables one skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to one skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but rather to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An automatic machine learning method based on structured data, comprising the steps of:

Obtaining a structured data set, and labeling target variables for the structured data;

performing type recognition on the structured data set to acquire a target task type and a target feature type;

constructing corresponding feature engineering according to the target feature type to obtain a target feature combination;

constructing a base model according to the target task type, and selecting a model fusion strategy corresponding to the base model;

Inputting the target feature combination into the base model for training to obtain a target base model;

And fusing the target base model according to the model fusion strategy to obtain a fusion model, and automatically deploying the target base model and the fusion model.

2. The structured data based automatic machine learning method of claim 1 wherein said type identifying said structured dataset to obtain a target task type and a target feature type comprises:

After the target variable in the structured dataset is de-duplicated, counting the category number:

If the category number is equal to 2, identifying the task as a category task type;

if the data type identified by the target variable is a character string and the counted class number after duplication removal is more than 2, identifying the data type as a multi-classification task type;

And if the data type identified by the target variable is an integer or a floating point number, identifying the data type as a regression task.

3. The structured data based automatic machine learning method of claim 1 wherein said type identifying said structured dataset to obtain a target task type and a target feature type comprises:

Removing the target variable from the structured dataset to obtain a target feature set;

traversing the target feature set to obtain the target feature type.

4. The structured data based automatic machine learning method of claim 1 wherein prior to said obtaining a target feature combination from said target feature type, further comprising:

And carrying out data preprocessing on the target feature type.

5. The structured data based automatic machine learning method of claim 1, wherein said constructing a corresponding feature project according to the target feature type to obtain a target feature combination specifically comprises:

Performing feature transformation according to the target feature type to obtain target features;

Performing feature selection on the target feature to obtain a weight score of the target feature;

traversing the number of each target feature, and performing feature recursion elimination and 5-fold cross validation to obtain an average cross validation score;

And obtaining the target feature combination according to the uniform cross validation score.

6. The structured data based automatic machine learning method of claim 1 wherein said constructing a base model from said target task type comprises:

And selecting a corresponding algorithm as a base model according to the target task type.

7. The structured data based automatic machine learning method of claim 1, wherein said inputting said target feature combinations into said base model for training to obtain a target base model, in particular comprising:

and inputting the target feature combination into the list of the base models for parallel training so as to acquire target base models.

8. An automated machine learning system based on structured data, comprising: the system comprises a first acquisition module, an identification module, a second acquisition module, a construction module, a training module and a deployment module;

The first acquisition module is used for acquiring a structured data set and setting a target variable for the structured data;

the identification module is used for carrying out type identification on the structured data set so as to acquire a target task type and a target feature type;

the second obtaining module is used for constructing corresponding feature engineering according to the target feature type so as to obtain a target feature combination;

The construction module is used for constructing a base model according to the target task type and selecting a model fusion strategy corresponding to the base model;

The training module is used for inputting the target feature combination into the base model for training so as to obtain a target base model;

The deployment module is used for fusing the target base model according to the model fusion strategy to obtain a fusion model, and automatically deploying the target base model and the fusion model.

9. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor, the memory storing a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the storage medium is for storing a computer program for causing a computer to execute the method of any one of claims 1 to 7.