CN114943607A - Feature discovery method, attribute prediction method and device - Google Patents
Feature discovery method, attribute prediction method and device Download PDFInfo
- Publication number
- CN114943607A CN114943607A CN202210619787.4A CN202210619787A CN114943607A CN 114943607 A CN114943607 A CN 114943607A CN 202210619787 A CN202210619787 A CN 202210619787A CN 114943607 A CN114943607 A CN 114943607A
- Authority
- CN
- China
- Prior art keywords
- feature
- attribute
- features
- derived
- attribute prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06311—Scheduling, planning or task assignment for a person or group
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Theoretical Computer Science (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Educational Administration (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Technology Law (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及特征发现方法、属性预测方法和装置。One or more embodiments of this specification relate to the field of computer technology, and in particular, to a feature discovery method, an attribute prediction method, and an apparatus.
背景技术Background technique
特征发现是业务场景中进行属性预测的重要手段。比如,在风险防控领域中,随着支付风控场景与黑灰产攻防日益频繁,风险策略人员需要花费大量的时间去分析案件、获取黑灰产作案的整个过程。经过数据分析总结出风险特点,并将风险特点抽象成具体的风险特征,应用于风险防控。Feature discovery is an important means of attribute prediction in business scenarios. For example, in the field of risk prevention and control, with the increasing frequency of payment risk control scenarios and the attack and defense of black and gray products, risk strategists need to spend a lot of time analyzing cases and obtaining the entire process of black and gray products committing crimes. After data analysis, the risk characteristics are summarized, and the risk characteristics are abstracted into specific risk characteristics, which are applied to risk prevention and control.
然而,基于人工进行特征发现的效率较低。However, the efficiency of feature discovery based on manual work is low.
发明内容SUMMARY OF THE INVENTION
本说明书一个或多个实施例描述了特征发现方法、属性预测方法和装置,能够提高特征提炼的效率。One or more embodiments of this specification describe a feature discovery method, an attribute prediction method, and an apparatus, which can improve the efficiency of feature extraction.
根据第一方面,提供了特征发现方法,包括:According to the first aspect, a feature discovery method is provided, comprising:
获取业务运营中产生的至少一个原始特征;其中,所述原始特征用于对运营的业务中所包含的属性进行预测;Obtain at least one original feature generated in the business operation; wherein, the original feature is used to predict the attributes included in the operated business;
对所述至少一个原始特征进行衍生,得到至少一个衍生特征;Derivating the at least one original feature to obtain at least one derived feature;
对所述至少一个衍生特征的质量进行评估,得到对所述运营的业务中所包含的属性进行预测的属性预测特征;其中,所述衍生特征的质量用于表征该特征对所述属性进行预测的重要程度。Evaluate the quality of the at least one derived feature to obtain an attribute prediction feature for predicting an attribute included in the operating business; wherein the quality of the derived feature is used to characterize the feature to predict the attribute degree of importance.
在一种可能的实现方式中,所述至少一个原始特征包括至少两个原始特征;In a possible implementation, the at least one original feature includes at least two original features;
所述对所述至少一个原始特征进行衍生得到至少一个衍生特征,包括:The at least one derived feature obtained by deriving the at least one original feature includes:
根据每一个所述原始特征对所述属性的覆盖程度,对所述至少两个原始特征进行筛选,得到至少一个筛选特征;Screening the at least two original features to obtain at least one screening feature according to the degree of coverage of the attribute by each of the original features;
将所述至少一个筛选特征映射到与当前筛选特征所在的维度不同的维度,得到所述至少一个衍生特征。The at least one screening feature is mapped to a dimension different from the dimension in which the current screening feature is located to obtain the at least one derived feature.
在一种可能的实现方式中,所述根据每一个所述原始特征对所述属性的覆盖程度对所述至少两个原始特征进行筛选得到至少一个筛选特征,包括:In a possible implementation manner, the at least one screening feature is obtained by screening the at least two original features according to the degree of coverage of the attribute by each of the original features, including:
将所述至少两个原始特征中,特征值大于预设的第一有效特征阈值的原始特征确定为所述筛选特征。Among the at least two original features, an original feature whose feature value is greater than a preset first effective feature threshold is determined as the screening feature.
在一种可能的实现方式中,所述至少一个原始特征包括至少两个原始特征;In a possible implementation, the at least one original feature includes at least two original features;
所述对所述至少一个原始特征进行衍生得到至少一个衍生特征,包括:The at least one derived feature obtained by deriving the at least one original feature includes:
针对所述至少两个原始特征中的任意一个第一原始特征和任意一个第二原始特征,均执行:For any one of the first original features and any one of the second original features in the at least two original features, perform:
将所述第一原始特征进行拆分,得到M个第一拆分特征;以及,Splitting the first original feature to obtain M first splitting features; and,
将所述第二原始特征进行拆分,得到N个第二拆分特征;其中,M和N均为大于0的整数;Splitting the second original feature to obtain N second splitting features; wherein, M and N are both integers greater than 0;
将所述M个第一拆分特征和所述N个第二拆分特征进行组合,得到M×N个初级衍生特征;其中,任意一个组合得到的初级衍生特征均由一个第一拆分特征和一个第二拆分特征组合得到;Combining the M first splitting features and the N second splitting features to obtain M×N primary derived features; wherein, the primary derived features obtained by any combination are composed of one first splitting feature Combined with a second split feature;
根据每一个初级衍生特征对所述属性的覆盖程度,对所述M×N个初级衍生特征进行筛选得到所述至少一个衍生特征。The at least one derived feature is obtained by screening the M×N primary derived features according to the degree of coverage of the attribute by each primary derived feature.
在一种可能的实现方式中,所述根据每一个初级衍生特征对所述属性的覆盖程度对所述M×N个初级衍生特征进行筛选得到所述至少一个衍生特征,包括:In a possible implementation manner, the at least one derived feature is obtained by filtering the M×N primary derived features according to the coverage degree of each primary derived feature to the attribute, including:
将所述M×N个初级衍生特征中,组合特征值大于预设的第二有效特征阈值的初级衍生特征确定为所述衍生特征;其中,所述组合特征值为组合成对应初级衍生特征的第一拆分特征和第二拆分特征的特征值通过加权平均得到的值。Among the M×N primary derivative features, a primary derivative feature whose combined feature value is greater than a preset second effective feature threshold is determined as the derivative feature; wherein, the combined feature value is combined into a corresponding primary derivative feature. The value obtained by the weighted average of the feature values of the first split feature and the second split feature.
在一种可能的实现方式中,所述对所述至少一个衍生特征的质量进行评估得到属性预测特征,包括:In a possible implementation manner, the attribute prediction feature obtained by evaluating the quality of the at least one derived feature includes:
计算各个所述衍生特征的属性预测能力值;calculating the attribute prediction ability value of each of the derived features;
将各所述属性预测能力值中,大于预设评估阈值的属性预测能力值所对应的衍生特征确定为所述属性预测特征。Among the attribute prediction ability values, a derivative feature corresponding to an attribute prediction ability value greater than a preset evaluation threshold is determined as the attribute prediction feature.
在一种可能的实现方式中,所述属性包括待预测的第一属性结果,以及与所述第一属性结果相反的第二属性结果;In a possible implementation manner, the attribute includes a first attribute result to be predicted, and a second attribute result opposite to the first attribute result;
所述计算各个所述衍生特征的属性预测能力值,包括:The calculating the attribute prediction ability value of each of the derived features includes:
针对每一个所述衍生特征,均执行:For each of the derived features, execute:
将当前衍生特征进行等距离的分组,得到k个分组;其中,所述等距离包括等高和等宽中的至少一个;k为大于0的整数;The current derived features are grouped equidistantly to obtain k groups; wherein, the equidistance includes at least one of equal height and equal width; k is an integer greater than 0;
利用如下计算式,计算各个分组对应的的属性预测能力值:Use the following formula to calculate the attribute prediction ability value corresponding to each group:
其中,IVi用于表征第i个分组对应的属性预测能力值,yi用于表征第i个分组包含所述第一属性结果的数量,ys用于表征当前衍生特征中包含所述第一属性结果的数量,xi用于表征第i个分组包含所述第二属性结果的数量,xs用于表征当前衍生特征中包含所述第二属性结果的数量;Among them, IV i is used to represent the attribute prediction ability value corresponding to the ith group, y i is used to represent the number of the ith group containing the first attribute result, y s is used to represent the current derived feature contains the said th A number of attribute results, x i is used to represent the number of the i-th packet containing the second attribute result, x s is used to represent the current derived feature contains the number of the second attribute result;
对各个分组得到的属性预测能力值进行求和,得到所述当前衍生特征的属性预测能力值。The attribute prediction ability values obtained by each grouping are summed to obtain the attribute prediction ability value of the current derived feature.
在一种可能的实现方式中,所述对所述至少一个衍生特征的质量进行评估得到属性预测特征,包括:In a possible implementation manner, the attribute prediction feature obtained by evaluating the quality of the at least one derived feature includes:
根据利用所述原始特征和所述衍生特征分别进行属性预测的准确度,确定所述属性预测特征。The attribute prediction feature is determined according to the accuracy of attribute prediction using the original feature and the derived feature respectively.
在一种可能的实现方式中,所述根据利用所述原始特征和所述衍生特征分别进行属性预测的准确度确定所述属性预测特征,包括:In a possible implementation manner, the determining the attribute prediction feature according to the accuracy of the attribute prediction using the original feature and the derived feature respectively includes:
利用所述至少两个原始特征,训练得到第一属性预测模型;Using the at least two original features, a first attribute prediction model is obtained by training;
利用所述衍生特征中的至少一个,训练得到第二属性预测模型;Using at least one of the derived features, training to obtain a second attribute prediction model;
分别利用所述第一属性预测模型和所述第二属性预测模型对同一待属性预测特征进行属性预测,得到对应所述第一属性预测模型的第一预测结果和对应所述第二属性预测模型的第二预测结果;Using the first attribute prediction model and the second attribute prediction model to perform attribute prediction on the same attribute prediction feature to obtain a first prediction result corresponding to the first attribute prediction model and a corresponding second attribute prediction model The second prediction result of ;
计算所述第一预测结果与所述待属性预测特征的标签值的距离,得到第一相似值;以及,Calculate the distance between the first prediction result and the label value of the attribute to be predicted feature to obtain a first similarity value; and,
计算所述第二预测结果与所述待属性预测特征的标签值的距离,得到第二相似值;Calculate the distance between the second prediction result and the label value of the attribute to be predicted feature to obtain a second similarity value;
当所述第二相似值小于所述第一相似值时,将用于训练所述第二属性预测模型的衍生特征确定为所述属性预测特征。When the second similarity value is smaller than the first similarity value, the derived feature used for training the second attribute prediction model is determined as the attribute prediction feature.
根据第二方面,提供了属性预测方法,包括:According to the second aspect, an attribute prediction method is provided, including:
获取基于第一方面中任一所述的特征发现方法得到的至少两个对运营的业务中所包含的属性进行预测的属性预测特征,以及该至少两个属性预测特征的质量评估结果;Acquiring at least two attribute prediction features obtained based on any one of the feature discovery methods described in the first aspect for predicting attributes included in the operated business, and quality evaluation results of the at least two attribute prediction features;
根据属性预测特征的质量评估结果,将所述至少两个属性预测特征划分为至少两个等级;以及,dividing the at least two attribute prediction features into at least two levels according to the quality evaluation results of the attribute prediction features; and,
将所述至少两个等级的属性预测特征部署到不同的业务中,以实现在不同业务中对所述属性进行预测;其中,所述业务的重要程度越高,部署的属性预测特征的等级越高。The attribute prediction features of the at least two levels are deployed in different services, so as to predict the attributes in different services; wherein, the higher the importance of the services, the higher the level of the deployed attribute prediction features. high.
根据第三方面,提供了特征发现装置,包括:特征获取模块、特征衍生模块和特征评估模块;According to a third aspect, a feature discovery device is provided, including: a feature acquisition module, a feature derivation module, and a feature evaluation module;
所述特征获取模块,配置为获取业务运营中产生的至少一个原始特征;其中,所述原始特征用于对运营的业务中所包含的属性进行预测;The feature acquisition module is configured to acquire at least one original feature generated in the business operation; wherein, the original feature is used to predict the attributes included in the operated business;
所述特征衍生模块,配置为对所述特征获取模块获取到的所述至少一个原始特征进行衍生,得到至少一个衍生特征;The feature derivation module is configured to derive at least one original feature acquired by the feature acquisition module to obtain at least one derived feature;
所述特征评估模块,配置为对所述特征衍生模块得到的所述至少一个衍生特征的质量进行评估,得到对所述运营的业务中所包含的属性进行预测的属性预测特征;其中,所述衍生特征的质量用于表征该特征对所述属性进行预测的重要程度。The feature evaluation module is configured to evaluate the quality of the at least one derived feature obtained by the feature derivation module, and obtain an attribute prediction feature for predicting attributes included in the operating business; wherein the The quality of a derived feature is used to characterize how important that feature is to predict the attribute.
根据第四方面,提供了属性预测装置,包括:预测数据获取模块、等级划分模块和属性预测模块;According to a fourth aspect, an attribute prediction apparatus is provided, including: a prediction data acquisition module, a grade division module, and an attribute prediction module;
所述预测数据获取模块,配置为获取基于第三方面所述的特征发现装置得到的至少两个对运营的业务中所包含的属性进行预测的属性预测特征,以及该至少两个属性预测特征的质量评估结果;The predicted data obtaining module is configured to obtain at least two attribute prediction features obtained based on the feature discovery device described in the third aspect for predicting the attributes included in the operating business, and the at least two attribute prediction features. quality assessment results;
所述等级划分模块,配置为根据所述预测数据获取模块得到的所述属性预测特征的质量评估结果,将所述至少两个属性预测特征划分为至少两个等级;The grade division module is configured to divide the at least two attribute prediction features into at least two grades according to the quality evaluation result of the attribute prediction feature obtained by the prediction data acquisition module;
所述属性预测模块,配置为将所述等级划分模块划分的所述至少两个等级的属性预测特征部署到不同的业务中,以实现在不同业务中对所述属性进行预测;其中,所述业务的重要程度越高,部署的属性预测特征的等级越高。The attribute prediction module is configured to deploy the attribute prediction features of the at least two grades divided by the grade division module into different services, so as to realize the attribute prediction in different services; wherein, the The higher the importance of the business, the higher the level of the deployed attribute prediction feature.
根据第五方面,提供了一种计算设备,包括:存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现上述第一方面和第二方面中任一项所述的方法。According to a fifth aspect, a computing device is provided, including: a memory and a processor, where executable code is stored in the memory, and when the processor executes the executable code, the above-mentioned first and second aspects are implemented The method of any of the above.
根据本说明书实施例提供的方法和装置,在进行特征发现时,首先可以获取业务运营中产生的对运营的业务中所包含的属性进行预测的原始特征,然后对该原始特征进行衍生得到至少一个衍生特征。进一步,通过对得到的衍生特征进行质量评估即可得到运营业务中所包含的对属性进行预测的属性预测特征。由此可见,本方案是通过对已有的特征进行衍生得到的,不需要花费大量时间对原始数据进行分析总结来得到,因此可以大大提高特征发现的效率。According to the method and apparatus provided by the embodiments of the present specification, when performing feature discovery, the original feature generated in the business operation for predicting the attributes included in the operated business can be obtained first, and then the original feature can be derived to obtain at least one derived features. Further, the attribute prediction feature included in the operation business for predicting the attribute can be obtained by performing quality evaluation on the obtained derivative feature. It can be seen that this solution is obtained by deriving the existing features, and does not need to spend a lot of time analyzing and summarizing the original data, so the efficiency of feature discovery can be greatly improved.
附图说明Description of drawings
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present specification or in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are For some embodiments of this specification, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1是本说明书一个实施例提供的一种特征发现方法的流程图;1 is a flowchart of a feature discovery method provided by an embodiment of the present specification;
图2是本说明书一个实施例提供的一种特征衍生方法的流程图;FIG. 2 is a flowchart of a feature derivation method provided by an embodiment of the present specification;
图3是本说明书一个实施例提供的另一种特征衍生方法的流程图;3 is a flowchart of another feature derivation method provided by an embodiment of the present specification;
图4是本说明书一个实施例提供的一种特征评估方法的流程图;4 is a flowchart of a feature evaluation method provided by an embodiment of the present specification;
图5是本说明书一个实施例提供的一种特征预测能力值的计算方法的流程图;5 is a flowchart of a method for calculating a feature prediction capability value provided by an embodiment of the present specification;
图6是本说明书一个实施例提供的另一种特征评估方法的流程图;6 is a flowchart of another feature evaluation method provided by an embodiment of the present specification;
图7是本说明书一个实施例提供的一种属性预测方法的流程图;FIG. 7 is a flowchart of an attribute prediction method provided by an embodiment of the present specification;
图8是本说明书一个实施例提供的一种特征发现装置的示意图;FIG. 8 is a schematic diagram of a feature discovery apparatus provided by an embodiment of the present specification;
图9是本说明书一个实施例提供的一种属性预测装置的示意图。FIG. 9 is a schematic diagram of an attribute prediction apparatus provided by an embodiment of the present specification.
具体实施方式Detailed ways
如前所述,特征发现是业务场景中进行属性预测的重要手段。但属性预测需要对原始数据进行分析和总结,得到能够对属性进行预测的特征。比如,风险防控领域中,在支付风控场景与黑灰产攻防日益频繁的情况下,风险策略人员通过分析案件、获取黑灰产作案的整个过程、经过数据分析能够总结出风险特点,并将风险特点抽象成具体的风险特征,进而基于该风险特征能够对风险账户进行预测。As mentioned above, feature discovery is an important means for attribute prediction in business scenarios. However, attribute prediction needs to analyze and summarize the original data to obtain features that can predict attributes. For example, in the field of risk prevention and control, with the increasing frequency of payment risk control scenarios and attacks and defenses of black and gray products, risk strategists can summarize the risk characteristics by analyzing cases, obtaining the entire process of black and gray products committing crimes, and analyzing data. The risk characteristics are abstracted into specific risk characteristics, and then the risk account can be predicted based on the risk characteristics.
然而,依赖运营策略人员对属性预测的特征进行挖掘的方法需要花费大量时间,产出效率较低。而且,特征的生成会直接影响属性的预测,以及业务的质量。如果运营策略人员的经验不一致,那么产出的用于进行属性识别的特征和评估标准也就不一致,难以对特征进行标准化挖掘,从而使得属性预测的准确性较低。However, the method that relies on operational strategists to mine the features of attribute prediction takes a lot of time and has low output efficiency. Moreover, the generation of features will directly affect the prediction of attributes, as well as the quality of services. If the experience of operation strategists is inconsistent, the features and evaluation criteria for attribute identification will be inconsistent, and it is difficult to standardize the mining of features, resulting in low accuracy of attribute prediction.
基于此,本方案考虑利用已有的特征进行特征衍生,快速高效地衍生出性能更好、风险覆盖更高的特征。如此不仅能够提高特征发现的效率,而且利用衍生出的特征能够提升关于属性的特征的覆盖率,从而提高属性预测的准确性。Based on this, this scheme considers the use of existing features for feature derivation, and quickly and efficiently derives features with better performance and higher risk coverage. This can not only improve the efficiency of feature discovery, but also use the derived features to improve the coverage of attributes related to attributes, thereby improving the accuracy of attribute prediction.
如图1所示,本说明书实施例提供了一种特征发现方法,该方法可以包括如下步骤:As shown in FIG. 1 , an embodiment of the present specification provides a feature discovery method, and the method may include the following steps:
步骤101:获取业务运营中产生的至少一个原始特征;其中,原始特征用于对运营的业务中所包含的属性进行预测;Step 101: Obtain at least one original feature generated in the business operation; wherein, the original feature is used to predict the attributes included in the operated business;
步骤103:对至少一个原始特征进行衍生,得到至少一个衍生特征;Step 103: Derive at least one original feature to obtain at least one derived feature;
步骤105:对至少一个衍生特征的质量进行评估,得到对运营的业务中所包含的属性进行预测的属性预测特征;其中,衍生特征的质量用于表征该特征对属性进行预测的重要程度。Step 105: Evaluate the quality of at least one derived feature to obtain an attribute prediction feature for predicting an attribute included in the operating business; wherein the quality of the derived feature is used to represent the importance of the feature for predicting the attribute.
本说明书实施例中,在进行特征发现时,首先可以获取业务运营中产生的对运营的业务中所包含的属性进行预测的原始特征,然后对该原始特征进行衍生得到至少一个衍生特征。进一步,通过对得到的衍生特征进行质量评估即可得到运营业务中对所包含的属性进行预测的属性预测特征。由此可见,本方案是通过对已有的特征进行衍生得到的,不需要花费大量时间对原始数据进行分析总结来,因此可以大大提高特征发现的效率。In the embodiment of the present specification, when performing feature discovery, the original feature generated in the business operation for predicting the attributes included in the operated business can be obtained first, and then the original feature is derived to obtain at least one derived feature. Further, the attribute prediction feature for predicting the included attributes in the operation business can be obtained by performing quality evaluation on the obtained derivative features. It can be seen that this solution is obtained by deriving existing features, and does not need to spend a lot of time analyzing and summarizing the original data, so the efficiency of feature discovery can be greatly improved.
而且,通过对原始特征进行衍生得到衍生特征之后,进一步对该衍生特征进行了质量评估,如此能够根据该衍生特征对属性进行预测的重要程度,得到性能更优的属性预测特征,从而在利用该属性预测特征进行预测时能够提高属性预测的准确性。Moreover, after the derived feature is obtained by deriving the original feature, the quality of the derived feature is further evaluated, so that the attribute prediction feature with better performance can be obtained according to the importance of the derived feature to predict the attribute. The attribute prediction feature can improve the accuracy of attribute prediction when making predictions.
下面对结合具体的实施例对附图1中的各个步骤进行说明。Each step in FIG. 1 will be described below with reference to specific embodiments.
首先在步骤101中,获取业务运营中产生的至少一个原始特征;First, in
本步骤中,原始特征可以是用于对运营的业务中所包含的属性进行预测的特征。比如,业务运营可以是商户准入风险管理、欺诈风险管理、网商风险管理等。业务运营中产生的原始特征则可以是策略运营人员在日常风险攻防过程中,基于风险案件的链路还原以及历史风险样本统计分析得到的特征。如进行属性预测为对风险账户进行预测时,那么该原始特征可以包括用户的年龄、性别、信用度、交易金额、交易时间等特征。In this step, the original feature may be a feature used to predict attributes included in the operating business. For example, business operations can be merchant access risk management, fraud risk management, and online business risk management. The original features generated in business operations can be the features obtained by strategic operators in the daily risk attack and defense process, based on link restoration of risk cases and statistical analysis of historical risk samples. If the attribute prediction is to predict the risk account, the original feature may include the user's age, gender, credit degree, transaction amount, transaction time and other features.
此外,原始特征还可以是之前的某一时间段内利用本方案的特征发现方法得到的特征。也就是说,本方案提供的方法可以将上一个时间段内衍生出的特征作为下一个时间段内的原始特征进行特征发现。In addition, the original feature may also be a feature obtained by using the feature discovery method of the present solution in a certain period of time before. That is to say, the method provided by this solution can perform feature discovery by using the features derived in the previous time period as the original features in the next time period.
当然,在一种可能的实现方式中,原始特征可以同时包括运营策略人员通过分析总结得到的特征,以及上一个时间段内通过特征衍生得到的特征。如此衍生特征不仅提高了特征发现的效率,而且得到的特征也一定程度上包含了人工分析总结得出的结果,能够提升特征用于进行属性预测的性能。Of course, in a possible implementation manner, the original features may include both the features obtained by the operation strategist through analysis and summary, and the features derived from the features in the previous time period. Such derived features not only improve the efficiency of feature discovery, but also the obtained features contain the results of manual analysis and summary to a certain extent, which can improve the performance of features for attribute prediction.
然后在步骤103中,对至少一个原始特征进行衍生,得到至少一个衍生特征。Then in
本步骤中,当得到原始特征之后,考虑通过空间变换和特征组合等方式对原始特征进行衍生。In this step, after the original features are obtained, it is considered to derive the original features by means of spatial transformation and feature combination.
比如,在通过轻量级模型进行空间变换衍生特征时,原始特征可以包括至少两个。如图2所示,步骤103在对至少一个原始特征进行衍生得到至少一个衍生特征时,可以通过如下步骤实现:For example, when spatially transforming derived features through a lightweight model, the original features may include at least two. As shown in FIG. 2, when at least one derived feature is obtained by deriving at least one original feature in
步骤201:根据每一个原始特征对属性的覆盖程度,对至少两个原始特征进行筛选,得到至少一个筛选特征;Step 201: Screen at least two original features to obtain at least one screening feature according to the coverage degree of each original feature to the attribute;
步骤203:将至少一个筛选特征映射到与当前筛选特征所在的维度不同的维度,得到至少一个衍生特征。Step 203: Map the at least one screening feature to a dimension different from the dimension in which the current screening feature is located to obtain at least one derived feature.
本实施例中,可以考虑通过轻量级模型的方式衍生特征。具体可以首先根据原始特征对属性的覆盖程度对原始特征进行筛选,得到筛选特征。然后,将得到的筛选特征映射到与当前筛选特征所在的维度不同的维度中,得到衍生特征。如此衍生出的特征能够具有很好的属性覆盖度,能够很好地用于对属性进行预测。In this embodiment, it may be considered to derive features by means of a lightweight model. Specifically, the original features can be firstly screened according to the coverage degree of the original features to the attributes to obtain the screening features. Then, the obtained screening feature is mapped to a dimension different from the dimension in which the current screening feature is located to obtain a derived feature. The features derived in this way can have good attribute coverage and can be well used to predict attributes.
比如,步骤201和步骤203在进行特征衍生时,可以利用线性或非线性的机器学习模型,如Logistic Regression、Decision Tree等算法,将输入的原始特征经过加工和提炼后,生成性能较优的特征,以表征特征的信息。而且,该方法对输入数据的类型没有要求,可以是数值型,也可以是字符型,能够很好的适应各种数据类型的应用场景。For example, when performing feature derivation in
下面对步骤201进行说明。Step 201 will be described below.
步骤201在根据每一个原始特征对属性的覆盖程度,对至少两个原始特征进行筛选得到至少一个筛选特征时,可以将至少两个原始特征中,特征值大于预设的第一有效特征阈值的原始特征确定为筛选特征。如此通过根据特征值对原始特征进行筛选,能够将表征属性的信息强度较低的数据筛选掉,利用特征表征强度较高的数据生成衍生特征,这更加有利于提高对属性进行预测的精度。In
比如,在实际应用场景中,由于很多特征并不是有效特征,其通常所体现的特征值为0。而如果利用包含这些无效特征的特征进行特征发现,不仅会由于数据量太大而降低处理效率,而且会降低衍生出的特征的质量。因此,在对原始特征进行筛选时,第一有效特征阈值可以为0。如此将所有特征值为0的特征筛选掉,即将所有无效的特征筛选掉,从而利用有效的特征更加高效的进行特征衍生。For example, in practical application scenarios, since many features are not effective features, the feature value they usually reflect is 0. However, if features containing these invalid features are used for feature discovery, it will not only reduce the processing efficiency due to the large amount of data, but also reduce the quality of the derived features. Therefore, when the original features are filtered, the first effective feature threshold may be 0. In this way, all features with eigenvalues of 0 are filtered out, that is, all invalid features are filtered out, so that effective features can be used for feature derivation more efficiently.
下面对步骤203进行说明。Step 203 will be described below.
步骤203在将至少一个筛选特征映射到与当前筛选特征所在的维度不同的维度得到至少一个衍生特征时,可以考虑对至少一个筛选特征进行线性变化得到至少一个衍生特征。如此通过设定合适的线性变化系数,能够衍生出包含各种程度的信息的特征。In
比如,在一种可能的实现方式中,可以通过如下计算式将筛选特征映射到其他维度中。For example, in a possible implementation manner, the filtering features can be mapped to other dimensions by the following calculation formula.
其中,用于表征映射得到的衍生特征,zv,t用于表征筛选特征,ωG用于表征对筛选特征进行映射的映射权重,bG用于表征对筛选特征进行映射的映射修正量。在上述计算式中,ωG和bG参数可以是学习参数,可以通过神经网络学习得到。而各个参数的维度大小可以分别为:ωG∈RC,bG∈RC。in, It is used to characterize the derived features obtained by mapping, z v, t is used to characterize the screening features, ω G is used to characterize the mapping weight for mapping the screening features, and b G is used to characterize the mapping correction amount for mapping the screening features. In the above calculation formula, the ω G and b G parameters can be learning parameters, which can be obtained through neural network learning. The dimensions of each parameter can be respectively: ω G ∈ R C , b G ∈ R C .
在另一种可能的实现方式中,通过特征组合的方式衍生特征时,原始特征可以包括至少两个。如图3所示,步骤103在对至少一个原始特征进行衍生得到至少一个衍生特征时,可以针对至少两个原始特征中的任意一个第一原始特征和任意一个第二原始特征执行如下步骤:In another possible implementation manner, when a feature is derived by means of feature combination, the original feature may include at least two. As shown in FIG. 3 , when at least one derived feature is derived from at least one original feature in
步骤301:将第一原始特征进行拆分,得到M个第一拆分特征;Step 301: Split the first original feature to obtain M first split features;
步骤303:将第二原始特征进行拆分,得到N个第二拆分特征;其中,M和N均为大于0的整数;Step 303: Splitting the second original feature to obtain N second splitting features; wherein, M and N are both integers greater than 0;
步骤305:将M个第一拆分特征和N个第二拆分特征进行组合,得到M×N个初级衍生特征;其中,任意一个组合得到的初级衍生特征均由一个第一拆分特征和一个第二拆分特征组合得到;Step 305: Combine the M first splitting features and the N second splitting features to obtain M×N primary derived features; wherein, the primary derived features obtained by any combination are composed of a first splitting feature and A second split feature combination is obtained;
步骤307:根据每一个初级衍生特征对属性的覆盖程度,对M×N个初级衍生特征进行筛选得到至少一个衍生特征。Step 307 : Screen M×N primary derivative features to obtain at least one derivative feature according to the coverage degree of each primary derivative feature to the attribute.
本实施例中,可以考虑通过特征组合的方式生成衍生特征。具体可以首先将第一原始特征拆分得到M个第一拆分特征,然后将第二原始特征拆分得到N个第二拆分特征。进一步将M个第一拆分特征和N个第二拆分特征进行组合得到M×N个初级衍生特征。最后可以根据每一个初级衍生特征对属性的覆盖程度进行特征筛选得到衍生特征。如此通过将原始特征进行两两组合,能够发现区分度更高的特征。In this embodiment, it may be considered to generate derived features by combining features. Specifically, the first original feature may be split to obtain M first split features, and then the second original feature may be split to obtain N second split features. The M first split features and the N second split features are further combined to obtain M×N primary derived features. Finally, the derived features can be obtained by feature screening according to the coverage degree of the attributes of each primary derived feature. In this way, by combining the original features in pairs, features with a higher degree of discrimination can be found.
比如,第一原始特征为年龄,第二原始特征为性别,步骤301在将第一原始特征进行拆分时,可以将年龄0-100岁拆分为5个,如0-20、21-40、41-60、61-80、81-100;步骤302在将第二原始特征进行拆分时,可以取男、女2个值。如此,通过将拆分后的年龄特征和性别特征进行两两组合,即可得到2×5=10个特征。分别为0-20岁的男性、21-40岁的男性、41-60岁的男性、61-80岁的男性、81-100岁的男性、0-20的女性、21-40的女性、41-60的女性、61-80的女性、以及81-100的女性。如此,在进一步根据属性覆盖程度筛选时,可以根据具体的特征值对组合得到的特征进行筛选。For example, the first original feature is age, and the second original feature is gender. When splitting the first original feature in
其中,两两组合得到的特征值可以是加权平均得到,即组合后得到的每一个衍生特征的组合特征值可以为对应生成该衍生特征的第一拆分特征和第二拆分特征的特征值通过加权平均得到的值。而步骤307在根据每一个初级衍生特征对属性的覆盖程度对M×N个初级衍生特征进行筛选得到至少一个衍生特征时,可以将M×N个初级衍生特征中,组合特征值大于预设的第二有效特征阈值的初级衍生特征确定为衍生特征。如此,在通过组合得到初级衍生特征之后,进一步根据特征值进行筛选能够得到特征表征强度更高的特征,即属性浓度更高的特征,从而得到的衍生特征在进行属性预测时性能更佳。The eigenvalues obtained by the pairwise combination may be obtained by weighted average, that is, the combined eigenvalue of each derived feature obtained after the combination may be the eigenvalue of the first split feature and the second split feature corresponding to the derived feature. Values obtained by weighted average. In
进一步在步骤105中,对至少一个衍生特征的质量进行评估,得到对运营的业务中所包含的属性进行预测的属性预测特征。Further in
本步骤中在对衍生特征进行质量评估时,主要是对得到的衍生特征对属性进行预测的重要程度进行评估,即对该衍生特征的预测能力进行评估。比如,在一种可能的实现方式中,可以通过计算预测能力值来进行评估。如图4所示,可以通过如下步骤对衍生特征进行评估:In this step, when evaluating the quality of the derived feature, it mainly evaluates the importance of the derived feature for predicting the attribute, that is, evaluating the predictive ability of the derived feature. For example, in one possible implementation, the evaluation may be performed by calculating a predictive ability value. As shown in Figure 4, the derived features can be evaluated by the following steps:
步骤401:计算各个衍生特征的属性预测能力值;Step 401: Calculate the attribute prediction ability value of each derived feature;
步骤403:将各属性预测能力值中,大于预设评估阈值的属性预测能力值所对应的衍生特征确定为属性预测特征。Step 403 : Determining, among the attribute prediction ability values, the derivative feature corresponding to the attribute prediction ability value greater than the preset evaluation threshold as the attribute prediction feature.
在本实施例中,首先可以计算各个衍生特征的属性预测能力值,然后将各属性预测能力值中,大于预设评估阈值的属性预测能力值所对应的衍生特征确定为属性预测特征。如此通过准确地计算衍生特征的属性预测能力值,能够筛选出性能更优的衍生特征,从而能够提高属性预测的精度。In this embodiment, the attribute prediction ability value of each derivative feature can be calculated first, and then the derivative feature corresponding to the attribute prediction ability value greater than the preset evaluation threshold among the attribute prediction ability values is determined as the attribute prediction feature. In this way, by accurately calculating the attribute prediction ability value of the derived features, the derived features with better performance can be screened out, thereby improving the accuracy of attribute prediction.
在一种可能的实现方式中,属性可以包括待预测的第一属性结果,以及与该第一属性结果相反的第二属性结果。那么步骤401在计算属性预测能力值时,如图5所示,可以通过如下步骤得到各衍生特征的属性预测能力值:In a possible implementation manner, the attribute may include a first attribute result to be predicted, and a second attribute result opposite to the first attribute result. Then, when calculating the attribute prediction ability value in
针对每一个衍生特征,For each derived feature,
步骤501:将当前衍生特征进行等距离的分组,得到k个分组;其中,等距离包括等高和等宽中的至少一个;k为大于0的整数;Step 501: The current derived features are grouped equidistantly to obtain k groups; wherein, the equidistance includes at least one of equal height and equal width; k is an integer greater than 0;
步骤503:利用属性预测能力的计算式计算各个分组对应的属性预测能力值;Step 503: Calculate the attribute prediction ability value corresponding to each group by using the calculation formula of the attribute prediction ability;
其中,属性预测能力值的计算公式为:Among them, the calculation formula of attribute prediction ability value is:
其中,IVi用于表征第i个分组对应的属性预测能力值,yi用于表征第i个分组包含第一属性结果的数量,ys用于表征当前衍生特征中包含第一属性结果的数量,xi用于表征第i个分组包含第二属性结果的数量,xs用于表征当前衍生特征中包含第二属性结果的数量;Among them, IV i is used to characterize the attribute prediction ability value corresponding to the ith group, y i is used to characterize the number of the first attribute result contained in the ith group, and y s is used to characterize the current derived feature that contains the first attribute result. Quantity, x i is used to represent the number of results of the second attribute contained in the ith group, and x s is used to represent the number of results of the second attribute contained in the current derived feature;
步骤505:对各个分组得到的属性预测能力值进行求和,得到当前衍生特征的属性预测能力值。Step 505: Summing the attribute prediction ability values obtained by each group to obtain the attribute prediction ability value of the current derived feature.
本实施例中,在计算各衍生特征的属性预测能力值时,首先可以将当前衍生特征通过等高、等宽等方式进行等距离的分组,然后利用上述属性预测能力值计算时计算各个分组对应的属性预测能力值。最后将各个分组对应的属性预测能力值进行求和即可得到当前衍生特征的属性预测能力值。In this embodiment, when calculating the attribute prediction ability value of each derived feature, firstly, the current derived features can be grouped equidistantly by means of equal height and equal width, and then the corresponding attribute prediction ability value of each grouping can be calculated by using the above attribute prediction ability value calculation. The attribute prediction ability value of . Finally, the attribute prediction ability value of the current derived feature can be obtained by summing the attribute prediction ability values corresponding to each group.
上述属性预测能力值衡量的是某一个变量的信息量,其值的大小决定了自变量对于目标变量的影响程度。比如,在一个用户进行信息反馈的应用场景中,第一属性结果可以是作出响应的客户,而第二属性结果可以为未作出响应的客户。那么yi/ys的含义为当前第i个分组中作出响应的客户占整个衍生特征的k个分组中所有作出响应的客户的比例,而xi/xs的含义为当前第i个分组中没有作出响应的客户占整个衍生特征的k个分组中所有没有作出响应的客户的比例。The above attribute prediction ability value measures the amount of information of a variable, and its value determines the degree of influence of the independent variable on the target variable. For example, in an application scenario in which a user performs information feedback, the first attribute result may be the customer who responded, and the second attribute result may be the customer who did not respond. Then the meaning of y i /y s is the proportion of the customers who responded in the current i-th group to all the responding customers in the k groups of the entire derived feature, and the meaning of x i /x s is the current i-th group. The proportion of customers who did not respond in k groupings of the entire derived feature.
值得注意的是,在衍生特征的任何分组中,不应该出现第一属性结果的数量为0或第二属性结果的数量为0的情况,当衍生特征的一个分组的属性结果的数量为0时,对应的ln值就为负无穷,此时IV值为正无穷。如果可能,直接把这个分组做成一个规则,作为模型的前置条件或补充条件。It is worth noting that in any grouping of derived features, there should not be a situation where the number of first attribute results is 0 or the number of second attribute results is 0, when the number of attribute results of a grouping of derived features is 0 , the corresponding ln value is negative infinity, and the IV value is positive infinity. If possible, make this grouping directly into a rule as a precondition or supplement to the model.
在建模过程中,IV值主要用于特征选择,如果想要对变量的预测能力进行排序的话,由于特征变量IV值的大小即表示该变量预测能力的强弱,因此可以按IV值从高到低筛选即可。In the modeling process, the IV value is mainly used for feature selection. If you want to rank the predictive ability of variables, since the size of the IV value of the feature variable indicates the strength of the variable's predictive ability, you can choose the IV value from high to high. to low filter.
在另一种可能的实现方式中,步骤105还可以基于分布式梯度决策树的机器学习算法来进行评估,即根据利用原始特征和衍生特征分别进行属性预测的准确度确定属性预测特征。比如,如图6所示,确定属性预测特征时可以具体包括如下步骤:In another possible implementation manner, step 105 may also perform evaluation based on a machine learning algorithm of a distributed gradient decision tree, that is, determine the attribute prediction feature according to the accuracy of attribute prediction using the original feature and the derived feature respectively. For example, as shown in Figure 6, the following steps may be specifically included when determining the attribute prediction feature:
步骤601:利用原始特征中的至少一个,训练得到第一属性预测模型;Step 601: Use at least one of the original features to train to obtain a first attribute prediction model;
步骤603:利用衍生特征中的至少一个,训练得到第二属性预测模型;Step 603: Use at least one of the derived features to train to obtain a second attribute prediction model;
步骤605:分别利用第一属性预测模型和第二属性预测模型对同一待属性预测特征进行属性预测,得到对应第一属性预测模型的第一预测结果和对应第二属性预测模型的第二预测结果;Step 605: Use the first attribute prediction model and the second attribute prediction model to perform attribute prediction on the same attribute prediction feature to obtain a first prediction result corresponding to the first attribute prediction model and a second prediction result corresponding to the second attribute prediction model ;
步骤607:计算第一预测结果与待属性预测特征的标签值的距离,得到第一相似值;Step 607: Calculate the distance between the first prediction result and the label value of the attribute to be predicted feature to obtain the first similarity value;
步骤609:计算第二预测结果与待属性预测特征的标签值的距离,得到第二相似值;Step 609: Calculate the distance between the second prediction result and the label value of the feature to be predicted for the attribute to obtain a second similarity value;
步骤611:当第二相似值小于第一相似值时,将用于训练第二属性预测模型的衍生特征确定为属性预测特征。Step 611: When the second similarity value is smaller than the first similarity value, determine the derived feature used for training the second attribute prediction model as the attribute prediction feature.
本实施例中,在根据原始特征和衍生特征进行属性预测的准确度确定属性特征时,首先可以分别利用原始特征和衍生特征训练模型,分别得到第一属性预测模型和第二属性预测模型。然后分别利用该第一属性预测模型和第二属性预测模型对同一待属性预测特征进行属性预测,得到第一预测结果和第二预测结果。进一步,比较两个预测结果与实际标签值的距离,如果衍生特征得到的预测结果与标签值的距离小于原始特征得到的预测结果与标签值的距离,则说明得到的衍生特征具有较好的属性预测效果,从而可以将训练该第二属性预测模型的衍生特征确定为属性预测特征。如此每次得到属性预测特征都是通过实际预测筛选出的,能够具有较好的性能。In this embodiment, when determining the attribute feature according to the accuracy of the attribute prediction based on the original feature and the derived feature, firstly, the original feature and the derived feature can be used to train the model to obtain the first attribute prediction model and the second attribute prediction model respectively. Then use the first attribute prediction model and the second attribute prediction model to perform attribute prediction on the same attribute to be predicted feature to obtain a first prediction result and a second prediction result. Further, compare the distance between the two prediction results and the actual label value. If the distance between the prediction result obtained by the derived feature and the label value is smaller than the distance between the prediction result obtained by the original feature and the label value, it means that the derived feature has better properties. The effect is predicted, so that the derived feature trained on the second attribute prediction model can be determined as the attribute prediction feature. In this way, the attribute prediction features obtained each time are screened out through the actual prediction, which can have better performance.
步骤601在利用原始特征训练第一属性预测模型时,可以从原始特征中随机选择至少一个特征进行训练。同理步骤603在利用衍生特征训练第二属性预测模型时,也可以从衍生特征中随机选择至少一个特征进行训练。当然还可以利用不同数量的特征进行模型训练来对属性特征进行进一步筛选。比如,对于原始特征A和B,衍生特征有C、D、E和F,其中第一属性预测模型利用A和B训练得到。而对于第二属性预测模型,可以先利用C和D进行训练得到。如果通过比较得到第二属性预测模型的性能优于第一属性预测模型的性能,那么可以将C和D确定为属性预测特征。如果进一步利用C、D和E训练得到第二属性预测模型,但此时第二属性预测模型的性能有所下降,那么过滤衍生特征C;而如果第二属性预测模型的性能有所上升,那么可以将衍生特征C也确定为属性预测特征。In step 601, when using the original features to train the first attribute prediction model, at least one feature may be randomly selected from the original features for training. Similarly, when using the derived features to train the second attribute prediction model in
其中,在确定模型的性能时,可以通过对同一待属性预测特征进行预测得到,与实际标签的距离越近,模型的性能越好;反之,与实际标签的距离越远,模型的性能越差。Among them, when determining the performance of the model, it can be obtained by predicting the same attribute to be predicted. The closer the distance to the actual label, the better the performance of the model; conversely, the farther the distance from the actual label, the worse the performance of the model .
在又一种可能的实现方式中,步骤105还可以通过热力图对衍生特征的质量进行评估。比如,步骤103在利用特征组合的方法对特征进行两两交叉后,针对特征x和特征y就可以形成类似如下表1的表格。然后统计每个单元中的数据分布,最终给每个单元一个分数,分数越大,表明交叉组合后的特征的性能越好。通过如下热力图的方式,能够更加直观地发现各特征性能的好坏,从而可以直接对衍生特征进行筛选得到属性预测特征。In yet another possible implementation manner, step 105 may further evaluate the quality of the derived features through a heat map. For example, in
表1Table 1
由上述表1中可知,如果设定的阈值为1,那么筛选出的特征[特征x,特征y,值]为[1,1,1.3]、[3,2,1.5]、[4,2,2.3]、[4,3,1.5]、[3,4,1.5]、[3,5,1.4]。如此可以得到的属性特征为该6个中由特征x和特征y所组合得到的特征。It can be seen from the above Table 1 that if the set threshold is 1, then the filtered features [feature x, feature y, value] are [1, 1, 1.3], [3, 2, 1.5], [4, 2] , 2.3], [4, 3, 1.5], [3, 4, 1.5], [3, 5, 1.4]. The attribute feature that can be obtained in this way is the feature obtained by combining the feature x and the feature y among the six.
如图7所示,本说明书实施例还提供了一种属性预测方法,该方法可以包括如下步骤:As shown in FIG. 7 , an embodiment of the present specification further provides an attribute prediction method, and the method may include the following steps:
步骤701:获取基于上述各个实施例的特征发现方法得到的至少两个对运营的业务中所包含的属性进行预测的属性预测特征,以及该至少两个属性预测特征的质量评估结果;Step 701: Acquire at least two attribute prediction features obtained based on the feature discovery method of each of the foregoing embodiments for predicting attributes included in the operating business, and quality evaluation results of the at least two attribute prediction features;
步骤703:根据属性预测特征的质量评估结果,将至少两个属性预测特征划分为至少两个等级;以及,Step 703: Divide the at least two attribute prediction features into at least two levels according to the quality evaluation results of the attribute prediction features; and,
步骤705:将至少两个等级的属性预测特征部署到不同的业务中,以实现在不同业务中对属性进行预测;其中,业务的重要程度越高,部署的属性预测特征的等级越高。Step 705: Deploy at least two levels of attribute prediction features into different services to implement attribute prediction in different services; wherein, the higher the importance of the service, the higher the level of the deployed attribute prediction features.
本实施例中,在进行属性预测时,首先可以获取利用本说明书各个实施例提供的特征发现方法得到的至少两个对运营的业务中所包含的属性进行预测的属性预测特征,以及该至少两个属性预测特征的质量评估结果。然后根据属性预测特征的质量评估结果将至少两个属性预测特征划分为至少两个等级,并将不同等级的属性预测特征部署到不同的业务中,以实现在不同业务中对属性进行预测。In this embodiment, when performing attribute prediction, at least two attribute prediction features obtained by using the feature discovery method provided by each embodiment of this specification for predicting the attributes included in the operating business, and the at least two attribute prediction features may be obtained first. The quality assessment results of each attribute prediction feature. Then, according to the quality evaluation result of the attribute prediction feature, the at least two attribute prediction features are divided into at least two levels, and the attribute prediction features of different levels are deployed in different services, so as to realize the attribute prediction in different services.
比如,在风险账户的预测中,当得到属性预测特征后,可以根据特征值将各属性特征划分不同的风险等级。如根据用户信用度和性别交叉组合得到的属性预测特征。分别为信用度0-20的男性、信用度21-40的男性、信用度41-60的男性、信用度61-80的男性、信用度81-100的男性、信用度0-20女性、信用度21-40的女性、信用度41-60的女性、信用度61-80的女性、以及信用度81-100的女性。如此可以将不同的特征部署到各个场景中,对于信用度较低的用户可以部署到重点关注的业务场景中,对于信用度较高的用户可以部署到关注程度相对较低的业务场景中。For example, in the prediction of the risk account, after the attribute prediction feature is obtained, each attribute feature can be divided into different risk levels according to the feature value. For example, attribute prediction features obtained by cross-combination of user credit and gender. Men with 0-20 credit, men with 21-40 credit, men with 41-60 credit, men with 61-80 credit, men with 81-100 credit, women with 0-20 credit, women with 21-40 credit, Women with 41-60 credit, women with 61-80 credit, and women with 81-100 credit. In this way, different features can be deployed in each scenario, for users with low credit degrees, they can be deployed in business scenarios that they focus on, and users with high credit degrees can be deployed in business scenarios with relatively low degrees of concern.
如图8所示,本说明书实施例提供了一种特征发现装置,包括:特征获取模块801、特征衍生模块802和特征评估模块803;As shown in FIG. 8 , an embodiment of this specification provides a feature discovery apparatus, including: a
特征获取模块801,配置为获取业务运营中产生的至少一个原始特征;其中,原始特征用于对运营的业务中所包含的属性进行预测;The
特征衍生模块802,配置为对特征获取模块801获取到的至少一个原始特征进行衍生,得到至少一个衍生特征;The
特征评估模块803,配置为对特征衍生模块802得到的至少一个衍生特征的质量进行评估,得到对运营的业务中所包含的属性进行预测的属性预测特征;其中,衍生特征的质量用于表征该特征对属性进行预测的重要程度。The
在一种可能的实现方式中,特征获取模块801获取的至少一个原始特征包括至少两个原始特征;In a possible implementation manner, the at least one original feature obtained by the
特征衍生模块802在对至少一个原始特征进行衍生得到至少一个衍生特征时,配置为执行如下操作:When the
根据每一个原始特征对属性的覆盖程度,对至少两个原始特征进行筛选,得到至少一个筛选特征;According to the coverage degree of each original feature to the attribute, at least two original features are screened to obtain at least one screening feature;
将至少一个筛选特征映射到与当前筛选特征所在的维度不同的维度,得到至少一个衍生特征。The at least one screening feature is mapped to a dimension different from the dimension in which the current screening feature is located to obtain at least one derived feature.
在一种可能的实现方式中,特征衍生模块802在根据每一个原始特征对属性的覆盖程度对至少两个原始特征进行筛选得到至少一个筛选特征时,配置为执行如下操作:In a possible implementation manner, the
将至少两个原始特征中,特征值大于预设的第一有效特征阈值的原始特征确定为筛选特征。Among the at least two original features, the original feature whose feature value is greater than the preset first effective feature threshold is determined as the screening feature.
在一种可能的实现方式中,特征获取模块801获取的至少一个原始特征包括至少两个原始特征;In a possible implementation manner, the at least one original feature obtained by the
特征衍生模块802在对至少一个原始特征进行衍生得到至少一个衍生特征时,配置为执行如下操作:When the
针对至少两个原始特征中的任意一个第一原始特征和任意一个第二原始特征,均执行:For any first original feature and any second original feature of the at least two original features, execute:
将第一原始特征进行拆分,得到M个第一拆分特征;以及,splitting the first original feature to obtain M first split features; and,
将第二原始特征进行拆分,得到N个第二拆分特征;其中,M和N均为大于0的整数;Splitting the second original feature to obtain N second splitting features; wherein, M and N are both integers greater than 0;
将M个第一拆分特征和N个第二拆分特征进行组合,得到M×N个初级衍生特征;其中,任意一个组合得到的初级衍生特征均由一个第一拆分特征和一个第二拆分特征组合得到;Combine M first split features and N second split features to obtain M×N primary derived features; wherein, the primary derived features obtained by any combination are composed of a first split feature and a second split feature. The combination of splitting features is obtained;
根据每一个初级衍生特征对属性的覆盖程度,对M×N个初级衍生特征进行筛选得到至少一个衍生特征。According to the coverage degree of each primary derived feature to the attribute, at least one derived feature is obtained by screening M×N primary derived features.
在一种可能的实现方式中,特征衍生模块802在根据每一个初级衍生特征对属性的覆盖程度对M×N个初级衍生特征进行筛选得到至少一个衍生特征时,配置为执行如下操作:In a possible implementation manner, the
将M×N个初级衍生特征中,组合特征值大于预设的第二有效特征阈值的初级衍生特征确定为衍生特征;其中,组合特征值为组合成对应初级衍生特征的第一拆分特征和第二拆分特征的特征值通过加权平均得到的值。Among the M×N primary derivative features, the primary derivative feature whose combined feature value is greater than the preset second effective feature threshold is determined as a derivative feature; wherein, the combined feature value is combined into a first split feature and a corresponding primary derivative feature. The eigenvalues of the second split feature are obtained by weighted averaging.
在一种可能的实现方式中,特征评估模块803在对至少一个衍生特征的质量进行评估得到属性预测特征时,配置为执行如下操作:In a possible implementation manner, the
计算各个衍生特征的属性预测能力值;Calculate the attribute prediction ability value of each derived feature;
将各属性预测能力值中,大于预设评估阈值的属性预测能力值所对应的衍生特征确定为属性预测特征。Among the attribute prediction ability values, the derivative feature corresponding to the attribute prediction ability value greater than the preset evaluation threshold is determined as the attribute prediction feature.
在一种可能的实现方式中,属性包括待预测的第一属性结果,以及与第一属性结果相反的第二属性结果;In a possible implementation manner, the attribute includes a first attribute result to be predicted, and a second attribute result opposite to the first attribute result;
特征评估模块803在计算各个衍生特征的属性预测能力值时,配置为执行如下操作:When the
针对每一个衍生特征,均执行:For each derived feature, execute:
将当前衍生特征进行等距离的分组,得到k个分组;其中,等距离包括等高和等宽中的至少一个;k为大于0的整数;The current derived features are grouped at equal distances to obtain k groups; wherein, the equal distance includes at least one of equal height and equal width; k is an integer greater than 0;
利用如下计算式,计算各个分组对应的属性预测能力值:Use the following formula to calculate the attribute prediction ability value corresponding to each group:
其中,IVi用于表征第i个分组对应的属性预测能力值,yi用于表征第i个分组包含第一属性结果的数量,ys用于表征当前衍生特征中包含第一属性结果的数量,xi用于表征第i个分组包含第二属性结果的数量,xs用于表征当前衍生特征中包含第二属性结果的数量;Among them, IV i is used to characterize the attribute prediction ability value corresponding to the ith group, y i is used to characterize the number of the first attribute result contained in the ith group, and y s is used to characterize the current derived feature that contains the first attribute result. Quantity, x i is used to represent the number of results of the second attribute contained in the ith group, and x s is used to represent the number of results of the second attribute contained in the current derived feature;
对各个分组得到的属性预测能力值进行求和,得到当前衍生特征的属性预测能力值。The attribute prediction ability values obtained by each group are summed to obtain the attribute prediction ability value of the current derived feature.
在一种可能的实现方式中,特征评估模块803在对至少一个衍生特征的质量进行评估得到属性预测特征时,配置为执行如下操作:In a possible implementation manner, the
根据利用原始特征和衍生特征分别进行属性预测的准确度,确定属性预测特征。Attribute prediction features are determined according to the accuracy of attribute prediction using original features and derived features respectively.
在一种可能的实现方式中,特征评估模块803在根据利用原始特征和衍生特征分别进行属性预测的准确度确定属性预测特征时,配置成执行如下操作:In a possible implementation manner, the
利用原始特征中的至少一个,训练得到第一属性预测模型;Using at least one of the original features, training to obtain a first attribute prediction model;
利用衍生特征中的至少一个,训练得到第二属性预测模型;Using at least one of the derived features, training to obtain a second attribute prediction model;
分别利用第一属性预测模型和第二属性预测模型对同一待属性预测特征进行属性预测,得到对应第一属性预测模型的第一预测结果和对应第二属性预测模型的第二预测结果;Using the first attribute prediction model and the second attribute prediction model to perform attribute prediction on the same attribute prediction feature to obtain a first prediction result corresponding to the first attribute prediction model and a second prediction result corresponding to the second attribute prediction model;
计算第一预测结果与待属性预测特征的标签值的距离,得到第一相似值;以及,Calculate the distance between the first prediction result and the label value of the feature to be predicted by the attribute to obtain the first similarity value; and,
计算第二预测结果与待属性预测特征的标签值的距离,得到第二相似值;Calculate the distance between the second prediction result and the label value of the attribute to be predicted feature to obtain the second similarity value;
当第二相似值小于第一相似值时,将用于训练第二属性预测模型的衍生特征确定为属性预测特征。When the second similarity value is smaller than the first similarity value, the derived feature used for training the second attribute prediction model is determined as the attribute prediction feature.
如图9所示,本说明书实施例还提供了一种属性预测装置,包括:预测数据获取模块901、等级划分模块902和属性预测模块903;As shown in FIG. 9 , the embodiment of the present specification further provides an attribute prediction apparatus, including: a prediction
预测数据获取模块901,配置为获取上述各实施例的特征发现装置得到的至少两个对运营的业务中所包含的属性进行预测的属性预测特征,以及该至少两个属性预测特征的质量评估结果;The prediction
等级划分模块902,配置为根据预测数据获取模块901得到的属性预测特征的质量评估结果,将至少两个属性预测特征划分为至少两个等级;The
属性预测模块903,配置为将等级划分模块902划分的至少两个等级的属性预测特征部署到不同的业务中,以实现在不同业务中对属性进行预测;其中,业务的重要程度越高,部署的属性预测特征的等级越高。The
本说明书还提供了一种计算机可读存储介质,其上存储有计算机程序,当计算机程序在计算机中执行时,令计算机执行说明书中任一个实施例中的方法。The present specification also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed in the computer, the computer is made to execute the method in any one of the embodiments in the specification.
本说明书还提供了一种计算设备,包括存储器和处理器,存储器中存储有可执行代码,处理器执行可执行代码时,实现说明书中任一个实施例中的方法。The present specification also provides a computing device including a memory and a processor, where executable codes are stored in the memory, and when the processor executes the executable codes, the method in any one of the embodiments in the specification is implemented.
可以理解的是,本说明书实施例示意的结构并不构成对特征发现装置和属性预测装置的具体限定。在说明书的另一些实施例中,特征发现装置和属性预测装置可以包括比图示更多或者更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或者软件和硬件的组合来实现。It can be understood that the structures illustrated in the embodiments of the present specification do not constitute specific limitations on the feature discovery apparatus and the attribute prediction apparatus. In other embodiments of the specification, the feature discovery apparatus and the attribute prediction apparatus may include more or less components than shown, or combine some components, or separate some components, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
上述装置内的各单元之间的信息交互、执行过程等内容,由于与本说明书方法实施例基于同一构思,具体内容可参见本说明书方法实施例中的叙述,此处不再赘述。Since the information exchange and execution process among the units in the above apparatus are based on the same concept as the method embodiments in this specification, the specific content can be found in the descriptions in the method embodiments in this specification, which will not be repeated here.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本说明书所描述的功能可以用硬件、软件、挂件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should realize that, in one or more of the above examples, the functions described in this specification can be implemented by hardware, software, accessories or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
以上所述的具体实施方式,对本说明书描述的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the purposes, technical solutions and beneficial effects described in this specification. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included within the protection scope of the present invention.
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619787.4A CN114943607A (en) | 2022-06-02 | 2022-06-02 | Feature discovery method, attribute prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210619787.4A CN114943607A (en) | 2022-06-02 | 2022-06-02 | Feature discovery method, attribute prediction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114943607A true CN114943607A (en) | 2022-08-26 |
Family
ID=82908610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210619787.4A Pending CN114943607A (en) | 2022-06-02 | 2022-06-02 | Feature discovery method, attribute prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114943607A (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004061556A2 (en) * | 2002-12-30 | 2004-07-22 | Fannie Mae | System and method of processing data pertaining to financial assets |
US7272593B1 (en) * | 1999-01-26 | 2007-09-18 | International Business Machines Corporation | Method and apparatus for similarity retrieval from iterative refinement |
US20180060738A1 (en) * | 2014-05-23 | 2018-03-01 | DataRobot, Inc. | Systems and techniques for determining the predictive value of a feature |
WO2018180970A1 (en) * | 2017-03-30 | 2018-10-04 | 日本電気株式会社 | Information processing system, feature value explanation method and feature value explanation program |
WO2019069507A1 (en) * | 2017-10-05 | 2019-04-11 | 日本電気株式会社 | Feature value generation device, feature value generation method, and feature value generation program |
CN110263821A (en) * | 2019-05-29 | 2019-09-20 | 阿里巴巴集团控股有限公司 | Transaction feature generates the generation method and device of the training of model, transaction feature |
US20190370833A1 (en) * | 2018-06-04 | 2019-12-05 | Zuora, Inc. | Systems and methods for predicting churn in a multi-tenant system |
CN111507831A (en) * | 2020-05-29 | 2020-08-07 | 长安汽车金融有限公司 | Credit risk automatic assessment method and device |
CN112328657A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Feature derivation method, feature derivation device, computer equipment and medium |
WO2021177593A1 (en) * | 2020-03-03 | 2021-09-10 | 한국과학기술원 | Machine learning-based future innovation prediction method and system therefor |
CN113610175A (en) * | 2021-08-16 | 2021-11-05 | 上海冰鉴信息科技有限公司 | Service strategy generation method and device and computer readable storage medium |
US11250368B1 (en) * | 2020-11-30 | 2022-02-15 | Shanghai Icekredit, Inc. | Business prediction method and apparatus |
CN114266643A (en) * | 2021-12-14 | 2022-04-01 | 上海孚厘科技有限公司 | Enterprise mining method, device, equipment and storage medium based on fusion algorithm |
-
2022
- 2022-06-02 CN CN202210619787.4A patent/CN114943607A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7272593B1 (en) * | 1999-01-26 | 2007-09-18 | International Business Machines Corporation | Method and apparatus for similarity retrieval from iterative refinement |
WO2004061556A2 (en) * | 2002-12-30 | 2004-07-22 | Fannie Mae | System and method of processing data pertaining to financial assets |
US20180060738A1 (en) * | 2014-05-23 | 2018-03-01 | DataRobot, Inc. | Systems and techniques for determining the predictive value of a feature |
WO2018180970A1 (en) * | 2017-03-30 | 2018-10-04 | 日本電気株式会社 | Information processing system, feature value explanation method and feature value explanation program |
WO2019069507A1 (en) * | 2017-10-05 | 2019-04-11 | 日本電気株式会社 | Feature value generation device, feature value generation method, and feature value generation program |
US20190370833A1 (en) * | 2018-06-04 | 2019-12-05 | Zuora, Inc. | Systems and methods for predicting churn in a multi-tenant system |
CN110263821A (en) * | 2019-05-29 | 2019-09-20 | 阿里巴巴集团控股有限公司 | Transaction feature generates the generation method and device of the training of model, transaction feature |
WO2021177593A1 (en) * | 2020-03-03 | 2021-09-10 | 한국과학기술원 | Machine learning-based future innovation prediction method and system therefor |
CN111507831A (en) * | 2020-05-29 | 2020-08-07 | 长安汽车金融有限公司 | Credit risk automatic assessment method and device |
CN112328657A (en) * | 2020-11-03 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Feature derivation method, feature derivation device, computer equipment and medium |
US11250368B1 (en) * | 2020-11-30 | 2022-02-15 | Shanghai Icekredit, Inc. | Business prediction method and apparatus |
CN113610175A (en) * | 2021-08-16 | 2021-11-05 | 上海冰鉴信息科技有限公司 | Service strategy generation method and device and computer readable storage medium |
CN114266643A (en) * | 2021-12-14 | 2022-04-01 | 上海孚厘科技有限公司 | Enterprise mining method, device, equipment and storage medium based on fusion algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108564286B (en) | Artificial intelligent financial wind-control credit assessment method and system based on big data credit investigation | |
CN112270545A (en) | Financial risk prediction method and device based on migration sample screening and electronic equipment | |
CN112001788B (en) | Credit card illegal fraud identification method based on RF-DBSCAN algorithm | |
CN112561685B (en) | Customer classification method and device | |
CN111178675A (en) | Electricity bill recovery risk prediction method, system, storage medium and computer equipment based on LR-Bagging algorithm | |
CN117093782B (en) | Electric power artificial intelligence model system and method | |
CN109242250A (en) | A kind of user's behavior confidence level detection method based on Based on Entropy method and cloud model | |
CN111738331A (en) | User classification method and device, computer-readable storage medium and electronic device | |
JP2021033711A (en) | Anomaly detection device, anomaly detection method, and program | |
CN112801231A (en) | Decision model training method and device for business object classification | |
CN114140230A (en) | Loan credit line determining method, electronic device and storage medium thereof | |
CN113627997A (en) | Data processing method, device, electronic device and storage medium | |
CN115114851B (en) | Score card modeling method and device based on five-fold cross validation | |
CN119558865B (en) | Numerical control machine tool aftermarket service project management method and management system | |
CN119006145B (en) | Financial platform risk prediction method and system based on multi-source user behavior data | |
CN113837481B (en) | Financial big data management system based on block chain | |
JP7256766B2 (en) | Inference basis analysis device and inference basis analysis method | |
CN116911994B (en) | External trade risk early warning system | |
Zhao et al. | Network-based feature extraction method for fraud detection via label propagation | |
CN118195756A (en) | Data analysis method for resource allocation and electronic equipment | |
CN114943607A (en) | Feature discovery method, attribute prediction method and device | |
CN118094224A (en) | Data processing method, device and equipment | |
CN110472680B (en) | Object classification method, device and computer-readable storage medium | |
Wahyuningrum et al. | An Extended Consistent Fuzzy Preference Relation to Evaluating Website Usability | |
CN113822490B (en) | Asset collection method and device based on artificial intelligence and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220826 |