CN110837841B

CN110837841B - KPI degradation root cause identification method and device based on random forest

Info

Publication number: CN110837841B
Application number: CN201810938061.0A
Authority: CN
Inventors: 张国华; 何峥; 张琪斌; 陈香; 王明
Original assignee: Beijing Boco Inter Telecom Technology Co ltd
Current assignee: Beijing Boco Inter Telecom Technology Co ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2024-05-21
Anticipated expiration: 2038-08-17
Also published as: CN110837841A

Abstract

The invention discloses a KPI degradation root cause identification method based on a random forest, which comprises the following steps: acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed; selecting basic data with set proportion as a training set, training a certain number of decision trees according to the training set, and constructing a KPI degradation root cause analysis model by the decision trees; and taking the rest of the basic data as a test set, and obtaining influence factors influencing all the influence factors of all the KPIs by using the KPI degradation root cause analysis model. The invention also discloses a KPI degradation root cause identification device based on the random forest. The invention can improve the accuracy and efficiency of identifying the influence factors.

Description

KPI degradation root cause identification method and device based on random forest

Technical Field

The invention relates to the technical field of data analysis and machine learning in the communication industry, in particular to a technology for identifying a degradation root cause.

Background

In the operation management of the mobile communication network, some key KPIs, such as call drop rate and call loss, are required to be paid attention to, besides daily maintenance, operators hope to know factors influencing the KPIs, obtain the association between the KPIs and the network, and facilitate the distribution and guarantee of network optimization tasks in the later period.

The traditional root cause identification method has lower operation efficiency and calculation accuracy, and is more difficult to obtain accurate root cause especially when thousands of input variables and training data are in missing information.

Therefore, how to quickly and accurately identify the degradation root causes is a problem to be solved.

Disclosure of Invention

The invention provides a KPI degradation root cause identification method based on a random forest, which comprises the following steps:

acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed;

Selecting basic data with set proportion as a training set, training a certain number of decision trees according to the training set, and constructing a KPI degradation root cause analysis model by the decision trees;

And taking the rest of the basic data as a test set, and obtaining influence factors influencing all the influence factors of all the KPIs by using the KPI degradation root cause analysis model.

The influencing factors comprise a retentivity index, an access class index, a mobility index, a resource class index and a system capacity class index, and the influencing factors are sample characteristics.

Further, the method for training a certain number of decision trees according to the training set specifically comprises the following steps:

Randomly and repeatedly extracting N training samples from a training set with the size of N to be used as the training set of the decision tree;

And the feature dimension of each training sample is M, M (M is less than or equal to M) features are randomly selected as feature subsets, and when the tree is split each time, the optimal feature is selected from the M features to split, so that a decision tree is obtained.

Further, the method for constructing the KPI inferior root cause analysis model by the decision tree specifically comprises the following steps:

obtaining k decision trees to generate a random forest model, namely a KPI inferior root cause analysis model;

and averaging the obtained results of all the decision trees to obtain the result of a random forest model, namely the analysis result of the KPI inferior root cause analysis model.

Specifically, each node of the decision tree is an influence factor, the value of the unrepeated degree of which the decision tree is reduced by each influence factor is calculated, and the value of the unrepeated degree is taken as the influence factor of the influence factor.

Preferably, the KPI bad root cause analysis model is fitted by using a variance or least square method.

Preferably, the influence factors of the obtained KPI influence factors are ranked, and the influence factors are output according to the ranking result.

The invention also discloses a KPI degradation root cause identification device based on the random forest, which comprises:

The data acquisition module is used for acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed;

The model building module is used for selecting relevant basic data acquired by the data acquisition module with set proportion as a training set, training a certain number of decision trees according to the training set, and building the KPI degradation root cause analysis model by the decision trees;

And the influence factor determining module is used for taking the basic data after the model building module selects the training set as a test set, and combining the model to build the KPI listed root cause analysis model built by the default module to obtain influence factors influencing all the influence factors of all the KPIs.

The model building module further comprises:

The training set selecting unit is used for selecting the basic data with a set proportion as a training set, and randomly extracting N training samples from the training set with a return if the size of the training set is N, and taking the N training samples as the training set of the decision tree;

The decision tree acquisition unit is used for designating a constant M < M if the feature dimension of each sample is M, randomly selecting M feature subsets from M features, and selecting optimal features from the M features for splitting to acquire a decision tree;

The model building unit is used for repeatedly building k decision trees k times according to the method for obtaining the decision tree by the decision tree obtaining unit to obtain a KPI degradation root cause analysis model;

And the calculating unit is used for calculating the average value of the k decision tree results by adopting an average method according to the k decision tree results established by the decision tree establishing unit and taking the average value as the result of the KPI degradation root cause analysis model.

The influence factor determination module further includes:

the test set acquisition unit is used for taking the basic data after the training set selection unit selects the training set as a test set;

the test data acquisition unit is used for acquiring historical data of influence factors in a certain time and actual data of the KPI indexes in a certain time in the test set acquisition unit;

The influence factor calculation unit is used for inputting the historical data and the actual data acquired by the test data acquisition unit into the KPI degradation root cause analysis model determined in the model building module, taking each node of the decision tree in the KPI degradation root cause analysis model as an influence factor, and calculating an unrepeace value of the influence factor, which enables the decision tree to be reduced averagely;

and the influence factor determining unit is used for determining the impure value obtained by calculation of the influence factor calculating unit as an influence factor of the influence factor on the KPI.

Preferably, the device further comprises a sequencing module for sequencing the determined influence factors according to a set rule;

and the main influence factor determining module is used for determining influence factors of influence factors influencing the KPI according to the influence factor sorting result of the sorting module.

According to the technical scheme, the KPI degradation root cause identification method based on the random forest disclosed by the embodiment of the invention establishes a KPI degradation root cause analysis model based on the random forest according to the collected historical data of a plurality of influence factors and KPI actual data to be analyzed, inputs the historical data of the plurality of influence factors and KPI data to be analyzed into a preset KPI degradation root cause analysis model to obtain influence factors affecting the KPI, and sequences and outputs main influence factors, thereby improving the accuracy and efficiency of the result.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a KPI degradation root cause identification method based on a random forest, which is provided by the embodiment of the application;

FIG. 2 is a flow chart of a method according to a second embodiment of the present application;

FIG. 3 is a flow chart of a method according to a third embodiment of the present application;

FIG. 4 is a flow chart of a method according to a fourth embodiment of the present application;

fig. 5 is a schematic structural diagram of a KPI degradation root cause identification device based on a random forest according to a fifth embodiment of the application;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a method for identifying KPI degradation root causes based on random forest according to a first embodiment of the present invention is provided.

Step S01: and acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed.

Specifically, the influencing factors may be tens of total downtilt angle, azimuth angle, wireless utilization rate, average e-rab number, mechanical downtilt angle, latitude, longitude and the like, and the variety of the influencing factors may be increased according to the newly added data.

Step S02: and selecting basic data with set proportion as a training set, training a certain number of decision trees according to the training set, and constructing the KPI degradation root cause analysis model by the decision trees.

The selection of the proportion of the training set can be flexibly set according to actual conditions.

For example, 40% of the underlying data may be selected as the training set.

Step S03: and taking the rest of the basic data as a test set, and obtaining influence factors influencing all the influence factors of all the KPIs by using the KPI degradation root cause analysis model.

And selecting the data remained after the training set as a test set, and obtaining the influence factors of all influence factors by using the KPI degradation root cause analysis model.

If 40% of the underlying data is selected as the training set, the remaining 60% of the data can be used as the test set.

In order to obtain the influence factors of the target influence factors more conveniently according to the requirements, preferably, the method further comprises the following steps:

step S04: and sequencing the influence factors of the obtained KPI influence factors, and outputting the influence factors and the influence factors according to the sequencing result.

And sequencing the influence factors of the influence factors from high to low. The ranking is ranking of all influence factors affecting the KPI, and the main influence factors are obtained.

Therefore, the embodiment of the invention discloses a method for identifying the degradation root cause of a KPI based on a random forest, which is characterized in that the degradation root cause of the KPI is accurately known by collecting historical data of influence factor indexes, the influence factors of the influence factors are used as the non-purity of decision tree nodes in the random forest, and the influence factors are sequenced, so that the degradation root cause of the KPI can be rapidly and accurately positioned, the time for manual judgment is greatly saved, and the accuracy is improved.

In order to better illustrate the present invention, a second embodiment is provided, and as shown in fig. 2, the model building process of the present invention is described in detail.

Step S201: n training samples are randomly and repeatedly extracted from a training set with the size of N to be used as the training set of the decision tree.

The subsampling is one of the modes of operation of simple random sampling. Training samples in the population are numbered from 1 to N, and each number is extracted and then put back into the population. For any one extraction, the N numbers are equally drawn because the overall capacity is unchanged.

Step S202: and the feature dimension of each training sample is M, M (M is less than or equal to M) features are randomly selected as feature subsets, and when the tree is split each time, the optimal feature is selected from the M features to split, so that a decision tree is obtained.

The method for determining the optimal division characteristics is as follows: so that the purity of the data of each node after splitting is the highest. That is, the samples included in the branch nodes classified by the feature are classified into the same class as much as possible.

Feature dimensions may be understood as the kind of features of the training sample, one feature dimension for each feature.

Specifically, the column properties of the training data may be sampled; for M columns of attributes, extraction M less than or equal to M without replacement is adopted.

And selecting the optimal feature to split the decision tree, namely taking the optimal feature as a father node, performing complete splitting according to a certain rule feature, and continuing splitting by taking the split leaf node as the father node until the splitting cannot be performed.

Therefore, each node of the decision tree is an influencing factor, the root causes of different KPIs can be distinguished by the influencing factor on the node, in order to obtain the optimal splitting result, the root causes can be completely distinguished by finding a KPI, and the purity of the node is higher.

Calculating an impure value for each influencing factor such that the decision tree is reduced on average, taking the impure value as the influencing factor for the influencing factor. Step S203: and repeating the step S202 to obtain a random forest model generated by k decision trees, namely the KPI inferior root cause analysis model.

Step S204: and averaging the obtained results of all the decision trees to obtain the result of a random forest model, namely the analysis result of the KPI inferior root cause analysis model.

Step S205: and fitting the KPI inferior root cause analysis model by using a variance or least square method.

To explain in detail how the influence factors of the influence factors are determined from the model, a third embodiment of the present invention is given as shown in fig. 3.

Step S301: and taking the basic data after the training set is selected as a test set.

Step S302: and acquiring historical data of the influence factor indexes in a certain time and actual data of the KPI indexes in a certain time in the test set.

Step S303: and the historical data and the actual data are input into a KPI degradation root cause analysis model determined in the model building module.

Step S304: and taking each node of the decision tree in the KPI degradation root cause analysis model as an influence factor, and calculating the influence factor so as to reduce the unreliability of the decision tree on average.

Step S305: and the influence factor calculating unit is used for determining the calculated non-purity value obtained by the influence factor calculating unit as an influence factor of the influence factor on the KPI.

In order to describe the implementation of the present invention in more systematic detail, a fourth embodiment is given below in conjunction with an example, as shown in fig. 4.

Step S401: and acquiring basic data related to establishing a KPI degradation root cause model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed.

In the KPI degradation root cause identification method provided in embodiment 4, the experimental data is from partial data (total 18510 lines) of a certain region, taking e-rab establishment success rate as an example, and the system finds that the historical data of the region are researched: these data exhibit a continuance, periodicity, correlation characteristic. The system determines the sample properties as total downtilt, azimuth, wireless utilization, average e-rab number, mechanical downtilt, latitude, longitude, etc. (58 items total). Sample data are shown in table 1:

TABLE 1 sample data

Although the data volume in the experiment does not reach the large data scale, the experimental data can be used for carrying out an algorithm correctness experiment, and then the experimental data is expanded to reach the large data scale for carrying out an algorithm prediction rate experiment.

Step S402: for random sampling of rows of training data. For the total sample size S, a put-back strategy is adopted to extract k training samples.

Step S403: column attribute sampling for training data. And for M columns of attributes, extracting M attributes without replacement, and determining the number M of the attributes randomly selected by each node according to the number M of the attributes in the sample data. Typically, M is 1/3 of M in the regression model. And calculating the information quantity of each attribute in the m attributes, and selecting the attribute with the largest information quantity for branching. Here, a random forest regression method, soOf course, the ratio of M to M may also be determined according to the actual situation, which will not be described herein.

Step S404: and establishing a decision tree. A number of decision trees are built using a fully split approach for the sampled data.

Each decision tree classifier is combined to form a random forest.

Each decision tree produces a result, and when a random forest is used to regress predictions, k trees give k predictions y ₁,y₂…y_k.

Step S405: determining the result, calculating the average value according to the predicted values of the decision trees, and recording the final random forest output result as the average value of the k predicted results of the decision treesCan be expressed as: /(I)

Step S406: substituting data in the test set into a random forest, determining nodes by each node in the decision tree by using the non-purity, and adopting variance or least square fitting.

Step S407: when training the decision tree, it is calculated how much of the tree's unreliability each influencing factor reduces.

Step S408: how much the tree is reduced in the degree of non-purity of each influence factor is confirmed as the influence factor of the influence factor on the KPI.

Step S409: and sequencing the influence factors of the influence factors from high to low.

Step S410: and obtaining the ranking of the influence factors of the KPI according to the high-to-low ranking, and obtaining the influence factors of the main influence factors.

The invention also discloses a KPI degradation root cause identification device based on random forests, and a fifth embodiment of the invention is provided firstly, as shown in fig. 5, for explaining the structural characteristics of the device.

The device comprises:

The data acquisition module 1 is used for acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed;

The model building module 2 is used for selecting relevant basic data acquired by the data acquisition module with set proportion as a training set, training a certain number of decision trees according to the training set, and building the KPI degradation root cause analysis model by the decision trees.

Each node of the decision tree is an influence factor, the value of the unrepeated degree which is reduced by each influence factor in average is calculated, and the value of the unrepeated degree is taken as the influence factor of the influence factor.

Specifically, the model building module further includes:

The training set selecting unit 21 selects the basic data with a set proportion as a training set, and if the training set is N in size, randomly and with a put back, extracts N training samples from the training set as a training set of the decision tree.

The decision tree obtaining unit 22 specifies a constant M < M if the feature dimension of each sample is M, randomly selects M feature subsets from M features, and selects an optimal feature from the M features for splitting to obtain a decision tree.

The model building unit 23 is used for repeatedly building k decision trees k times according to the method of acquiring the decision trees by the decision tree acquisition unit to acquire a KPI degradation root cause analysis model.

And the calculating unit 24 is used for calculating the average value of the k decision tree results by adopting an average method according to the k decision tree results established by the decision tree establishing unit, and taking the average value as the result of the KPI degradation root cause analysis model.

And the influence factor determining module 3 is used for taking the basic data after the model building module selects the training set as a test set, and combining the model to build the KPI listed root cause analysis model built by the default module to obtain influence factors influencing all the influence factors of all the KPIs.

The influence factor determination module 3 further comprises:

The test set obtaining unit 31 is configured to use the basic data after the training set selection unit selects the training set as a test set.

A test data acquisition unit 32, configured to acquire, in the test set acquisition unit, historical data of influencing factors in a certain time and actual data of KPI indicators in a certain time.

And an influence factor calculating unit 33, configured to input the historical data and the actual data acquired by the test data acquiring unit into the KPI degradation root cause analysis model determined in the model building module, take each node of the decision tree in the KPI degradation root cause analysis model as an influence factor, and calculate the influence factor so that the decision tree has an average reduced impure value.

An influence factor determining unit 34, configured to determine the impure value obtained by calculation by the influence factor calculating unit as an influence factor of the influence factor on KPI.

Preferably, for easier selection of the main influencing factor from among the influencing factors, the device may further comprise:

And the ordering module 4 orders the determined influence factors according to a set rule.

And the main influence factor determining module 5 is used for determining influence factors of influence factors influencing the KPI according to the influence factor sorting result of the sorting module.

It will be clear to those skilled in the art that, for convenience and brevity of description, the corresponding process in the above-described apparatus embodiment may refer to the specific working process of the foregoing method, which is not described herein again.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of operation in sequences other than those illustrated herein.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The KPI degradation root cause identification method based on random forest is characterized by comprising the following steps:

Selecting basic data with set proportion as a training set, training a certain number of decision trees according to the training set, and constructing the KPI degradation root cause analysis model by the decision trees, wherein the method specifically comprises the following steps:

Obtaining k decision trees to generate a random forest model, namely a KPI degradation root cause analysis model;

Averaging the obtained results of all the decision trees to obtain the result of a random forest model, namely the analysis result of the KPI degradation root cause analysis model;

Taking the rest of the basic data as a test set, and obtaining influence factors influencing each influence factor of each KPI by using the KPI degradation root cause analysis model;

The influence factors comprise a retentivity index, an access class index, a mobility index, a resource class index and a system capacity class index, and the influence factors are sample characteristics;

The influencing factors further comprise: total downtilt, azimuth, wireless utilization, average e-rab number, mechanical downtilt, latitude and longitude.

2. The method according to claim 1, wherein the method for training a number of decision trees according to a training set is specifically:

3. The method according to any one of claims 1-2, characterized in that:

4. A method according to claim 3, characterized in that:

and the KPI degradation root cause analysis model is fitted by adopting a variance or least square method.

5. A method according to claim 3, characterized in that:

And sequencing the influence factors of the obtained KPI influence factors, and outputting the influence factors and the influence factors according to the sequencing result.

6. A KPI degradation root cause identification apparatus based on a random forest, the apparatus comprising:

The data acquisition module is used for acquiring relevant basic data for establishing a KPI degradation root cause analysis model, wherein the basic data comprises historical data of influence factors and KPI actual data to be analyzed; wherein,

the influencing factors further comprise: total downtilt, azimuth, wireless utilization, average e-rab number, mechanical downtilt, latitude and longitude;

The model building module is used for selecting relevant basic data acquired by the data acquisition module with set proportion as a training set, training a certain number of decision trees according to the training set, and building the KPI degradation root cause analysis model by the decision trees; the model building module comprises: the model building unit is used for repeatedly building k decision trees k times according to the method for obtaining the decision tree by the decision tree obtaining unit to obtain a KPI degradation root cause analysis model; the calculating unit is used for calculating the average value of k decision tree results by adopting an average method according to the k decision tree results established by the decision tree establishing unit and taking the average value as the result of the KPI degradation root cause analysis model;

And the influence factor determining module is used for taking the basic data after the model building module selects the training set as a test set, and combining the model to build the KPI degradation root cause analysis model built by the default module to obtain influence factors influencing all the KPIs.

7. The apparatus of claim 6, wherein the model building module further comprises:

And the decision tree acquisition unit is used for assigning a constant M < M if the feature dimension of each sample is M, randomly selecting M feature subsets from the M features, and selecting optimal features from the M features to split so as to acquire a decision tree.

8. The apparatus according to claim 7, wherein:

9. The apparatus of claim 8, wherein the influence factor determination module further comprises:

10. The apparatus according to any one of claims 6-9, wherein the apparatus further comprises:

The ordering module orders the determined influence factors according to a set rule;