CN105824811A

CN105824811A - Big data analysis method and device

Info

Publication number: CN105824811A
Application number: CN201510001942.6A
Authority: CN
Inventors: 黄庆荣; 谢志崇; 魏建荣; 彭家华; 郑志欢; 林恪; 陈钰铖
Original assignee: China Mobile Group Fujian Co Ltd
Current assignee: China Mobile Group Fujian Co Ltd
Priority date: 2015-01-04
Filing date: 2015-01-04
Publication date: 2016-08-03
Anticipated expiration: 2035-01-04
Also published as: CN105824811B

Abstract

The embodiment of the present invention discloses a big data analysis method, including: based on the input first set of data and second set of data, obtaining at least two feature information satisfying preset conditions; the first set of data and the second set of data The data are all data in the first communication network; the first set of data satisfies a first preset rule; the second set of data satisfies a second preset rule; according to the at least two feature information, the first One set of data and the second set of data are analyzed to determine the first type of rules and the second type of rules; according to the first type of rules and the second type of rules, it is determined in the input third set of data Target data of a preset rule; the third group of data is data in other communication networks except the first communication network. The embodiment of the invention also discloses a big data analysis device.

Description

A big data analysis method and device thereof

技术领域technical field

本发明涉及通信技术，尤其涉及一种大数据分析方法及其装置。The invention relates to communication technology, in particular to a big data analysis method and a device thereof.

背景技术Background technique

随着第四代移动通信技术(4G，the4Generationmobilecommunicationtechnology)的商用，各大运营商竞争益发激烈；异网高价值用户的回流工作和4G终端的渗透工作对于移动运营商的发展起着重要作用；所以异网高价值用户的识别显得至关重要。With the commercialization of the 4th generation mobile communication technology (4G, the4Generation mobile communication technology), the competition among major operators has become increasingly fierce; the return of high-value users on different networks and the penetration of 4G terminals play an important role in the development of mobile operators; therefore The identification of high-value users of different networks is very important.

目前业界已有对用户行为进行分析并建模以确定用户属性的方法，但是，现有方法中，普遍侧重于统计异网用户的数量，并不侧重于异网用户的识别，以及异网用户的终端类型的识别。At present, there are methods in the industry to analyze and model user behavior to determine user attributes. However, the existing methods generally focus on counting the number of users on different networks, not on the identification of users on different networks. identification of the terminal type.

发明内容Contents of the invention

为解决现有存在的技术问题，本发明实施例提供了一种大数据分析方法及其装置，能够依据本网数据规则，在异网数据中确定出满足预设规则的目标数据。In order to solve the existing technical problems, the embodiment of the present invention provides a big data analysis method and its device, which can determine the target data satisfying the preset rules in the data of different networks according to the data rules of the local network.

本发明实施例的技术方案是这样实现的：本发明实施例提供了一种大数据分析方法，所述方法包括：The technical solution of the embodiment of the present invention is achieved in this way: the embodiment of the present invention provides a big data analysis method, the method comprising:

基于输入的第一组数据和第二组数据，获取满足预设条件的至少两个特征信息；所述第一组数据和第二组数据均为第一通信网络中的数据；所述第一组数据满足第一预设规则；所述第二组数据满足第二预设规则；Based on the input first set of data and second set of data, at least two feature information satisfying preset conditions are acquired; the first set of data and the second set of data are both data in the first communication network; the first set of The set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

依据所述至少两个特征信息，对所述第一组数据和第二组数据进行分析，确定出第一类规则和第二类规则；Analyzing the first set of data and the second set of data according to the at least two feature information to determine the first type of rules and the second type of rules;

依据所述第一类规则和第二类规则，在输入的第三组数据中确定出满足所述第一预设规则的目标数据；所述第三组数据为除所述第一通信网络以外的其他通信网络中的数据。According to the first type of rule and the second type of rule, determine the target data that satisfies the first preset rule in the input third set of data; the third set of data is other than the first communication network data in other communication networks.

上述方案中，所述依据所述至少两个特征信息，对所述第一组数据和第二组数据进行分析，确定出第一类规则和第二类规则，包括：In the above solution, the first set of data and the second set of data are analyzed according to the at least two feature information, and the first type of rules and the second type of rules are determined, including:

采用逻辑回归算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第一类规则；Using a logistic regression algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine the first type of rules;

采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第二类规则。A decision tree algorithm is used to analyze the first set of data and the second set of data according to the at least two feature information to determine a second type of rule.

上述方案中，所述采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第二类规则，包括：In the above solution, the decision tree algorithm is used to analyze the first set of data and the second set of data according to the at least two feature information, and determine the second type of rules, including:

采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出N个规则；所述N为大于等于2的正整数；Using a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information, and determine N rules; the N is a positive integer greater than or equal to 2;

在所述N个规则中，确定出满足第三预设规则的第二类规则。Among the N rules, a second-type rule that satisfies the third preset rule is determined.

上述方案中，所述依据所述第一类规则和第二类规则，在输入的第三组数据中确定出满足所述第一预设规则的目标数据，包括：In the above solution, according to the first type of rules and the second type of rules, determining the target data that satisfies the first preset rule from the input third set of data includes:

分别依据所述第一类规则和第二类规则，对输入的第三组数据进行分析，得到第一疑似目标数据和第二疑似目标数据；Analyzing the input third set of data according to the first type of rules and the second type of rules respectively, to obtain the first suspected target data and the second suspected target data;

基于所述第一疑似目标数据和第二疑似目标数据确定出满足所述第一预设规则的目标数据。Target data satisfying the first preset rule is determined based on the first suspected target data and the second suspected target data.

上述方案中，所述第二类规则包括：第一类子规则；所述第一类子规则满足所述第一预设规则；In the above solution, the second type of rules includes: the first type of sub-rules; the first type of sub-rules satisfy the first preset rule;

对应地，所述分别依据所述第一类规则和第二类规则，对输入的第三组数据进行分析，得到第一疑似目标数据和第二疑似目标数据，包括：Correspondingly, according to the first type of rules and the second type of rules, the input third set of data is analyzed to obtain the first suspected target data and the second suspected target data, including:

依据所述第一类规则，对输入的第三组数据进行分析，得到第一疑似目标数据；Analyzing the input third set of data according to the first type of rules to obtain the first suspected target data;

依据所述第一类子规则，对输入的第三组数据进行分析，得到第二疑似目标数据。According to the first type of sub-rules, the input third set of data is analyzed to obtain the second suspected target data.

上述方案中，所述第二类规则还包括：第二类子规则；所述第二类子规则满足第二预设规则；所述方法还包括：In the above solution, the second type of rule further includes: a second type of sub-rule; the second type of sub-rule satisfies a second preset rule; the method further includes:

依据所述第二类子规则，对所述第一疑似目标数据和所述第二疑似目标数据进行分析，得到疑似非目标数据；Analyzing the first suspected target data and the second suspected target data according to the second type of sub-rules to obtain suspected non-target data;

对应地，所述基于所述第一疑似目标数据和第二疑似目标数据确定出目标数据，包括：Correspondingly, the determining the target data based on the first suspected target data and the second suspected target data includes:

基于所述第一疑似目标数据、第二疑似目标数据和疑似非目标数据，确定出目标数据。The target data is determined based on the first suspected target data, the second suspected target data and the suspected non-target data.

本发明实施例还提供了一种大数据分析装置，所述装置包括：The embodiment of the present invention also provides a big data analysis device, the device includes:

获取单元，用于基于输入的第一组数据和第二组数据，获取满足预设条件的至少两个特征信息；所述第一组数据和第二组数据均为第一通信网络中的数据；所述第一组数据满足第一预设规则；所述第二组数据满足第二预设规则；An acquisition unit, configured to acquire at least two feature information satisfying preset conditions based on the input first set of data and second set of data; the first set of data and the second set of data are both data in the first communication network ; The first set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

分析单元，用于依据所述至少两个特征信息，对所述第一组数据和第二组数据进行分析，确定出第一类规则和第二类规则；An analysis unit, configured to analyze the first set of data and the second set of data according to the at least two characteristic information, and determine the first type of rules and the second type of rules;

确定单元，用于依据所述第一类规则和第二类规则，在输入的第三组数据中确定出满足所述第一预设规则的目标数据；所述第三组数据为除所述第一通信网络以外的其他通信网络中的数据。A determining unit, configured to determine, from the input third set of data, target data satisfying the first preset rule according to the first type of rule and the second type of rule; the third set of data is Data in a communication network other than the first communication network.

上述方案中，所述分析单元包括：In the above scheme, the analysis unit includes:

第一分析子单元，用于采用逻辑回归算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第一类规则；The first analysis subunit is configured to use a logistic regression algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a first type of rule;

第二分析子单元，用于采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第二类规则。The second analysis subunit is configured to use a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a second type of rule.

上述方案中，所述第二分析子单元，还用于采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出N个规则；所述N为大于等于2的正整数；In the above solution, the second analysis subunit is further configured to use a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information, and determine N Rules; said N is a positive integer greater than or equal to 2;

还用于在所述N个规则中，确定出满足第三预设规则的第二类规则。It is also used to determine, among the N rules, a second type of rule that satisfies the third preset rule.

上述方案中，所述确定单元，包括：In the above scheme, the determination unit includes:

第一确定子单元，用于分别依据所述第一类规则和第二类规则，对输入的第三组数据进行分析，得到第一疑似目标数据和第二疑似目标数据；The first determination subunit is configured to analyze the input third set of data according to the first type of rules and the second type of rules respectively, to obtain the first suspected target data and the second suspected target data;

第二确定子单元，用于基于所述第一疑似目标数据和第二疑似目标数据确定出满足所述第一预设规则的目标数据。The second determining subunit is configured to determine target data satisfying the first preset rule based on the first suspected target data and the second suspected target data.

上述方案中，所述第二类规则包括：第一类子规则；所述第一类子规则满足所述第一预设规则；对应地，In the above solution, the second type of rules includes: the first type of sub-rules; the first type of sub-rules satisfy the first preset rule; correspondingly,

所述第一确定子单元，还用于依据所述第一类规则，对输入的第三组数据进行分析，得到第一疑似目标数据；The first determining subunit is further configured to analyze the input third set of data according to the first type of rules to obtain the first suspected target data;

还用于依据所述第一类子规则，对输入的第三组数据进行分析，得到第二疑似目标数据。It is also used to analyze the input third set of data according to the first type of sub-rules to obtain the second suspected target data.

上述方案中，所述第二类规则还包括：第二类子规则；所述第二类子规则满足第二预设规则；In the above solution, the second type of rule further includes: a second type of sub-rule; the second type of sub-rule satisfies the second preset rule;

所述第一确定子单元，还用于依据所述第二类子规则，对所述第一疑似目标数据和所述第二疑似目标数据进行分析，得到疑似非目标数据；The first determination subunit is further configured to analyze the first suspected target data and the second suspected target data according to the second type of sub-rules to obtain suspected non-target data;

对应地，所述第二确定子单元，还用于基于所述第一疑似目标数据、第二疑似目标数据和疑似非目标数据，确定出目标数据。Correspondingly, the second determination subunit is further configured to determine the target data based on the first suspected target data, the second suspected target data and the suspected non-target data.

本发明实施例所提供的大数据分析方法及其装置，能够在第一通信网络的第一组数据和第二组数据中确定出至少两个特征信息，并采用两种不同算法，基于所述至少两个特征信息确定出针对于不同算法的第一类规则和第二类规则，如此，通过所述第一类规则和第二类规则，对除所述第一通信网络以外的其他通信网络中的第三组数据进行分析，以在所述第三组数据中确定出满足预设规则的目标数据，因此，本发明实施例能够实现依据本网数据规则，在异网数据中确定出满足预设规则的目标数据的目的。The big data analysis method and its device provided by the embodiments of the present invention can determine at least two feature information in the first set of data and the second set of data in the first communication network, and use two different algorithms, based on the At least two pieces of feature information determine the first type of rules and the second type of rules for different algorithms, so that, through the first type of rules and the second type of rules, other communication networks except the first communication network Analyze the third group of data in the third group of data to determine the target data that satisfies the preset rules in the third group of data. The purpose of the target data for preset rules.

附图说明Description of drawings

图1为本发明实施例大数据分析方法的实现流程示意图；Fig. 1 is a schematic diagram of the implementation flow of the big data analysis method of the embodiment of the present invention;

图2为本发明实施例大数据分析装置的具体结构示意图；FIG. 2 is a schematic structural diagram of a big data analysis device according to an embodiment of the present invention;

图3为本发明实施例分析单元的具体结构示意图；Fig. 3 is the specific structural schematic diagram of the analyzing unit of the embodiment of the present invention;

图4为本发明实施例确定单元的具体结构示意图；FIG. 4 is a schematic structural diagram of a determination unit according to an embodiment of the present invention;

图5为本发明实施例大数据分析方法的具体实现的流程示意图。FIG. 5 is a schematic flowchart of a specific implementation of a big data analysis method according to an embodiment of the present invention.

具体实施方式detailed description

为了能够更加详尽地了解本发明的特点与技术内容，下面结合附图对本发明的实现进行详细阐述，所附附图仅供参考说明之用，并非用来限定本发明。In order to understand the characteristics and technical content of the present invention in more detail, the implementation of the present invention will be described in detail below in conjunction with the accompanying drawings. The attached drawings are only for reference and description, and are not intended to limit the present invention.

实施例一Embodiment one

图1为本发明实施例大数据分析方法的实现流程示意图；如图1所示，所述方法包括：Fig. 1 is a schematic diagram of the implementation process of the big data analysis method of the embodiment of the present invention; as shown in Fig. 1, the method includes:

步骤101：基于输入的第一组数据和第二组数据，获取满足预设条件的至少两个特征信息；所述第一组数据和第二组数据均为第一通信网络中的数据；所述第一组数据满足第一预设规则；所述第二组数据满足第二预设规则；Step 101: Based on the input first set of data and second set of data, obtain at least two feature information satisfying preset conditions; the first set of data and the second set of data are both data in the first communication network; The first set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

本实施例中，所述第一预设规则可以为在第一通信网络中数据对应的用户的通信设备类型属于第一类型的规则；所述第二预设规则可以为在第一通信网络中数据对应的用户的通信设备类型不属于第一类型的规则；如此，在所述第一通信网络中，所述第一组数据所对应的通信设备类型均为第一类型；所述第二组数据对应的通信设备类型均不为第一类型；由于不同通信设备类型所对应的数据的特征规则不同，因此，通过对第一组数据和第二组数据各自的特征规则进行分析，能够确定出满足预设条件的M个特征信息；基于所述M个特征信息对数据进行分析，能够估算出数据对应的通信设备类型等特征；基于上述过程，本发明实施例能够依据所述第一通信网络中的特征信息，从异网的大量数据中确定出通信设备类型属于第一类型的数据，为大数据分析奠定基础；这里，所述M为大于等于2的正整数。In this embodiment, the first preset rule may be a rule that the communication device type of the user corresponding to the data in the first communication network belongs to the first type; the second preset rule may be a rule in the first communication network The communication device type of the user corresponding to the data does not belong to the first type of rule; thus, in the first communication network, the communication device types corresponding to the first group of data are all of the first type; the second group The type of communication equipment corresponding to the data is not the first type; since the characteristic rules of the data corresponding to different types of communication equipment are different, by analyzing the respective characteristic rules of the first set of data and the second set of data, it can be determined that M pieces of feature information that meet the preset conditions; analyze the data based on the M pieces of feature information, and estimate the characteristics such as the communication device type corresponding to the data; based on the above process, the embodiment of the present invention can be based on the first communication network The feature information in is determined from a large amount of data in different networks to determine that the communication device type belongs to the first type of data, which lays the foundation for big data analysis; here, the M is a positive integer greater than or equal to 2.

本实施例中，所述特征信息具体为符合预设条件的关键变量指标，采用不同的算法，通过关键变量指标对第一通信网络中的大数据进行分析，也即对第一组数据和第二组数据进行分析，如此，为在第一通信网络的大数据中确定出规则奠定基础。In this embodiment, the feature information is specifically a key variable index that meets the preset conditions, and different algorithms are used to analyze the big data in the first communication network through the key variable index, that is, the first group of data and the second group of data The second set of data is analyzed, thus laying the foundation for determining the rules in the big data of the first communication network.

本实施例中，所述预设条件包括但不限于：大于等于第一用户数量的条件、通信对象的通信设备类型为第一类型的条件等。In this embodiment, the preset conditions include, but are not limited to: a condition that the number of users is greater than or equal to the first one, a condition that the communication device type of the communication object is the first type, and the like.

步骤102：依据所述至少两个特征信息，对所述第一组数据和第二组数据进行分析，确定出第一类规则和第二类规则；Step 102: Analyze the first set of data and the second set of data according to the at least two characteristic information, and determine the first type of rules and the second type of rules;

本实施例中，依据第一通信网络中确定出的至少两个特征信息，采用不同算法，对所述第一组数据和第二组数据进行分析，进而确定出基于所述第一通信网络的第一类规则和第二类规则。In this embodiment, according to at least two characteristic information determined in the first communication network, different algorithms are used to analyze the first group of data and the second group of data, and then determine the information based on the first communication network. The first type of rules and the second type of rules.

在实际应用中，对大数据进行数据分析时，通常选用不同的算法，如此，以提高分析结果的准确性；因此，本实施例也选用两种不同的算法对输入的第一组数据和第二组数据进行分析。In practical applications, different algorithms are usually used for data analysis of big data, so as to improve the accuracy of the analysis results; Two sets of data were analyzed.

本实施例中，由于步骤101中确定出的特征信息的个数不同，使得采用决策树算法确定出的规则的个数不同，即N不同；因此，N的取值受限于所述特征信息的个数。In this embodiment, since the number of feature information determined in step 101 is different, the number of rules determined by the decision tree algorithm is different, that is, N is different; therefore, the value of N is limited by the feature information the number of .

本实施例中，所述第二类规则为一统称，是所述N个规则中、所有满足第三预设规则的规则统称，因此，并未指一特定规则。In this embodiment, the second type of rules is a general term, which is a general term for all rules satisfying the third preset rule among the N rules, and therefore, does not refer to a specific rule.

步骤103：依据所述第一类规则和第二类规则，在输入的第三组数据中确定出满足所述第一预设规则的目标数据；所述第三组数据为除所述第一通信网络以外的其他通信网络中的数据。Step 103: According to the first type of rule and the second type of rule, determine the target data that satisfies the first preset rule from the input third set of data; the third set of data is Data in communication networks other than communication networks.

本实施例中，能够通过在第一通信网络中确定出的第一类规则和第二类规则，在除所述第一通信网络之外的其他通信网络中的大量数据中、确定出满足第一预设规则的目标数据，即在其他通信网络的数据中，确定出用户的通信设备类型属于第一类型的目标数据，如此，实现基于本网中数据规则，在异网数据中确定出满足预设规则的目标数据的目的。In this embodiment, by using the first-type rules and the second-type rules determined in the first communication network, it can be determined among a large amount of data in other communication networks other than the first communication network that satisfies the first type of rules. The target data of a preset rule, that is, in the data of other communication networks, it is determined that the type of the user’s communication device belongs to the first type of target data, so that based on the data rules in this network, it is determined in the data of other networks that satisfy The purpose of the target data for preset rules.

本实施例中，所述第一疑似目标数据为与第一类规则对应的数据，即通过第一类规则，在除所述第一通信网络之外的其他通信网络中确定出的满足第一预设规则的疑似目标数据；所述第二疑似目标数据为与第二类规则对应的数据，即通过第二类规则，在除所述第一通信网络之外的其他通信网络中确定出的满足第一预设规则的疑似目标数据。In this embodiment, the first suspected target data is data corresponding to the first type of rule, that is, through the first type of rule, it is determined in a communication network other than the first communication network that satisfies the first type of rule. Suspected target data of preset rules; the second suspected target data is data corresponding to the second type of rule, that is, determined in other communication networks except the first communication network through the second type of rule Suspected target data satisfying the first preset rule.

本实施例中，由于所述第二类规则为采用决策树算法确定出的规则，因此，通过第二类规则能够确定出满足第一预设规则的第二疑似目标数据，和满足第二预设规则的疑似非目标数据；即，所述第二类规则包括：第一类子规则和第二类子规则；通过所述第一类子规则，能够确定出满足第一预设规则的第二疑似目标数据；通过所述第二类子规则，能够确定出满足第二预设规则的疑似非目标数据；因此，本实施例还需要从第一疑似目标数据和第二疑似目标数据中剔除疑似非目标数据，以确定出最终目标数据。In this embodiment, since the second type of rule is a rule determined by using a decision tree algorithm, the second suspected target data that meets the first preset rule can be determined through the second type of rule, and the second suspected target data that satisfies the second preset rule can be determined through the second type of rule. Set the suspected non-target data of the rule; that is, the second type of rule includes: the first type of sub-rule and the second type of sub-rule; through the first type of sub-rule, it is possible to determine the first Two suspected target data; through the second type of sub-rules, the suspected non-target data that meets the second preset rule can be determined; therefore, this embodiment also needs to be removed from the first suspected target data and the second suspected target data Suspected non-target data to determine the final target data.

本实施例中，所述第一类子规则为满足第一预设规则的规则；所述第二类子规则为不满足所述第一预设规则的规则；也即为满足所述第二预设规则的规则；当所述第二类子规则为不满足所述第一预设规则的规则时，所述疑似非目标数据为一类干扰数据；因此，所述疑似非目标数据也可以称为干扰数据。In this embodiment, the first type of sub-rule is a rule that satisfies the first preset rule; the second type of sub-rule is a rule that does not satisfy the first preset rule; that is, it is a rule that satisfies the second A rule of a preset rule; when the second type of sub-rule is a rule that does not satisfy the first preset rule, the suspected non-target data is a type of interference data; therefore, the suspected non-target data can also be called noise data.

为实现上述方法，本发明实施例还提供了一种大数据分析装置，如图2所示，所述装置包括：In order to implement the above method, an embodiment of the present invention also provides a big data analysis device, as shown in Figure 2, the device includes:

获取单元21，用于基于输入的第一组数据和第二组数据，获取满足预设条件的至少两个特征信息；所述第一组数据和第二组数据均为第一通信网络中的数据；所述第一组数据满足第一预设规则；所述第二组数据满足第二预设规则；An acquisition unit 21, configured to acquire at least two feature information satisfying preset conditions based on the input first set of data and second set of data; both the first set of data and the second set of data are in the first communication network data; the first set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

分析单元22，用于依据所述至少两个特征信息，对所述第一组数据和第二组数据进行分析，确定出第一类规则和第二类规则；An analyzing unit 22, configured to analyze the first set of data and the second set of data according to the at least two characteristic information, and determine the first type of rules and the second type of rules;

确定单元23，用于依据所述第一类规则和第二类规则，在输入的第三组数据中确定出满足所述第一预设规则的目标数据；所述第三组数据为除所述第一通信网络以外的其他通信网络中的数据。The determining unit 23 is configured to determine, from the input third set of data, target data satisfying the first preset rule according to the first type of rule and the second type of rule; the third set of data is Data in other communication networks other than the first communication network.

上述方案中，如图3所示，所述分析单元22包括：In the above solution, as shown in Figure 3, the analysis unit 22 includes:

第一分析子单元221，用于采用逻辑回归算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第一类规则；The first analysis subunit 221 is configured to use a logistic regression algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a first type of rule;

第二分析子单元222，用于采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出第二类规则。The second analysis subunit 222 is configured to use a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a second type of rule.

上述方案中，所述第二分析子单元222，还用于采用决策树算法，依据所述至少两个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出N个规则；所述N为大于等于2的正整数；In the above solution, the second analysis subunit 222 is also configured to use a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information, and determine N rules; said N is a positive integer greater than or equal to 2;

上述方案中，如图4所示，所述确定单元23，包括：In the above solution, as shown in FIG. 4, the determining unit 23 includes:

第一确定子单元231，用于分别依据所述第一类规则和第二类规则，对输入的第三组数据进行分析，得到第一疑似目标数据和第二疑似目标数据；The first determination subunit 231 is configured to analyze the input third set of data according to the first type of rules and the second type of rules respectively, to obtain the first suspected target data and the second suspected target data;

第二确定子单元232，用于基于所述第一疑似目标数据和第二疑似目标数据确定出满足所述第一预设规则的目标数据。The second determining subunit 232 is configured to determine target data satisfying the first preset rule based on the first suspected target data and the second suspected target data.

所述第一确定子单元231，还用于依据所述第一类规则，对输入的第三组数据进行分析，得到第一疑似目标数据；The first determination subunit 231 is further configured to analyze the input third set of data according to the first type of rules to obtain the first suspected target data;

所述第一确定子单元231，还用于依据所述第二类子规则，所述第一疑似目标数据和所述第二疑似目标数据进行分析，得到疑似非目标数据；The first determination subunit 231 is further configured to analyze the first suspected target data and the second suspected target data according to the second type of sub-rules to obtain suspected non-target data;

对应地，所述第二确定子单元232，还用于基于所述第一疑似目标数据、第二疑似目标数据和疑似非目标数据，确定出目标数据。Correspondingly, the second determination subunit 232 is further configured to determine the target data based on the first suspected target data, the second suspected target data and the suspected non-target data.

所述获取单元21、分析单元22及确定单元23均可以运行于计算机上，可由位于计算机上的中央处理器(CPU)、或微处理器(MPU)、或数字信号处理器(DSP)、或可编程门阵列(FPGA)实现。The acquisition unit 21, the analysis unit 22 and the determination unit 23 all can run on a computer, and can be located on a central processing unit (CPU), or a microprocessor (MPU), or a digital signal processor (DSP), or Programmable Gate Array (FPGA) implementation.

实施例二Embodiment two

第一软件，例如IMESSAGE软件是指第一类型终端内置的用户间发送短信的软件，该软件可以使短信直接从GPRS端发送，节省了使用第一类型终端的用户的短信费用；因此，使用第一软件的第一类型终端用户可能会大大减少短信的使用量，形成了短信黑洞现象，本实施例正是基于上述短信黑洞现象，在异网中确定出终端类型为第一类型的用户。The first software, such as IMESSAGE software, refers to the built-in software for sending short messages between users of the first type of terminal. This software can make short messages directly send from the GPRS end, saving the cost of short messages for users who use the first type of terminal; therefore, using the first type of terminal The first type of terminal users of a software may greatly reduce the usage of short messages, forming a short message black hole phenomenon. This embodiment is based on the above short message black hole phenomenon, and determines the terminal type as the first type of users in the different network.

本实施例主要利用现有经分系统的通信数据，分析本网使用第一软件的第一类型终端用户的交往行为、以及其交往圈的人群的特点，识别出异网具备上述交往行为、以及其交往圈人群符合上述特点的数据，也即用户，以最终在异网中确定出终端类型为第一类型的用户，以助力于运营商的异网高价值客户的回流工作及营销策略。This embodiment mainly uses the communication data of the existing sub-systems to analyze the communication behavior of the first type of terminal users using the first software on this network and the characteristics of the people in their communication circle, and identify that the different network has the above communication behavior, and The data of the people in the communication circle conforming to the above characteristics, that is, users, can finally determine the terminal type as the first type of users in the different network, so as to help the return work and marketing strategy of the high-value customers of the different network of the operator.

具体地，本实施例主要以用户交往圈模型为基础，通过分析本网第一类型终端中使用第一软件的客户语音交往圈和短信交往圈等习惯特征，在异网大量用户中，分析出第一类型终端用户的用户群，进而分析出异网某一用户是否为第一类型终端用户的概率，以为运营商提供具有参考价值的数据信息。Specifically, this embodiment is mainly based on the user communication circle model, and by analyzing the habitual characteristics of the customer's voice communication circle and SMS communication circle using the first software in the first type of terminal on this network, among a large number of users in different networks, it is analyzed that The user group of the first type of end users, and then analyze the probability of whether a certain user of the different network is the first type of end users, so as to provide operators with data information with reference value.

图5为本发明实施例大数据分析方法的具体实现的流程示意图；在进行大数据分析之前，需要确定出第一组数据和第二组数据；具体地，在第一通信网络中确定出具有第一数据量的第一组数据、以及具有第一数据量的第二组数据；其中，所述第一组数据中各数据对应的用户设备类型为第一类型；所述第二组数据对应的用户设备类型为非第一类型；如图5所示，所述方法包括：Fig. 5 is a schematic flow chart of a specific implementation of the big data analysis method according to the embodiment of the present invention; before performing big data analysis, it is necessary to determine the first set of data and the second set of data; specifically, it is determined in the first communication network that the The first group of data with the first data amount, and the second group of data with the first data amount; wherein, the user equipment type corresponding to each data in the first group of data is the first type; the second group of data corresponds to The type of user equipment is not the first type; as shown in Figure 5, the method includes:

步骤501：在第一组数据和第二组数据中，结合第一组数据和第二组数据各自对应的用户的交往圈的特征规则、交往圈中语音和短信的特征规则、交往对象是否使用第一类型终端的特征规则等选取出M个特征信息；其中，M为大于等于2的正整数；Step 501: In the first set of data and the second set of data, combine the feature rules of the user's social circle corresponding to the first set of data and the second set of data, the feature rules of the voice and text messages in the social circle, and whether the contact object uses M pieces of feature information are selected from the feature rules of the first type of terminal; wherein, M is a positive integer greater than or equal to 2;

这里，所述特征信息也称为关键变量指标。Here, the characteristic information is also referred to as a key variable index.

步骤502：采用逻辑回归算法，依据所述M个特征信息，对所述第一组数据和第二组数据进行分析，模拟出满足第一预设规则的第一类规则；Step 502: Using a logistic regression algorithm to analyze the first set of data and the second set of data according to the M pieces of feature information, and simulate a first type of rule that satisfies the first preset rule;

这里，所述第一类规则可以为逻辑回归公式；所述第一预设规则为用户终端类型为第一类型的规则。Here, the first type of rule may be a logistic regression formula; the first preset rule is a rule that the user terminal type is the first type.

本实施例中，所述对所述第一组数据和第二组数据进行分析，模拟出满足第一预设规则的第一类规则，包括：In this embodiment, the analysis of the first set of data and the second set of data to simulate a first type of rule that satisfies the first preset rule includes:

基于所述M个特征信息，采用逻辑回归算法，对所述第一组数据和第二组数据进行分析，模拟出满足第一预设规则的第一类规则。Based on the M pieces of feature information, a logistic regression algorithm is used to analyze the first set of data and the second set of data, and simulate a first type of rule that satisfies the first preset rule.

步骤503：确定第三组数据，依据所述第一类规则，计算所述第三组数据中的各数据的概率，以确定出第一疑似目标数据；所述第三组数据为与所述第一通信网络中的用户进行通信的、其他通信网络中的用户所对应的数据；Step 503: Determine the third group of data, and calculate the probability of each data in the third group of data according to the first type of rules, so as to determine the first suspected target data; the third group of data is related to the Data corresponding to users in other communication networks that are communicated by users in the first communication network;

这里，所述依据所述第一类规则，计算所述第三组数据中的各数据的概率，以确定出第一疑似目标数据，进一步包括：Here, the calculation of the probability of each data in the third group of data according to the first type of rules to determine the first suspected target data further includes:

依据所述第一类规则，计算所述第三组数据中的各数据的概率；calculating the probability of each data in the third group of data according to the first type of rules;

依据数据业务需求、逻辑回归算法的逻辑回归等级对应的预设用户数，在所述第三组数据中的各数据对应的概率中，确定出概率大于等于预设阈值的数据，并将概率大于等于预设阈值的数据作为第一疑似目标数据。According to the data service requirements and the preset number of users corresponding to the logistic regression level of the logistic regression algorithm, among the probabilities corresponding to each data in the third group of data, determine the data whose probability is greater than or equal to the preset threshold, and set the probability greater than or equal to the preset threshold. The data equal to the preset threshold is taken as the first suspected target data.

步骤504：采用C5决策树算法，依据所述M个特征信息，对所述第一组数据和所述第二组数据进行分析，确定出m1个规则A和m2个规则B；Step 504: Using the C5 decision tree algorithm to analyze the first set of data and the second set of data according to the M feature information, and determine m1 rules A and m2 rules B;

步骤505：根据规则A和规则B对应的用户数和置信度，对规则A和规则B进行筛选，以在所述规则A中确定出第一类子规则，在所述规则B中确定第二类子规则；Step 505: According to the number of users and confidence levels corresponding to rule A and rule B, filter rule A and rule B to determine the first type of sub-rule in rule A, and determine the second type of sub-rule in rule B class subrules;

这里，所述第一类子规则满足所述第一预设规则；所述第二类子规则满足所述第二预设规则；所述m1、m2为大于等于1的正整数。Here, the first type of sub-rule satisfies the first preset rule; the second type of sub-rule satisfies the second preset rule; the m1 and m2 are positive integers greater than or equal to 1.

具体地，当第一组数据和第二组数据的用户数均为10W时，从规则A中筛选出置信度大于85％、用户数大于2W的规则，确定为第一类子规则；从规则B中筛选出置信度大于90％、用户数大于1.8W的规则，确定为第二类子规则；Specifically, when the number of users of the first set of data and the number of users of the second set of data are both 10W, a rule with a confidence degree greater than 85% and a number of users greater than 2W is selected from rule A, and is determined as the first type of sub-rule; In B, the rules with confidence greater than 90% and the number of users greater than 1.8W are selected and determined as the second type of sub-rules;

本实施例中，所述第一类子规则和第二类子规则均归属于第二类规则。In this embodiment, both the first type of sub-rules and the second type of sub-rules belong to the second type of rules.

步骤506：依据所述第一类子规则，对所述第三组数据进行分析，确定出第二疑似目标数据；Step 506: Analyze the third group of data according to the first type of sub-rules, and determine the second suspected target data;

步骤507：确定所述第一疑似目标数据和第二疑似目标数据的交集数据，作为第三疑似目标数据；Step 507: Determine the intersection data of the first suspected target data and the second suspected target data as the third suspected target data;

步骤508：剔除所述第三疑似目标数据中符合第二类子规则的数据，将剩余第三疑似目标数据作为目标数据。Step 508: Eliminate the data conforming to the second type of sub-rules in the third suspected target data, and use the remaining third suspected target data as target data.

本发明实施例，能够在第一通信网络中的第一组数据和第二组数据中确定出关键变量指标，即特征信息；并分别采用逻辑回归算法和决策树算法对所述第一组数据和第二组数据进行分析，确定出与所述逻辑回归算法对应的第一类规则，和与所述决策树算法对应的第二类规则；其中，所述第二类规则包括第一类子规则和第二类子规则；随后，分别依据所述第一类规则和第一类子规则对异网中的第三组数据进行分析，确定出第一疑似目标数据和第二疑似目标数据；由于所述第一类规则满足第一预设规则；所述第一类子规则也满足所述第一预设规则；而第二类子规则满足所述第二预设规则，因此，取所述第一疑似目标数据和第二疑似目标数据的交集确定出第三疑似目标数据后，在所述第三疑似目标数据中剔除满足第二类子规则的数据，即在所述第三疑似目标数据中剔除疑似非目标数据以最终得到目标数据，所述目标数据即为依据本网数据规则，在异网数据中确定出满足第一预设规则的目标数据。In the embodiment of the present invention, it is possible to determine key variable indicators, that is, feature information, from the first set of data and the second set of data in the first communication network; Analyze with the second set of data, determine the first type of rules corresponding to the logistic regression algorithm, and the second type of rules corresponding to the decision tree algorithm; wherein, the second type of rules include the first type of rules rules and second-type sub-rules; then, analyze the third group of data in the different network according to the first-type rules and the first-type sub-rules respectively, and determine the first suspected target data and the second suspected target data; Since the first type of rule satisfies the first preset rule; the first type of sub-rule also satisfies the first preset rule; and the second type of sub-rule satisfies the second preset rule, therefore, the After the third suspected target data is determined by the intersection of the first suspected target data and the second suspected target data, the data satisfying the second type of sub-rule is eliminated from the third suspected target data, that is, in the third suspected target data Suspected non-target data is removed from the data to finally obtain target data. The target data is the target data that satisfies the first preset rule determined from the data of other networks according to the data rules of the local network.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述仅是本发明实施例的实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明实施例原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本发明实施例的保护范围。The above is only the implementation of the embodiment of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the embodiment of the present invention, some improvements and modifications can also be made. These improvements and Retouching should also be regarded as the scope of protection of the embodiments of the present invention.

Claims

1. a big data analysis method, is characterized in that, described method comprises:

Based on the input first set of data and second set of data, at least two feature information satisfying preset conditions are acquired; the first set of data and the second set of data are both data in the first communication network; the first set of The set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

Analyzing the first set of data and the second set of data according to the at least two feature information to determine the first type of rules and the second type of rules;

According to the first type of rule and the second type of rule, determine the target data that satisfies the first preset rule in the input third set of data; the third set of data is other than the first communication network data in other communication networks.

2. The method according to claim 1, characterized in that, according to the at least two feature information, the first set of data and the second set of data are analyzed to determine the first type of rules and the second set of rules. class rules, including:

Using a logistic regression algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine the first type of rules;

A decision tree algorithm is used to analyze the first set of data and the second set of data according to the at least two feature information to determine a second type of rule.

3. The method according to claim 2, wherein the decision tree algorithm is used to analyze the first group of data and the second group of data according to the at least two feature information, and determine The second category of rules includes:

Using a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information, and determine N rules; the N is a positive integer greater than or equal to 2;

Among the N rules, a second-type rule that satisfies the third preset rule is determined.

4. The method according to any one of claims 1 or 3, characterized in that, according to the first type of rules and the second type of rules, it is determined in the input third group of data that the first type of data is satisfied. Target data for preset rules, including:

Analyzing the input third set of data according to the first type of rules and the second type of rules respectively, to obtain the first suspected target data and the second suspected target data;

Target data satisfying the first preset rule is determined based on the first suspected target data and the second suspected target data.

5. The method according to claim 4, wherein the second type of rule comprises: a first type of sub-rule; the first type of sub-rule satisfies the first preset rule;

Correspondingly, according to the first type of rules and the second type of rules, the input third set of data is analyzed to obtain the first suspected target data and the second suspected target data, including:

Analyzing the input third set of data according to the first type of rules to obtain the first suspected target data;

According to the first type of sub-rules, the input third set of data is analyzed to obtain the second suspected target data.

6. The method according to claim 5, wherein the second type of rule further comprises: a second type of sub-rule; the second type of sub-rule satisfies a second preset rule; the method further comprises:

Analyzing the first suspected target data and the second suspected target data according to the second type of sub-rules to obtain suspected non-target data;

Correspondingly, the determining the target data based on the first suspected target data and the second suspected target data includes:

The target data is determined based on the first suspected target data, the second suspected target data and the suspected non-target data.

7. A big data analysis device, characterized in that the device comprises:

An acquisition unit, configured to acquire at least two feature information satisfying preset conditions based on the input first set of data and second set of data; the first set of data and the second set of data are both data in the first communication network ; The first set of data satisfies a first preset rule; the second set of data satisfies a second preset rule;

An analysis unit, configured to analyze the first set of data and the second set of data according to the at least two characteristic information, and determine the first type of rules and the second type of rules;

A determining unit, configured to determine, from the input third set of data, target data satisfying the first preset rule according to the first type of rule and the second type of rule; the third set of data is Data in a communication network other than the first communication network.

8. The device according to claim 7, wherein the analysis unit comprises:

The first analysis subunit is configured to use a logistic regression algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a first type of rule;

The second analysis subunit is configured to use a decision tree algorithm to analyze the first set of data and the second set of data according to the at least two feature information to determine a second type of rule.

9. The device according to claim 8, wherein the second analysis subunit is further configured to use a decision tree algorithm to analyze the first group of data and the The second set of data is analyzed to determine N rules; said N is a positive integer greater than or equal to 2;

It is also used to determine, among the N rules, a second type of rule that satisfies the third preset rule.

10. The device according to any one of claims 7 to 9, wherein the determining unit includes:

The first determination subunit is configured to analyze the input third set of data according to the first type of rules and the second type of rules respectively, to obtain the first suspected target data and the second suspected target data;

The second determining subunit is configured to determine target data satisfying the first preset rule based on the first suspected target data and the second suspected target data.

11. The device according to claim 10, wherein the second type of rule comprises: a first type of sub-rule; the first type of sub-rule satisfies the first preset rule; correspondingly,

The first determining subunit is further configured to analyze the input third set of data according to the first type of rules to obtain the first suspected target data;

It is also used to analyze the input third set of data according to the first type of sub-rules to obtain the second suspected target data.

12. The method according to claim 11, wherein the second type of rule further comprises: a second type of sub-rule; the second type of sub-rule satisfies a second preset rule;

The first determination subunit is further configured to analyze the first suspected target data and the second suspected target data according to the second type of sub-rules to obtain suspected non-target data;

Correspondingly, the second determination subunit is further configured to determine the target data based on the first suspected target data, the second suspected target data and the suspected non-target data.