CN111090708B

CN111090708B - Method and system for generating user characteristics based on data warehouse

Info

Publication number: CN111090708B
Application number: CN201910962259.7A
Authority: CN
Inventors: 周翱; 胡研; 党孟光; 张一丁
Original assignee: AlipayCom Co ltd
Current assignee: AlipayCom Co ltd
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2023-07-14
Anticipated expiration: 2039-10-11
Also published as: CN111090708A

Abstract

The present disclosure provides a data warehouse-based user feature yield method, comprising: acquiring transaction detail data of a plurality of users and dividing the transaction detail data into a plurality of subsets according to different dimensions; aggregating data in the first dimension subset per user; acquiring a second dimension for data screening; correlating the data in the second subset of dimensions to the data in the aggregated first subset of dimensions to generate data to be screened; screening the data to be screened to obtain user characteristics; and outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.

Description

Method and system for generating user characteristics based on data warehouse

技术领域technical field

本公开主要涉及数据处理，尤其涉及基于数据聚合的数据处理。The present disclosure generally relates to data processing, and more particularly to data processing based on data aggregation.

背景技术Background technique

在大数据时代，通常会采用大数据关联追溯技术，对海量交易数据进行深度挖掘，基于交易网络结构发现和异常交易发现来寻找出异常交易网络。In the era of big data, big data correlation and traceability technology is usually used to deeply mine massive transaction data, and find abnormal transaction networks based on transaction network structure discovery and abnormal transaction discovery.

在大数据领域中，由于交易数据量十分庞大和繁杂，因此在数据挖掘过程中往往采用分组运算和数据聚合来获取结果。这样的处理过程相对简便快捷，但是也带来其局限性，也就是在输出某一特征的最终结果时，往往没有中间证据信息，因而需要后续通过人工搜集证据或凭据，工作量大且极有可能遗漏。In the field of big data, due to the huge and complicated volume of transaction data, grouping operations and data aggregation are often used to obtain results in the data mining process. Such a processing process is relatively simple and fast, but it also brings its limitations, that is, when outputting the final result of a certain feature, there is often no intermediate evidence information, so it is necessary to manually collect evidence or evidence later, which is a heavy workload and extremely expensive. May be missed.

在这种情况下，大数据领域运营方有对更强大工具的强烈需求，来在快速输出某一特征的最终结果时一并提供中间证据信息，从而提高大数据领域中数据处理的效率和针对性。In this case, operators in the big data field have a strong demand for more powerful tools to provide intermediate evidence information when quickly outputting the final result of a certain feature, thereby improving the efficiency of data processing in the big data field and targeting at sex.

发明内容Contents of the invention

为解决上述技术问题，本公开提供了一种基于数据仓库的用户特征产出方案。该方案能够在快速输出用户特征的最终结果时一并提供必要的中间证据信息，从而提高数据处理的效率和针对性。In order to solve the above technical problems, the present disclosure provides a data warehouse-based user feature output solution. The scheme can provide the necessary intermediate evidence information when outputting the final result of user characteristics quickly, thereby improving the efficiency and pertinence of data processing.

在本公开一实施例中，提供了一种基于数据仓库的用户特征产出方法，包括：获取多个用户的交易明细数据并按不同维度将该交易明细数据分拆成多个子集；按用户聚合第一维度子集中的数据；获取用于数据筛选的第二维度；将第二维度子集中的数据关联到经聚合的第一维度子集中的数据以生成待筛选数据；筛选该待筛选数据以获得用户特征；以及输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一。In an embodiment of the present disclosure, a method for generating user characteristics based on a data warehouse is provided, including: obtaining transaction detail data of multiple users and splitting the transaction detail data into multiple subsets according to different dimensions; Aggregating data in the subset of the first dimension; obtaining a second dimension for data filtering; associating the data in the subset of the second dimension with data in the aggregated subset of the first dimension to generate data to be filtered; filtering the data to be filtered obtaining user features; and outputting at least one of the user features, data in the first dimension subset, and data in the second dimension subset.

在本公开的另一实施例中，多个子集各自包括键值数据对。In another embodiment of the present disclosure, the plurality of subsets each include key-value data pairs.

在本公开的又一实施例中，数据仓库基于Spark、Hadoop、MapReduce、Hive或SQL。In yet another embodiment of the present disclosure, the data warehouse is based on Spark, Hadoop, MapReduce, Hive or SQL.

在本公开的另一实施例中，按不同维度将交易明细数据分拆成多个子集可采用切片或切块的操作。In another embodiment of the present disclosure, the operation of slicing or dicing may be used to split the transaction detail data into multiple subsets according to different dimensions.

在本公开的另一实施例中，输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一包括：在需要证据时，输出用户特征、第一维度子集中的数据或第二维度子集中的数据。In another embodiment of the present disclosure, outputting at least one of user characteristics, data in the first dimension subset, and data in the second dimension subset includes: outputting user characteristics, data in the first dimension subset Data or data in a subset of the second dimension.

在本公开的又一实施例中，输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一包括：在需要证据时，输出用户特征、第一维度子集中的键数据和第二维度子集中的键数据。In yet another embodiment of the present disclosure, outputting at least one of user characteristics, data in the first dimension subset, and data in the second dimension subset includes: outputting user characteristics, data in the first dimension subset when evidence is required, Key data and key data in the second dimension subset.

在本公开一实施例中，提供了一种基于数据仓库的用户特征产出系统，包括：获取模块，获取多个用户的交易明细数据并按不同维度将所述交易明细数据分拆成多个子集，并且获取用于数据筛选的第二维度；聚合和关联模块，按用户聚合第一维度子集中的数据，并将第二维度子集中的数据关联到经聚合的所述第一维度子集中的数据以生成待筛选数据；筛选模块，筛选所述待筛选数据以获得用户特征；以及输出模块，输出所述用户特征、所述第一维度子集中的数据和所述第二维度子集中的数据的至少之一。In an embodiment of the present disclosure, a data warehouse-based user feature output system is provided, including: an acquisition module that acquires transaction detail data of multiple users and splits the transaction detail data into multiple sub-components according to different dimensions. set, and obtain the second dimension used for data screening; the aggregation and association module aggregates the data in the first dimension subset by user, and associates the data in the second dimension subset with the aggregated first dimension subset to generate data to be screened; a screening module to screen the data to be screened to obtain user features; and an output module to output the user features, the data in the first dimension subset and the data in the second dimension subset at least one of the data.

在本公开的另一实施例中，获取模块按不同维度将交易明细数据分拆成多个子集可采用切片或切块的操作。In another embodiment of the present disclosure, the acquisition module may divide the transaction detail data into multiple subsets according to different dimensions by slicing or dicing operations.

在本公开的另一实施例中，输出模块输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一包括：在需要证据时，输出模块输出用户特征、第一维度子集中的数据或第二维度子集中的数据。In another embodiment of the present disclosure, the output module outputting at least one of the user characteristics, the data in the first dimension subset and the data in the second dimension subset includes: when evidence is required, the output module outputs the user characteristics, the first dimension subset Data in a dimension subset or data in a second dimension subset.

在本公开的又一实施例中，输出模块输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一包括：在需要证据时，输出模块输出用户特征、第一维度子集中的键数据和第二维度子集中的键数据。In yet another embodiment of the present disclosure, the output module outputting at least one of the user characteristics, the data in the first dimension subset, and the data in the second dimension subset includes: when evidence is required, the output module outputs the user characteristics, the first dimension subset The key data in the dimension subset and the key data in the second dimension subset.

在本公开一实施例中，提供了一种存储有指令的计算机可读存储介质，当这些指令被执行时使得机器执行如前所述的方法。In an embodiment of the present disclosure, a computer-readable storage medium storing instructions is provided, and when executed, the instructions cause a machine to perform the aforementioned method.

提供本概述以便以简化的形式介绍以下在详细描述中进一步描述的一些概念。本概述并不旨在标识所要求保护主题的关键特征或必要特征，也不旨在用于限制所要求保护主题的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

附图说明Description of drawings

本公开的以上发明内容以及下面的具体实施方式在结合附图阅读时会得到更好的理解。需要说明的是，附图仅作为所请求保护的发明的示例。在附图中，相同的附图标记代表相同或类似的元素。The above summary of the invention and the following detailed description of the present disclosure will be better understood when read in conjunction with the accompanying drawings. It should be noted that the drawings are merely examples of the claimed invention. In the drawings, the same reference numerals represent the same or similar elements.

图1示出根据本公开一实施例的基于spark的用户特征产出方法的流程图；Fig. 1 shows a flow chart of a spark-based user feature production method according to an embodiment of the present disclosure;

图2示出根据本公开一实施例的基于spark的用户特征产出方法的示意图；Fig. 2 shows a schematic diagram of a spark-based user feature production method according to an embodiment of the present disclosure;

图3示出根据本公开一实施例的基于数据仓库的用户特征产出方法的流程图；FIG. 3 shows a flow chart of a method for generating user characteristics based on a data warehouse according to an embodiment of the present disclosure;

图4示出根据本公开一实施例的基于数据仓库的用户特征产出方法的应用场景的示意图；FIG. 4 shows a schematic diagram of an application scenario of a data warehouse-based user characteristic production method according to an embodiment of the present disclosure;

图5示出根据本公开一实施例的基于数据仓库的用户特征产出系统的框图。Fig. 5 shows a block diagram of a data warehouse-based user feature production system according to an embodiment of the present disclosure.

具体实施方式Detailed ways

为使得本公开的上述目的、特征和优点能更加明显易懂，以下结合附图对本公开的具体实施方式作详细说明。In order to make the above objects, features and advantages of the present disclosure more comprehensible, specific implementations of the present disclosure will be described in detail below in conjunction with the accompanying drawings.

在下面的描述中阐述了很多具体细节以便于充分理解本公开，但是本公开还可以采用其它不同于在此描述的其它方式来实施，因此本公开不受下文公开的具体实施例的限制。Many specific details are set forth in the following description to facilitate a full understanding of the present disclosure, but the present disclosure can also be implemented in other ways than those described here, so the present disclosure is not limited by the specific embodiments disclosed below.

在大数据时代，海量的数据需要进行处理和分析必须基于有效的存储。目前采用的Hadoop(HDFS)分布式文件系统可有效存储超大数据集。在数据有效存储的基础上，对数据的统计和分析本质上就是数据的计算。在大数据领域常见的计算工具有MapReduce、Spark等。对大数据进行有效管理可采用Hadoop模式下的数据查询。数据仓库为数据的挖掘提供了基础，通过分类、预测、相关性分析来建立模型进行模式识别、机器学习，从而构建专家系统。In the era of big data, massive data needs to be processed and analyzed based on effective storage. The currently used Hadoop (HDFS) distributed file system can effectively store very large data sets. On the basis of effective data storage, the statistics and analysis of data is essentially the calculation of data. Common computing tools in the field of big data include MapReduce, Spark, etc. Effective management of big data can use data query in Hadoop mode. The data warehouse provides the basis for data mining, and establishes models through classification, prediction, and correlation analysis for pattern recognition and machine learning, thereby building an expert system.

Hadoop作为分布式文件系统(HDFS)可以进行海量数据存储，其中的MapReduce负责数据的计算，从而形成对海量数据进行处理的分布式框架。HDFS使得数据可以跨越不同的机器与设备，并且用路径去管理不同平台上的数据。As a distributed file system (HDFS), Hadoop can store massive data, and MapReduce is responsible for data calculation, thus forming a distributed framework for processing massive data. HDFS enables data to span different machines and devices, and uses paths to manage data on different platforms.

MapReduce作为计算引擎，在处理不同类型的数据时负责对不同的数据资源进行分配，处理不同设备之间的数据交换与通信问题。而Spark是第二代。在统计数据时需要分类，分类越细、参与统计者越多，计算的时间就越短。在大数据计算中，成百上千台机器同时读取目标文件的各个部分，然后对每个部分的统计量进行计算。由此，数据的计算过程就是在HDFS基础上进行分类汇总。As a computing engine, MapReduce is responsible for allocating different data resources when processing different types of data, and dealing with data exchange and communication issues between different devices. And Spark is the second generation. Classification is required when counting data. The finer the classification and the more people involved in the statistics, the shorter the calculation time. In big data computing, hundreds of machines read various parts of the target file at the same time, and then calculate statistics for each part. Therefore, the calculation process of data is classified and summarized on the basis of HDFS.

Spark是以数据计算为主的大数据处理集群，其中数据交换和磁盘读写非常方便，由此数据处理量得到进一步提升。Spark is a big data processing cluster based on data computing, in which data exchange and disk reading and writing are very convenient, so the data processing capacity is further improved.

Hive作为数据仓库工具是建立在HDFS基础之上的，其将HDFS中结构化的数据映射为数据库中的表，由此仅仅通过SQL语句就可以查询MapReduce的计算结果，同时还可以通过SQL对文件系统中的数据进行修改，提高了工作效率。As a data warehouse tool, Hive is based on HDFS. It maps the structured data in HDFS to tables in the database, so that the calculation results of MapReduce can be queried only through SQL statements, and files can also be processed through SQL. The data in the system is modified to improve work efficiency.

在本公开的以下描述中，将以Spark为例对基于数据仓库的用户特征产出方法和系统进行描述。本领域技术人员可以理解，本公开的技术方案并不仅限于应用于Spark，实际上本公开的技术方案适用于任何数据仓库。更为先进的数据仓库技术亦可纳入本公开的技术方案，在此不再赘述。In the following description of the present disclosure, Spark will be taken as an example to describe the method and system for generating user characteristics based on a data warehouse. Those skilled in the art can understand that the technical solution of the present disclosure is not limited to be applied to Spark, in fact, the technical solution of the present disclosure is applicable to any data warehouse. More advanced data warehouse technologies can also be included in the technical solutions of the present disclosure, and will not be repeated here.

下文将基于附图具体描述根据本公开各个实施例的基于数据仓库的用户特征产出方法和系统。The method and system for generating user characteristics based on data warehouses according to various embodiments of the present disclosure will be described in detail below based on the accompanying drawings.

在数据仓库中，如上所述，对数据集进行拆分或分组(split)并对各子集或分组应用(apply)不同函数，是数据分析工作中的重要环节。数据集中的数据按照所选择的一个或多个键被拆分成多个子集或分组。拆分操作是在数据对象的特定维度或轴上执行的。例如，DataFrame可以在其行(axis＝0)或者列(axis＝1)上进行拆分。然后，将一函数应用到各个子集或分组并产生新值。最后，所有这些函数的执行结果会被聚合(aggregate)或合并(combine)到最终特征结果中。In a data warehouse, as mentioned above, splitting or grouping data sets and applying different functions to each subset or grouping is an important part of data analysis. The data in a dataset is split into subsets or groupings according to one or more selected keys. Split operations are performed on a specific dimension or axis of a data object. For example, a DataFrame can be split on its rows (axis=0) or columns (axis=1). Then, a function is applied to each subset or grouping and produces new values. Finally, the execution results of all these functions are aggregated or combined into the final feature result.

在这样的聚合计算过程中，通常输出的结果仅仅是最终特征，并不会包含中间数据。在大数据场景中，当确定或判断最终用户特征为“可疑交易用户”时，常常需要一并提供与该用户特征相关的证据信息。在计算完成之后再去寻找证据信息往往存在效率低下的问题。因此，在输出最终用户特征的同时，按需一并输出证据信息是本领域所需要的。In such aggregation calculation process, usually the output result is only the final feature and does not contain intermediate data. In a big data scenario, when it is determined or judged that the characteristics of the end user are "suspicious transaction users", it is often necessary to provide evidence information related to the characteristics of the user. It is often inefficient to look for evidence information after the calculation is completed. Therefore, it is required in this field to output evidence information as required while outputting the characteristics of the end user.

图1根据本公开一实施例的基于Spark的用户特征产出方法100的流程图。图2示出根据本公开一实施例的基于Spark的用户特征产出方法的示意图。将参照图1和图2描述根据本公开一实施例的基于Spark的用户特征产出方法。FIG. 1 is a flowchart of a method 100 for generating user characteristics based on Spark according to an embodiment of the present disclosure. Fig. 2 shows a schematic diagram of a Spark-based user feature generation method according to an embodiment of the present disclosure. A Spark-based user feature generation method according to an embodiment of the present disclosure will be described with reference to FIG. 1 and FIG. 2 .

在本公开一实施例中，基于Spark的用户特征产出方法100拟产出的目标用户特征是客户交易地点在成都的付款次数。In an embodiment of the present disclosure, the Spark-based user feature production method 100 intends to produce the target user feature is the number of times the customer's transaction location is in Chengdu.

在102，获取多个用户的交易明细数据。At 102, transaction detail data of multiple users is acquired.

从数据存储中获取多个用户的交易明细数据。该数据存储可以是基于Spark、Hadoop、MapReduce、Hive或SQL等等的数据仓库。Get the transaction detail data of multiple users from the data store. The data storage may be a data warehouse based on Spark, Hadoop, MapReduce, Hive, or SQL, etc.

该交易明细数据作为数据集，包含多个维度的数据。如图2所示，按不同维度分拆的子集将包括不同数据。例如，子集1有以客户ID和交易UUID为核心的多个字段，包括客户ID、交易UUID、交易金额、付款IP和交易时间。子集2有以付款IP为核心的两个字段，包括付款IP和交易地点。The transaction detail data is used as a data set, which contains data of multiple dimensions. As shown in Figure 2, the subsets split by different dimensions will include different data. For example, subset 1 has multiple fields centered on customer ID and transaction UUID, including customer ID, transaction UUID, transaction amount, payment IP, and transaction time. Subset 2 has two fields with payment IP as the core, including payment IP and transaction location.

举例而言，在子集1中可以看出，客户ID为A111的用户，其有两笔交易(交易UUID分别为T100和T101，付款IP分别为192.1.2.3和192.1.2.4。在子集1中还可以看出，客户ID为A112的用户，其也有两笔交易(交易UUID分别为T102和T103，付款IP分别为192.1.2.5和192.1.2.6。For example, in subset 1, it can be seen that the user whose customer ID is A111 has two transactions (transaction UUIDs are T100 and T101, and payment IPs are 192.1.2.3 and 192.1.2.4 respectively. In subset 1 It can also be seen from the figure that the user whose customer ID is A112 also has two transactions (the transaction UUIDs are T102 and T103, and the payment IPs are 192.1.2.5 and 192.1.2.6 respectively.

在子集2中可以看出，付款IP分别为192.1.2.3和192.1.2.4的两次交易的交易地点在成都。付款IP为192.1.2.5的交易的交易地点也在成都，而付款IP为192.1.2.6的交易的交易地点则在杭州。It can be seen in subset 2 that the transaction locations of the two transactions whose payment IPs are 192.1.2.3 and 192.1.2.4 respectively are in Chengdu. The transaction location of the payment IP 192.1.2.5 is also in Chengdu, while the transaction location of the payment IP 192.1.2.6 is Hangzhou.

在104，在Spark中按用户维度聚合数据，并且该数据包含交易UUID。At 104, data is aggregated by user dimension in Spark, and the data includes transaction UUID.

在聚合数据时，往往以某一维度的数据为键值(key)，构造出键值对(key:value)。When aggregating data, the data of a certain dimension is often used as the key-value (key) to construct a key-value pair (key:value).

在聚合子集1中的数据时，可以以用户ID为key值，构造出(交易UUID，付款IP，……)的value值集合。在图2所示的本实施例中，对于用户A111，聚合到其名下的交易(交易UUID:T100，付款IP:192.1.2.3)和(交易UUID:T101，付款IP 192.1.2.4)。对于用户A112，聚合到其名下的交易(交易UUID:T102，付款IP:192.1.2.5)和(交易UUID:T103，付款IP192.1.2.6)。When aggregating the data in subset 1, the user ID can be used as the key value to construct a set of value values (transaction UUID, payment IP, ...). In the present embodiment shown in FIG. 2, for user A111, transactions (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4) are aggregated under his name. For user A112, aggregate transactions under his name (transaction UUID: T102, payment IP: 192.1.2.5) and (transaction UUID: T103, payment IP 192.1.2.6).

在106，关联另一维度的数据。At 106, the data of another dimension is correlated.

由于要产出的目标用户特征是用户交易地点在成都的付款次数，仅仅按用户维度聚合子集1将无法产出该目标用户特征。在此情形中，将需要关联另一维度的数据，即地址映射表(即付款IP和交易地点所构成的子集2)。Since the target user feature to be produced is the payment times of the user's transaction place in Chengdu, only aggregating subset 1 according to the user dimension will not be able to produce the target user feature. In this case, another dimension of data will need to be associated, namely the address mapping table (ie the subset 2 consisting of payment IP and transaction location).

需要关联的该另一维度取决于要产出的目标用户特征。当筛选条件为“用户的交易地点在成都”时，由于用户维度的交易明细数据(子集1)并不包含交易地点数据，因此为了产出目标用户特征，将需要建立相关子集的关联。This other dimension that needs to be correlated depends on the target user characteristics to be produced. When the filtering condition is "the user's transaction location is in Chengdu", since the transaction detail data (subset 1) of the user dimension does not contain the transaction location data, in order to generate the characteristics of the target user, it is necessary to establish the association of the relevant subsets.

在聚合后子集1的数据的基础上关联子集2，构成待筛选数据。对于用户A111，待筛选的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP192.1.2.4，地点：成都)。对于用户A112，待筛选的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)和(交易UUID:T103，付款IP 192.1.2.6，地点：杭州)。Based on the aggregated data of subset 1, associate subset 2 to form the data to be screened. For user A111, the transactions to be screened are (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4, location: Chengdu). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu) and (transaction UUID: T103, payment IP 192.1.2.6, location: Hangzhou).

在108，筛选关联数据。At 108, the associated data is screened.

基于筛选条件为“用户的交易地点在成都”对以上的关联数据进行筛选，筛选出的数据为：对于用户A111，筛选出的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP 192.1.2.4，地点：成都)。对于用户A112，筛选出的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)。Based on the filtering condition "the user's transaction location is in Chengdu" to filter the above associated data, the filtered data is: For user A111, the filtered transaction is (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4, location: Chengdu). For user A112, the filtered transaction is (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu).

在110，输出用户特征和交易UUID。At 110, the user characteristics and transaction UUID are output.

针对在108筛选出的交易数据，计算用户交易地点在成都的付款次数，即：对于用户A111，交易地点在成都的交易有2次；对于用户A112，交易地点在成都的交易有1次。Based on the transaction data screened out at 108, calculate the number of times the user's transaction location is in Chengdu, that is: for user A111, there are 2 transactions in Chengdu; for user A112, there is 1 transaction in Chengdu.

在大数据的场景中，可输出交易UUID作为证据，即：In the big data scenario, the transaction UUID can be output as evidence, namely:

对于用户A111，交易地点在成都的交易有2次，交易UUID为T100和T101；对于用户A112，交易地点在成都的交易有1次，交易UUID为T102。For user A111, there are 2 transactions in Chengdu, and the transaction UUIDs are T100 and T101; for user A112, there is 1 transaction in Chengdu, and the transaction UUID is T102.

当然，本领域技术人员可以理解，对于不同的大数据场景，可输出不同的相关交易明细作为证据，并且可按需输出多个相关明细数据作为证据。Of course, those skilled in the art can understand that for different big data scenarios, different relevant transaction details can be output as evidence, and multiple relevant detailed data can be output as evidence as required.

在大数据领域中，异常转账结构发现可通过发现不同的交易结构来进行。连通性是当今网络和系统最普遍的特点，并且连接不是均匀分布的、也不是静态的，从社交关系网到交易网络均如是。研究连通关系及其动态特性将得以充分描述、乃至预测连通系统中的行为。In the field of big data, abnormal transfer structure discovery can be carried out by discovering different transaction structures. Connectivity is the most ubiquitous feature of today's networks and systems, and connections are not evenly distributed or static, from social networks to transactional networks. Studying connected relations and their dynamic properties will fully describe and even predict behaviors in connected systems.

异常转账结构发现中所发现的不同交易结构，包括链式交易结构、嵌套环状交易结构、集中转入分散转出交易结构、分散转入集中转出交易结构等等。对于这样的异常交易结构，采用本公开的基于数据仓库的用户特征产出方案即可挖掘出异常用户特征，并提供相应证据。The different transaction structures found in the discovery of abnormal transfer structures include chain transaction structures, nested ring transaction structures, centralized transfer-in decentralized transfer-out transaction structures, decentralized transfer-in centralized transfer-out transaction structures, etc. For such an abnormal transaction structure, abnormal user characteristics can be excavated by using the data warehouse-based user characteristic output scheme of the present disclosure, and corresponding evidence can be provided.

图3是根据本公开一实施例的基于数据仓库的用户特征产出方法300的流程图。图4是根据本公开一实施例的基于数据仓库的用户特征产出方法的应用场景的示意图。将参照图3和图4描述根据本公开一实施例的基于数据仓库的用户特征产出方法。FIG. 3 is a flowchart of a method 300 for generating user characteristics based on a data warehouse according to an embodiment of the present disclosure. Fig. 4 is a schematic diagram of an application scenario of a method for generating user characteristics based on a data warehouse according to an embodiment of the present disclosure. A method for generating user characteristics based on a data warehouse according to an embodiment of the present disclosure will be described with reference to FIGS. 3 and 4 .

在数据仓库中，多维分析可以对以多维形式组织起来的数据进行上卷、下钻、切片、切块、旋转等各种分析操作，以便剖析数据、产出特征，使分析者、管理者和决策者能从多个角度、多个侧面观察数据库中的数据，从而深入了解包含在数据中的信息和内涵。这种多维分析方式适合人的思维模式，减少了混淆、并降低了出现错误解读的可能性。In the data warehouse, multi-dimensional analysis can perform various analysis operations such as roll-up, drill-down, slicing, dicing, and rotation on data organized in multi-dimensional forms, so as to analyze data and output characteristics, so that analysts, managers and Decision makers can observe the data in the database from multiple angles and sides, so as to gain an in-depth understanding of the information and content contained in the data. This multidimensional analysis fits the human mindset, reducing confusion and reducing the possibility of misinterpretation.

在302，获取多个用户的交易明细数据并按不同维度将交易明细数据分拆成多个子集。At 302, the transaction detailed data of multiple users is acquired and the transaction detailed data is split into multiple subsets according to different dimensions.

对于存储于数据仓库的多维度的数据集，需要在对其进行多维分析之前分拆该数据集。既可按一个维度分拆，即为切片(slice)，切片的结果通常得到二维平面数据。；亦可按数个维度分拆，即为切块(dice)，切块的结果通常得到数据的子立方体。For the multi-dimensional data set stored in the data warehouse, it is necessary to split the data set before performing multi-dimensional analysis on it. It can be split according to one dimension, that is, slice (slice), and the result of slice is usually two-dimensional plane data. ; It can also be split according to several dimensions, that is, dice, and the result of dice is usually a sub-cube of the data.

在本公开一实施例中，多个用户的交易明细数据按用户维度分拆成子集1，即包括客户ID、交易UUID、交易金额、付款IP和交易时间的多个字段；按照付款IP维度分拆成子集2，即包括付款IP和交易地点。In an embodiment of the present disclosure, the transaction detail data of multiple users is split into subset 1 according to the user dimension, that is, multiple fields including customer ID, transaction UUID, transaction amount, payment IP and transaction time; according to the payment IP dimension Split into subset 2, including payment IP and transaction location.

在304，按用户聚合第一维度子集中的数据。At 304, data in the first dimension subset is aggregated by user.

在聚合子集1中的数据时，可以以用户ID为key值，构造出(交易UUID，付款IP，……)的value值集合。例如，对于用户A111，聚合到其名下的交易(交易UUID:T100，付款IP:192.1.2.3)和(交易UUID:T101，付款IP 192.1.2.4)。When aggregating the data in subset 1, the user ID can be used as the key value to construct a set of value values (transaction UUID, payment IP, ...). For example, for user A111, aggregate transactions under his name (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4).

在数据仓库中，数据的维度是具有层次性的，例如时间维度可能由年、月、日构成，而维度的层次实际上反映了数据的综合程度。维度的层次越高，所代表的数据综合度越高，细节越少，数据量越少；维度的层次越低，所代表的数据综合度越低，细节越充分，数据量越大。数据聚合可称为上卷(roll-up)，通过在维度层级中上升或通过消除某个或某些维度来观察更概括的数据。In a data warehouse, the dimensions of data are hierarchical. For example, the time dimension may be composed of year, month, and day, and the level of the dimension actually reflects the comprehensiveness of the data. The higher the level of the dimension, the higher the comprehensiveness of the data represented, the fewer the details, and the smaller the amount of data; the lower the level of the dimension, the lower the comprehensiveness of the data represented, the more sufficient the details, and the greater the amount of data. Data aggregation may be referred to as roll-up, to view more generalized data by going up in the dimension hierarchy or by eliminating a dimension or dimensions.

作为数据聚合的逆操作，下钻(drill-down)也称为数据钻取，通过下降维度层级或通过引入某个或某些维来更细致地观察数据。此外，通过数据旋转(pivot)可以得到不同视角的数据。数据旋转操作相当于基于平面数据将坐标轴旋转。例如，旋转可能包含行和列的交换，或是把某一维旋转到其他维中去。As an inverse operation of data aggregation, drill-down is also called data drill-down, which observes the data in more detail by descending the dimension level or by introducing one or some dimensions. In addition, data from different perspectives can be obtained through data rotation (pivot). The data rotation operation is equivalent to rotating the coordinate axis based on the plane data. For example, a rotation might involve swapping rows and columns, or rotating one dimension into another.

本领域技术人员可以理解，本公开的基于数据仓库的用户特征产出方法不应当仅局限于数据聚合操作，其亦可扩展到其他分析操作，在此不做赘述。Those skilled in the art can understand that the method for generating user characteristics based on the data warehouse in the present disclosure should not be limited to data aggregation operations, but can also be extended to other analysis operations, which will not be repeated here.

在306，获取用于数据筛选的第二维度。At 306, a second dimension for data filtering is acquired.

由于数据集被分拆成多个子集，数据将不可避免地分散到多个维度，因此当进行数据筛选时，需要针对数据筛选或过滤条件获取相关的其他一个或多个维度。Since the data set is split into multiple subsets, the data will inevitably be scattered into multiple dimensions. Therefore, when performing data filtering, it is necessary to obtain one or more other relevant dimensions for data filtering or filter conditions.

在本公开一实施例中，筛选条件为“用户的交易地点在成都”，则与该筛选条件相关的维度中需要包括交易地点数据。由此，所获取的第二维度为地址映射表(即付款IP和交易地点所构成的子集2)。In an embodiment of the present disclosure, if the filter condition is "the user's transaction location is in Chengdu", then the dimensions related to the filter condition need to include the transaction location data. Thus, the obtained second dimension is the address mapping table (that is, the subset 2 formed by the payment IP and the transaction location).

在308，将第二维度子集中的数据关联到经聚合的第一维度子集中的数据以生成待筛选数据。At 308, the data in the second dimension subset is associated with the aggregated data in the first dimension subset to generate data to be filtered.

针对所获取的第二维度，将第二维度子集中的数据关联到经聚合的第一维度子集中的数据，即生成了可以进行筛选的数据集。在本实施例中，待筛选数据为：对于用户A111，待筛选的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP 192.1.2.4，地点：成都)。对于用户A112，待筛选的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)和(交易UUID:T103，付款IP 192.1.2.6，地点：杭州)。For the obtained second dimension, the data in the second dimension subset is associated with the aggregated data in the first dimension subset, that is, a data set that can be filtered is generated. In this embodiment, the data to be screened is: for user A111, the transaction to be screened is (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4, Location: Chengdu). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu) and (transaction UUID: T103, payment IP 192.1.2.6, location: Hangzhou).

在310，筛选待筛选数据以获得用户特征。At 310, the data to be screened is screened to obtain user characteristics.

在本实施例中，筛选出的数据为：对于用户A111，筛选出的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP 192.1.2.4，地点：成都)。对于用户A112，筛选出的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)。In this embodiment, the filtered data is: for user A111, the filtered transaction is (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4 , Location: Chengdu). For user A112, the filtered transaction is (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu).

在312，输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一。At 312, at least one of the user characteristics, the data in the first dimension subset, and the data in the second dimension subset is output.

在大数据场景中，除输出用户特征外，还需要输出相关证据。In big data scenarios, in addition to outputting user characteristics, relevant evidence also needs to be output.

由此，在本实施例中，可输出的数据为：对于用户A111，交易地点在成都的交易有2次，交易UUID为T100和T101；对于用户A112，交易地点在成都的交易有1次，交易UUID为T102。Therefore, in this embodiment, the data that can be output is: for user A111, there are 2 transactions with the transaction location in Chengdu, and the transaction UUIDs are T100 and T101; for user A112, there is 1 transaction with the transaction location in Chengdu, The transaction UUID is T102.

在数据以键值对形式存储的场景中，相关证据可以相应维度中的键值的形式来输出。In scenarios where data is stored in the form of key-value pairs, relevant evidence can be output in the form of key-value in the corresponding dimension.

本领域技术人员可以理解，在不同场景中，可按需输出用户特征和证据数据。即，可仅输出用户特征；可输出用户特征和必要证据；在涉及重要事务时，可输出用户特征、必要证据以及辅助证据。Those skilled in the art can understand that in different scenarios, user characteristics and evidence data can be output as required. That is, only user characteristics can be output; user characteristics and necessary evidence can be output; when important affairs are involved, user characteristics, necessary evidence and auxiliary evidence can be output.

本公开的基于数据仓库的用户特征产出方法能够在快速输出用户特征的最终结果时一并提供必要的中间证据信息，从而提高数据处理的效率和针对性。在本方法中，在产出用户特征的同时产出了特征相关的证据，在上报时不需要人工再去检索数据，大大提高了证据的效率和完整度。The user characteristic output method based on the data warehouse of the present disclosure can provide necessary intermediate evidence information together when quickly outputting the final result of the user characteristic, thereby improving the efficiency and pertinence of data processing. In this method, feature-related evidence is produced while user features are produced, and there is no need to manually retrieve data when reporting, which greatly improves the efficiency and integrity of evidence.

图5示出根据本公开一实施例的基于数据仓库的用户特征产出系统500的框图。FIG. 5 shows a block diagram of a data warehouse-based user feature production system 500 according to an embodiment of the present disclosure.

基于数据仓库的用户特征产出系统500包括获取模块502、筛选模块504、聚合和关联模块506以及输出模块508。The user feature production system 500 based on the data warehouse includes an acquisition module 502 , a screening module 504 , an aggregation and correlation module 506 and an output module 508 .

获取模块502获取多个用户的交易明细数据并按不同维度将交易明细数据分拆成多个子集。The obtaining module 502 obtains transaction detailed data of multiple users and splits the transaction detailed data into multiple subsets according to different dimensions.

在本公开一实施例中，多个用户的交易明细数据按用户维度分拆成子集1，即包括客户ID、交易UUID、交易金额、付款IP和交易时间的多个字段；按照付款IP维度分拆成子集2，即包括付款IP和交易地点In an embodiment of the present disclosure, the transaction detail data of multiple users is split into subset 1 according to the user dimension, that is, multiple fields including customer ID, transaction UUID, transaction amount, payment IP and transaction time; according to the payment IP dimension Split into subset 2, including payment IP and transaction location

进一步地，获取模块502获取用于数据筛选的第二维度。由于数据集被分拆成多个子集，数据将不可避免地分散到多个维度，因此当进行数据筛选时，需要针对数据筛选或过滤条件获取相关的其他一个或多个维度。Further, the obtaining module 502 obtains the second dimension used for data filtering. Since the data set is split into multiple subsets, the data will inevitably be scattered into multiple dimensions. Therefore, when performing data filtering, it is necessary to obtain one or more other relevant dimensions for data filtering or filter conditions.

聚合和关联模块506按用户聚合第一维度子集中的数据，并且将第二维度子集中的数据关联到经聚合的第一维度子集中的数据以生成待筛选数据。The aggregation and association module 506 aggregates the data in the first dimension subset by user, and associates the data in the second dimension subset with the aggregated data in the first dimension subset to generate data to be filtered.

在数据仓库中，数据的维度是具有层次性的，而维度的层次实际上反映了数据的综合程度。维度的层次越高，所代表的数据综合度越高，细节越少，数据量越少；维度的层次越低，所代表的数据综合度越低，细节越充分，数据量越大。数据聚合通过在维度层级中上升或通过消除某个或某些维度来观察更概括的数据。In a data warehouse, the dimension of data is hierarchical, and the level of dimension actually reflects the comprehensive degree of data. The higher the level of the dimension, the higher the comprehensiveness of the data represented, the fewer the details, and the smaller the amount of data; the lower the level of the dimension, the lower the comprehensiveness of the data represented, the more sufficient the details, and the greater the amount of data. Data aggregation looks at more generalized data by going up in the dimensional hierarchy or by eliminating one or more dimensions.

在聚合子集1中的数据时，聚合和关联模块506以用户ID为key值，构造出(交易UUID，付款IP，……)的value值集合。例如，对于用户A111，聚合到其名下的交易为(交易UUID:T100，付款IP:192.1.2.3)和(交易UUID:T101，付款IP 192.1.2.4)。When aggregating the data in subset 1, the aggregation and association module 506 uses the user ID as the key value to construct a set of value values (transaction UUID, payment IP, . . . ). For example, for user A111, the transactions aggregated under his name are (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4).

针对所获取的第二维度，聚合和关联模块506将第二维度子集中的数据关联到经聚合的第一维度子集中的数据，即生成了可以进行筛选的数据集。在本实施例中，待筛选数据为：对于用户A111，待筛选的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP 192.1.2.4，地点：成都)。对于用户A112，待筛选的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)和(交易UUID:T103，付款IP 192.1.2.6，地点：杭州)。For the obtained second dimension, the aggregation and association module 506 associates the data in the second dimension subset with the aggregated data in the first dimension subset, that is, generates a data set that can be filtered. In this embodiment, the data to be screened is: for user A111, the transaction to be screened is (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4, Location: Chengdu). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu) and (transaction UUID: T103, payment IP 192.1.2.6, location: Hangzhou).

筛选模块504筛选待筛选数据以获得用户特征。The screening module 504 screens the data to be screened to obtain user characteristics.

在本实施例中，筛选模块504筛选出的数据为：对于用户A111，筛选出的交易为(交易UUID:T100，付款IP:192.1.2.3，地点：成都)和(交易UUID:T101，付款IP 192.1.2.4，地点：成都)。对于用户A112，筛选出的交易为(交易UUID:T102，付款IP:192.1.2.5，地点：成都)。In this embodiment, the data screened by the screening module 504 is: for user A111, the screened transactions are (transaction UUID: T100, payment IP: 192.1.2.3, location: Chengdu) and (transaction UUID: T101, payment IP 192.1.2.4, location: Chengdu). For user A112, the filtered transaction is (transaction UUID: T102, payment IP: 192.1.2.5, location: Chengdu).

针对在108筛选出的交易数据，筛选模块504计算用户交易地点在成都的付款次数，即：对于用户A111，交易地点在成都的交易有2次；对于用户A112，交易地点在成都的交易有1次。For the transaction data screened out at 108, the screening module 504 calculates the payment times of the user's transaction location in Chengdu, that is: for user A111, there are 2 transactions with the transaction location in Chengdu; for user A112, there are 1 transactions with the transaction location in Chengdu. Second-rate.

输出模块508输出用户特征、第一维度子集中的数据和第二维度子集中的数据的至少之一。The output module 508 outputs at least one of user characteristics, data in the first dimension subset, and data in the second dimension subset.

本公开的基于数据仓库的用户特征产出系统能够在快速输出用户特征的最终结果时一并提供必要的中间证据信息，从而提高数据处理的效率和针对性。在本系统中，在产出用户特征的同时产出了特征相关的证据，在上报时不需要人工再去检索数据，提高了证据的效率和完整度。The user feature output system based on the data warehouse of the present disclosure can provide necessary intermediate evidence information together with the final result of the user feature output quickly, thereby improving the efficiency and pertinence of data processing. In this system, the feature-related evidence is produced while the user features are produced, and there is no need to manually retrieve the data when reporting, which improves the efficiency and integrity of the evidence.

以上描述的基于数据仓库的用户特征产出方法和系统的各个步骤和模块可以用硬件、软件、或其组合来实现。如果在硬件中实现，结合本发明描述的各种说明性步骤、模块、以及电路可用通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、或其他可编程逻辑组件、硬件组件、或其任何组合来实现或执行。通用处理器可以是处理器、微处理器、控制器、微控制器、或状态机等。如果在软件中实现，则结合本发明描述的各种说明性步骤、模块可以作为一条或多条指令或代码存储在计算机可读介质上或进行传送。实现本发明的各种操作的软件模块可驻留在存储介质中，如RAM、闪存、ROM、EPROM、EEPROM、寄存器、硬盘、可移动盘、CD-ROM、云存储等。存储介质可耦合到处理器以使得该处理器能从/向该存储介质读写信息，并执行相应的程序模块以实现本发明的各个步骤。而且，基于软件的实施例可以通过适当的通信手段被上载、下载或远程地访问。这种适当的通信手段包括例如互联网、万维网、内联网、软件应用、电缆(包括光纤电缆)、磁通信、电磁通信(包括RF、微波和红外通信)、电子通信或者其他这样的通信手段。Each step and module of the data warehouse-based user characteristic production method and system described above can be realized by hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with the present invention can be implemented with a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA), or other programmable logic components, hardware components, or any combination thereof. A general-purpose processor may be a processor, microprocessor, controller, microcontroller, or state machine, among others. If implemented in software, the various illustrative steps, modules described in connection with the invention may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The software modules implementing the various operations of the present invention may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, removable disk, CD-ROM, cloud storage, and the like. The storage medium can be coupled to the processor so that the processor can read and write information from/to the storage medium, and execute corresponding program modules to realize various steps of the present invention. Furthermore, software-based embodiments may be uploaded, downloaded or accessed remotely through appropriate communication means. Such suitable means of communication include, for example, the Internet, the World Wide Web, an intranet, software applications, cables (including fiber optic cables), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such means of communication.

还应注意，这些实施例可能是作为被描绘为流程图、流图、结构图、或框图的过程来描述的。尽管流程图可能会把诸操作描述为顺序过程，但是这些操作中有许多操作能够并行或并发地执行。另外，这些操作的次序可被重新安排。It should also be noted that the embodiments may be described as processes depicted as flowcharts, flow diagrams, structural diagrams, or block diagrams. Although a flowchart may describe operations as a sequential process, many of these operations can be performed in parallel or concurrently. Additionally, the order of these operations may be rearranged.

所公开的方法、装置和系统不应以任何方式被限制。相反，本发明涵盖各种所公开的实施例(单独和彼此的各种组合和子组合)的所有新颖和非显而易见的特征和方面。所公开的方法、装置和系统不限于任何具体方面或特征或它们的组合，所公开的任何实施例也不要求存在任一个或多个具体优点或者解决特定或所有技术问题。The disclosed methods, apparatus and systems should not be limited in any way. On the contrary, the invention covers all novel and nonobvious features and aspects of the various disclosed embodiments, both alone and in various combinations and subcombinations with each other. The disclosed methods, devices and systems are not limited to any specific aspect or feature or combination thereof, nor do any disclosed embodiments require that any one or more specific advantages be present or that specific or all technical problems be solved.

上面结合附图对本发明的实施例进行了描述，但是本发明并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本发明的启示下，在不脱离本发明宗旨和权利要求所保护的范围情况下，还可做出很多更改，这些均落在本发明的保护范围之内。Embodiments of the present invention have been described above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned specific implementations, and the above-mentioned specific implementations are only illustrative, rather than restrictive, and those of ordinary skill in the art will Under the enlightenment of the present invention, without departing from the gist of the present invention and the protection scope of the claims, many changes can also be made, and these all fall within the protection scope of the present invention.

Claims

1. A data warehouse-based user feature yield method, comprising:

acquiring transaction detail data of a plurality of users and dividing the transaction detail data into a plurality of subsets according to different dimensions;

aggregating data in the first dimension subset per user;

acquiring a second dimension for data screening, wherein the second dimension subset corresponds to the same dimension and different dimension as the first dimension subset, the field value of the dimension which corresponds to the second dimension subset and is different from the first dimension is used for determining a final result, and the field value of the dimension which corresponds to the first dimension subset and is different from the second dimension is used as intermediate evidence information of the final result;

associating data in a second dimension subset to the aggregated data in the first dimension subset based on field values of the same dimensions corresponding to the second dimension subset as the first dimension subset to generate data to be screened;

screening the data to be screened based on field values of different dimensions corresponding to the second dimension subset and the first dimension subset to obtain user characteristics; and

outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.

2. The method of claim 1, wherein the plurality of subsets each comprise key-value data pairs.

3. The method of claim 1, wherein the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.

4. The method of claim 1, wherein splitting the transaction detail data into subsets in different dimensions may employ a slicing or dicing operation.

5. The method of claim 1, wherein outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: outputting the user features, the data in the first subset of dimensions, or the data in the second subset of dimensions when evidence is required.

6. The method of claim 2, wherein outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: and outputting the user characteristics, the key data in the first dimension subset and the key data in the second dimension subset when evidence is needed.

7. A data warehouse-based user feature yield system, comprising:

the system comprises an acquisition module, a data filtering module and a data filtering module, wherein the acquisition module acquires transaction detail data of a plurality of users, divides the transaction detail data into a plurality of subsets according to different dimensions, and acquires a second dimension for data filtering;

the aggregation and association module aggregates data in a first dimension subset according to users, associates the data in a second dimension subset with the aggregated data in the first dimension subset based on field values of the same dimension as the first dimension subset, which correspond to the same and different dimensions as the first dimension subset, wherein the field values of the different dimension as the first dimension are used for determining a final result, and the field values of the different dimension as the second dimension are used as intermediate evidence information of the final result;

the screening module is used for screening the data to be screened based on the field values of the dimensions, corresponding to the second dimension subset, which are different from the first dimension subset, so as to obtain user characteristics; and

and an output module outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.

8. The system of claim 7, wherein the plurality of subsets each comprise key-value data pairs.

9. The system of claim 7, the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.

10. The system of claim 7, wherein the obtaining module splits the transaction detail data into subsets in different dimensions using a slicing or dicing operation.

11. The system of claim 7, wherein the output module outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: the output module outputs the user characteristic, the data in the first subset of dimensions, or the data in the second subset of dimensions when evidence is required.

12. The system of claim 7, wherein the output module outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: when evidence is needed, the output module outputs the user feature, the key data in the first dimension subset, and the key data in the second dimension subset.

13. A computer readable storage medium storing instructions that, when executed, cause a machine to perform the method of any of claims 1-6.