[go: up one dir, main page]

CN111626842A - Consumption behavior data analysis method and device - Google Patents

Consumption behavior data analysis method and device Download PDF

Info

Publication number
CN111626842A
CN111626842A CN202010322724.3A CN202010322724A CN111626842A CN 111626842 A CN111626842 A CN 111626842A CN 202010322724 A CN202010322724 A CN 202010322724A CN 111626842 A CN111626842 A CN 111626842A
Authority
CN
China
Prior art keywords
consumer
consumers
preset time
consumption behavior
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010322724.3A
Other languages
Chinese (zh)
Inventor
王文刚
康晓中
郭豪
孙悦
郭晓鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Trusfort Technology Co ltd
Original Assignee
Beijing Trusfort Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Trusfort Technology Co ltd filed Critical Beijing Trusfort Technology Co ltd
Priority to CN202010322724.3A priority Critical patent/CN111626842A/en
Publication of CN111626842A publication Critical patent/CN111626842A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for analyzing consumption behavior data, which comprises the following steps: collecting consumption behavior data in a sliding window mode; constructing a relationship graph of the consumer and the merchant corresponding to each window according to the consumption behavior data acquired by each window; constructing a relation graph of a consumer and a merchant corresponding to a preset time period T; and calculating the similarity between the consumers according to the relationship graph of the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers by using the similarity matrix.

Description

Consumption behavior data analysis method and device
Technical Field
The present invention relates to data processing technologies, and in particular, to a method and an apparatus for analyzing consumption behavior data.
Background
The risk of abnormal consumption behaviors such as illegal cash register of the current credit card is increasingly prominent, and a lot of adverse effects are caused to the society. And the abnormal consumption means such as cash register and the like are more and more concealed, the technical performance is more and more strong, the manipulation is more and more professional, and the ganged and specialized characteristics are gradually presented.
At present, identification of such abnormal consumption behaviors as cash register mainly depends on that a service expert directly and manually audits consumption behavior records of a user or constructs an expert rule for identification according to transaction characteristics of a card and a merchant.
If the business expert rules are adopted for direct identification, on one hand, the subjectivity of the identification result is too large and the stability of the identification result is too poor due to different judgment standards of different experts; on the other hand, the identification efficiency is very low due to the fact that the auditing amount is usually large, and the method is not suitable for the business scene requirement of large-scale recording of the current electronic bank.
The simple expert rules are simple in form and the formulation of the rules is very dependent on the experience of experts, so that the rules are difficult to deal with various abnormal consumption behavior patterns, and a relatively large rate of missing reports can be caused.
Disclosure of Invention
The invention provides a consumption behavior data analysis method and device, which can improve the identification efficiency and accuracy of abnormal consumption behaviors.
One aspect of the present invention provides a method for analyzing consumption behavior data, including:
collecting consumption behavior data in a sliding window mode; the consumption behavior data collected by each window comprises at least one consumption behavior record; each consumption behavior record includes: consumer information, merchant information, and time of occurrence of a consumption action;
constructing a relationship graph of the consumer and the merchant corresponding to each window according to the consumption behavior data acquired by each window;
constructing a relation graph of a consumer and a merchant corresponding to a preset time period T;
calculating the similarity between the consumers according to the relation graph of the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers according to the similarity matrix.
In one embodiment, the length of each of the windows is the same, and the lengths of two adjacent windows overlap for a predetermined time period, which is shorter than the length of the window.
In one embodiment, the relationship graph between the consumer and the merchant is a bipartite graph;
the bipartite graph is an unauthorized graph or an authorized graph;
the non-rights map indicates that consumption behavior is generated between the consumer and the merchant;
the authorized graph represents the consumption behavior generated between the consumer and the merchant and the number of times of the consumption behavior generated.
In an implementation manner, the building a relationship graph between a consumer and a merchant corresponding to a preset time period T includes:
combining the weightless bipartite graphs of all windows in the preset time period T to obtain a bipartite graph corresponding to the preset time period T; or,
and performing the following processing on the weighted bipartite graph corresponding to each window contained in the preset time period T: comparing consumption behavior records in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behavior records in the overlapping duration, and deleting the corresponding relationship of the consumption behavior records in the corresponding authorized bipartite graph of the ith window; and combining the weighted bipartite graphs corresponding to the processed windows to obtain the weighted bipartite graph corresponding to the preset time period T.
In an implementation manner, calculating the similarity between consumers according to the weightless bipartite graph of the consumers and the merchants corresponding to the preset time period T includes:
the similarity w between two consumers is:
Figure BDA0002462049710000031
calculating the similarity between the consumers according to the weighted bipartite graphs of the consumers and the merchants corresponding to the preset time period T, wherein the similarity comprises the following steps:
the similarity w between two consumers is determined as:
Figure BDA0002462049710000032
n isThe number of commercial tenants in the intersection of the commercial tenants corresponding to the two consumers, i is the ith commercial tenant in the intersection, CiIndicating the number of times two consumers are co-linear at the ith merchant.
In an embodiment, after clustering the consumers, at least one cluster is obtained, and each cluster includes at least one consumer, the method further includes:
and in the preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
In one embodiment, the method further comprises:
counting the abnormal clusters in a plurality of preset time periods T;
counting the number of abnormal clusters to which the consumers belong aiming at each consumer in the preset time periods T;
determining the abnormal cluster percentage corresponding to the consumer according to the number of the abnormal clusters to which the consumer belongs and the total number of the clusters in the preset time period T;
and if the abnormal cluster percentage corresponding to the consumer reaches a preset threshold value, determining that the consumer is an abnormal consumer.
Another aspect of the present invention provides a consumption behavior data analysis device, including:
the data acquisition module is used for acquiring consumption behavior data in a sliding window mode; the consumption behavior data collected by each window comprises at least one consumption behavior record; each consumption behavior record includes: consumer information, merchant information, and time of occurrence of a consumption action;
the relational graph building module is used for building a relational graph of the consumer and the merchant corresponding to each window according to the consumption behavior data collected by each window; the system is also used for constructing a relation graph of the consumer and the merchant corresponding to the preset time period T;
and the data analysis module is used for calculating the similarity between the consumers according to the relationship graph of the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers according to the similarity matrix.
In one embodiment, the length of each of the windows is the same, and the lengths of two adjacent windows overlap for a predetermined time period, which is shorter than the length of the window.
In one embodiment, the relationship graph between the consumer and the merchant is a bipartite graph;
the bipartite graph is an unauthorized graph or an authorized graph;
the non-rights map indicates that consumption behavior is generated between the consumer and the merchant;
the authorized graph represents the consumption behavior generated between the consumer and the merchant and the number of times of the consumption behavior generated.
In an embodiment, the relationship graph building module is further configured to:
combining the weightless bipartite graphs of all windows in the preset time period T to obtain a bipartite graph corresponding to the preset time period T; or,
and performing the following processing on the weighted bipartite graph corresponding to each window contained in the preset time period T: comparing consumption behavior records in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behavior records in the overlapping duration, and deleting the corresponding relationship of the consumption behavior records in the corresponding authorized bipartite graph of the ith window; and combining the weighted bipartite graphs corresponding to the processed windows to obtain the weighted bipartite graph corresponding to the preset time period T.
In an implementation manner, the data analysis module is further configured to calculate a similarity between consumers according to an unweighted bipartite graph of the consumer and the merchant corresponding to a preset time period T, including:
the similarity w between two consumers is:
Figure BDA0002462049710000051
in an implementation manner, the data analysis module is further configured to calculate a similarity between consumers according to a weighted bipartite graph of the consumer and the merchant corresponding to a preset time period T, including:
the similarity w between two consumers is determined as:
Figure BDA0002462049710000052
n is the number of commercial tenants in the intersection of the commercial tenants corresponding to the two consumers, i is the ith commercial tenant in the intersection, CiIndicating the number of times two consumers are co-linear at the ith merchant.
In an implementation manner, after clustering is performed on the consumers, at least one cluster is obtained, wherein each cluster comprises at least one consumer;
the data analysis module is further configured to: and in the preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
In an embodiment, the data analysis module is further configured to:
counting the abnormal clusters in a plurality of preset time periods T;
counting the number of abnormal clusters to which the consumers belong aiming at each consumer in the preset time periods T;
determining the abnormal cluster percentage corresponding to the consumer according to the number of the abnormal clusters to which the consumer belongs and the total number of the clusters in the preset time period T;
and if the abnormal cluster percentage corresponding to the consumer reaches a preset threshold value, determining that the consumer is an abnormal consumer.
Based on the scheme of the invention, the relation graph of the consumer and the merchant is constructed based on the acquired consumption behavior data, the similarity matrix of the consumer is generated at present, and the abnormal behavior analysis is carried out, so that the method does not depend on expert experience, thereby rapidly and accurately identifying the abnormal cluster and the abnormal consumer, and having higher identification efficiency on abnormal consumption behaviors such as cash register and the like.
Drawings
Fig. 1 is a schematic flow chart illustrating a consumption behavior data analysis method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an unauthorized window bipartite graph according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a window empowermess bipartite graph according to another embodiment of the present invention;
FIG. 4 is a diagram of a window weighted bipartite graph according to an embodiment of the invention;
FIG. 5 is a diagram of a window weighted bipartite graph according to another embodiment of the invention;
fig. 6 is a non-weighted graph of a predetermined time period T according to an embodiment of the present invention;
FIG. 7 is a weighted bipartite graph of a predetermined time period T according to another embodiment of the invention;
FIG. 8 is a schematic diagram of an unauthorized bipartite graph according to an embodiment of the invention;
FIG. 9 is a right bipartite graph according to another embodiment of the invention;
fig. 10 is a schematic structural diagram of a consumption behavior data analysis apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a consumption behavior data analysis method according to an embodiment of the present invention, which includes:
step 101, collecting consumption behavior data in a sliding window mode.
The consumption behavior data concerned by the invention mainly comprises: consumers (information) and merchants (information), and the time at which the consumption action occurred (referred to as consumption time).
In the invention, each window corresponds to a duration, which is called the length of the window, for example, 2 hours, and the length of each window is the same; the length of two adjacent windows overlaps for a certain period of time which is less than the length of the window, for example 1 hour. By adopting the method for sampling, the accuracy of the sampled data can be improved.
The following describes an example of collecting consumption behavior data in a sliding window manner:
assuming that the length of each window is set to 2 hours, the lengths of two adjacent windows overlap by 1 hour. If from a day 00:00 starts collecting data, then the 1 st window has a time span of 0-1 (representing from 00:00 to 01: 59), the 2 nd window has a time span of 1-2 (representing from 01: 00 to 02: 59), the 3 rd window has a time span of 2-3 (representing from 02:00 to 03: 59), and so on; then the overlap time of the 1 st window and the 2 nd window is 1 (from 01: 00 to 01: 59), then the overlap time of the 2 nd window and the 3 rd window is 2 (from 02:00 to 02: 59), and so on.
The consumption data collected by each window comprises one or more consumption behavior records, and each consumption behavior record represents that a consumer has consumed behavior at a merchant at a certain time point. For example, consumption behavior data collected in window 1 characterizes a change from 00:00 to 01:59 of all consumer's consumption activities at the merchant.
Suppose that the consumption behavior data collected in the 1 st window (0-1) is shown in Table 1:
A1,B2 00:24
A2,B1 00:30
A4,B2 00:35
A6,B4 00:40
A6,B4 00:45
A3,B2 01:32
A6,B4 01:40
TABLE 1
In table 1, the data for the first row indicates: consumer a1 took place consumption at merchant B2 at time 00: 24; the data of the second row represents: consumer a2 took place consumption at merchant B1 at time 00: 30; the data in the third row indicates: consumer a4 has taken place at merchant B2 at time 00:35, and so on.
Suppose that the consumption behavior data collected in the 2 nd window (1-2) is recorded as shown in Table 2
A3,B2 01:32
A6,B4 01:40
A4,B2 01:50
A5,B3 02:10
A6,B4 02:30
A3,B2 02:32
A4,B2 02:40
TABLE 2
In table 2, the data in the first row indicates: consumer a3 took place consumption at merchant B2 at time 01: 32; the data of the second row represents: consumer a6 took place at merchant B4 at a time of 01: 40; the data in the third row indicates: consumer a4 took place consumption at merchant B2, time 01: 50; and so on.
And 102, constructing a relationship diagram of the consumer and the merchant corresponding to each window according to the consumption behavior data collected by each window.
Preferably, the present invention employs a bipartite graph to characterize consumer and merchant relationships.
According to the collected behavior data, the subjects involved in the consumption behavior can be divided into two types, namely, consumers and merchants, and then, the network represented by the bipartite graph comprises two types of nodes, namely, consumer nodes and merchant nodes. When constructing the bipartite graph, the following processing can be performed for each piece of collected consumption behavior data: and connecting lines between the consumer nodes and the merchant nodes involved in the consumption behaviors. And after the above processing is completed for each piece of consumption behavior data collected by one window, a bipartite graph corresponding to the window is constructed. The bipartite graph corresponding to table 1 is shown in fig. 2, and the bipartite graph corresponding to table 2 is shown in fig. 3.
In the present invention, the weight of each edge is represented by the number of consumption times occurring between a consumer node and a merchant node, as shown in fig. 4, a weighted bipartite graph corresponding to table 1 is shown, and fig. 5 is a weighted bipartite graph corresponding to table 2. That is, the no-rights graph only indicates consumption behavior occurring between the consumer and the merchant, and the rights graph further indicates the number of times consumption behavior occurs between the consumer and the merchant.
And 103, constructing a relation graph of the consumer and the merchant corresponding to the preset time period T.
The preset time period T here may be a fixed period, for example, one day, one month, etc.; and may be any set period of time, such as 3 days, 5 days, etc. And subsequently, detecting abnormal consumption behaviors according to the consumption behavior data in the preset time period T.
There are two cases:
the first situation is as follows: drawing without authority
The two-part graphs without weights of all windows in the preset time period T may be merged to obtain the two-part graph corresponding to the preset time period T, for example, fig. 6 is the two-part graph after merging of fig. 2 and fig. 3;
case two: authorized graph
Because two adjacent windows are overlapped for a fixed time length, some consumption behavior data possibly occurring in the fixed time length is counted twice, namely the ith-1 window is collected once, and the ith window is collected once again. Therefore, the bipartite graph for the ith window that has been constructed needs to be processed as follows: comparing the consumption behavior data occurring in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behaviors (namely, the consumer, the merchant and the consumption time are identical) existing in the overlapping duration, and deleting the corresponding relation of the consumption behaviors in the corresponding second graph of the ith window (namely, subtracting 1 from the weight of the edge between the consumer and the merchant). Finally, combining the weighted bipartite graphs of all the processed windows in the preset time period T to obtain a weighted bipartite graph corresponding to the preset time period, as shown in fig. 7, which is a weighted bipartite graph obtained by combining fig. 4 and fig. 5.
And 104, calculating the similarity between the consumers according to the corresponding relation graph of the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers according to the calculated similarity matrix.
The first situation is as follows: based on the unweighted bipartite graph: considering whether the consumer transacts with the merchant or not without considering the transaction times of the consumer and the merchant, the similarity calculation formula between any two consumers is as follows:
Figure BDA0002462049710000101
for example, fig. 8 shows an unweighted bipartite graph corresponding to a certain time period T, and the similarity between consumers is calculated according to formula 1 as follows:
Figure BDA0002462049710000102
Figure BDA0002462049710000103
Figure BDA0002462049710000104
Figure BDA0002462049710000105
Figure BDA0002462049710000106
Figure BDA0002462049710000107
the resulting similarity matrix between consumers is as follows:
A1 A2 A3 A4
A1 0 1/2 1/3 2/3
A2 1/2 0 0 1/3
A3 1/3 0 0 2/3
A4 2/3 1/3 2/3 0
after the similarity matrix of the consumer is obtained, clustering is performed on the consumer by using a clustering algorithm (for example, DBSCAN application based on density-based noisy spatial clustering, expectation-maximization (EM) clustering based on Gaussian Mixture Model (GMM), etc.), and then the inputs of the clustering algorithm model are:
Figure BDA0002462049710000111
case two: based on the weighted bipartite graph: consider the number of transactions of the consumer with the merchant, i.e., the weight of each edge. The similarity between any two consumers is:
Figure BDA0002462049710000112
wherein n is the number of merchants in the intersection of the merchants corresponding to the two consumers, i is the ith merchant in the intersection, and CiIndicating the number of times two consumers are co-linear at the ith merchant.
Here, the number of times two consumers are collinear at the ith merchant is: the minimum value of the weights of the edges between two consumers and the ith merchant respectively, for example, in fig. 9, there are two merchants in the merchant intersection between a1 and a4, i.e., B1 and B2, then the weights of the edges between a1 and a4 and B1 respectively are 3 and 1, and then the number of times that a1 and a4 are collinear at merchant B1 is 1 (denoted as B1 (1)); the weights of the edges between a1 and a4 and the merchant B2 are 2 and 2, respectively, then the number of times a1 and a4 are collinear with a1 and a4 at the merchant B2 is 2 (noted as B2(2)), then the weight w between a1 and a4 is 2A1,A4=B1(1)2+B2(2)25. As another example, if there is no merchant intersection between Consumer A2 and A3, then the weight between the two is 0.
Fig. 9 shows a weighted bipartite graph corresponding to a certain time period T, and the similarity between consumers is calculated according to formula 2:
Figure BDA0002462049710000113
Figure BDA0002462049710000114
Figure BDA0002462049710000115
Figure BDA0002462049710000116
Figure BDA0002462049710000117
Figure BDA0002462049710000118
the resulting similarity matrix between consumers is as follows:
A1 A2 A3 A4
A1 0 1 9 5
A2 1 0 0 1
A3 9 0 0 37
A4 5 1 37 0
after the similarity matrix of the consumer is obtained, clustering is performed on the consumer by using a clustering algorithm (for example, DBSCAN application based on density-based noisy spatial clustering, expectation-maximization (EM) clustering based on Gaussian Mixture Model (GMM), etc.), and then the inputs of the clustering algorithm model are:
Figure BDA0002462049710000121
through the process, the consumers can be classified, divided, namely clustered, and a plurality of clusters can be clustered, wherein each cluster comprises at least one consumer. Consumers classified into a cluster may indicate that they all have traded with the same or a few particular merchants with a high degree of similarity between them. It should be noted that, in the present invention, within the time period T, only one clustered consumer belongs to one cluster.
And 105, identifying abnormal consumption behaviors based on the clustered consumer data.
In the invention, the identification of the abnormal consumption behavior can comprise identification of abnormal clusters and identification of abnormal consumers. Wherein:
the abnormal cluster recognition comprises the following steps: and in a preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
For example, in the above time period T, assuming that the number of consumers included in the ith cluster (Di) is CountDi, and the total number of consumption records in Di is Si (i.e., the sum of the number of consumption records generated by each consumer in Di in the time period T), the average number of consumption records of Di is:
Meani=Si/CountCi。
for example, assuming that the 1 st cluster D1 includes 5 consumers, and the total consumption records of the 5 consumers in the time period T is S1-125 times, the average consumption record number of the cluster D1 is: mean 1-125/5-25.
And (4) counting the average value Meanall of the consumption records of all the consumers in the time period T (namely the sum of the consumption records of all the consumers in the time period T is divided by the number of the consumers).
Comparing Meani with Meanoll, for example, to determine whether: meani > alpha Meanall
Where α is an empirical value, and is described here by taking α ═ 5 as an example, that is, when the average number of consumption records (Meani) of the ith cluster is greater than 5 times the overall average number (Meanall), the ith cluster is determined to be an abnormal cluster.
Of course, the abnormal cluster can be identified more accurately according to other rules. For example: it can further calculate whether the average transaction amount of the cluster is abnormal, whether the consumption record number of which the consumption amount is integral multiple of the preset amount in the cluster is abnormal, and the like. If multiple metrics for a cluster are all abnormal, the cluster may be identified as an abnormal cluster.
Anomalous consumers may further be identified based on the identified anomalous clusters.
In the actual model using process, there are many time periods T, and the data analysis process described above can be performed for each time period T, so that the abnormal cluster and the normal cluster in each time period T, and the consumers included in the abnormal cluster and the consumers included in the normal cluster can be determined.
Performing anomalous consumer identification for any one consumer x may include:
and determining the total number CountR of the multiple T-interior clusters, determining the total number Countannoml of the abnormal cluster to which the consumer x belongs, wherein the abnormal cluster percentage Px corresponding to the consumer x is Countannoml/CountR, and if Px is greater than gamma, the consumer x is an abnormal consumer. Where γ is a value that needs to be set according to a specific scenario, where γ is 80% as an example, that is, if the number of abnormal clusters to which the current consumer belongs is more than 80%, the current consumer is considered as an abnormal consumer.
Further, it can also perform the following analysis, assuming that the percentage of abnormal consumers in a cluster is Pnum, if Pnum > k, k is a value that needs to be set according to specific situations, here, taking k as 70% as an example, if the abnormal consumers in a cluster exceed 70%, the cluster is defined as a specific abnormal cluster, and all the consumers in the cluster are deposited into the black product database.
The scheme of the invention has the following advantages:
firstly, from the aspect of service:
first, there is an overlap between windows: the method aims to improve the accuracy of similarity statistics. Because the similarity matrix between consumers is generated by building bipartite graphs, if a traditional sliding window sampling method is used: (0,1), (2,3), (4,5) … …, (i.e., there is no overlap in windows), then if consumer 1 and merchant A have had a card swipe or code swipe transaction at 01:59:59 and consumer 2 and merchant A have had a card swipe or code swipe transaction at 02:00:01, the similarity in behavior between consumer 1 and consumer 2 over time is very high and should be counted, but conventional sliding window sampling would miss this situation, so the data sampling is more accurate for the sliding window sampling mode with overlap in windows (0,1), (1,2), (2,3) … ….
Secondly, the large time end T is segmented into finer-grained windows in order to satisfy consumption behaviors or temporal similarities. Some unusual consumption group work is often completed in a short time, and the judgment of whether the similarity of a group of consumers is high is also based on whether the behaviors of the consumers are highly consistent in a short time. For example, in a cash-out scenario, if a batch of consumers are all conducting card swiping consumption at multiple merchants within a window, the similarity of the batch of users is very high, and conversely, if a large time period is not divided, the example is conducted in 24h every 1 day, the similarity of two consumption behaviors occurring in the same window is intuitively understood to be very low compared with the similarity of two consumption behaviors occurring in the same window when the consumer 1 is conducting card swiping or code scanning consumption with the merchant a and the consumer 2 is 23:59:59 with the merchant a, because the consumption behaviors of the two consumers at the same merchant a are almost 24 hours apart, theoretically should not be related, but if the time window T is not divided into small windows, the consumer 1 and the consumer 2 are also statistically similar. This can cause significant noise and greatly reduce the accuracy of the identification of truly anomalous consumers.
Secondly, engineering aspects: a large time period T is divided into several small windows, so that the calculation between each window can be parallelized, and the overall algorithm can be sped up.
In order to implement the method, an embodiment of the present invention further provides a device for analyzing consumption behavior, as shown in fig. 10, including
The data acquisition module 10 is used for acquiring consumption behavior data in a sliding window mode; the consumption behavior data collected by each window comprises at least one consumption behavior record; each consumption behavior record includes: consumer information, merchant information, and time of occurrence of a consumption action;
the relational graph building module 20 is configured to build a relational graph between a consumer and a merchant corresponding to each window according to the consumption behavior data collected by each window; the system is also used for constructing a relation graph of the consumer and the merchant corresponding to the preset time period T;
and the data analysis module 30 is configured to calculate similarity between the consumers according to the relationship graph between the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and perform clustering on the consumers according to the similarity matrix.
The length of each window is the same, the lengths of two adjacent windows are overlapped for a preset time, and the preset time is smaller than the length of the window.
The relationship graph of the consumer and the merchant is a bipartite graph;
the bipartite graph is an unauthorized graph or an authorized graph;
the non-rights map indicates that consumption behavior is generated between the consumer and the merchant;
the authorized graph represents the consumption behavior generated between the consumer and the merchant and the number of times of the consumption behavior generated.
The relationship graph building module 20 is further configured to:
combining the weightless bipartite graphs of all windows in the preset time period T to obtain a bipartite graph corresponding to the preset time period T; or,
and performing the following processing on the weighted bipartite graph corresponding to each window contained in the preset time period T: comparing consumption behavior records in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behavior records in the overlapping duration, and deleting the corresponding relationship of the consumption behavior records in the corresponding authorized bipartite graph of the ith window; and combining the weighted bipartite graphs corresponding to the processed windows to obtain the weighted bipartite graph corresponding to the preset time period T.
The data analysis module 30 is further configured to calculate a similarity between consumers according to the weightless bipartite graph of the consumer and the merchant corresponding to the preset time period T, including:
the similarity w between two consumers is:
Figure BDA0002462049710000161
the data analysis module 30 is further configured to calculate similarity between consumers according to the authorized bipartite graphs of the consumers and the merchants corresponding to the preset time period T, including:
determining a weight between two consumers
Figure BDA0002462049710000162
Comprises the following steps:
Figure BDA0002462049710000163
n is the number of commercial tenants in the intersection of the commercial tenants corresponding to the two consumers, i is the ith commercial tenant in the intersection, CiRepresenting the number of times two consumers are co-linear at the ith merchant;
according to weight between two consumers
Figure BDA0002462049710000164
The similarity w between two consumers is determined as:
Figure BDA0002462049710000165
clustering the consumers to obtain at least one cluster, wherein each cluster comprises at least one consumer; the data analysis module 30 is further configured to: and in the preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
The data analysis module 30 is further configured to:
counting the abnormal clusters in a plurality of preset time periods T;
counting the number of abnormal clusters to which the consumers belong aiming at each consumer in the preset time periods T;
determining the abnormal cluster percentage corresponding to the consumer according to the number of the abnormal clusters to which the consumer belongs and the total number of the clusters in the preset time period T;
and if the abnormal cluster percentage corresponding to the consumer reaches a preset threshold value, determining that the consumer is an abnormal consumer.
Based on the scheme provided by the embodiment of the invention, the analysis can be carried out based on the collected consumption behavior data, and the expert experience is not relied on, so that the abnormal clusters and the abnormal consumers can be quickly and accurately identified, and the identification efficiency of the abnormal consumption behaviors such as cash register and the like is higher.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (16)

1. A method for analyzing consumption behavior data, comprising:
collecting consumption behavior data in a sliding window mode; the consumption behavior data collected by each window comprises at least one consumption behavior record; each consumption behavior record includes: consumer information, merchant information, and time of occurrence of a consumption action;
constructing a relationship graph of the consumer and the merchant corresponding to each window according to the consumption behavior data acquired by each window;
constructing a relation graph of a consumer and a merchant corresponding to a preset time period T;
calculating the similarity between the consumers according to the relation graph of the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers according to the similarity matrix.
2. The method of claim 1,
the length of each window is the same, the length of two adjacent windows is overlapped by a preset time length, and the preset time length is smaller than the length of the window.
3. The method of claim 2,
the relationship graph of the consumer and the merchant is a bipartite graph;
the bipartite graph is an unauthorized graph or an authorized graph;
the non-rights map indicates that consumption behavior is generated between the consumer and the merchant;
the authorized graph represents the consumption behavior generated between the consumer and the merchant and the number of times of the consumption behavior generated.
4. The method according to claim 3, wherein the constructing a relationship graph between the consumer and the merchant corresponding to the preset time period T comprises:
combining the weightless bipartite graphs of all windows in the preset time period T to obtain a bipartite graph corresponding to the preset time period T; or,
and performing the following processing on the weighted bipartite graph corresponding to each window contained in the preset time period T: comparing consumption behavior records in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behavior records in the overlapping duration, and deleting the corresponding relationship of the consumption behavior records in the corresponding authorized bipartite graph of the ith window; and combining the weighted bipartite graphs corresponding to the processed windows to obtain the weighted bipartite graph corresponding to the preset time period T.
5. The method according to claim 4, wherein calculating the similarity between consumers according to the weightless bipartite graph of consumers and merchants corresponding to the preset time period T comprises:
the similarity w between two consumers is:
Figure FDA0002462049700000021
6. the method according to claim 4, wherein calculating the similarity between consumers according to the weighted bipartite graph of consumers and merchants corresponding to the preset time period T comprises:
determining the similarity between two consumers as w:
Figure FDA0002462049700000022
n is the number of commercial tenants in the intersection of the commercial tenants corresponding to the two consumers, i is the ith commercial tenant in the intersection, CiIndicating the number of times two consumers are co-linear at the ith merchant.
7. The method of claim 5 or 6, wherein clustering the consumers results in at least one cluster, each cluster including at least one consumer, the method further comprising:
and in the preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
8. The method of claim 7, further comprising:
counting the abnormal clusters in a plurality of preset time periods T;
counting the number of abnormal clusters to which the consumers belong aiming at each consumer in the preset time periods T;
determining the abnormal cluster percentage corresponding to the consumer according to the number of the abnormal clusters to which the consumer belongs and the total number of the clusters in the preset time period T;
and if the abnormal cluster percentage corresponding to the consumer reaches a preset threshold value, determining that the consumer is an abnormal consumer.
9. An apparatus for analyzing consumption behavior data, comprising:
the data acquisition module is used for acquiring consumption behavior data in a sliding window mode; the consumption behavior data collected by each window comprises at least one consumption behavior record; each consumption behavior record includes: consumer information, merchant information, and time of occurrence of a consumption action;
the relational graph building module is used for building a relational graph of the consumer and the merchant corresponding to each window according to the consumption behavior data collected by each window; the system is also used for constructing a relation graph of the consumer and the merchant corresponding to the preset time period T;
and the data analysis module is used for calculating the similarity between the consumers according to the relationship graph of the consumers and the merchants corresponding to the preset time period T to obtain a similarity matrix of the consumers, and clustering the consumers according to the similarity matrix.
10. The apparatus of claim 9,
the length of each window is the same, the length of two adjacent windows is overlapped by a preset time length, and the preset time length is smaller than the length of the window.
11. The apparatus of claim 10,
the relationship graph of the consumer and the merchant is a bipartite graph;
the bipartite graph is an unauthorized graph or an authorized graph;
the non-rights map indicates that consumption behavior is generated between the consumer and the merchant;
the authorized graph represents the consumption behavior generated between the consumer and the merchant and the number of times of the consumption behavior generated.
12. The apparatus of claim 11,
the relationship graph building module is further configured to:
combining the weightless bipartite graphs of all windows in the preset time period T to obtain a bipartite graph corresponding to the preset time period T; or,
and performing the following processing on the weighted bipartite graph corresponding to each window contained in the preset time period T: comparing consumption behavior records in the overlapping duration of the ith window and the (i-1) th window, determining two identical consumption behavior records in the overlapping duration, and deleting the corresponding relationship of the consumption behavior records in the corresponding authorized bipartite graph of the ith window; and combining the weighted bipartite graphs corresponding to the processed windows to obtain the weighted bipartite graph corresponding to the preset time period T.
13. The apparatus of claim 12,
the data analysis module is further configured to calculate similarity between consumers according to the weightless bipartite graphs of the consumers and the merchants corresponding to the preset time period T, and includes:
the similarity w between two consumers is:
Figure FDA0002462049700000041
14. the apparatus of claim 12,
the data analysis module is further configured to calculate similarity between consumers according to a weighted bipartite graph of the consumer and the merchant corresponding to a preset time period T, and includes:
the similarity w between two consumers is determined as:
Figure FDA0002462049700000042
n is the number of commercial tenants in the intersection of the commercial tenants corresponding to the two consumers, i is the ith commercial tenant in the intersection, CiIndicating the number of times two consumers are co-linear at the ith merchant.
15. The apparatus according to claim 13 or 14, wherein clustering of the consumers results in at least one cluster, each cluster comprising at least one consumer;
the data analysis module is further configured to: and in the preset time period T, counting the average value Meanall of the consumption behavior record numbers of all the consumers, counting the average consumption behavior record number Meani of each cluster, comparing the Meani of each cluster with the Meanall, and when a preset condition is met, determining that the cluster corresponding to the Meani is an abnormal cluster.
16. The apparatus of claim 8,
the data analysis module is further configured to:
counting the abnormal clusters in a plurality of preset time periods T;
counting the number of abnormal clusters to which the consumers belong aiming at each consumer in the preset time periods T;
determining the abnormal cluster percentage corresponding to the consumer according to the number of the abnormal clusters to which the consumer belongs and the total number of the clusters in the preset time period T;
and if the abnormal cluster percentage corresponding to the consumer reaches a preset threshold value, determining that the consumer is an abnormal consumer.
CN202010322724.3A 2020-04-22 2020-04-22 Consumption behavior data analysis method and device Pending CN111626842A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010322724.3A CN111626842A (en) 2020-04-22 2020-04-22 Consumption behavior data analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010322724.3A CN111626842A (en) 2020-04-22 2020-04-22 Consumption behavior data analysis method and device

Publications (1)

Publication Number Publication Date
CN111626842A true CN111626842A (en) 2020-09-04

Family

ID=72260967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010322724.3A Pending CN111626842A (en) 2020-04-22 2020-04-22 Consumption behavior data analysis method and device

Country Status (1)

Country Link
CN (1) CN111626842A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232888A (en) * 2020-11-06 2021-01-15 深圳市护家科技有限公司 Intelligent analysis system and method for consumer behaviors
CN113450142A (en) * 2021-06-09 2021-09-28 重庆锦禹云能源科技有限公司 Clustering analysis method and device for power consumption behaviors of power customers
CN113506113A (en) * 2021-06-02 2021-10-15 北京顶象技术有限公司 Credit card cash-registering group-partner mining method and system based on associated network
JP2022044185A (en) * 2020-09-07 2022-03-17 セカンドサイトアナリティカ株式会社 Information processing system and information processing method
CN115129988A (en) * 2022-06-29 2022-09-30 北京达佳互联信息技术有限公司 Information acquisition method, device, electronic device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020077790A1 (en) * 2000-12-18 2002-06-20 Mikael Bisgaard-Bohr Analysis of retail transactions using gaussian mixture models in a data mining system
EP2840542A2 (en) * 2013-08-19 2015-02-25 Compass Plus (GB) Limited Method and system for detection of fraudulent transactions
CN107526667A (en) * 2017-07-28 2017-12-29 阿里巴巴集团控股有限公司 A kind of Indexes Abnormality detection method, device and electronic equipment
CN107545422A (en) * 2017-08-02 2018-01-05 中国银联股份有限公司 A kind of arbitrage detection method and device
US20180218369A1 (en) * 2017-02-01 2018-08-02 Google Inc. Detecting fraudulent data
CN109191107A (en) * 2018-06-29 2019-01-11 阿里巴巴集团控股有限公司 Transaction abnormality recognition method, device and equipment
WO2019046344A1 (en) * 2017-08-29 2019-03-07 Paypal, Inc. Rapid online clustering
CN109978538A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Determine fraudulent user, training pattern, the method and device for identifying risk of fraud
CN110348850A (en) * 2019-05-28 2019-10-18 深圳壹账通智能科技有限公司 The arbitrage risk checking method and device, electronic equipment of polymerization payment trade company
CN110517097A (en) * 2019-09-09 2019-11-29 平安普惠企业管理有限公司 Identify method, apparatus, equipment and the storage medium of abnormal user
CN110717828A (en) * 2019-09-09 2020-01-21 中国科学院计算技术研究所 A method and system for abnormal account detection based on frequent transaction mode

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020077790A1 (en) * 2000-12-18 2002-06-20 Mikael Bisgaard-Bohr Analysis of retail transactions using gaussian mixture models in a data mining system
EP2840542A2 (en) * 2013-08-19 2015-02-25 Compass Plus (GB) Limited Method and system for detection of fraudulent transactions
US20180218369A1 (en) * 2017-02-01 2018-08-02 Google Inc. Detecting fraudulent data
CN107526667A (en) * 2017-07-28 2017-12-29 阿里巴巴集团控股有限公司 A kind of Indexes Abnormality detection method, device and electronic equipment
CN107545422A (en) * 2017-08-02 2018-01-05 中国银联股份有限公司 A kind of arbitrage detection method and device
WO2019046344A1 (en) * 2017-08-29 2019-03-07 Paypal, Inc. Rapid online clustering
CN109978538A (en) * 2017-12-28 2019-07-05 阿里巴巴集团控股有限公司 Determine fraudulent user, training pattern, the method and device for identifying risk of fraud
CN109191107A (en) * 2018-06-29 2019-01-11 阿里巴巴集团控股有限公司 Transaction abnormality recognition method, device and equipment
CN110348850A (en) * 2019-05-28 2019-10-18 深圳壹账通智能科技有限公司 The arbitrage risk checking method and device, electronic equipment of polymerization payment trade company
CN110517097A (en) * 2019-09-09 2019-11-29 平安普惠企业管理有限公司 Identify method, apparatus, equipment and the storage medium of abnormal user
CN110717828A (en) * 2019-09-09 2020-01-21 中国科学院计算技术研究所 A method and system for abnormal account detection based on frequent transaction mode

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022044185A (en) * 2020-09-07 2022-03-17 セカンドサイトアナリティカ株式会社 Information processing system and information processing method
JP7316984B2 (en) 2020-09-07 2023-07-28 セカンドサイトアナリティカ株式会社 Information processing system and information processing method
CN112232888A (en) * 2020-11-06 2021-01-15 深圳市护家科技有限公司 Intelligent analysis system and method for consumer behaviors
CN112232888B (en) * 2020-11-06 2021-05-14 深圳市护家科技有限公司 Intelligent analysis system and method for consumer behaviors
CN113506113A (en) * 2021-06-02 2021-10-15 北京顶象技术有限公司 Credit card cash-registering group-partner mining method and system based on associated network
CN113450142A (en) * 2021-06-09 2021-09-28 重庆锦禹云能源科技有限公司 Clustering analysis method and device for power consumption behaviors of power customers
CN113450142B (en) * 2021-06-09 2023-04-18 重庆锦禹云能源科技有限公司 Clustering analysis method and device for power consumption behaviors of power customers
CN115129988A (en) * 2022-06-29 2022-09-30 北京达佳互联信息技术有限公司 Information acquisition method, device, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN111626842A (en) Consumption behavior data analysis method and device
US20100257092A1 (en) System and method for predicting a measure of anomalousness and similarity of records in relation to a set of reference records
CN107491970B (en) Real-time anti-cheating detection monitoring method and system and computing equipment
CN112084229A (en) Method and device for identifying abnormal gas consumption behaviors of town gas users
CN112990386B (en) User value clustering method and device, computer equipment and storage medium
CN109767225B (en) Network payment fraud detection method based on self-learning sliding time window
CN112631889B (en) Portrayal method, device, equipment and readable storage medium for application system
CN111861487A (en) Financial transaction data processing method, fraud detection method and device
CN111105092A (en) Hospital medical insurance quota allocation oriented data interaction system and method
CN111126844A (en) Evaluation method, device, equipment and storage medium for mass-related risk enterprises
CN112819476A (en) Risk identification method and device, nonvolatile storage medium and processor
CN117974321A (en) Financial product risk management and control method based on rule engine
CN113902272A (en) Service scoring method, device and storage medium
CN114429367B (en) Abnormality detection method, device, computer equipment and storage medium
WO2019235954A1 (en) Methods, systems, apparatus, and articles of manufacture to generate corrected projection data for stores
WO2019196502A1 (en) Marketing activity quality assessment method, server, and computer readable storage medium
CN115689713A (en) Abnormal risk data processing method and device, computer equipment and storage medium
CN114611272A (en) Electrical load curve data fitting method based on minimum interval dynamic distribution
CN110413967B (en) Account checking chart generation method, device, computer equipment and storage medium
Bakhshi et al. Fraud detection system in online ride-hailing services
CN115471041B (en) Identification method, device, equipment and storage medium of black product account
CN114549179B (en) Method, device, storage medium and processor for generating risk list
CN107783942B (en) Abnormal behavior detection method and device
CN115237970A (en) Data prediction method, device, equipment, storage medium and program product
Alexopoulos et al. A network and machine learning approach to detect Value Added Tax fraud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200904

RJ01 Rejection of invention patent application after publication