CN106156026B - A method for online anomaly detection of virtual assets based on data flow - Google Patents
A method for online anomaly detection of virtual assets based on data flow Download PDFInfo
- Publication number
- CN106156026B CN106156026B CN201510130123.1A CN201510130123A CN106156026B CN 106156026 B CN106156026 B CN 106156026B CN 201510130123 A CN201510130123 A CN 201510130123A CN 106156026 B CN106156026 B CN 106156026B
- Authority
- CN
- China
- Prior art keywords
- data
- user
- abnormal
- behavior
- behavior pattern
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000002159 abnormal effect Effects 0.000 claims abstract description 46
- 238000004458 analytical method Methods 0.000 claims abstract description 27
- 206010000117 Abnormal behaviour Diseases 0.000 claims abstract description 25
- 239000000284 extract Substances 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000005856 abnormality Effects 0.000 claims description 10
- 238000010223 real-time analysis Methods 0.000 claims description 6
- 238000005065 mining Methods 0.000 claims description 5
- 238000013500 data storage Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 9
- 230000006399 behavior Effects 0.000 description 79
- 239000011159 matrix material Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000013450 outlier detection Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000031068 symbiosis, encompassing mutualism through parasitism Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Landscapes
- Debugging And Monitoring (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开一种基于数据流虚拟资产在线异常发现的方法,主要包括数据处理、离线分析、在线分析。用户操作行为日志数据流流入数据窗口并进行预处理提取数据概要,数据库中的数据定期用模式生成算法挖掘用户正常行为模式和异常行为模式,系统实时对滑动窗口中的数据进行分析,提取当前的行为模式与模式库中的正常行为模式和异常行为模式匹配。本发明将数据流的技术应用于虚拟资产的异常发现,设计了基于数据流的虚拟资产在线异常发现技术框架,使得系统能更快速有效地实现实时检测异常,从而更好地预防用户的损失。
The invention discloses a method for online abnormal discovery of virtual assets based on data flow, which mainly includes data processing, offline analysis and online analysis. The user operation behavior log data stream flows into the data window and is preprocessed to extract the data summary. The data in the database is regularly mined by the pattern generation algorithm to mine the user's normal behavior pattern and abnormal behavior pattern. The system analyzes the data in the sliding window in real time and extracts the current Behavior patterns match normal and abnormal behavior patterns in the pattern library. The invention applies the technology of data flow to the abnormal discovery of virtual assets, and designs a technical framework of online abnormal discovery of virtual assets based on data flow, so that the system can realize real-time abnormal detection more quickly and effectively, so as to better prevent the loss of users.
Description
技术领域technical field
本发明属互联网技术领域,具体涉及一种基于数据流虚拟资产在线异常发现的方法。The invention belongs to the field of Internet technology, and in particular relates to a method for online anomaly detection of virtual assets based on data flow.
背景技术Background technique
互联网的迅猛发展催生了电子商务的繁荣,其中虚拟资产交易的增长尤为迅速,虚拟资产是指在网络世界中存在的具有竞争性、持久性以及可以交换或者买卖的物品,包括网上银行、网络帐号、网游装备武器、虚拟货币等。The rapid development of the Internet has given birth to the prosperity of e-commerce, among which the growth of virtual asset transactions is particularly rapid. Virtual assets refer to items that are competitive, persistent and can be exchanged or traded in the online world, including online banking, online account numbers. , online game equipment weapons, virtual currency, etc.
目前,我国已经开展了基于eID的网域空间虚拟资产管理与保全技术研究,以实现对虚拟资产的规范统一管理。虚拟资产保全系统全面准确的记录了对虚拟资产的各种操作,但如何从这些记录数据中间挖掘出异常的交易行为仍然面临诸多挑战。针对网络虚拟资产交易信息规模巨大,增长速度非常快的特点,自动地从海量的虚拟资产交易信息中发现以及预测异常行为,从而对已经发生以及可能发生的犯罪行为进行有效的检测显得极为迫切。At present, my country has carried out research on eID-based virtual asset management and security technology in cyberspace to achieve standardized and unified management of virtual assets. The virtual asset security system comprehensively and accurately records various operations on virtual assets, but how to mine abnormal transaction behaviors from these recorded data still faces many challenges. In view of the huge scale and very fast growth rate of online virtual asset transaction information, it is extremely urgent to automatically discover and predict abnormal behaviors from massive virtual asset transaction information, so as to effectively detect criminal behaviors that have occurred and may occur.
异常发现的主要目的是根据已知的异常数据训练并建立一个异常检测模型。异常发现方法主要包括基于统计学、基于信息论、基于谱、基于机器学习的异常发现技术,其中基于机器学习的异常发现技术又主要包括基于聚类、基于分类、基于序列模式的异常发现技术。基于聚类的异常发现技术只能用于离线分析,把所有数据进行聚类之后,那些个体数量小于某一阈值的族群被视为是异常,聚类算法的优势在于它不需要历史数据带有标签。异常发现从本质上讲可以看作是个分类问题,就是把数据进行分类,分为正常或异常。异常发现技术主要是使用带标签的历史数据进行训练,得到分类器,然后运用这个分类器对新来的数据进行分类。基于序列模式的异常发现技术主要是通多用户的操作时序数据挖掘出用户的一些正常行为模式和异常行为模式,之后对用户新数据提取行为模式,与数据库中的正常行为模式和异常行为模式进行匹配,看当前操作是否属于异常。The main purpose of anomaly discovery is to train and build an anomaly detection model based on known anomalous data. Anomaly detection methods mainly include anomaly detection techniques based on statistics, information theory, spectrum-based, and machine learning-based, among which anomaly detection techniques based on machine learning mainly include cluster-based, classification-based, and sequence-based pattern-based anomaly detection techniques. Cluster-based anomaly detection technology can only be used for offline analysis. After clustering all data, those groups whose number of individuals is less than a certain threshold are regarded as anomalies. The advantage of clustering algorithm is that it does not require historical data with Label. Anomaly detection can be regarded as a classification problem in essence, which is to classify data into normal or abnormal. The anomaly detection technology mainly uses the labeled historical data for training to obtain a classifier, and then uses the classifier to classify the new data. Sequence pattern-based anomaly discovery technology mainly mines some normal behavior patterns and abnormal behavior patterns of users through multi-user operation time series data, and then extracts behavior patterns from new user data, and compares them with the normal behavior patterns and abnormal behavior patterns in the database. Match to see if the current operation is an exception.
全拥等人[1]提出了一种基于共生矩阵的电子商务交易日志的异常检测方法,该算法利用共生矩阵对用户的交易行为建模,通过PCA方法建立共生矩阵空间,从而得到用户正常交易模式。在检测阶段,对待数据产生的共生矩阵进行了修正并获取用户的交易模式,通过矩阵2-范数计算用户交易模式和其正常模式之间的距离并以此来判断用户的交易行为是否异常。Quan Yong et al. [1] proposed an anomaly detection method for e-commerce transaction logs based on co-occurrence matrix. The algorithm uses co-occurrence matrix to model user's transaction behavior, and establishes co-occurrence matrix space by PCA method, so as to obtain user's normal transaction. model. In the detection stage, the co-occurrence matrix generated by the data to be treated is revised and the user's transaction pattern is obtained, and the distance between the user's transaction pattern and its normal pattern is calculated by the matrix 2-norm, and the user's transaction behavior is judged whether the user's transaction behavior is abnormal.
姬炳帅等人[2]提出了另一电子商务的用户行为异常检测方法,首先根据用户行为日志数据的特点将其分割为静态属性集和操作序列集,然后利用基于轴属性的Apriori算法和GSP序列模式挖掘算法分别对这两种类型的数据集进行模式挖掘,在此基础上建立用户的正常行为模式,最后使用基于先后顺序的模式比较方法将用户当前的行为模式与其历史正常行为模式进行匹配,以此来判断该用户的交易行为是否异常。Ji Bingshuai et al. [2] proposed another user behavior anomaly detection method in e-commerce. First, according to the characteristics of user behavior log data, it was divided into static attribute sets and operation sequence sets, and then the Apriori algorithm and GSP sequence based on axis attributes were used. The pattern mining algorithm performs pattern mining on these two types of data sets respectively, establishes the normal behavior pattern of the user on this basis, and finally uses the sequence-based pattern comparison method to match the current behavior pattern of the user with its historical normal behavior pattern. This is to determine whether the user's transaction behavior is abnormal.
赵学良[3]提出了一种基于滑动窗口模型的数据流离群点检测方法,该方法使用简单的滑动窗口对数据流的新旧数据更迭进行有效管理,并且算法采用的数据结构有效地降低了近邻集统计时的计算量,使得算法性能较优。Zhao Xueliang [3] proposed a data stream outlier detection method based on the sliding window model. This method uses a simple sliding window to effectively manage the change of new and old data in the data stream, and the data structure used in the algorithm effectively reduces the number of neighbors. The amount of calculation in statistics makes the algorithm performance better.
然而,上述前两种[1,2]虚拟资产中的异常检测方法都是离线进行分析,离线分析是针对历史数据进行分析,如果发现异常数据,那么再对异常数据进行追溯,找到异常源头,因此,时效性很低。However, the anomaly detection methods in the first two [1, 2] virtual assets above are both offline analysis, and offline analysis is based on historical data analysis. Therefore, the timeliness is very low.
上述第三种[3]异常发现方法所寻找的离群点是指当前滑动窗口中的异常点,而不是全局的异常点,而且也没有给出基于数据流的离群点发现技术的框架。The outliers found by the third method [3] above are the outliers in the current sliding window, not the global outliers, and there is no framework for outlier discovery technology based on data flow.
[1]全拥,李树栋,贾焰,等.基于共生矩阵的电子商务交易日志异常检测[J].中国电子商情:通信市场,2013(4):39-45。[1] Quan Yong, Li Shudong, Jia Yan, et al. Anomaly Detection of E-Commerce Transaction Logs Based on Symbiosis Matrix [J]. China Electronic Commerce: Communication Market, 2013(4):39-45.
[2]姬炳帅,李虎,韩伟红,等.面向电子商务的用户异常行为检测研究[J].信息网络安全,2014(9):80-85。[2] Ji Bingshuai, Li Hu, Han Weihong, et al. Research on abnormal user behavior detection for e-commerce [J]. Information Network Security, 2014(9):80-85.
[3]赵学良.基于滑动窗口模型的数据流离群点检测研究[D].重庆大学,2012。[3] Zhao Xueliang. Research on data flow outlier detection based on sliding window model [D]. Chongqing University, 2012.
发明内容SUMMARY OF THE INVENTION
针对以上问题,本发明提供一种基于数据流虚拟资产在线异常发现的方法,可实时地对异常进行检测,适用于实时检测虚拟资产操作中的异常行为。In view of the above problems, the present invention provides a method for online anomaly detection of virtual assets based on data flow, which can detect anomalies in real time, and is suitable for real-time detection of abnormal behaviors in virtual asset operations.
本发明的技术方案如下:The technical scheme of the present invention is as follows:
一种基于数据流虚拟资产在线异常发现的方法,包括以下步骤:A method for online anomaly discovery of virtual assets based on data flow, comprising the following steps:
(1)数据处理:用户操作行为日志数据流流入数据窗口中,通过对数据窗口中数据的预处理提取到数据概要,已经处理过的数据流直接流出数据窗口,存到永久存储器中;(1) Data processing: The user operation behavior log data stream flows into the data window, and the data summary is extracted by preprocessing the data in the data window, and the processed data stream directly flows out of the data window and is stored in the permanent storage;
(2)离线分析:数据库中的数据定期计算一次,用模式生成算法挖掘出用户的正常行为模式和异常行为模式;(2) Offline analysis: The data in the database is calculated once regularly, and the pattern generation algorithm is used to mine the normal behavior patterns and abnormal behavior patterns of users;
(3)在线分析:系统实时地对滑动窗口中的数据进行分析,提取当前的行为模式,与模式库中的正常行为模式和异常行为模式进行匹配,看是否属于异常,若被判为异常,进行报警处理。(3) Online analysis: The system analyzes the data in the sliding window in real time, extracts the current behavior pattern, and matches the normal behavior pattern and abnormal behavior pattern in the pattern library to see if it is abnormal. If it is judged to be abnormal, Perform alarm processing.
其中,所述的步骤(2)中包括以下步骤:Wherein, the described step (2) comprises the following steps:
1、数据的存储:从数据窗口流出的数据流流进永久存储器时默认为正常行为标签,当实时分析模块检测到某个用户操作为异常时,调整数据库中对应数据的标签。同时,调整数据库中对应数据的标签还包括人工反馈性的调整,如当系统判断某用户行为异常并发出警报后被人工确认为是错误警报,需要把信息反馈到数据库中去调整相应数据的标签。应对虚拟资产用户操作行为海量数据的存储一般采用nosql的数据库存储,如Cassandra。1. Data storage: When the data flow from the data window flows into the permanent storage, the default is the normal behavior label. When the real-time analysis module detects that a user operation is abnormal, it adjusts the corresponding data label in the database. At the same time, adjusting the labels of the corresponding data in the database also includes manual feedback adjustments. For example, when the system judges that a user behaves abnormally and issues an alarm and is manually confirmed as a false alarm, the information needs to be fed back to the database to adjust the labels of the corresponding data. . To deal with the storage of massive data of virtual asset user operation behavior, nosql database storage is generally used, such as Cassandra.
2、模式的生成:对离线分析模块数据库中的数据,系统定期用模式生成算法定期计算一次,得到每个用户的正常行为模式库和异常行为模式库。模式生成算法采用多种算法,如关联规则、序列模式、谱理论、基于时空序列挖掘等;2. Pattern generation: For the data in the database of the offline analysis module, the system uses the pattern generation algorithm to periodically calculate once, and obtains the normal behavior pattern library and abnormal behavior pattern library of each user. The pattern generation algorithm uses a variety of algorithms, such as association rules, sequential patterns, spectral theory, and spatiotemporal sequence mining;
3、模式的更新:对数据库中数据进行计算更新模式时,只使用用户最后一次登出之前的所有操作行为数据进行分析。3. Mode update: When calculating the update mode for the data in the database, only use all the operation behavior data before the last logout of the user for analysis.
其中,所述的步骤(3)中包括以下步骤:Wherein, described step (3) comprises the following steps:
1)提取数据概要:只对用户登录到登出之间的数据进行处理,只记录登录操作的时间,节省内存空间而保证不丢失重要信息,且所用数据结构有利于后续计算;1) Extract data summary: only process the data between user login and logout, record only the time of the login operation, save memory space and ensure that important information is not lost, and the data structure used is conducive to subsequent calculations;
2)提取当前用户行为模式:每一次用户有新的操作行为数据进入时,都对该用户对应的数据概要进行当前用户行为模式提取;2) Extract the current user behavior pattern: every time the user has new operation behavior data to enter, the current user behavior pattern extraction is performed on the data summary corresponding to the user;
3)行为模式匹配:提取出的行为模式与离线分析模块中生成的正常行为模式库和异常行为模式进行匹配。3) Behavior pattern matching: The extracted behavior pattern is matched with the normal behavior pattern library and abnormal behavior pattern generated in the offline analysis module.
进一步的,所述的步骤1)中还包括以下步骤:Further, the described step 1) also includes the following steps:
步骤1:首先创建一个新的HashMap,命名为dataProfile,用来存数据概要;Step 1: First create a new HashMap named dataProfile to store the data summary;
步骤2:读到缓冲区的一条记录,验证该记录中用户ID字段是否为空,若为空,直接跳到步骤5;否则,进入下一步骤;Step 2: Read a record in the buffer, verify whether the user ID field in the record is empty, if it is empty, skip directly to step 5; otherwise, go to the next step;
步骤3:验证当前数据概要dataProfile中是否存在key为当前用户ID的记录,若不存在,则在dataProfile中添加一条key为当前用户ID的记录,这种情况操作类型肯定为登录操作,需要记录登录时间;否则,进入下一步骤;Step 3: Verify whether there is a record with the key as the current user ID in the current data profile dataProfile. If not, add a record with the key as the current user ID in the dataProfile. In this case, the operation type must be a login operation, and the login needs to be recorded. time; otherwise, go to the next step;
步骤4:查看当前操作类型是何类型,若为登出操作,则将dataProfile中key为当前用户ID的记录删除;若为其他操作,则在dataProfile中key为当前用户ID的记录的value中的操作序列中加入当前操作类型和相应商品ID;Step 4: Check what the current operation type is. If it is a logout operation, delete the record whose key is the current user ID in the dataProfile; if it is other operations, in the value of the record whose key is the current user ID in the dataProfile Add the current operation type and the corresponding product ID to the operation sequence;
步骤5:读取缓冲区下一条记录,进入循环。Step 5: Read the next record in the buffer and enter the loop.
进一步的,所述的步骤3)中还包括以下步骤:Further, the described step 3) also includes the following steps:
步骤a:与异常行为模式库中异常行为模式匹配;Step a: Match with the abnormal behavior pattern in the abnormal behavior pattern library;
步骤b:若匹配成功,则将其判断为已知的异常;Step b: If the match is successful, it is judged as a known abnormality;
步骤c:若未匹配成功,则再与正常行为模式匹配,若匹配成功,则将其判断为正常,若未匹配成功,则将其判断为未知的异常;Step c: If the match is not successful, then match with the normal behavior pattern, if the match is successful, it is judged as normal, if the match is not successful, it is judged as an unknown abnormality;
步骤d:确认为异常后,进行以下四个操作:①实时反馈给前端,发出异常报警,②在数据概要中将该用户的记录删除,③把该用户加入到一个异常用户队列中去,不再对其进行异常检测了,直至该用户发送登出行为,将其从异常用户队列中删除,④把异常反馈给数据库,调整相应标签。Step d: After it is confirmed to be abnormal, perform the following four operations: ①Real-time feedback to the front end, and an abnormal alarm is issued; ②Delete the user's record in the data summary; ③Add the user to an abnormal user queue, and do not Then perform anomaly detection on it until the user sends a logout behavior, and delete it from the abnormal user queue. ④ Feed the anomaly back to the database and adjust the corresponding label.
本发明的有益效果是:采用从数据窗口流出的数据流进永久存储器时默认为正常行为标签,当实时分析模块检测到某个用户操作为异常时,再去调整数据库中对应数据的标签,可使得数据窗口中的数据不需要等到检测操作完成并判断好它属于哪个标签就可以直接流出,可节省内存空间,防止数据被堵塞在数据窗口中。The beneficial effects of the present invention are: when the data flowing from the data window is used to flow into the permanent storage, the default is the normal behavior label, and when the real-time analysis module detects that a certain user operation is abnormal, it adjusts the label of the corresponding data in the database, which can The data in the data window can flow out directly without waiting for the detection operation to be completed and to determine which label it belongs to, which can save memory space and prevent data from being blocked in the data window.
由于用户为异常操作可在登出操作之前被判断出来,且实时分析模块检测发现异常之后可以立马反馈给离线分析模块去调整数据库中对应数据的标签,因此,可确保用户最后一次登出之前的所有数据都为更新过标签的。Since the abnormal operation of the user can be judged before the logout operation, and the real-time analysis module can immediately feedback the abnormal operation to the offline analysis module to adjust the label of the corresponding data in the database, it can ensure that the user's last logout before the last time. All data are labelled updated.
与现有技术相比,本发明将数据流的技术应用到虚拟资产的异常发现中,设计了基于数据流的虚拟资产在线异常发现技术框架,使得系统能更快速有效地实现实时地检测异常,从而更好地预防用户的损失。Compared with the prior art, the present invention applies the technology of data flow to the abnormal discovery of virtual assets, and designs a technical framework for online abnormal discovery of virtual assets based on data flow, so that the system can detect abnormalities in real time more quickly and effectively, So as to better prevent the loss of users.
附图说明Description of drawings
图1为本发明的基于数据流虚拟资产在线异常发现框架图。FIG. 1 is a frame diagram of an online anomaly discovery based on a data stream virtual asset according to the present invention.
图2为本发明的提取数据概要生成算法流程图。FIG. 2 is a flow chart of an extraction data summary generation algorithm of the present invention.
图3为本发明的硬件部署环境图。FIG. 3 is a diagram of a hardware deployment environment of the present invention.
具体实施方式Detailed ways
为了便于理解本发明,以下结合说明书附图和实施例对本发明作进一步说明。In order to facilitate the understanding of the present invention, the present invention will be further described below with reference to the accompanying drawings and embodiments.
本发明提供一种基于数据流虚拟资产在线异常发现的方法,其框架图如图1所示,包括在线分析模块和离线分析模块。首先,用户操作行为日志数据流流入数据窗口中,通过对数据窗口中数据的预处理提取到数据概要,已经处理过的数据流直接流出数据窗口,存到永久存储器中。在离线分析模块中,数据库中的数据将被定期计算一次,用模式生成算法挖掘出用户的正常行为模式和异常行为模式。在线分析模块中,系统会实时地对滑动窗口中的数据进行分析,提取当前的行为模式,然后再与模式库中的正常行为模式和异常行为模式进行匹配,看是否属于异常。若被判为异常,则进行报警处理。The present invention provides a method for online anomaly detection of virtual assets based on data flow, the frame diagram of which is shown in FIG. 1 , including an online analysis module and an offline analysis module. First, the user operation behavior log data stream flows into the data window, and the data summary is extracted by preprocessing the data in the data window. The processed data stream directly flows out of the data window and is stored in permanent storage. In the offline analysis module, the data in the database will be calculated regularly, and the pattern generation algorithm will be used to mine the user's normal behavior patterns and abnormal behavior patterns. In the online analysis module, the system will analyze the data in the sliding window in real time, extract the current behavior pattern, and then match with the normal behavior pattern and abnormal behavior pattern in the pattern library to see if it is abnormal. If it is judged to be abnormal, alarm processing will be performed.
在线分析模块:在线分析模块主要有三个工作,即提取数据概要、提取当前用户行为模式、行为模式匹配。表1是某个时间段的用户操作行为日志流的简单示例,该数据流包括12条记录,时间跨多为50秒,有三个用户参与。此示例仅展示用户、IP地址、时间、操作行为类型、相关商品ID五个字段,现实数据中会复杂许多。提取数据概要的目的是为了尽可能地节省宝贵的内存空间而又要保证不丢失重要信息,并且做到所用的数据结构需要有利于之后的计算。所以,本发明提取数据概要模式主要坚持以下两条要求:Online analysis module: The online analysis module mainly has three tasks, namely, extracting data summary, extracting current user behavior pattern, and behavior pattern matching. Table 1 is a simple example of the user operation behavior log stream in a certain period of time. The data stream includes 12 records, the time span is mostly 50 seconds, and three users participate. This example only shows five fields of user, IP address, time, operation behavior type, and related product ID. Real data will be much more complicated. The purpose of extracting the data summary is to save valuable memory space as much as possible without losing important information, and the data structure used needs to be conducive to subsequent calculations. Therefore, the data extraction summary mode of the present invention mainly adheres to the following two requirements:
A.只对用户登录到登出之间的数据进行处理;A. Only process the data between the user's login and logout;
B.只记录登录操作的时间。B. Only record the time of the login operation.
表1 用户操作行为日志流的简单示例Table 1 Simple example of user operation behavior log stream
表2是根据表1中数据示例产生的用户操作行为数据概要简单示例,数据概要主要包括用户ID、IP地址、登录时间、操作序列四个字段。数据概要以每个用户为单位存于List当中,其中的操作序列该字段也是一个List,当用户有新的操作行为数据进入数据窗口之后,提取其操作类型和相关商品ID加入到操作序列这个List中。Table 2 is a simple example of user operation behavior data summary generated according to the data example in Table 1. The data summary mainly includes four fields: user ID, IP address, login time, and operation sequence. The data summary is stored in the List in terms of each user, and the operation sequence in this field is also a List. When the user has new operation behavior data into the data window, the operation type and related product ID are extracted and added to the operation sequence list. middle.
表2 用户操作行为数据概要简单示例Table 2 Simple example of user operation behavior data summary
提取数据概要具体算法如图2所示,主要步骤为:The specific algorithm for extracting the data summary is shown in Figure 2. The main steps are:
步骤1:首先创建一个新的HashMap,命名为dataProfile,用来存数据概要。Step 1: First create a new HashMap named dataProfile to store the data profile.
步骤2:读到缓冲区的一条记录,验证该记录中用户ID字段是否为空,若为空,直接跳到步骤5;否则,进入下一步骤。Step 2: Read a record in the buffer, verify whether the user ID field in the record is empty, if it is empty, skip directly to step 5; otherwise, go to the next step.
步骤3:验证当前数据概要dataProfile中是否存在key为当前用户ID的记录,若不存在,则再dataProfile中添加一条key为当前用户ID的记录,这种情况操作类型肯定为登录操作,需要记录登录时间;否则,进入下一步骤。Step 3: Verify whether there is a record with the key as the current user ID in the current data profile dataProfile. If not, add a record with the key as the current user ID in the dataProfile. In this case, the operation type must be a login operation, and the login needs to be recorded. time; otherwise, go to the next step.
步骤4:查看当前操作类型是何类型,若为登出操作,则将dataProfile中key为当前用户ID的记录删除;若为其他操作,则在dataProfile中key为当前用户ID的记录的value中的操作序列中加入当前操作类型和相应商品ID。Step 4: Check what the current operation type is. If it is a logout operation, delete the record whose key is the current user ID in the dataProfile; if it is other operations, in the value of the record whose key is the current user ID in the dataProfile The current operation type and the corresponding product ID are added to the operation sequence.
步骤5:读取缓冲区下一条记录,进入循环。Step 5: Read the next record in the buffer and enter the loop.
每一次用户有新的操作行为数据进入时,都对该用户对应的数据概要进行当前行为模式提取,提取出的行为模式与离线分析模块中生成的正常行为模式库和异常行为模式进行匹配。匹配过程具体为:先与异常行为模式库中异常行为模式匹配,若匹配成功,则将其判断为已知的异常;若未匹配成功,则再与正常行为模式匹配,若匹配成功,则将其判断为正常,若未匹配成功,则将其判断为未知的异常。确认为异常后,需要做四个操作:①实时反馈给前端,发出异常报警;②在数据概要中将该用户的记录删除;③把该用户加入到一个异常用户队列中去,不再对其进行异常检测了,直至该用户发送登出行为,将其从异常用户队列中删除;④把异常反馈给数据库,调整相应标签。Every time a user enters new operation behavior data, the current behavior pattern is extracted from the data summary corresponding to the user, and the extracted behavior pattern is matched with the normal behavior pattern library and abnormal behavior pattern generated in the offline analysis module. The matching process is as follows: first match the abnormal behavior pattern in the abnormal behavior pattern library. If the match is successful, it will be judged as a known abnormality; if the match is not successful, it will be matched with the normal behavior pattern. It is judged as normal, and if the match is not successful, it is judged as an unknown abnormality. After it is confirmed to be abnormal, four operations need to be done: ①Real-time feedback to the front end, and an abnormal alarm is issued; ②Delete the user's record in the data summary; ③Add the user to an abnormal user queue, and no longer Anomaly detection is performed until the user sends a logout behavior, and it is deleted from the abnormal user queue; ④ Feedback the anomaly to the database and adjust the corresponding label.
表3是根据表2中的数据概要对其中的用户user1所提取的一个简单行为模式示例,其表示用户user1在19点左右在IP地址220.79.15.21登录时长为30分钟以内,相关商品的价格为0-100元区间,操作序列为登录---浏览了与加入购物车的商品相似度为0.84的商品---浏览了加入购物车的商品---加入购物车。Table 3 is an example of a simple behavior pattern extracted from the user user1 according to the data summary in Table 2, which indicates that the user user1 logs in at the IP address 220.79.15.21 for less than 30 minutes at around 19:00, and the price of the related product is In the range of 0-100 yuan, the operation sequence is login---browsing the goods whose similarity is 0.84 with the goods added to the shopping cart---browsing the goods added to the shopping cart---adding to the shopping cart.
表3 用户user1提取的简单行为模式示例Table 3 Examples of simple behavior patterns extracted by user user1
表4是行为模式库中用户user1的部分正常行为模式的示例,包括两条IP地址及时间的关联规则;关注商品价格区间的百分比,示例中用户user1关注的商品80%是0-100元的,19%是100-200元的,1%是200-500元的;三条操作序列的频繁模式。Table 4 is an example of some normal behavior patterns of user user1 in the behavior pattern database, including two association rules of IP addresses and time; the percentage of the price range of the products concerned, in the example, 80% of the products that user user1 follows are 0-100 yuan , 19% are 100-200 yuan, 1% are 200-500 yuan; the frequent pattern of three operation sequences.
表4 用户user1部分正常行为模式的示例Table 4 Examples of normal behavior patterns of user user1 part
模式匹配阶段,采用的步骤是:①首先将用户当前操作行为模式中的静态属性(IP地址及时间、商品价格)与正常行为模式库中的所有关联规则进行比较,如果全部匹配成功,则此次行为判断为正常;②否则,将当前用户操作行为模式中的操作序列与正常行为模式库中的所有操作序列进行比较,当相似度超过所设定的阈值时将此次行为判断为正常,否则判断为异常。所给示例中,静态属性的匹配时发现“IP地址及时间”匹配不上,正常行为模式在IP地址为220.79.15.21的登录时间一般在11点左右,而这次出现在19点左右,进入操作序列的匹配;计算操作序列的相似性目前有很多方法可以使用,该点不是本发明重点研究之处,该例中使用Deep-Simi算法计算得到当前行为模式中的操作序列与所给示例正常行为模式中的第一条操作序列相似度为0.7,而阈值一般设在0.4-0.6之间,所以将此次行为判断为正常。In the pattern matching stage, the steps are: ① First, compare the static attributes (IP address and time, commodity price) in the user's current operation behavior pattern with all the association rules in the normal behavior pattern library. The second behavior is judged to be normal; ② Otherwise, the operation sequence in the current user operation behavior pattern is compared with all the operation sequences in the normal behavior pattern library. When the similarity exceeds the set threshold, the behavior is judged to be normal. Otherwise, it is judged as abnormal. In the given example, when the static attributes are matched, it is found that the "IP address and time" does not match. The normal behavior mode is that the login time with the IP address 220.79.15.21 is generally around 11:00, and this time it appears at around 19:00. Enter Matching of operation sequences; There are many methods for calculating the similarity of operation sequences, which is not the focus of the present invention. In this example, the Deep-Simi algorithm is used to calculate the operation sequence in the current behavior pattern and the given example is normal The similarity of the first operation sequence in the behavior pattern is 0.7, and the threshold is generally set between 0.4 and 0.6, so this behavior is judged to be normal.
离线分析模块:主要是包括数据的存储和模式的生成。应对虚拟资产用户操作行为海量数据的存储一般采用nosql的数据库存储,比如Cassandra。值得注意的是,从数据窗口流出的数据流进永久存储器时默认为正常行为标签,当实时分析模块检测到某个用户操作为异常时,再去调整数据库中对应数据的标签。这样做的一个好处就是数据窗口中的数据不需要等到检测操作完成并判断好它属于哪个标签就可以直接流出,这样很节省内存空间,不然将会有很多数据都被堵塞在数据窗口中。同时,调整数据库中的标签还应包括人工反馈性的调整,比如当系统判断某用户行为异常并发出的警报后被人工确认为是错误警报,那么我们需要把这个信息反馈到数据库中去,去调整相应数据的标签。Offline analysis module: mainly includes data storage and pattern generation. To deal with the storage of massive data of virtual asset user operation behavior, nosql database storage is generally used, such as Cassandra. It is worth noting that when the data flowing from the data window flows into the permanent storage, it defaults to the normal behavior label. When the real-time analysis module detects that a user operation is abnormal, it adjusts the label of the corresponding data in the database. One advantage of this is that the data in the data window does not need to wait for the detection operation to complete and determine which label it belongs to, and can flow directly out, which saves memory space, otherwise a lot of data will be blocked in the data window. At the same time, adjusting the labels in the database should also include manual feedback adjustments. For example, when the system judges that a user's behavior is abnormal and an alarm is manually confirmed as a false alarm, then we need to feed this information back to the database. Adjust the labels of the corresponding data.
对离线分析模块数据库中的数据,系统会定期用模式生成算法定期计算一次,得到每个用户的正常行为模式库和异常行为模式库。模式生成算法可以采用多种算法,比如关联规则、序列模式、谱理论、基于时空序列挖掘等等。在对数据库中数据进行计算更新模式时,我们只使用用户最后一次登出之前的所有操作行为数据进行分析就好。因为数据库中的有一部分最新的数据是没有调整过标签的,标签都是默认为正常的,而我们能确保用户最后一次登出之前的所有数据都是更新过标签的,这是因为如果用户为异常操作肯定是会在登出操作之前就会被判断出来,实时分析模块检测发现异常之后可以立马反馈给离线分析模块去调整数据库中对应数据的标签。For the data in the database of the offline analysis module, the system will use the pattern generation algorithm to periodically calculate it once, and obtain the normal behavior pattern library and abnormal behavior pattern library of each user. Pattern generation algorithms can use a variety of algorithms, such as association rules, sequential patterns, spectral theory, and spatiotemporal sequence mining. When calculating the update mode for the data in the database, we only use all the operation behavior data before the last logout of the user for analysis. Because some of the latest data in the database has not adjusted the label, the label is normal by default, and we can ensure that all the data before the user's last logout is the updated label, this is because if the user is The abnormal operation will definitely be judged before the logout operation. After the real-time analysis module detects the abnormality, it can immediately feed back to the offline analysis module to adjust the label of the corresponding data in the database.
本发明的硬件部署环境图如图3所示,本发明硬件可扩展性强,当需求增加时,只需增加集群节点即可。The hardware deployment environment diagram of the present invention is shown in FIG. 3 . The hardware of the present invention has strong scalability. When the demand increases, it is only necessary to add cluster nodes.
实施例1Example 1
一种基于数据流虚拟资产在线异常发现的方法,其虚拟资产管理系统的硬件具体信息如下:A method for online anomaly discovery of virtual assets based on data flow, the hardware specific information of the virtual asset management system is as follows:
虚拟资产数据流处理集群:2个节点,节点配置为4核CPU、32G内存、Centos6.564位系统;Virtual asset data stream processing cluster: 2 nodes, the nodes are configured as 4-core CPU, 32G memory, Centos6.564-bit system;
行为模式计算集群:5个节点,节点配置为4核CPU、16G内存、Centos6.564系统;Behavior mode computing cluster: 5 nodes, the nodes are configured as 4-core CPU, 16G memory, Centos6.564 system;
虚拟资产操作日志数据库:1个节点,节点配置为2核CPU、8G内存、2TB硬盘、Centos6.564位操作系统;Virtual asset operation log database: 1 node, the node is configured with 2-core CPU, 8G memory, 2TB hard disk, Centos6.564-bit operating system;
行为模式库:1个节点,节点配置为2核CPU、8G内存、2TB硬盘、Centos6.564位操作系统。Behavior pattern library: 1 node, the node is configured with 2-core CPU, 8G memory, 2TB hard disk, and Centos6.564-bit operating system.
如上述的硬件配置环境能够应对1W级别用户的并发操作。虚拟资产数据流处理集群实时地不断流入的数据提取数据概要,将数据概要存储在内存中,处理过的数据直接流出滑动窗口存到虚拟资产操作日志数据库中。行为模式计算集群定期不断地去访问虚拟资产操作日志数据库中的数据,计算用户行为模式,得到新的行为模式后去更新行为模式库。与此同时,虚拟资产数据流处理集群根据数据概要中的信息提取用户的当前行为模式,再去访问行为模式库中的该用户的正常行为模式和异常行为模式,分别进行匹配,验证当前操作是否属于异常。若判断为异常,需要将异常标签反馈给虚拟资产操作日志数据库。The hardware configuration environment as above can cope with the concurrent operations of 1W level users. The virtual asset data stream processing cluster extracts data summaries from the continuously flowing data in real time, stores the data summaries in memory, and the processed data flows directly out of the sliding window and stores them in the virtual asset operation log database. The behavior pattern computing cluster regularly and continuously accesses the data in the virtual asset operation log database, calculates the user behavior pattern, and updates the behavior pattern database after obtaining a new behavior pattern. At the same time, the virtual asset data stream processing cluster extracts the current behavior pattern of the user according to the information in the data summary, and then accesses the normal behavior pattern and abnormal behavior pattern of the user in the behavior pattern library, and matches them respectively to verify whether the current operation is are abnormal. If it is judged to be abnormal, the abnormal label needs to be fed back to the virtual asset operation log database.
与现有技术相比,本发明将数据流的技术应用到虚拟资产的异常发现中,设计了基于数据流的虚拟资产在线异常发现技术框架,使得系统能更快速有效地实现实时地检测异常,从而更好地预防用户的损失。Compared with the prior art, the present invention applies the technology of data flow to the abnormal discovery of virtual assets, and designs a technical framework for online abnormal discovery of virtual assets based on data flow, so that the system can detect abnormalities in real time more quickly and effectively, So as to better prevent the loss of users.
以上是对本发明进行了示例性的描述,显然本发明的实现并不受上述方式的限制,只要采用了本发明技术方案进行的各种改进,或未经改进将本发明的构思和技术方案直接应用于其它场合的,均在本发明的保护范围内。The above is an exemplary description of the present invention. Obviously, the realization of the present invention is not limited by the above-mentioned methods, as long as various improvements made by the technical solutions of the present invention are adopted, or the ideas and technical solutions of the present invention are directly used without improvement. Those applied to other occasions are all within the protection scope of the present invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510130123.1A CN106156026B (en) | 2015-03-24 | 2015-03-24 | A method for online anomaly detection of virtual assets based on data flow |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510130123.1A CN106156026B (en) | 2015-03-24 | 2015-03-24 | A method for online anomaly detection of virtual assets based on data flow |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN106156026A CN106156026A (en) | 2016-11-23 |
| CN106156026B true CN106156026B (en) | 2020-02-18 |
Family
ID=58064356
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201510130123.1A Active CN106156026B (en) | 2015-03-24 | 2015-03-24 | A method for online anomaly detection of virtual assets based on data flow |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106156026B (en) |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108075906A (en) * | 2016-11-08 | 2018-05-25 | 上海有云信息技术有限公司 | A kind of management method and system for cloud computation data center |
| CN107335220B (en) * | 2017-06-06 | 2021-01-26 | 广州华多网络科技有限公司 | Negative user identification method and device and server |
| CN107402957B (en) * | 2017-06-09 | 2023-02-07 | 全球能源互联网研究院 | Construction of User Behavior Pattern Library and Method and System for Abnormal User Behavior Detection |
| CN108055281B (en) * | 2017-12-27 | 2021-05-18 | 百度在线网络技术(北京)有限公司 | Account abnormity detection method, device, server and storage medium |
| CN109308615B (en) * | 2018-08-02 | 2020-12-29 | 同济大学 | Method, system, storage medium and electronic terminal for real-time fraudulent transaction detection based on statistical sequence features |
| CN110363381B (en) * | 2019-05-31 | 2023-12-22 | 创新先进技术有限公司 | An information processing method and device |
| CN111143415B (en) * | 2019-12-26 | 2023-12-29 | 政采云有限公司 | A data processing method, device and computer-readable storage medium |
| CN113806523B (en) * | 2020-06-11 | 2023-07-21 | 中国科学院计算机网络信息中心 | A classification-based anomaly detection method and system |
| CN112000863B (en) * | 2020-08-14 | 2024-04-09 | 北京百度网讯科技有限公司 | User behavior data analysis method, device, equipment and medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101364104A (en) * | 2008-09-23 | 2009-02-11 | 西部矿业股份有限公司 | Multi entity monitoring decision support system and method for downhole entironment |
| CN102130800A (en) * | 2011-04-01 | 2011-07-20 | 苏州赛特斯网络科技有限公司 | Device and method for detecting network access abnormality based on data stream behavior analysis |
| CN102413013A (en) * | 2011-11-21 | 2012-04-11 | 北京神州绿盟信息安全科技股份有限公司 | Network abnormal behavior detection method and device |
| CN104090835A (en) * | 2014-06-27 | 2014-10-08 | 中国人民解放军国防科学技术大学 | eID (electronic IDentity) and spectrum theory based cross-platform virtual asset transaction audit method |
-
2015
- 2015-03-24 CN CN201510130123.1A patent/CN106156026B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101364104A (en) * | 2008-09-23 | 2009-02-11 | 西部矿业股份有限公司 | Multi entity monitoring decision support system and method for downhole entironment |
| CN102130800A (en) * | 2011-04-01 | 2011-07-20 | 苏州赛特斯网络科技有限公司 | Device and method for detecting network access abnormality based on data stream behavior analysis |
| CN102413013A (en) * | 2011-11-21 | 2012-04-11 | 北京神州绿盟信息安全科技股份有限公司 | Network abnormal behavior detection method and device |
| CN104090835A (en) * | 2014-06-27 | 2014-10-08 | 中国人民解放军国防科学技术大学 | eID (electronic IDentity) and spectrum theory based cross-platform virtual asset transaction audit method |
Non-Patent Citations (1)
| Title |
|---|
| 数据流频繁模式挖掘关键算法及其应用研究;毛伊敏;《中国博士学位论文全文数据库 信息科技辑》;20121215(第12期);论文第1.2.4、1.2.5、5.3-5.4节 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106156026A (en) | 2016-11-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106156026B (en) | A method for online anomaly detection of virtual assets based on data flow | |
| US20200320431A1 (en) | System and method for detecting anomalies in prediction generation systems | |
| Huang et al. | CoDetect: Financial fraud detection with anomaly feature detection | |
| US11176206B2 (en) | Incremental generation of models with dynamic clustering | |
| WO2021088499A1 (en) | False invoice issuing identification method and system based on dynamic network representation | |
| CN105187242B (en) | A kind of user's anomaly detection method excavated based on variable-length pattern | |
| Sethi et al. | A revived survey of various credit card fraud detection techniques | |
| CN108053087A (en) | Anti money washing monitoring method, equipment and computer readable storage medium | |
| CN109829721B (en) | Multi-agent Behavior Modeling Method for Online Transactions Based on Heterogeneous Network Representation Learning | |
| Balasupramanian et al. | User pattern based online fraud detection and prevention using big data analytics and self organizing maps | |
| CN103353880A (en) | Data mining method adopting dissimilarity degree clustering and association | |
| Liu et al. | Towards trustworthy rumor detection with interpretable graph structural learning | |
| Guanghe et al. | Real-time anomaly detection in dark pool trading using enhanced transformer networks | |
| Lin et al. | Tracking phishing on Ethereum: Transaction network embedding approach for accounts representation learning | |
| He et al. | Sgfm: Conditional flow matching for time series anomaly detection with state space models | |
| Li et al. | Umgad: Unsupervised multiplex graph anomaly detection | |
| CN111275447A (en) | Online network payment fraud detection system based on automatic feature engineering | |
| CN114897613A (en) | Abnormal transaction behavior detection method and system, electronic device and storage medium | |
| Jing et al. | Data streams classification with ensemble model based on decision-feedback | |
| Jing et al. | Improving the data quality for credit card fraud detection | |
| Mehana et al. | Fraud detection using data-driven approach | |
| Chen et al. | Anomaly subgraph mining in large-scale social networks | |
| Dasari et al. | Analysis of Machine Learning Models for Anomaly Detection Using PMU data | |
| Sun et al. | [Retracted] Enterprise Financial Risk Analysis Based on Improved Model C‐Means Clustering Algorithm | |
| Zhang et al. | Graph Anomaly Detection via Cross-Layer Integration |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |