CN105138661B

CN105138661B - A kind of network security daily record k-means cluster analysis systems and method based on Hadoop

Info

Publication number: CN105138661B
Application number: CN201510553636.3A
Authority: CN
Inventors: 高岭; 苏蓉; 高妮; 王帆; 杨建锋; 雷艳婷; 申元
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2018-10-30
Anticipated expiration: 2035-09-02
Also published as: CN105138661A

Abstract

A Hadoop-based network security log k-means cluster analysis system and method, including a log data acquisition subsystem, a log data mixing mechanism storage management subsystem, and a log data analysis subsystem; in the data storage layer, Hadoop and traditional data are used The hybrid storage mechanism of warehouse collaboration stores log data and provides an interface for Hive operations at the data access layer. The data storage layer and computing layer receive instructions from the Hive engine, and through HDFS, cooperate with MapReduce to realize efficient query and analysis of data; When data is mined and analyzed, MapReduce is used to implement the k-means algorithm for cluster mining analysis; Hadoop and the traditional data warehouse collaboration architecture are used to make up for the shortcomings of traditional data warehouses in massive data processing and storage, and also make the original The traditional data warehouse makes the best use of everything; the MapReduce-based k-means algorithm is used for cluster analysis, and the security level assessment and early warning of log data can be performed in a timely manner.

Description

A Hadoop-based network security log k-means clustering analysis system and method

技术领域technical field

本发明属于计算机信息处理技术领域，具体涉及一种基于Hadoop的网络安全日志k-means聚类分析系统及方法。The invention belongs to the technical field of computer information processing, and in particular relates to a Hadoop-based network security log k-means cluster analysis system and method.

背景技术Background technique

随着数据的爆炸，信息量的急剧增加，企业现有的传统数据仓库已经难以应付数据的增长速度。传统数据仓库通常采用高性能一体机建设，成本高，扩展性差，而且传统数据仓库仅擅长处理结构化数据，这种特性影响到传统数据仓库在面对海量异构数据时对于内在价值的挖掘，这是Hadoop与传统数据处理方式最大的区别。对于企业现有传统数据仓库我们要合理利用，同时要把已有的传统数据仓库和大数据平台整合在一起，建立一个统一的数据分析和数据处理架构，使得通过Hadoop与传统数据仓库的协作实现对网络日志的监控统计分析。With the explosion of data and the sharp increase in the amount of information, the existing traditional data warehouses of enterprises have been unable to cope with the growth rate of data. Traditional data warehouses are usually built with high-performance all-in-one machines, which are costly and poor in scalability, and traditional data warehouses are only good at processing structured data. This feature affects the mining of intrinsic value of traditional data warehouses when faced with massive heterogeneous data. This is the biggest difference between Hadoop and traditional data processing methods. We need to make reasonable use of the existing traditional data warehouse of the enterprise, and at the same time integrate the existing traditional data warehouse and big data platform to establish a unified data analysis and data processing architecture, so that through the collaboration between Hadoop and traditional data warehouse Monitoring and statistical analysis of network logs.

Hadoop是Apache组织管理的一个开源分布式计算平台，是一个能够对大量数据进行分布式处理的软件框架。以Hadoop分布式文件系统HDFS和MapReduce为核心的Hadoop为用户提供了系统底层细节透明的分布式基础架构。HDFS的高容错性、高伸缩性、高可扩展性、高获得性、高吞吐率等优点允许用户将Hadoop部署在低廉的硬件上，形成分布式系统；MapReduce分布式编程模型允许用户在不了解分布式系统底层细节的情况下开发并行应用程序。Hadoop is an open source distributed computing platform managed by the Apache organization and a software framework capable of distributed processing of large amounts of data. Hadoop, with the Hadoop distributed file system HDFS and MapReduce as the core, provides users with a distributed infrastructure that is transparent to the underlying details of the system. The advantages of HDFS, such as high fault tolerance, high scalability, high scalability, high availability, and high throughput, allow users to deploy Hadoop on low-cost hardware to form a distributed system; the MapReduce distributed programming model allows users to Develop parallel applications without the low-level details of distributed systems.

HDFS是分布式计算中数据存储管理的基础，是基于流数据模式访问和处理超大文件的需求而开发的。它的特性为海量数据提供了不怕故障的存储，为超大数据集的应用处理带来了很多便利。HDFS是一个主/从(Mater/Slave)体系结构，其体系结构中有两类节点，一类是NameNode，又叫"元数据节点"；另一类是DataNode，又叫"数据节点"，这两类节点分别承担Master和Worker具体任务的执行节点。但由于分布式存储的性质，HDFS集群拥有一个NameNode和多个DataNode。元数据节点用来管理文件系统的命名空间；数据节点是文件系统中真正存储数据的地方。HDFS is the basis of data storage management in distributed computing, and is developed based on the requirements of streaming data mode to access and process very large files. Its characteristics provide storage for massive data without fear of failure, and bring a lot of convenience to the application processing of very large data sets. HDFS is a master/slave (Mater/Slave) architecture. There are two types of nodes in its architecture, one is NameNode, also called "metadata node"; the other is DataNode, also called "data node". The two types of nodes are the execution nodes responsible for the specific tasks of Master and Worker respectively. But due to the nature of distributed storage, HDFS cluster has a NameNode and multiple DataNodes. Metadata nodes are used to manage the namespace of the file system; data nodes are where data is actually stored in the file system.

MapReduce并行计算框架是一个并行化程序执行系统。它提供了一个包含Map和Reduce两个阶段的并行处理模型和过程，以键值对数据输入方式来处理数据，并能自动完成数据的划分和调度管理。在程序执行中，MapReduce并行计算框架将负责调度和分配计算资源，划分和输入输出数据，调度程序的执行，监控程序的执行状态，并负责程序执行时各计算节点的同步以及中间结果的收集整理。The MapReduce parallel computing framework is a parallel program execution system. It provides a parallel processing model and process including two stages of Map and Reduce, processes data in the form of key-value pair data input, and can automatically complete data division and scheduling management. During program execution, the MapReduce parallel computing framework will be responsible for scheduling and allocating computing resources, dividing and inputting and outputting data, scheduling program execution, monitoring program execution status, and being responsible for the synchronization of each computing node during program execution and the collection of intermediate results .

Sqoop是一个在关系数据库与Hadoop平台间进行快速批量数据交换的工具。它可以将一个关系数据库中的数据批量导入Hadoop的HDFS、Hive中，也可以反过来将Hadoop平台中的数据导入关系数据库中。Sqoop is a tool for fast batch data exchange between relational database and Hadoop platform. It can import data from a relational database into Hadoop's HDFS and Hive in batches, and can also import data from the Hadoop platform into a relational database in reverse.

Hive是一个建立在Hadoop之上的数据仓库，用于管理存储于HDFS中的结构化/半结构化数据。它允许直接用类似SQL的HiveQL查询语言作为编程接口编写数据查询分析程序，并提供数据仓库所需要的数据抽取转换、存储管理和查询分析功能，而HiveQL语句在底层实现时被转换为相应的MapReduce程序加以执行。Hive is a data warehouse built on Hadoop for managing structured/semi-structured data stored in HDFS. It allows directly using the SQL-like HiveQL query language as a programming interface to write data query analysis programs, and provides data extraction and conversion, storage management, and query analysis functions required by data warehouses, while HiveQL statements are converted into corresponding MapReduce in the underlying implementation program to be executed.

发明内容Contents of the invention

为克服上述现有技术的不足，本发明的目的在于提供一种基于Hadoop的网络安全日志k-means聚类分析系统及方法，在合理利用已建好的传统数据仓库的基础上，把大数据平台整合进去，建立一个统一的数据存储和数据处理架构，克服传统数据仓库的扩展能力差，仅擅长处理结构化数据，无法对海量异构数据的内在价值进行挖掘的缺点。In order to overcome the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a Hadoop-based network security log k-means clustering analysis system and method, on the basis of rationally utilizing the traditional data warehouse that has been built, the big data The platform is integrated to establish a unified data storage and data processing architecture, which overcomes the shortcomings of traditional data warehouses, which are poor in scalability, only good at processing structured data, and unable to mine the intrinsic value of massive heterogeneous data.

为实现上述目的，本发明采用的技术方案是：一种基于Hadoop的网络安全日志k-means聚类分析系统，包括有日志数据获取子系统、日志数据混合机制存储管理子系统、日志数据分析子系统；In order to achieve the above object, the technical solution adopted in the present invention is: a Hadoop-based network security log k-means cluster analysis system, including a log data acquisition subsystem, a log data mixing mechanism storage management subsystem, and a log data analysis subsystem. system;

所述日志数据获取子系统是采集所有设备的网络安全日志数据；The log data acquisition subsystem collects the network security log data of all devices;

所述日志数据混合机制存储管理子系统是对所有日志数据进行管理与存储；The log data mixing mechanism storage management subsystem manages and stores all log data;

所述日志数据分析子系统是对所有日志数据进行快速查询分析处理，以及对日志数据的潜在价值进行挖掘分析。The log data analysis subsystem is to quickly query and analyze all log data, and to mine and analyze the potential value of the log data.

所述日志数据获取子系统在Linux环境下，配置Syslogd集中日志服务器，采用Syslog方式采集记录设备和系统日志数据，并将日志数据集中管理。The log data acquisition subsystem configures a Syslogd centralized log server in a Linux environment, collects recording equipment and system log data in a Syslog manner, and centrally manages the log data.

所述日志数据混合机制存储管理子系统，整合Hadoop平台和传统数据仓库，包括HDFS分布式文件系统模块，Hadoop平台和传统数据仓库协作模块。The log data mixing mechanism storage management subsystem integrates Hadoop platform and traditional data warehouse, including HDFS distributed file system module, Hadoop platform and traditional data warehouse cooperation module.

所述日志数据查询分析子系统，主要采用工具hive对数据进行简单的统计查询分析，根据需求，编写HiveQL查询语句，在Hive Driver的驱动下，完成HiveQL查询语句中词法分析、语法分析、编译、优化以及查询计划的生成，生成的查询计划存储在HDFS中，并在随后由MapReduce调用执行，对于日志数据中潜在的信息的分析，通过编写实现相应算法的MapReduce程序来挖掘其内在价值。The log data query and analysis subsystem mainly uses the tool hive to perform simple statistical query and analysis on the data, writes the HiveQL query statement according to the requirements, and completes the lexical analysis, syntax analysis, compilation, Optimization and query plan generation. The generated query plan is stored in HDFS and then executed by MapReduce. For the analysis of potential information in log data, its intrinsic value can be tapped by writing a MapReduce program that implements the corresponding algorithm.

所述HDFS分布式文件系统模块的基本文件访问过程是：The basic file access process of the HDFS distributed file system module is:

1）应用程序通过HDFS的客户端程序将文件名发送至NameNode；1) The application sends the file name to the NameNode through the client program of HDFS;

2）NomeNode接收到文件名之后，在HDFS目录中检索文件名对应的数据块，再根据数据块信息找到保存数据块DataNode地址，将这些地址回送到客户端；2) After NomeNode receives the file name, it retrieves the data block corresponding to the file name in the HDFS directory, and then finds the DataNode address for saving the data block according to the data block information, and sends these addresses back to the client;

3）客户端接收到这些DataNode地址之后，与这些DataNode并行地进行数据传输操作，同时将操作结果的相关日志提交到NameNode。3) After receiving the addresses of these DataNodes, the client performs data transmission operations in parallel with these DataNodes, and at the same time submits the relevant logs of the operation results to the NameNode.

所述Hadoop平台和传统数据仓库协作模块，配置MySQL数据库作为Hive的元数据库，用于存储Hive的Schema表结构信息等，通过Sqoop工具实现数据在传统数据仓库和Hadoop平台之间的传输。The Hadoop platform and the traditional data warehouse collaboration module configure the MySQL database as the metabase of Hive for storing the Schema table structure information of Hive, etc., and realize the transmission of data between the traditional data warehouse and the Hadoop platform through the Sqoop tool.

以Hadoop集群为平台，整合传统数据仓库和大数据平台，将MySQL数据库作为Hive的元数据库，用于存储Hive的Schema表结构信息，使用Sqoop工具实现数据在传统数据仓库和大数据平台之间的传输；包括有数据源层、数据存储层、计算层、数据分析层和结果显示层；Using Hadoop cluster as a platform, integrate traditional data warehouse and big data platform, use MySQL database as Hive's metabase to store Hive's Schema table structure information, and use Sqoop tool to realize data exchange between traditional data warehouse and big data platform Transmission; including data source layer, data storage layer, calculation layer, data analysis layer and result display layer;

所述的数据源层，通过配置Syslogd集中日志服务器，采集所有设备中的日志数据，再通过Sqoop工具将日志数据从传统数据仓库导入到数据存储层；The data source layer collects the log data in all devices by configuring the Syslogd centralized log server, and then imports the log data from the traditional data warehouse to the data storage layer through the Sqoop tool;

所述的数据存储层，采用Hadoop与传统数据仓库协作的混合存储架构，通过Sqoop数据传输工具将数据导入到HDFS中，处理元数据，将处理后的数据导入到Hive中对应的表里；The data storage layer adopts the hybrid storage architecture of Hadoop and traditional data warehouse collaboration, imports data into HDFS through Sqoop data transfer tool, processes metadata, and imports the processed data into corresponding tables in Hive;

所述的结果显示层，用户发出请求到数据分析层；In the result display layer, the user sends a request to the data analysis layer;

所述的数据分析层，将用户发送的请求转换为相应的HiveQL语句，在Hive Driver的驱动下，完成执行操作；The data analysis layer converts the request sent by the user into a corresponding HiveQL statement, and completes the execution operation under the drive of the Hive Driver;

所述的计算层，从Hive引擎接收指令，并通过数据存储层的HDFS，配合MapReduce实现数据的处理分析，最终将结果返回结果显示层。The calculation layer receives instructions from the Hive engine, and through the HDFS of the data storage layer, cooperates with MapReduce to realize data processing and analysis, and finally returns the result to the result display layer.

一种基于Hadoop的网络安全日志k-means聚类分析方法，其特征在于，包括以下步骤：A Hadoop-based network security log k-means cluster analysis method, characterized in that, comprising the following steps:

1）日志数据预处理，转换日志描述信息的文本内容的Syslog_incoming_mes文件为一个文本向量文件；1) Log data preprocessing, converting the Syslog_incoming_mes file of the text content of the log description information into a text vector file;

2）基于MapReduce的k-means算法的实现，在文本向量上运行k-means聚类算法。2) Based on the implementation of the k-means algorithm of MapReduce, run the k-means clustering algorithm on the text vector.

所述的日志数据预处理，包括以下步骤：The log data preprocessing includes the following steps:

1）去虚词，去除日志描述信息文本中无实意的词；1) Remove function words, remove meaningless words in the log description text;

2）标记词性，本系统使用english-left3words -distsim.tagger分词器来标记每条日志描述中的单词；2) To mark the part of speech, the system uses the english-left3words-distsim.tagger tokenizer to mark the words in each log description;

3）提取有用词，通过分词之后，系统提取的是名词(NN,NNS,NNP,NNPS)、动词(VB,VBP, VBN,VBD)和形容词(JJ,JJR,JJS)，这些词都具有实际的意思，能准确表达日志信息；3) Extract useful words. After word segmentation, the system extracts nouns (NN, NNS, NNP, NNPS), verbs (VB, VBP, VBN, VBD) and adjectives (JJ, JJR, JJS). These words have practical Meaning, can accurately express the log information;

4）获取频繁字典，对所有记录统计频率，高频的词就在该描述领域内具有代表作用，超过阈值的词作为keyword元素，选取出频繁字典，使用频繁字典能够有效的表达日志信息；4) Obtain the frequent dictionary, and count the frequency of all records. The high-frequency words have a representative role in the description field. The words exceeding the threshold are used as keyword elements, and the frequent dictionary is selected. Using the frequent dictionary can effectively express log information;

5）生成文本向量文件，对比日志描述字段和频繁字典得到由一连串0，1组成的keyspace，多个keyspace集合构成文本向量文件。5) Generate a text vector file, compare the log description field with the frequent dictionary to obtain a keyspace consisting of a series of 0, 1, and multiple keyspace sets form a text vector file.

所述的基于MapReduce的k-means算法的实现，包括以下步骤：The realization of the described k-means algorithm based on MapReduce comprises the following steps:

1）扫描原始数据集合中的所有的点，并随机选取k个点作为初始的簇中心；1) Scan all the points in the original data set, and randomly select k points as the initial cluster centers;

2）各个Map节点读取存在本地的数据集，用k-means算法生成聚类集合，最后在Reduce阶段用若干聚类集合生成新的全局聚类中心，重复这一过程直到满足结束条件；2) Each Map node reads the local data set, uses the k-means algorithm to generate a clustering set, and finally uses several clustering sets to generate a new global clustering center in the Reduce stage, and repeats this process until the end condition is met;

3）根据最终生成的簇中心对所有的数据元素进行划分聚类的工作。3) According to the final cluster centers, all data elements are divided and clustered.

本发明技术方案的优点主要体现在：The advantage of technical solution of the present invention is mainly reflected in:

1）在已有的传统数据仓库的基础上提供Hadoop的支持，建立了一个统一的数据存储和数据处理架构，弥补了传统数据仓库在海量数据处理、存储等方面的不足，同时也使得原来的传统数据仓库物尽所用。1) Provide Hadoop support on the basis of existing traditional data warehouses, establish a unified data storage and data processing architecture, make up for the shortcomings of traditional data warehouses in massive data processing and storage, and also make the original Traditional data warehouses use everything.

2）随着数据越来越多，需要更多的集群资源来处理这些数据，而Hadoop是一个易扩展系统，只需要对一个新节点进行简单的配置，就可以很方便的扩展集群，提升其计算能力。2) With more and more data, more cluster resources are needed to process the data, and Hadoop is an easy-to-expand system. It only needs to configure a new node to easily expand the cluster and improve its performance. Calculate ability.

3）对于海量网络日志中的异构数据，先用MapReduce处理元数据，再将处理后的数据导入到Hive中对应的表里，然后根据需求编写hiveQL语句对数据进行简单地查询分析，用MapReduce实现k-means算法对数据进行挖掘分析。提高了查询分析效率，数据的潜在价值也得到了挖掘。3) For heterogeneous data in massive network logs, first use MapReduce to process metadata, then import the processed data into the corresponding table in Hive, and then write hiveQL statements to query and analyze the data simply according to requirements, and use MapReduce Implement the k-means algorithm to mine and analyze the data. The efficiency of query analysis is improved, and the potential value of data is also tapped.

附图说明Description of drawings

图1为本发明的系统结构原理框图。Fig. 1 is a functional block diagram of the system structure of the present invention.

图2为本发明的基于Hadoop与传统数据仓库协作的网络日志分析系统架构图。FIG. 2 is an architecture diagram of a network log analysis system based on collaboration between Hadoop and traditional data warehouses of the present invention.

图3为本发明日志数据k-means聚类算法研究框架。Fig. 3 is the research framework of the log data k-means clustering algorithm of the present invention.

具体实施方式Detailed ways

下面结合实施例和说明书附图对本发明的技术方案做详细的说明，但不限于此。The technical solution of the present invention will be described in detail below in conjunction with the embodiments and the accompanying drawings, but is not limited thereto.

参见图1，一种基于Hadoop的网络安全日志k-means聚类分析系统，包括有日志数据获取子系统11、日志数据混合机制存储管理子系统12、日志数据分析子系统13；Referring to Fig. 1, a kind of Hadoop-based network security log k-means clustering analysis system includes a log data acquisition subsystem 11, a log data mixing mechanism storage management subsystem 12, and a log data analysis subsystem 13;

所述日志数据获取子系统11是采集所有设备的网络安全日志数据；The log data acquisition subsystem 11 is to collect the network security log data of all devices;

日志数据混合机制存储管理子系统12是对所有日志数据进行管理与存储；Log data mixing mechanism storage management subsystem 12 is to manage and store all log data;

日志数据分析子系统13是对所有日志数据进行快速查询分析处理，以及对日志数据的潜在价值进行挖掘分析。The log data analysis subsystem 13 performs fast query analysis processing on all log data, and conducts mining analysis on the potential value of the log data.

该系统各个模块的运作流程步骤如下：The operation process steps of each module of the system are as follows:

步骤1：日志数据获取：配置Syslogd集中日志服务器，使用UDP作为传输协议，通过目的端口，将所有安全设备的日志管理配置发送到安装了Syslog软件系统的日志服务器，Syslog日志服务器自动接收日志数据并写到日志文件中；Step 1: Log data acquisition: Configure the Syslogd centralized log server, use UDP as the transmission protocol, and send the log management configuration of all security devices to the log server installed with the Syslog software system through the destination port. The Syslog log server automatically receives the log data and sends it. write to the log file;

步骤2：使用Sqoop将MySQL中的日志信息的表syslog_incoming导入到HDFS，使用命令：sqoop import --connect jdbc:mysql://219.245.31.39:3306/syslog --usernamesqoop --password sqoop --table syslog_incoming -m1Step 2: Use Sqoop to import the log information table syslog_incoming in MySQL to HDFS, using the command: sqoop import --connect jdbc:mysql://219.245.31.39:3306/syslog --usernamesqoop --password sqoop --table syslog_incoming -m1

Sqoop通过一个MapReduce作业从MySQL中导入一个表，这个作业从表中抽取一行行记录，然后将记录写入到HDFS，集群中的namenode承担数据的位置存储，并将存储位置信息告诉client端，得到位置信息后，client端开始写数据，写数据的时候是将数据分块，并存储为多份，放在不同的datanode节点上，client 先将数据写到第一个节点，在第一个节点接收数据的同时，又将它所接收的数据推送到第二个，第二个推送到第三个节点，依次类推；Sqoop imports a table from MySQL through a MapReduce job. This job extracts a row of records from the table, and then writes the records to HDFS. The namenode in the cluster is responsible for storing the data and telling the client the storage location information. After the location information, the client starts to write data. When writing data, the data is divided into blocks and stored in multiple copies, which are placed on different datanode nodes. The client first writes the data to the first node, and on the first node While receiving data, it pushes the received data to the second node, the second node to the third node, and so on;

步骤3：编写MapReduce程序对导入到HDFS中的原始日志数据提取有用信息；Step 3: Write a MapReduce program to extract useful information from the original log data imported into HDFS;

步骤4：根据提取出的关系数据源中的表利用Sqoop来生成一个hive表，使用命令直接生成相应hive表的定义，然后加载保存在HDFS中的数据：Step 4: Use Sqoop to generate a hive table according to the table in the extracted relational data source, use the command to directly generate the definition of the corresponding hive table, and then load the data stored in HDFS:

sqoop create-hive-table --connect jdbc:mysql://219.245.31.39:3306/syslog --table syslog_incoming --fields-terminated-by ‘,’sqoop create-hive-table --connect jdbc:mysql://219.245.31.39:3306/syslog --table syslog_incoming --fields-terminated-by ','

启动hive，加载数据：Start hive and load data:

load data inpath “syslog_incoming” into table syslog_incoming；load data inpath “syslog_incoming” into table syslog_incoming;

步骤5：根据业务需求，编写相应的hiveQL语句或者MapReduce程序，对日志数据进行统计分析；所述统计分析具体步骤为：根据业务需求对数据表进行分区，而分区是在创建表的时候用PARTITIONED BY子句定义的。因此，根据需求分别把表记录定义为由等级和时间（年、季度、月）分区构成，如下实例是将表记录定义为由等级分区构成：Step 5: According to business requirements, write corresponding hiveQL statements or MapReduce programs to perform statistical analysis on the log data; the specific steps of the statistical analysis are: partition the data table according to business requirements, and the partition is PARTITIONED when creating the table BY clause defined. Therefore, according to the requirements, the table records are defined to be composed of grades and time (year, quarter, month) partitions. The following example defines the table records to be composed of grade partitions:

hive >create table syslog_incoming_priority (facility varchar, datadata, host varchar)hive >create table syslog_incoming_priority (facility varchar, datadata, host varchar)

>partitioned by (priority varchar)>partitioned by (priority varchar)

>row format delimited>row format delimited

>fields terminated by ‘\t’>fields terminated by '\t'

>stored as textfile;>stored as textfile;

定义好表结构之后，把数据加载到分区表中：After defining the table structure, load the data into the partition table:

hive >insert into table syslog_incoming_priorityhive >insert into table syslog_incoming_priority

>partitioned (priority)>partitioned (priority)

>select facility, data, host, priority>select facility, data, host, priority

>from syslog_incoming;>from syslog_incoming;

在文件系统级别，分区只是表目录下嵌套的子目录，此时，表目录结构中有多个等级分区，数据文件则存放在底层目录中，最后根据需求，编写查询语句hiveQL，集群再将查询语句转换为MapReduce任务进行运行。At the file system level, partitions are just nested subdirectories under the table directory. At this time, there are multiple hierarchical partitions in the table directory structure, and data files are stored in the underlying directory. Finally, according to the requirements, write the query statement hiveQL, and then the cluster will Query statements are converted into MapReduce tasks for execution.

步骤6：将查询分析结果通过Sqoop导入到MySQL中，前台显示界面通过图表向用户展示。Step 6: Import the query analysis results into MySQL through Sqoop, and the front-end display interface shows the user through charts.

参见图3，一种基于Hadoop的网络安全日志k-means聚类分析方法，包括以下步骤：Referring to Fig. 3, a Hadoop-based network security log k-means clustering analysis method includes the following steps:

3）提取有用词，通过分词之后，系统提取的是名词(NN,NNS,NNP,NNPS)、动词(VB,VBP, VBN,VBD)和形容词(JJ,JJR,JJS)。这些词都具有实际的意思，能准确表达日志信息；3) Extract useful words. After word segmentation, the system extracts nouns (NN, NNS, NNP, NNPS), verbs (VB, VBP, VBN, VBD) and adjectives (JJ, JJR, JJS). These words have practical meanings and can accurately express log information;

4）获取频繁字典，对所有记录统计频率，高频的词就在该描述领域内具有代表作用，超过阈值的词作为keyword元素。选取出频繁字典，使用频繁字典能够有效的表达日志信息；4) Obtain a frequent dictionary, count the frequency of all records, high-frequency words have a representative role in the description field, and words exceeding the threshold are used as keyword elements. Select frequent dictionaries, and use frequent dictionaries to effectively express log information;

初始簇中心选择，首先给出簇的数据结构，该类保存一个簇的基本信息如簇id，中心坐标及属于该簇的点的个数，其类型定义如下：To select the initial cluster center, the data structure of the cluster is given first. This class stores the basic information of a cluster such as the cluster id, the center coordinates and the number of points belonging to the cluster. Its type is defined as follows:

public class Cluster implements writable{public class Cluster implements writable{

private int clusterID; //簇idprivate int clusterID; //cluster id

private long numOfPoints; //属于该簇的点的个数private long numOfPoints; //The number of points belonging to the cluster

private Instance center; //簇中心点信息private Instance center; //cluster center point information

}}

然后随机抽取k个点作为初始的簇中心，抽取流程：初始化簇中心集合为空，然后扫描整个数据集。如果当前簇中心集合大小小于k，则将扫描到的点加入到簇中心中，否则以1/(1+k)的概率替换掉簇中心集合中的一点。通过这一步我们将产生的簇中心信息写入到Cluster-0目录下，该目录中的文件作为下一轮迭代时的全局共享信息加入到MapReduce的分布共享缓存中作为全局共享数据。Then randomly select k points as the initial cluster centers, and the extraction process: initialize the cluster center set to be empty, and then scan the entire data set. If the size of the current cluster center set is less than k, add the scanned point to the cluster center, otherwise replace a point in the cluster center set with a probability of 1/(1+k). Through this step, we write the generated cluster center information into the Cluster-0 directory, and the files in this directory are added to the distributed shared cache of MapReduce as global shared data as the global shared information in the next iteration.

迭代计算簇中心。该阶段需要执行多次迭代，开始之前每个map节点都首先需要在setup()方法中读入上一轮迭代中产生的簇的信息，包括以下步骤：Iteratively computes cluster centers. This stage needs to perform multiple iterations. Before starting, each map node first needs to read in the cluster information generated in the previous round of iterations in the setup() method, including the following steps:

1）读入初始簇中心：读出存放在共享缓存中为所有节点共享的初始簇中心数据；1) Read the initial cluster center: read the initial cluster center data stored in the shared cache shared by all nodes;

2）map方法的实现：map方法需要为每个传入的数据点找到离其最近的簇中心，并且将簇中的id作为键，该数据点作为值发射出去，表示这个数据点属于id所在的簇；2) Implementation of the map method: The map method needs to find the nearest cluster center for each incoming data point, and use the id in the cluster as the key, and emit the data point as a value, indicating that the data point belongs to the id. the cluster;

3）combiner的实现：为了减轻网络数据传输开销，我们在map端利用combiner来对map端产生的结果做一次归并，这样既减轻了map向reduce端的数据传输开销，同时也减轻了reduce端的计算开销，Combiner输出的键和值的类型必须和map输出的键和值的类型相同。在reduce程序中，我们根据属于同一个簇的所有点的信息计算出这些点的临时中心，这里以简单求均值的方法实现，即将簇中所有点相加除以该簇中此时所含有的所有点的个数；3) Combiner implementation: In order to reduce network data transmission overhead, we use combiner on the map side to merge the results generated by the map side, which not only reduces the data transmission overhead from the map to the reduce side, but also reduces the calculation overhead on the reduce side , the type of the key and value output by the Combiner must be the same as the type of the key and value output by the map. In the reduce program, we calculate the temporary centers of these points based on the information of all points belonging to the same cluster. Here, it is realized by a simple method of averaging, that is, adding and dividing all points in the cluster by the points contained in the cluster at this time. the number of all points;

4）Reducer的实现：Reduce阶段和Combiner所做的几乎一样，其将Combiner的输出结果进行进一步的归并输出。4) Implementation of Reducer: The Reduce stage is almost the same as that of the Combiner, which further merges and outputs the output of the Combiner.

重复计算簇中心步骤的处理，直到所求得的聚类中心不再发生变化为止。The processing of the step of calculating the cluster center is repeated until the obtained cluster center does not change any more.

按照最终的聚类中心划分数据。在获得了最终的聚类中心后，依据所获得的聚类中心，扫描所有数据集合，将每个数据点划分到距离最近的聚类中心即可。Divide the data according to the final cluster centers. After obtaining the final cluster center, scan all data sets according to the obtained cluster center, and divide each data point into the nearest cluster center.

实施例：Example:

首先搭建Hadoop分布式集群环境，包括5台PC机搭建。一台主服务器，其余四台为从服务器。在每台机器上配置Hadoop，再在namenode上安装配置Sqoop、hive、MySQL。本实施例中采用陕西利安电超市的所有安全设备的日志记录，其中文件大小为16G。根据需求对日志每天进行定时更新，在更新业务中统计查询结果。First, build a Hadoop distributed cluster environment, including 5 PCs. One master server and the remaining four are slave servers. Configure Hadoop on each machine, and then install and configure Sqoop, hive, and MySQL on the namenode. In this embodiment, the log records of all security devices in Shaanxi Li'an Electric Supermarket are used, and the file size is 16G. The log is regularly updated every day according to the requirements, and the query results are counted in the update business.

该方法可以通过hive实现快速统计查询，其优势在于：学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。使用分区可以加快数据分片的查询速度，提高查询效率。通过MapReduce实现k-means算法，并对k-means聚类算法输出的结果进行安全等级评估，针对同一IP中告警、危险等级所占比重较大的及时做出提示，使得日志数据的潜在价值得到挖掘。This method can realize fast statistical query through hive. Its advantages are: low learning cost, simple MapReduce statistics can be quickly realized through SQL-like statements, and no special MapReduce application needs to be developed, which is very suitable for statistical analysis of data warehouses. Using partitions can speed up the query speed of data shards and improve query efficiency. Realize the k-means algorithm through MapReduce, and evaluate the security level of the output results of the k-means clustering algorithm, and make timely prompts for those with a large proportion of alarms and danger levels in the same IP, so that the potential value of log data can be obtained. dig.

Claims

1. A Hadoop-based network security log k-means cluster analysis system, characterized in that, includes log data acquisition subsystem (11), log data mixing mechanism storage management subsystem (12), log data analysis subsystem (13);

The log data acquisition subsystem (11) is to collect the network security log data of all devices;

The log data mixing mechanism storage management subsystem (12) manages and stores all log data;

The log data analysis subsystem (13) performs fast query analysis processing on all log data, and mines and analyzes the potential value of the log data;

The log data acquisition subsystem (11) configures a Syslogd centralized log server in a Linux environment, adopts Syslog mode to collect recording equipment and system log data, and centrally manages the log data;

The basic file access process of the HDFS distributed file system module is:

1) The application sends the file name to the NameNode through the client program of HDFS;

2) After NomeNode receives the file name, it retrieves the data block corresponding to the file name in the HDFS directory, and then finds the DataNode address for saving the data block according to the data block information, and sends these addresses back to the client;

3) After receiving the addresses of these DataNodes, the client performs data transmission operations in parallel with these DataNodes, and at the same time submits the relevant logs of the operation results to the NameNode;

Described Hadoop platform and traditional data warehouse cooperation module, configure MySQL database as the metabase of Hive, be used to store the Schema table structure information of Hive, realize the transmission of data between traditional data warehouse and Hadoop platform by Sqoop tool;

The log data analysis subsystem (13) mainly uses the tool hive to perform simple statistical query and analysis on the data, writes HiveQL query statements according to requirements, and completes the lexical analysis, syntax analysis, compilation, Optimization and query plan generation. The generated query plan is stored in HDFS and then executed by MapReduce. For the analysis of potential information in log data, its intrinsic value can be tapped by writing a MapReduce program that implements the corresponding algorithm.

2. A Hadoop-based network security log k-means cluster analysis system according to claim 1, characterized in that the log data mixing mechanism storage management subsystem (12) integrates Hadoop platform and traditional data warehouse , including HDFS distributed file system module, Hadoop platform and traditional data warehouse collaboration module.

3. A network security log k-means cluster analysis system based on Hadoop, characterized in that, using Hadoop cluster as platform, integrating traditional data warehouse and big data platform, using MySQL database as the metabase of Hive for storing Hive Schema table structure information, use Sqoop tool to realize data transmission between traditional data warehouse and big data platform; including data source layer (21), data storage layer (22), computing layer (23), data analysis layer ( 24) and result display layer (25);

Described data source layer (21), collects the log data in all equipments by disposing Syslogd centralized log server, then by Sqoop tool, log data is imported into data storage layer from traditional data warehouse;

The data storage layer (22) adopts a hybrid storage architecture in which Hadoop cooperates with traditional data warehouses, imports data into HDFS through Sqoop data transfer tools, processes metadata, and imports the processed data into corresponding tables in Hive inside;

In the result display layer (25), the user sends a request to the data analysis layer;

The data analysis layer (24) converts the request sent by the user into a corresponding HiveQL statement, and completes the execution operation under the drive of the Hive Driver;

The computing layer (23) receives instructions from the Hive engine, and cooperates with MapReduce to implement data processing and analysis through the HDFS of the data storage layer, and finally returns the results to the result display layer.

4. Utilize the method for systematic clustering analysis described in claim 1, it is characterized in that, comprise the following steps:

1) Log data preprocessing, converting the Syslog_incoming_mes file of the text content of the log description information into a text vector file;

2) Implementation of the k-means algorithm based on MapReduce, running the k-means clustering algorithm on the text vector;

The log data preprocessing includes the following steps:

1) Remove function words, remove meaningless words in the log description text;

2) To mark the part of speech, the system uses the english-left3words-distsim.tagger tokenizer to mark the words in each log description;

3) Extract useful words. After word segmentation, the system extracts nouns NN, NNS, NNP, NNPS, verbs VB, VBP, VBN, VBD and adjectives JJ, JJR, JJS. These words have actual meanings and can be accurately expressed log information;

4) Obtain the frequent dictionary, and count the frequency of all records. The high-frequency words have a representative role in the description field. The words exceeding the threshold are used as keyword elements, and the frequent dictionary is selected. Using the frequent dictionary can effectively express log information;

5) Generate a text vector file, compare the log description field with the frequent dictionary to obtain a keyspace consisting of a series of 0, 1, and multiple keyspace sets form a text vector file;

The realization of the described k-means algorithm based on MapReduce comprises the following steps:

1) Scan all the points in the original data set, and randomly select k points as the initial cluster centers;

2) Each Map node reads the local data set, uses the k-means algorithm to generate a clustering set, and finally uses several clustering sets to generate a new global clustering center in the Reduce stage, and repeats this process until the end condition is met;

3) According to the final cluster centers, all data elements are divided and clustered.