KR20250041463A

KR20250041463A - A system for predicting network load and detecting failures based on artificial intelligence and method using the same

Info

Publication number: KR20250041463A
Application number: KR1020230124217A
Authority: KR
Inventors: 강경훈; 김정대; 김부일
Original assignee: 주식회사 케이벨
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2025-03-25

Abstract

본 발명은 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 방법에 관한 것으로, 하이브리드 클라우드 네트워크내의 가상화 자원으로부터 생성되는 네트워크 트래픽을 수집하는 수집서버, 수집된 네트워크 트래픽 및 분석 결과를 데이터베이스로 저장하는 빅데이터서버, 네트워크 트래픽 수집 결과 및 네트워크 상태 정보를 제공하는 관제서버 및 기계학습을 통해 네트워크의 부하를 예측하고 시스템 및 서비스 이상·장애를 분류하는 학습서버를 포함하는 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 그 동작 방법에 포함하는 가상화 머신의 트래픽 수집 장치 및 이를 동작하는 트래픽 수집 방법에 관한 것이다.The present invention relates to a system and method for network load prediction and fault detection based on artificial intelligence, and more particularly, to a system for network load prediction and fault detection based on artificial intelligence and an operating method thereof, comprising: a collection server for collecting network traffic generated from virtualized resources in a hybrid cloud network, a big data server for storing the collected network traffic and analysis results in a database, a control server for providing the network traffic collection results and network status information, and a learning server for predicting network load and classifying system and service abnormalities and faults through machine learning. The present invention relates to a traffic collection device of a virtualized machine and a traffic collection method for operating the same.

Description

{A system for predicting network load and detecting failures based on artificial intelligence and method using the same}

본 발명은 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 그 동작 방법에 관한 것으로, 보다 상세하게는 클라우드 컴퓨팅 환경에서 생성되는 네트워크 트래픽의 부하를 인공지능 모델을 통해 사전에 예측하여 시스템 및 서비스의 이상·장애를 탐지하는 것에 관한 것이다. The present invention relates to a system for network load prediction and fault detection based on artificial intelligence and an operating method thereof, and more specifically, to detecting abnormalities and faults in systems and services by predicting the load of network traffic generated in a cloud computing environment in advance using an artificial intelligence model.

최근, ICT 기술의 발전에 따라, 단말의 독립적인 하드웨어 성능에 의존하던 기존의 컴퓨팅 환경은, 네트워크 상의 모든 컴퓨팅 자원을 활용하여 서비스를 제공하는 클라우드 컴퓨팅(Cloud Computing) 형태로 진화하고 있다. Recently, with the advancement of ICT technology, the existing computing environment that depended on the independent hardware performance of terminals is evolving into cloud computing, which provides services by utilizing all computing resources on the network.

이러한, 클라우드 컴퓨팅 실현은 다양한 서비스 영역에서 수행되고 있고, 데이터 보호, 자원 관리, 가용성 확보, 개인 정보보호 등 해결되어야 할 많은 문제점을 안고 있다.This realization of cloud computing is being carried out in various service areas and has many problems to be solved, such as data protection, resource management, availability assurance, and privacy protection.

특히, 가상화 환경의 가상 머신(virtual machine)들을 이용하는 클라우드 컴퓨팅이 빠르게 확산함에 따라 ICT 자원을 클라우드로 전환하거나 클라우드 사업자의 서비스를 이용하는 추세가 증가하고 있고 더불어, 클라우드 서비스에 사용되는 트래픽이 기하급수적으로 증가하고 있다.In particular, as cloud computing, which utilizes virtual machines in a virtualized environment, rapidly spreads, the trend of converting ICT resources to the cloud or using cloud service providers' services is increasing, and the traffic used for cloud services is increasing exponentially.

이와 같이, 트래픽의 증가는 서비스 성능 저하 또는 품질 문제 발생시켜 서비스 지연 및 서비스의 이용률을 떨어지게 하고 이는 생산성 저하 및 매출 감소로 이어진다. In this way, increased traffic can cause service performance degradation or quality issues, which can lead to service delays and reduced service availability, which in turn leads to decreased productivity and sales.

따라서, 성능 저하의 원인을 신속히 파악하고, 이에 대한 대응이 최대한 빠르게 이루어져야 한다. 하지만, 이러한 성능 저하의 원인을 파악하기 위해 트래픽 수집 및 분석이 무엇보다도 중요하게 대두되고 있다.Therefore, the cause of the performance degradation must be quickly identified and a response must be made as quickly as possible. However, traffic collection and analysis are becoming more important than ever to identify the cause of this performance degradation.

도 1은 클라우드 관제 플랫폼의 개념도를 나타낸 것으로, 도 1에 도시된 바와 같이, 종래의 네트워크 부하 예측 및 이상·장애 탐지를 위해서는 다양하게 구성되는 클라우드의 네트워크 트래픽을 클라우드 관제 플랫폼 내로 수집하여 모니터링하고, 사전 정의한 임계치 즉, 사용자가 과거 수집 통계를 기반으로 직접 설정한 임계치에 도달하는 경우에 서비스 관리자에게 알림을 제공하는 서비스 중심이 아닌 시스템 별로 네트워크를 제어 관리하였다.Figure 1 is a conceptual diagram of a cloud control platform. As shown in Figure 1, in order to predict network load and detect abnormalities and failures in the past, network traffic of various cloud configurations is collected and monitored within the cloud control platform, and when a predefined threshold is reached, i.e., a threshold set directly by the user based on past collection statistics is reached, the network is controlled and managed by system rather than being service-centric.

하지만, 이러한 사전 설정된 임계치에 따라 시스템 운용 비용에 많은 차이가 발생한다.However, there is a large difference in the operating cost of the system depending on these preset thresholds.

예를 들어, 지나치게 낮은 임계치를 설정한 경우, 낮은 부하에도 발생하는 알림에 서비스 관리자의 잦은 시스템 상태 점검으로 운용 인력과 시스템 추가 등 시스템 운용에 들어가는 비용이 높아지는 문제점을 가진다.For example, if the threshold is set too low, there is a problem that the cost of operating the system increases due to frequent system status checks by the service manager for notifications that occur even under low load, which requires additional operating personnel and systems.

반면에 지나치게 높은 임계치 값을 설정한 경우, 시스템 리소스의 여유로 시스템 이상 상태의 확인 늦어지며 서비스 품질의 보장이 어려워진다는 단점을 가진다. 뿐만 아니라, 응용 서비스가 달라지거나 서비스 플랫폼의 네트워크 환경이 달라질 경우에는 임계치를 재설정하기 위한 시험과 검증 작업에 많은 노력과 시간을 투입하여야 한다.On the other hand, if the threshold value is set too high, it has the disadvantage of delaying the confirmation of system abnormalities due to the system resource slack and making it difficult to guarantee service quality. In addition, if the application service changes or the network environment of the service platform changes, a lot of effort and time must be invested in testing and verification work to reset the threshold.

따라서, 현재 네트워크의 변화하는 환경에서도 적응적으로 적절한 부하 임계치를 예측하여, 사용자에 의한 임계치 조정없이 서비스 품질을 보장할 수 있는 자동화된 일련의 방법이 필요하다.Therefore, a series of automated methods are needed that can adaptively predict appropriate load thresholds even in the changing environment of the current network and guarantee service quality without threshold adjustment by users.

이와 관련하여 특허문헌 1은 IT 서비스에 대한 각종 지표를 포함하는 지표 정보를 이용하여 학습된 2개의 장애 예측 모델을 이용하여 IT 서비스에 대한 장애 가능성을 판단함으로써, 보다 정확한 장애 예측이 가능한 IT 서비스 장애 예측을 위한 인공지능 모델 학습 방법, 서버 및 컴퓨터프로그램을 제공하는 것에 관한 것으로, IT 서비스의 지표정보를 기반으로 미래 임계부하 값을 추출하여 서비스 장애 가능성을 판단하는 알고리즘만을 개시하고 있을 뿐 구체적인 관제 기능을 수반한 시스템 운용에 관해서는 개시하고 있지 않다.In this regard, Patent Document 1 relates to a method for learning an artificial intelligence model for predicting IT service failures, a server, and a computer program that enable more accurate prediction of failures by determining the possibility of a failure in an IT service using two failure prediction models learned using indicator information including various indicators of the IT service, and only discloses an algorithm for extracting a future critical load value based on indicator information of the IT service to determine the possibility of a service failure, and does not disclose system operation involving specific control functions.

한국 공개특허공보 제10-2023-0063413호Korean Patent Publication No. 10-2023-0063413

상기와 같은 문제점을 해결하기 위하여, 본 발명에서는 심층신경망 기반의 인공지능 기법을 도입하여 일정 시간 후의 네트워크 부하를 능동적으로 예측함으로써 서비스 부하 증가로 인한 시스템 이상·장애 상태를 알리는 기능, 또한 서비스 관리자의 점검 결과를 기반으로 서비스 이상·장애 상태를 알리는 기능 등의 지능형 부하 예측 및 이상·장애를 탐지 방법을 제공하는 것을 목적으로 한다.In order to solve the above problems, the present invention introduces an artificial intelligence technique based on a deep neural network to actively predict the network load after a certain period of time, thereby providing a function to notify of system abnormality/failure status due to an increase in service load, and a function to notify of service abnormality/failure status based on the inspection results of a service manager, thereby providing an intelligent load prediction and abnormality/failure detection method.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템은 하이브리드 클라우드 네트워크내의 가상화 머신으로부터 생성되는 네트워크 트래픽과 네트워크 메트릭(Metric)을 수집하는 수집서버, 수집서버로부터 수집된 상기 네트워크 트래픽 및 상기 네트워크 트래픽의 분석 결과를 데이터베이스로 저장하는 빅데이터서버, 서비스 관리자에게 상기 네트워크 트래픽 수집 결과 및 네트워크 상태 정보를 제공하는 관제서버 및 빅데이터 서버로부터 수집된 상기 네트워크 트래픽을 이용하여 인공지능 기반의 기계학습을 통해 네트워크의 부하를 예측하고, 이를 통해 시스템 및 서비스 이상·장애를 예측하고 분류하는 학습서버를 포함하는 것을 특징으로 한다.In order to achieve the above purpose, the system for AI-based network load prediction and fault detection according to the present invention comprises a collection server which collects network traffic and network metrics generated from a virtual machine in a hybrid cloud network, a big data server which stores the network traffic collected from the collection server and the analysis results of the network traffic in a database, a control server which provides the network traffic collection results and network status information to a service manager, and a learning server which predicts the network load through AI-based machine learning using the network traffic collected from the big data server, and predicts and classifies system and service abnormalities and faults through the same.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 네트워크 트래픽은 넷플로우이고, 수집서버는 상기 넷플로우에 포함되는 소스 IP, 목적 IP, 포트번호, IP 프로토콜 타입, 서비스 타입, AS번호 중 적어도 하나 이상을 수집하는 넷플로우 수집부, 넷플로우를 이용하여 이벤트의 추적, 과다한 트래픽을 발생시키는　IP,　바이러스에 감염된　PC,　또는 트래픽 유형 중 적어도 하나 이상을 분석하는 넷플로우 분석부 및 넷플로우 수집부에서 수집된 상기 넷플로우와 상기 넷플로우 분석부에서 분석된 결과를 데이터베이스에 저장하는 넷플로우 저장부를 포함하는 것을 특징으로 한다.In addition, in a system for network load prediction and fault detection based on artificial intelligence according to the present invention, the network traffic is netflow, and the collection server includes a netflow collection unit that collects at least one or more of a source IP, a destination IP, a port number, an IP protocol type, a service type, and an AS number included in the netflow, a netflow analysis unit that analyzes at least one or more of an IP that generates excessive traffic, a PC infected with a virus, or a traffic type using the netflow, and a netflow storage unit that stores the netflow collected by the netflow collection unit and the result analyzed by the netflow analysis unit in a database.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 상기 수집서버는 하이브리드 클라우드 네트워크내의 레거시(Legacy) 서버, 오픈스택(Openstack) 기반의 사설 클라우드 및 공공 클라우드의 상기 가상화 머신으로부터 생성된 네트워크 트래픽을 API를 통해 수집하는 API 관리자를 더 포함하는 것을 특징으로 한다.In addition, in the system for artificial intelligence-based network load prediction and fault detection according to the present invention, the collection server is characterized in that it further includes an API manager that collects network traffic generated from a legacy server in a hybrid cloud network, a private cloud based on Openstack, and a virtualized machine of a public cloud through an API.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 관제서버는 일정시간 네트워크 부하를 예측하는 지능형 부하 예측부, 부하 예측의 결과를 이용하여 네트워크의 이상·장애 상태를 예측하는 시스템 이상·장애 탐지부 및 이상·장애 상태를 통해 예상되는 서비스의 상태를 분류하는 서비스 이상·장애 분류부를 포함하는 것을 특징으로 한다.In addition, in a system for network load prediction and fault detection based on artificial intelligence according to the present invention, the control server is characterized by including an intelligent load prediction unit that predicts network load for a certain period of time, a system abnormality/failure detection unit that predicts network abnormality/failure status using the results of the load prediction, and a service abnormality/failure classification unit that classifies the expected service status through the abnormality/failure status.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 학습서버는 상기 네트워크 트래픽으로부터 생성되는 부하를 예측하는 부하 예측 학습부 및 예측되는 부하를 이용하여 소정의 시간이 경과한 후 네트워크 시스템의 이상 동작에 따른 서비스 이상·장애를 분류하는 서비스 이상·장애 학습부를 포함하는 것을 특징으로 한다.In addition, in a system for artificial intelligence-based network load prediction and fault detection according to the present invention, the learning server is characterized by including a load prediction learning unit that predicts a load generated from the network traffic, and a service abnormality/failure learning unit that classifies service abnormalities/failures due to abnormal operation of the network system after a predetermined period of time using the predicted load.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 부하 예측 학습부는 입력층, 은닉층 및 출력층을 포함하는 심층신경망 기반의 인공지능 모델 및 상기 출력층의 결과를 입력으로 하는 LSTM 모델을 포함하는 것을 특징으로 한다.In addition, in a system for network load prediction and fault detection based on artificial intelligence according to the present invention, the load prediction learning unit is characterized by including an artificial intelligence model based on a deep neural network including an input layer, a hidden layer, and an output layer, and an LSTM model that uses the result of the output layer as an input.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 상기 입력층은 7개의 노드로 구성되고, 각각의 상기 입력 노드에는 요일정보, 시간단위 정보, 분 단위 정보, 입력 플로우 비율, 입력 패킷 비율 및 입력 바이트 수가 입력되는 것을 특징으로 한다.In addition, in the system for artificial intelligence-based network load prediction and fault detection according to the present invention, the input layer is composed of seven nodes, and each input node is characterized in that day-of-the-week information, hourly information, minute-by-minute information, input flow rate, input packet rate, and input byte count are input.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 서비스 이상·장애 학습부는 두 단계의 심층신경망 기반의 인공지능 모델로 구성되되, 제1 단계의 심층신경망 기반의 인공지능 모델은 복수의 시스템 중 하나로부터의 상태정보를 입력받는 입력층, 은닉층 및 출력층을 포함하고, 제2 단계의 심층신경망 기반의 인공지능 모델은 상기 제1 단계의 출력층의 결과를 입력으로 하는 입력층, 은닉층 및 출력층을 포함하는 것을 특징으로 한다.In addition, in a system for network load prediction and fault detection based on artificial intelligence according to the present invention, a service anomaly/failure learning unit is composed of a two-stage deep neural network-based artificial intelligence model, wherein the deep neural network-based artificial intelligence model of the first stage includes an input layer, a hidden layer, and an output layer that receives status information from one of a plurality of systems, and the deep neural network-based artificial intelligence model of the second stage includes an input layer, a hidden layer, and an output layer that inputs the result of the output layer of the first stage.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템에서, 제1 단계의 심층신경망 기반의 인공지능 모델의 입력층은 시스템의 입력 바이트 수, 시스템의 출력 바이트 수 및 시스템의 상태정보가 입력하는 3개의 입력 노드로 구성되는 것을 특징으로 한다.In addition, in the system for network load prediction and fault detection based on artificial intelligence according to the present invention, the input layer of the artificial intelligence model based on a deep neural network in the first stage is characterized by being composed of three input nodes into which the number of system input bytes, the number of system output bytes, and the status information of the system are input.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 방법은 하이브리드 클라우드 네트워크 내의 복수의 가상화 머신으로부터 넷플로우를 수집하는 제1 단계, 수집된 넷플로우를 분석하여 네트워크의 트래픽 정보 및 상태정보를 분류하여 데이터베이스로 저장하는 제2 단계, 트래픽 정보로부터 네트워크 부하를 예측하는 제3 단계, 예측된 부하로부터 시스템 이상·장애를 탐지하는 제4 단계 및 탐지된 시스템의 상태에 따라 네트워크 서비스의 이상·장애를 분류하는 제5 단계로 이루어지되, 제3단계 및 제5단계는 기계학습을 통한 인공지능 모델에 의해 수행되는 것을 특징으로 한다.In addition, the method for artificial intelligence-based network load prediction and fault detection according to the present invention comprises a first step of collecting netflows from multiple virtual machines in a hybrid cloud network, a second step of analyzing the collected netflows to classify network traffic information and status information and storing them in a database, a third step of predicting network load from the traffic information, a fourth step of detecting system abnormalities and faults from the predicted load, and It consists of the fifth step of classifying network service abnormalities and failures based on the status of the detected system. Steps 3 and 5 are characterized by being performed by artificial intelligence models through machine learning.

또한 본 발명에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 방법에서, 제5 단계 이후, 과거에 발생한 시스템 및 서비스의 이상·장애 상황과 비교한 결과를 서비스 관리자에게 알람으로 통보하는 단계를 더 포함하는 것을 특징으로 한다.In addition, in the method for artificial intelligence-based network load prediction and fault detection according to the present invention, after the fifth step, the method further includes a step of notifying a service manager of the results of comparison with the abnormality/failure situations of systems and services that occurred in the past as an alarm.

이상에서 설명한 바와 같이, 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 동작 방법에 의하면, 사용자의 서비스 품질 요구사항을 만족시키기 위해 리소스를 추가하거나 불필요한 리소스를 제거하도록 도울 수 있다는 장점이 있다.As described above, the system and operating method for artificial intelligence-based network load prediction and fault detection have the advantage of helping to add resources or remove unnecessary resources to satisfy users' service quality requirements.

또한 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 동작 방법에 의하면, 시스템 장애에 의한 부하 감소/증가를 감지하여 서비스 제공자가 빠른 대처를 할 수 있는데 활용될 수 있는 장점이 있다.In addition, the system and operating method for artificial intelligence-based network load prediction and fault detection have the advantage of being able to be utilized to detect a decrease/increase in load due to a system fault, enabling a service provider to respond quickly.

게다가 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템 및 동작 방법에 의하면, 인공지능 기반의 기계학습을 통해 서비스 이상·장애 분류하여 미래에 유사한 상황이 발생할 경우 서비스 제공자에게 신속하게 알릴 수 있고, 고품질의 서비스 제공과 저비용으로 안정적인 시스템의 운용이 가능한 장점이 있다.In addition, the system and operating method for network load prediction and fault detection based on artificial intelligence have the advantage of being able to classify service abnormalities and faults through artificial intelligence-based machine learning and promptly notify service providers in case similar situations occur in the future, thereby enabling the provision of high-quality services and stable operation of the system at low cost.

도 1은 종래의 네트워크 부하 예측 및 이상·장애 탐지 시스템의 개념도이다.
도 2는 본 발명의 일실시예에 따른 SD-WAN 기반의 하이브리드 클라우드 네트워크의 기본 개념도를 나타낸 도면이다.
도 3은 본 발명의 일실시예에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템의 기본 구성도를 나타낸 도면이다.
도 4는 본 발명의 일실시예에 따른 수집서버의 연결 구성 및 기능 블럭도를 나타낸 도면이다.
도 5는 본 발명의 일실시예에 따른 관제서버의 기능 블럭도를 나타낸 도면이다.
도 6은 본 발명의 일실시예에 따른 지능형 부하 예측부의 기능 블럭도를 나타낸 도면이다.
도 7은 본 발명의 일실시예에 따른 서비스 이상·장애 분류부의 기능 블럭도를 나타낸 도면이다.
도 8은 본 발명의 일실시예에 따른 학습서버의 기능 블럭도를 나타낸 도면이다.
도 9는 본 발명의 일실시예에 따른 부하예측모델의 인공지능 모델 구조를 나타낸 도면이다.
도 10은 본 발명의 일실시예에 따른 서비스 이상·장애 분류 모델의 인공지능 모델 구조를 나타낸 도면이다.
도 11은 본 발명의 일실시예에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 동작을 설명하기 위한 순서도이다.Figure 1 is a conceptual diagram of a conventional network load prediction and anomaly/fault detection system.
FIG. 2 is a drawing showing a basic concept of a hybrid cloud network based on SD-WAN according to one embodiment of the present invention.
FIG. 3 is a diagram showing the basic configuration of a system for artificial intelligence-based network load prediction and fault detection according to one embodiment of the present invention.
FIG. 4 is a drawing showing a connection configuration and functional block diagram of a collection server according to one embodiment of the present invention.
FIG. 5 is a drawing showing a functional block diagram of a control server according to an embodiment of the present invention.
FIG. 6 is a drawing showing a functional block diagram of an intelligent load prediction unit according to an embodiment of the present invention.
FIG. 7 is a drawing showing a functional block diagram of a service abnormality/failure classification unit according to one embodiment of the present invention.
Figure 8 is a drawing showing a functional block diagram of a learning server according to one embodiment of the present invention.
FIG. 9 is a diagram showing an artificial intelligence model structure of a load prediction model according to an embodiment of the present invention.
FIG. 10 is a diagram showing the artificial intelligence model structure of a service abnormality/failure classification model according to one embodiment of the present invention.
FIG. 11 is a flowchart for explaining an operation for artificial intelligence-based network load prediction and fault detection according to one embodiment of the present invention.

이하 첨부된 도면을 참조하여, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 쉽게 실시할 수 있는 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예에 대한 동작 원리를 상세하게 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, with reference to the attached drawings, embodiments of the present invention will be described in detail so that those with ordinary skill in the art can easily implement the present invention. However, when describing the operating principles of preferred embodiments of the present invention in detail, if it is determined that a specific description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 도면 전체에 걸쳐 유사한 기능 및 작용을 하는 부분에 대해서는 동일한 도면 부호를 사용한다. 명세서 전체에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 기능을 사이에 두고, 간접적으로 연결되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 포함한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Also, the same drawing symbols are used for parts that have similar functions and actions throughout the drawings. Throughout the specification, when a part is said to be connected to another part, this includes not only cases where it is directly connected, but also cases where it is indirectly connected with another function in between. Also, including a certain component does not mean excluding other components unless specifically stated otherwise, but rather means that other components can be included.

도 2는 SD-WAN 기반 하이브리드 클라우드 네트워크의 기본 개념도를 나타낸 도면이다.Figure 2 is a diagram illustrating the basic concept of an SD-WAN-based hybrid cloud network.

도 2를 참조하면, SD-WAN (Software-Defined Wide Area Network)기반 하이브리드 클라우드 네트워크는 하이브리드 클라우드(Hybrid Cloud)와 SD-WAN 기술을 모두 결합한 네트워크 아키텍처로서, SD-WAN 기술을 사용하여 사설(private) 클라우드 환경과 공공(public) 클라우드 환경 모두에 걸쳐 있는 네트워크 인프라에 대한 연결, 관리 및 최적화를 제공한다.Referring to Figure 2, a hybrid cloud network based on SD-WAN (Software-Defined Wide Area Network) is a network architecture that combines both hybrid cloud and SD-WAN technologies, and provides connectivity, management, and optimization for network infrastructure spanning both private and public cloud environments using SD-WAN technologies.

여기서, SD-WAN은 집중화된 정책 제어 및 관리를 사용하여 기존의 브랜치 라우터 연결을 원격 위치 간 트래픽을 원활하고 안전하고 효율적으로 라우팅하는 가상화 또는 어플라이언스 기반 소프트웨어이고, 가상화는 물리적 하드웨어 장치에서 네트워킹 기능과 제어 기능을 분리합니다.Here, SD-WAN is a virtualized or appliance-based software that uses centralized policy control and management to seamlessly, securely and efficiently route traffic between remote locations over traditional branch router connections, while virtualization separates networking functions and control functions from physical hardware devices.

도 2에 도시된 바와 같이, 사설 클라우드, 공공 클라우드 및 레거시 네트워크에서 발생하는 각종 네트워크 트래픽은 통상적으로 지능형 클라우드 관제 플랫폼을 통해 일괄적으로 수집되고, 수집된 정보의 분석을 통해 다시 클라우드 네트워크를 구성하는 프로토콜 스택을 제어할 수 있다.As illustrated in Figure 2, various network traffics occurring in private clouds, public clouds, and legacy networks are typically collected in batches through an intelligent cloud control platform, and the protocol stack that constitutes the cloud network can be controlled again through analysis of the collected information.

본 발명은 SD-WAN 기반 하이브리드 클라우드 네트워크(20) 환경의 인프라 (네트워킹, 컴퓨팅 등) 자원(NetFlow, syslog, SNMP 등)을 빅데이터 기반으로 수집*저장하고 이를 인공지능 기반으로 시스템 및 서비스의 부하 예측과 이상·장애 탐지를 제공할 수 있으며 네트워크 전체에 대한 플로우 기반 트래픽 모니터링 및 분석, 제어를 제공할 수 있는 SD-WAN 기반 지능형 하이브리드 클라우드 네트워크 관제 플랫폼에 특징이 있다.The present invention is characterized by an SD-WAN-based intelligent hybrid cloud network control platform that collects and stores infrastructure (networking, computing, etc.) resources (NetFlow, syslog, SNMP, etc.) of an SD-WAN-based hybrid cloud network (20) environment based on big data, and provides load prediction of systems and services and detection of abnormalities and failures based on artificial intelligence, and provides flow-based traffic monitoring and analysis, and control for the entire network.

이하, 본 발명에서의 일실시예는 SD-WAN 기반의 하이브리드 클라우드 네트워크를 중심으로 설명하지만, 이에 제한되지는 않고 다양한 클라우드 플랫폼에도 적용될 수 있다. Hereinafter, an embodiment of the present invention will be described focusing on a hybrid cloud network based on SD-WAN, but is not limited thereto and can be applied to various cloud platforms.

도 3은 본 발명의 일실시예에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 시스템의 기본 구성도를 나타낸 도면으로, 네트워크 부하 예측 및 장애 탐지를 위한 시스템(10)은 수집서버(100), 빅데이터서버(200), 관제서버(300) 및 학습서버(400)를 포함할 수 있다.FIG. 3 is a drawing showing the basic configuration of a system for artificial intelligence-based network load prediction and fault detection according to one embodiment of the present invention. The system (10) for network load prediction and fault detection may include a collection server (100), a big data server (200), a control server (300), and a learning server (400).

도 3을 참조하면, 수집서버(100)는 하이브리드 클라우드 네트워크(20) 내의 가상화 머신으로부터 생성되는 가상화 트래픽의 일종인 넷플로우(Netflow)와 네트워크 메트릭(Metric)을 수집하여 분석하는 기능을 수행한다.Referring to FIG. 3, the collection server (100) performs a function of collecting and analyzing Netflow and network metrics, which are a type of virtualized traffic generated from a virtualized machine within a hybrid cloud network (20).

여기서, 넷플로우는 네트워크 내에서 송수신되는 트래픽 흐름을 기록하기 위해 시스코 시스템즈(Cisco Systems)에서 설계한 네트워킹 프로토콜이고, 네트워크 동작, 트래픽 패턴 및 대역폭 사용을 모니터링하는데 효과적이다. 또한, 네트워크 메트릭은 네트워크 상에서 발생하는 다양한 측정지표(예를 들면, 지연, 손실, 스루풋 등)를 의미한다.Here, Netflow is a networking protocol designed by Cisco Systems to record traffic flows transmitted and received within a network, and is effective in monitoring network behavior, traffic patterns, and bandwidth usage. In addition, network metrics refer to various measurement indicators (e.g., delay, loss, throughput, etc.) that occur on the network.

수집서버(100)는 API 관리자를 구비하여 하이브리드 클라우드 네트워크(20)내의 레거시(Legacy) 서버, 오픈스택(Openstack) 기반의 사설 클라우드의 가상화 머신, AWS, Azure, GCP 기반의 공공 클라우드의 가상화 머신으로부터 생성된 네트워크 트래픽을 API를 통해 수집할 수 있다.The collection server (100) is equipped with an API manager and can collect network traffic generated from a legacy server within a hybrid cloud network (20), a virtual machine of a private cloud based on Openstack, and a virtual machine of a public cloud based on AWS, Azure, or GCP through an API.

빅데이터서버(200)는 수집서버(100)에서 수집된 네트워크 트래픽 및 트래픽의 분석 결과를 데이터베이스로 저장하고 기계학습을 위한 훈련 또는 시험 데이터를 제공하는 기능을 수행한다.The big data server (200) stores network traffic and traffic analysis results collected from the collection server (100) in a database and performs the function of providing training or test data for machine learning.

관제서버(300)는 네트워크 운영자(30)에게 트래픽 수집 결과 및 네트워크 상태 정보를 제공하고, 클라우드의 제어 및 연동 프로토콜을 제어하는 역할을 수행한다.The control server (300) provides traffic collection results and network status information to the network operator (30) and controls the cloud control and linkage protocol.

또한, 학습서버(400)는 빅데이터서버(200)로부터 수집된 네트워크 트래픽 정보를 이용하여 인공지능 기반의 기계학습을 통해 네트워크 부하를 예측하고, 이를 통해 시스템 및 서비스 이상·장애를 예측하고 분류하는 인공지능 모델을 제공한다.In addition, the learning server (400) uses network traffic information collected from the big data server (200) to predict network load through artificial intelligence-based machine learning, and thereby provides an artificial intelligence model that predicts and classifies system and service abnormalities and failures.

이하에서는 각각의 서버에 대해 자세히 설명한다.Below we describe each server in detail.

도 4는 본 발명의 일실시예에 따른 수집서버의 연결 구성 및 기능 블럭도를 나타낸 도면이다.FIG. 4 is a drawing showing a connection configuration and functional block diagram of a collection server according to one embodiment of the present invention.

도 4에 도시된 바와 같이, 수집서버(100)는 넷플로우 수집부(110), 넷플로우 분석부(120) 및 넷플로우 저장부(130)로 이루어진다.As shown in Fig. 4, the collection server (100) consists of a netflow collection unit (110), a netflow analysis unit (120), and a netflow storage unit (130).

먼저, 넷플로우 생성을 사설 클라우드에서 예로 들면, 복수의 가상화 머신 (VM, 사설 클라우드 플랫폼 내의 컴퓨트 노드에 구성되고, 트래픽을 발생하는 트래픽 발생기로 트래픽의 생성 주체가 됨)들에서 생성되는 트래픽을 스위치(일명 '브릿지'라고도 하고, 가상화 머신들 내에 구비된 입출력 인터페이스와 연결되어 다른 가상화 머신과의 트래픽의 송수신 및 스위칭을 위한 네트워크 관리 역할을 수행)를 통해 별도의 넷플로우 생성기에서 넷플로우로 생성되고, 생성된 넷플로우는 외부 네트워크를 통해 수집서버(100)로 전달된다.First, taking NetFlow generation as an example in a private cloud, traffic generated from multiple virtual machines (VMs, which are configured on compute nodes within a private cloud platform and serve as traffic generators that generate traffic) is generated as NetFlow in a separate NetFlow generator through a switch (also called a 'bridge', which is connected to input/output interfaces provided within the virtual machines and performs a network management role for transmitting/receiving and switching traffic with other virtual machines), and the generated NetFlow is transmitted to a collection server (100) via an external network.

본 발명의 실시예에서 넷플로우 생성기는 트래픽을 수집하고 분석하기 위해 가상화 머신에서 구현되었을 경우를 가정하여 설명하나, 가상화 머신이 아닌 다른 물리적인 서버로도 구현될 수 있어 이에 제한을 두지 않는다.In the embodiment of the present invention, the NetFlow generator is described assuming that it is implemented in a virtual machine to collect and analyze traffic, but it can also be implemented in a physical server other than a virtual machine and is not limited thereto.

도 4를 참조하면, 넷플로우 수집부(110)는 넷플로우에 포함되는 소스 IP, 목적 IP, 포트번호, IP 프로토콜 타입, 타임스탬프, 서비스 타입, AS번호, 패킷/바이트 수 등을 수집할 수 있다.Referring to FIG. 4, the Netflow collection unit (110) can collect source IP, destination IP, port number, IP protocol type, timestamp, service type, AS number, number of packets/bytes, etc. included in Netflow.

넷플로우 분석부(120)는 넷플로우 수집부(110)를 통해 수집된 정보를 이용하여 이벤트의 추적 및 과다한 트래픽을 발생시키는　IP나,　바이러스에 감염된　PC,　그리고 네트워크에서 어떤 유형의 트래픽이 많은 지를 분석할 수 있고, 또한, 넷플로우 버전 V5, V9 및 IPFIX(Internet Protocol Flow Information Export)와 같은 다양한 버전에 따른 수집 및 분석이 가능하다.The Netflow analysis unit (120) can use the information collected through the Netflow collection unit (110) to track events and analyze IPs that generate excessive traffic, PCs infected with viruses, and what type of traffic is abundant in the network. In addition, collection and analysis are possible according to various versions such as Netflow version V5, V9, and IPFIX (Internet Protocol Flow Information Export).

넷플로우 저장부(130)는 넷플로우 수집부(110)에서 수집된 넷플로우 정보와 넷플로우 분석부(120)에서 분석된 트래픽 분석 결과를 데이터베이스에 저장하는 기능을 수행한다.The Netflow storage unit (130) performs the function of storing the Netflow information collected by the Netflow collection unit (110) and the traffic analysis results analyzed by the Netflow analysis unit (120) in a database.

도 5는 본 발명의 일실시예에 따른 관제서버의 기능 블럭도를 나타낸 도면이다.FIG. 5 is a drawing showing a functional block diagram of a control server according to an embodiment of the present invention.

도 5를 참조하면, 관제서버(300)는 서비스 장애를 발생할 수 있는 시스템 장애를 사전에 예방하고, 이상·장애 상태가 발생할 경우 서비스 관리자(30)에게 알람을 통해 통지하여 신속한 장애 분석을 지원하며, 서비스 관리자(30)의 경험을 기반으로 시스템 장애가 발생하였을 경우 서비스 이상·장애의 상황 분석에 도움을 주는 역할을 할 수 있다.Referring to FIG. 5, the control server (300) can prevent system failures that may cause service failures in advance, and when an abnormality or failure occurs, it can notify the service manager (30) via an alarm to support rapid failure analysis, and based on the experience of the service manager (30), it can play a role in helping to analyze the situation of service abnormality or failure when a system failure occurs.

도 5에 도시된 바와 같이, 관제서버(300)는 일정시간 동안 네트워크 부하를 예측하는 지능형 부하 예측부(310), 예측된 부하 값을 통해 현재 입력되는 네트워크의 이상·장애 상태를 예측하는 시스템 이상·장애 탐지부(320) 및 시스템 이상·장애 탐지 결과와 서비스 관리자의 운영 결과를 학습하여 시스템 이상·장애 상태 발생시 예상되는 상태를 분류하는 서비스 이상·장애 분류부(330)로 이루어진다.As illustrated in FIG. 5, the control server (300) is composed of an intelligent load prediction unit (310) that predicts network load for a certain period of time, a system abnormality/failure detection unit (320) that predicts an abnormality/failure state of a currently input network through a predicted load value, and a service abnormality/failure classification unit (330) that learns the results of system abnormality/failure detection and the operation results of a service manager to classify an expected state when a system abnormality/failure state occurs.

이하, 지능형 부하 예측부(310) 및 서비스 이상·장애 분류부(330)는 도 6 및 7에서 자세히 설명한다.Below, the intelligent load prediction unit (310) and the service abnormality/failure classification unit (330) are described in detail in FIGS. 6 and 7.

시스템 이상·장애 탐지부(320)는 서비스를 구성하는 각 시스템들이 처리하는 데이터 처리량이 규정된 예측 범위를 벗어나는 트래픽을 탐지하는 수단으로, 이는 해당 시스템 장애 또는 외부 연동 시스템의 이상 동작, 갑자기 발생하는 다량의 서비스 요청 등에 의해 발생될 수 있다. The system abnormality/failure detection unit (320) is a means for detecting traffic in which the amount of data processed by each system constituting the service exceeds the prescribed predicted range. This may be caused by a system failure, abnormal operation of an external linkage system, or a sudden occurrence of a large number of service requests.

여기서 해당 시스템에 장애가 발생하는 경우, 요청 데이터의 처리 불능으로 재전송이 대량 발생하면서 유입되는 데이터 처리량이 늘어나거나 해당 시스템이 장애로 인지되어 유입되는 데이터 처리량이 줄거나, 또는 외부 연동 시스템 장애는 반대로 유입되는 데이터 처리량이 갑자기 줄어들거나, 오동작에 의해 데이터 처리량이 늘어나거나 또는 DoS 공격과 같은 외적인 요인으로 데이터 처리량이 늘어나는 경우 등 다양한 상황에서 발생할 수 있다. Here, if a failure occurs in the system, the amount of incoming data increases due to a large number of retransmissions caused by the inability to process requested data, or the system is recognized as having a failure and the amount of incoming data decreases, or an external linkage system failure suddenly decreases the amount of incoming data, or the amount of data processed increases due to a malfunction, or the amount of data processed increases due to external factors such as a DoS attack, etc., which can occur in various situations.

이를 진단하기 위해 필요한 데이터 처리량 범위를 앞서 설명한 부하 예측 모델의 결과로 능동적으로 설정할 수 있으며, 설정된 임계치와 들어오는 데이터 처리량을 비교하여 관리자에게 알람을 이용해 알리게 된다.To diagnose this, the required data processing volume range can be actively set based on the results of the load prediction model described above, and the incoming data processing volume is compared with the set threshold to notify the manager using an alarm.

도 6은 본 발명의 일실시예에 따른 지능형 부하 예측부의 기능 블럭도를 나타낸 도면으로, 지능형 부하 예측부(310)는 부하 예측 모델부(311), 부하 예측 판단부(312) 및 부하 알림부(313)로 이루어진다.FIG. 6 is a drawing showing a functional block diagram of an intelligent load prediction unit according to one embodiment of the present invention. The intelligent load prediction unit (310) is composed of a load prediction model unit (311), a load prediction judgment unit (312), and a load notification unit (313).

도 6을 참조하면, 지능형 부하 예측부(310)는 서비스를 구성하는 각 네트워크 시스템(예로, 서비스를 실현하기 위한 가상화 머신 등)들이 처리할 수 있는 한계 데이터 처리량 이상의 처리 요청이 들어오게 되면 각 네트워크 또는 서비스 정책에 따라 데이터를 버리거나 버퍼에 적재 후 지연 처리하도록 할 수 있다. Referring to FIG. 6, when a processing request exceeds the data processing limit that each network system (e.g., virtual machine for realizing the service) that constitutes the service can process, the intelligent load prediction unit (310) can discard the data or load it into a buffer and then delay processing according to each network or service policy.

또한, 지능형 부하 예측부(310)는 최근 시스템 부하의 변화 추세에 따른 추후 발생할 시스템 부하를 예측할 수 있어야 하고 이 예측된 결과로 인해 시스템 용량을 늘리거나 추가 시스템을 설치하여 부하를 분산 처리하여 품질 저하를 막을 수 있고, 시스템에 설정된 임계치 값과 비교하여 서비스 관리자에게 알람을 이용해 알릴 수 있다.In addition, the intelligent load prediction unit (310) should be able to predict future system loads based on recent system load change trends, and based on the predicted results, the system capacity can be increased or additional systems can be installed to distribute the load, thereby preventing quality degradation, and the service manager can be notified using an alarm by comparing it with a threshold value set in the system.

한편, 지능형 부하 예측부(310)는 인공지능 기반의 기계학습을 통한 부하 예측 모델을 활용하는데, 빅데이터서버(200)에 저장된 일정 기간 동안의 학습데이터를 활용하여 15분 또는 30분 후의 데이터를 예측하도록 학습한다. Meanwhile, the intelligent load prediction unit (310) utilizes a load prediction model through artificial intelligence-based machine learning, and learns to predict data 15 or 30 minutes later by utilizing learning data for a certain period of time stored in the big data server (200).

도 6을 참조하면, 부하 예측 모델부(311)는 부하 예측 모델을 활용하여, 최근 시스템 부하의 변화 추세에 따른 추후 발생할 시스템 부하를 예측한다.Referring to Fig. 6, the load prediction model unit (311) uses the load prediction model to predict the system load that will occur in the future according to the recent change trend of the system load.

부하 예측 판단부(312)는 학습된 모델과 현재 입력되는 메트릭을 이용하여 최근 시스템 부하의 변화 추세에 따른 추후 발생할 시스템 부하를 예측하고, 이 예측된 결과로 시스템에 설정된 임계치 값과 비교하여 부하 알림부(313)를 통해 서비스 관리자에게 알람을 이용해 알리게 된다.The load prediction judgment unit (312) uses the learned model and the currently input metrics to predict the system load that will occur in the future based on the recent change trend of the system load, and compares the predicted result with the threshold value set in the system and notifies the service manager using an alarm through the load notification unit (313).

또한 예측된 부하 값을 이용하여 시스템 이상·장애 임계치를 설정하여 일정 시간 후에 실제 수집되는 부하와 비교하여 시스템의 이상 동작을 예측할 수 있으며, 시스템 이상·장애 경고를 받은 서비스 관리자의 분석을 통한 결과와 서비스 시스템의 정보로 유형을 정리하여 학습한 서비스 이상·장애 분류 모델에서는 시스템 장애가 발생할 때 서비스 이상·장애를 분류하여 관리자가 시스템 장애 여부를 판단하는데 도움을 주게 된다. In addition, by using the predicted load value to set the system abnormality/failure threshold, the system's abnormal operation can be predicted by comparing it with the load actually collected after a certain period of time, and the service abnormality/failure classification model learned by organizing the types through the analysis results of the service manager who received the system abnormality/failure alert and the information of the service system classifies the service abnormality/failure when a system failure occurs, helping the manager determine whether there is a system failure.

부하 알림부(313)는 부하 예측 판단부(312)에서 판단한 부하의 정도에 따라 서비스 관리자에게 알람으로 즉시 통보할지 아니면 시스템 및 서비스 이상·장애의 발생 가능성이 있는 부하일 때 통보할지를 판단하는 기능을 추가로 구비할 수 있다.The load notification unit (313) may additionally be provided with a function for determining whether to immediately notify the service manager with an alarm based on the degree of load determined by the load prediction judgment unit (312) or to notify when the load is likely to cause system and service abnormalities or failures.

도 7은 본 발명의 일실시예에 따른 서비스 이상·장애 분류부의 기능 블럭도를 나타낸 도면이다.FIG. 7 is a drawing showing a functional block diagram of a service abnormality/failure classification unit according to one embodiment of the present invention.

도 7을 참조하면, 서비스 이상·장애 분류부(331)는 서비스 이상·장애를 분류할 수 있는데, 먼저, 시스템 이상·장애 탐지부(320)에서 이상·장애 상태를 알리는 알람이 발생하면, 도 8에 도시된 서비스 이상·장애 학습부(420)에서 기계 학습된 모델에 의해 서비스 이상·장애 분류할 수 있다.Referring to FIG. 7, the service abnormality/failure classification unit (331) can classify service abnormalities/failures. First, when an alarm notifying an abnormality/failure state is generated in the system abnormality/failure detection unit (320), the service abnormality/failure learning unit (420) illustrated in FIG. 8 can classify the service abnormality/failure by a machine-learned model.

서비스 이상·장애 상태 알림부(332)는 서비스 이상·장애 분류부(331)에 의해 분류된 이상·장애를 서비스 관리자에게 알람으로 통지한다.The service abnormality/failure status notification unit (332) notifies the service manager of abnormalities/failures classified by the service abnormality/failure classification unit (331) as an alarm.

도 8은 본 발명의 일실시예에 따른 학습서버의 기능 블럭도를 나타낸 도면이고, 도 9는 본 발명의 일실시예에 따른 부하예측모델의 인공지능 모델 구조를 나타낸 도면이다. 또한, 도 10은 본 발명의 일실시예에 따른 서비스 이상·장애 분류 모델의 인공지능 모델 구조를 나타낸 도면이다.FIG. 8 is a diagram showing a functional block diagram of a learning server according to an embodiment of the present invention, and FIG. 9 is a diagram showing an artificial intelligence model structure of a load prediction model according to an embodiment of the present invention. In addition, FIG. 10 is a diagram showing an artificial intelligence model structure of a service abnormality/failure classification model according to an embodiment of the present invention.

일반적으로, 컴퓨터 자원의 한계와 쉽게 예측하기 어려운 돌발 상황에 따른 네트워크 환경 변화는 서비스 품질 중 하나인 부하, 즉 네트워크 트래픽의 변화를 야기시키므로 인해 이를 사전에 예측하여 대처하는 수단이 우선 요구된다. 하지만 단순히 트래픽 량을 측정하고 정량적인 임계값을 통해 수시로 변화하는 부하를 예측하기는 매우 어려운 문제점이 있고, 이를 위해 최근에는 인공지능 기반의 기계학습을 통해 부하를 예측하는 기술들이 소개되고 있다.In general, changes in the network environment due to limitations in computer resources and unexpected situations that are difficult to predict cause changes in network traffic, which is one of the service qualities, so a means to predict and respond to them in advance is required first. However, it is very difficult to simply measure the amount of traffic and predict the load that changes frequently through quantitative thresholds, and for this purpose, technologies that predict the load through artificial intelligence-based machine learning are being introduced recently.

서비스의 이상·장애는 서비스 관리자에 의한 분석 결과에 따라 장애 일수 도 있으며, 서비스 요청이 평소 보다 현저히 줄어들면서 발생이 될 수도 있다. 이는 서비스 관리자의 운영결과와 예측 알람이 발생한 시점의 주변 다른 시스템의 상태를 학습하여 시스템 알람이 발생한 경우 과거 현상과 비교하여 판단하는 모델이 필요하고, 이는 초반에는 별 효용성이 없더라도 시간이 경과할수록 서비스 관리자의 피로를 줄일 수 있다. Service anomalies and failures may occur due to failures based on analysis results by the service manager, and may also occur when service requests are significantly reduced compared to usual. This requires a model that learns the operation results of the service manager and the status of other surrounding systems at the time of the predicted alarm occurrence, and compares it with past phenomena to make a judgment when a system alarm occurs. This may not be very useful at first, but it can reduce the fatigue of the service manager over time.

도 8에 도시된 바와 같이, 학습서버(400)는 부하 예측 학습부(410) 및 서비스 이상·장애 학습부(420)로 이루어진다. As shown in Figure 8, the learning server (400) consists of a load prediction learning unit (410) and a service abnormality/failure learning unit (420).

부하 예측 학습부(410)는 입력 트래픽 량을 예측하는데, 이는 네트워크 서비스에서의 트래픽 량은 입력 트래픽 량 및 출력 트래픽 량으로 구분되어 측정되지만 실질적으로는 입력 트래픽 량이 시스템에 영향을 주는 경우가 대부분이기 때문이다.The load prediction learning unit (410) predicts the input traffic volume. This is because the traffic volume in a network service is measured by dividing it into input traffic volume and output traffic volume, but in reality, the input traffic volume mostly affects the system.

도 9에 도시된 바와 같이, 부하 예측 학습부(410)는 입력층(411), 은닉층(412) 및 출력층(413)의 3개 층의 심층신경망(DNN, Deep Neural Network) 구조와 과거 일정 시간의 결과를 입력으로 하는 LSTM 모델(414)로 이루어진다.As illustrated in Figure 9, the load prediction learning unit (410) consists of a deep neural network (DNN) structure of three layers: an input layer (411), a hidden layer (412), and an output layer (413), and an LSTM model (414) that takes as input the results of a certain period of time in the past.

또한, 입력층(411)의 입력 노드에는 7개의 입력 파라미터, 즉 요일정보, 시 단위 정보, 분 단위 정보, 입력 플로우 수, 입력 패킷 수, 입력 바이트 수, 입력 바이트 비율(입력 바이트 수 / 총 바이트 수)이 입력되지만 이에 제한되지 않고 입력 파라미터의 추가, 삭제 또는 변경이 가능하다. In addition, seven input parameters, namely, day of the week information, hourly information, minutely information, number of input flows, number of input packets, number of input bytes, and input byte ratio (number of input bytes / total number of bytes), are input to the input node of the input layer (411), but this is not limited thereto, and addition, deletion, or change of input parameters is possible.

입력층(411)의 입력 파라미터는 은닉층(411)을 통해 연산되고, 최종적으로 출력층(413)을 통해 들어오는 트래픽 예측 값이 출력된다.The input parameters of the input layer (411) are calculated through the hidden layer (411), and finally, the incoming traffic prediction value is output through the output layer (413).

LSTM(Long Short-Term Memory) 모델(414)은 출력층(413)으로부터 출력되는 예측되는 트래픽을 일정시간 간격으로 시계열적으로 입력받아 미리 저장된 과거 정보를 활용하여 현재의 트래픽을 예측할 수 있다.The LSTM (Long Short-Term Memory) model (414) can predict current traffic by using past information stored in advance by receiving predicted traffic output from the output layer (413) in a time-series manner at regular time intervals.

부하 예측 학습부(410)에서는 수집되어 있는 학습데이터 중 80%를 학습용 데이터로, 나머지 20%를 테스트용 데이터로 구분하여, 모델 학습 및 테스트를 진행하며 테스트 정확도가 일정 값 이상이 될 때까지 학습을 반복 진행할 수 있다.In the load prediction learning unit (410), 80% of the collected learning data is divided into learning data and the remaining 20% is divided into test data, and model learning and testing are performed, and learning can be repeated until the test accuracy reaches a certain value or higher.

도 10에 도시된 바와 같이, 서비스 이상·장애 학습부(420)는 입력층, 은닉층 및 출력층의 3개 층의 심층신경망(DNN) 구조를 갖는 제1 DNN(421) 및 제2 DNN(422)으로 이루어진다.As illustrated in Figure 10, the service abnormality/failure learning unit (420) is composed of a first DNN (421) and a second DNN (422) having a three-layer deep neural network (DNN) structure of an input layer, a hidden layer, and an output layer.

여기서, 제1 DNN(421)의 입력층에는 복수의 클라우드 네트워크 시스템(하이브리드 클라우드 네트워크내의 레거시서버, 오픈스택 기반의 사설 클라우드의 가상화 머신, AWS, Azure, GCP 기반의 공공 클라우드의 가상화 머신 등)중 하나로부터 3개의 파라미터 즉 각 해당 시스템의 입력 바이트 수, 각 해당 시스템의 출력 바이트 수 및 각 해당 시스템의 이상·장애 상태가 입력 노드에 입력된다. 이러한 입력 파라미터의 종류 및 개수는 제한되지 않고 추가, 삭제 또는 변경이 가능하다.Here, in the input layer of the first DNN (421), three parameters, i.e., the number of input bytes of each corresponding system, the number of output bytes of each corresponding system, and the abnormality/failure status of each corresponding system, are input to the input node from one of a plurality of cloud network systems (such as a legacy server in a hybrid cloud network, a virtual machine of a private cloud based on OpenStack, or a virtual machine of a public cloud based on AWS, Azure, or GCP). The types and numbers of these input parameters are not limited and can be added, deleted, or changed.

제2 DNN(422)의 입력층에는 복수의 시스템에 대한 각각에 대한 제1 DNN(421)의 출력값이 입력되고 은닉층을 통한 연산 결과가 최종적으로 출력층을 통해 분류 결과값이 출력된다. The output values of the first DNN (421) for each of the multiple systems are input into the input layer of the second DNN (422), and the operation results through the hidden layer are finally output as classification results through the output layer.

서비스 이상·장애 학습부(420)에서는 수집되어 있는 학습데이터 중 80%를 학습용 데이터로, 나머지 20%를 테스트용 데이터로 구분하여, 모델 학습 및 테스트를 진행하며 테스트 정확도가 일정 값 이상이 될 때까지 학습을 반복 진행할 수 있다.In the service abnormality/failure learning department (420), 80% of the collected learning data is divided into learning data and the remaining 20% is divided into test data, and model learning and testing are performed, and learning can be repeated until the test accuracy reaches a certain value or higher.

도 11은 본 발명의 일실시예에 따른 인공지능 기반의 네트워크 부하 예측 및 장애 탐지를 위한 동작을 설명하기 위한 순서도이다.FIG. 11 is a flowchart for explaining an operation for artificial intelligence-based network load prediction and fault detection according to one embodiment of the present invention.

도 11을 참조하면, 하이브리드 클라우드 네트워크(20) 내의 복수의 가상화 머신으로부터 넷플로우를 수집하는 제1 단계, 수집된 넷플로우를 분석하여 네트워크의 트래픽 정보 및 상태정보를 분류하여 데이터베이스로 저장하는 제2 단계, 트래픽 정보로부터 네트워크 부하를 예측하는 제3 단계, 예측된 부하로부터 시스템 이상·장애를 탐지하는 제4 단계 및 탐지된 시스템의 상태에 따라 네트워크 서비스의 이상·장애를 분류하는 제5 단계로 이루어진다.Referring to FIG. 11, the method comprises a first step of collecting netflows from multiple virtual machines within a hybrid cloud network (20), a second step of analyzing the collected netflows to classify network traffic information and status information and storing them in a database, a third step of predicting network load from the traffic information, a fourth step of detecting system abnormalities and failures from the predicted load, and a fifth step of classifying network service abnormalities and failures based on the detected system status.

또한, 제3단계 및 제5단계는 기계학습을 통한 인공지능 모델에 의해 수행되고, 제5 단계 이후, 과거에 발생한 시스템 및 서비스의 이상·장애 상황과 비교한 결과를 서비스 관리자에게 알람으로 통보하는 단계를 더 포함할 수 있다.In addition, the third and fifth steps are performed by an artificial intelligence model through machine learning, and after the fifth step, a step may further be included of notifying the service manager of the results compared with the abnormality/failure situations of systems and services that occurred in the past as an alarm.

지금까지 본 발명에 대해 구체적인 실시예들을 참고하여 설명하였다. 그러나, 본 발명이 속한 분야에서 통상의 지식을 가진 자라면 상기 내용을 바탕으로 본 발명의 범주내에서 다양한 응용 및 변형을 수행하는 것이 가능할 것이다.The present invention has been described so far with reference to specific embodiments. However, those skilled in the art will be able to perform various applications and modifications within the scope of the present invention based on the above contents.

10: 네트워크 부하 예측 및 장애 탐지를 위한 시스템
20: 하이브리드 클라우드 네트워크
30: 서비스 관리자
100: 수집서버
110: 넷플로우 수집부
120: 넷플로우 분석부
130: 넷플로우 저장부
200: 빅데이터서버
300: 관제서버
310: 지능형 부하 예측부
311: 부하 예측 모델부
312: 부하 예측 판단부
313: 부하 알림부
320: 서비스 이상·장애 탐지부
330: 서비스 이상·장애 분류부
400: 학습서버
410: 부하 예측 학습부
411: 입력층
412: 은닉층
413: 출력층
414: LSTM 모델
420: 서비스 이상·장애 학습부
421: 제1 DNN
422: 제2 DNN10: System for network load prediction and fault detection
20: Hybrid Cloud Network
30: Service Manager
100: Collection Server
110: Netflow collection unit
120: Netflow Analysis Department
130: Netflow storage
200: Big Data Server
300: Control Server
310: Intelligent Load Forecasting Unit
311: Load Prediction Model Section
312: Load prediction judgment unit
313: Subordinate Notification Department
320: Service Abnormality/Failure Detection Department
330: Service Abnormality/Failure Classification Department
400: Learning Server
410: Load Prediction Learning Unit
411: Input layer
412: Hidden layer
413: Output layer
414: LSTM model
420: Service Abnormalities/Disability Learning Department
421: 1st DNN
422: 2nd DNN

Claims

A collection server that collects network traffic and network metrics generated from virtual machines within a hybrid cloud network;
A big data server that stores the network traffic collected from the collection server and the analysis results of the network traffic in a database;
A control server that provides the network traffic collection results and network status information to the service manager; and
A system for predicting network load and detecting failures based on artificial intelligence, characterized in that it includes a learning server that predicts network load through artificial intelligence-based machine learning using the network traffic collected from the big data server, and predicts and classifies system and service abnormalities and failures through this.

In the first paragraph,
The above network traffic is Netflow,
The above collection server is a Netflow collection unit that collects at least one of the source IP, destination IP, port number, IP protocol type, service type, and AS number included in the Netflow;
A netflow analysis unit that uses the above netflow to track events, analyze IPs that generate excessive traffic, PCs infected with viruses, or at least one of the traffic types; and
A system for artificial intelligence-based network load prediction and fault detection, characterized by further including a netflow storage unit that stores the netflow collected from the netflow collection unit and the results analyzed from the netflow analysis unit in a database.

In the second paragraph,
A system for artificial intelligence-based network load prediction and fault detection, characterized in that the above collection server further includes an API manager that collects network traffic generated from the virtualized machines of legacy servers, private clouds based on Openstack, and public clouds within a hybrid cloud network through API.

In the first paragraph,
The above control server is an intelligent load prediction unit that predicts network load for a certain period of time;
A system abnormality/failure detection unit that predicts network abnormality/failure status using the results of the above load prediction; and
A system for network load prediction and fault detection based on artificial intelligence, characterized in that it further includes a service abnormality/failure classification unit that classifies the expected service status through the above abnormality/failure status.

In the first paragraph,
The above learning server comprises a load prediction learning unit that predicts the load generated from the network traffic; and
A system for predicting network load and detecting faults based on artificial intelligence, characterized by further including a service abnormality/failure learning unit that classifies service abnormalities/failures due to abnormal operation of the network system after a predetermined period of time has elapsed by using the above load.

In paragraph 5,
A system for network load prediction and fault detection based on artificial intelligence, characterized in that the load prediction learning unit comprises an artificial intelligence model based on a deep neural network including an input layer, a hidden layer, and an output layer, and an LSTM model that uses the result of the output layer as input.

In Article 6,
A system for network load prediction and fault detection based on artificial intelligence, characterized in that the input layer is composed of 7 nodes, and each input node inputs day of the week information, time unit information, minute unit information, number of input flows, number of input packets, number of input bytes, and input byte ratio.

In paragraph 5,
The above service abnormality/failure learning unit is composed of a two-stage deep neural network-based artificial intelligence model,
The artificial intelligence model based on the deep neural network of the first stage includes an input layer, a hidden layer, and an output layer that receives state information from one of the plurality of cloud networks.
A system for artificial intelligence-based network load prediction and fault detection, characterized in that the artificial intelligence model based on a deep neural network of the second stage includes an input layer, a hidden layer, and an output layer that inputs the result of the output layer of the first stage.

In Article 8,
A system for network load prediction and fault detection based on artificial intelligence, characterized in that the input layer of the artificial intelligence model based on the deep neural network of the first stage is composed of three input nodes into which the number of system input bytes, the number of system output bytes, and the status information of the system are input.

In a method for network load prediction and fault detection based on artificial intelligence,
Step 1: Collecting NetFlow from multiple virtual machines within a hybrid cloud network;
A second step of analyzing the collected netflow and classifying the network traffic information and status information and storing them in a database;
A third step of predicting network load from the above traffic information;
A fourth step of detecting system abnormalities and failures from the predicted load; and
It consists of the fifth step of classifying network service abnormalities and failures based on the status of the system detected above.
A method for network load prediction and fault detection, characterized in that the third and fifth steps are performed by an artificial intelligence model through machine learning.

In Article 10,
A method for network load prediction and fault detection, characterized in that, after the fifth step, it further includes a step of notifying a service manager of the results compared with the abnormality/failure situations of systems and services that occurred in the past as an alarm.