[go: up one dir, main page]

WO2019223062A1 - Method and system for processing system exceptions - Google Patents

Method and system for processing system exceptions Download PDF

Info

Publication number
WO2019223062A1
WO2019223062A1 PCT/CN2018/093707 CN2018093707W WO2019223062A1 WO 2019223062 A1 WO2019223062 A1 WO 2019223062A1 CN 2018093707 W CN2018093707 W CN 2018093707W WO 2019223062 A1 WO2019223062 A1 WO 2019223062A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
monitoring node
operation information
central server
communication end
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/093707
Other languages
French (fr)
Chinese (zh)
Inventor
陈天豪
杨海勇
谢晓华
袁少雄
金鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2019223062A1 publication Critical patent/WO2019223062A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • H04L41/0661Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method and system for processing system exceptions.
  • the system is mainly monitored by monitoring tools. Once a system abnormality occurs, the alarm information is usually transmitted to the relevant operation and maintenance personnel by mail or phone, and then the operation and maintenance personnel handle the system abnormality.
  • the alarm information is usually transmitted to the relevant operation and maintenance personnel by mail or phone, and then the operation and maintenance personnel handle the system abnormality.
  • many system failures occur repeatedly and the same processing methods exist. Existing system exception handling methods will cause a lot of tedious and repetitive work, reducing the operation and maintenance efficiency of the system.
  • the embodiments of the present application provide a method and a system for processing system abnormalities, so as to solve the problem of low operation and maintenance efficiency of current network equipment when a system abnormality occurs.
  • a first aspect of the embodiments of the present application provides a method for processing a system exception, including:
  • the communication terminal collects its system operation information in real time and reports the system operation information to the monitoring node;
  • the monitoring node generates an alarm email according to the system operation information, and sends the alarm email to a central server.
  • the alarm email records a communication terminal where a system abnormality occurs and operating data indicating the system abnormality of the communication terminal. ;
  • the central server outputs the operation data in the alarm email to a preset processing solution database for matching, and obtains a processing script matching the operation data;
  • the central server pushes the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to handle the system abnormality.
  • a system abnormality processing system including a central server and multiple communication terminals and multiple monitoring nodes deployed in a distributed manner.
  • the communication end is used to collect its system operation information in real time and report the system operation information to the monitoring node;
  • the monitoring node is configured to generate an alarm email according to the system operation information, and send the alarm email to the central server.
  • the alarm email records a communication terminal where a system abnormality occurs and indicates the communication terminal system. Abnormal operating data;
  • the central server is configured to output the running data in the alarm email to a preset processing scheme database for matching, and obtain a processing script matching the running data;
  • the central server is further configured to push the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to process the system abnormality.
  • a central server is deployed in the existing network, and multiple monitoring nodes are deployed in a distributed manner.
  • the original communication terminal in the network collects its system operation information in real time and reports the system operation information to Monitoring node.
  • the monitoring node Based on the system operation information, the monitoring node generates an alarm email for the communication end where the system is abnormal and sends it to the central server, so that the central server matches the corresponding processing script in the preset processing plan database and pushes it to the communication end.
  • Automatic processing From the occurrence of system abnormality to the recovery of system abnormality, the entire process is automatically completed between the communication end, the monitoring node and the central server, realizing automatic system operation and maintenance, and also ensuring the timeliness of system operation and maintenance, saving operation and maintenance personnel Time and energy.
  • FIG. 1 is a schematic diagram of a network topology architecture of a system exception handling system according to an embodiment of the present application
  • FIG. 2 is an interaction flowchart of a method for processing a system exception according to an embodiment of the present application
  • FIG. 3 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application.
  • FIG. 4 is a flowchart of a method for processing a system exception according to another embodiment of the present application.
  • FIG. 5 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application.
  • FIG. 6 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application.
  • FIG. 7 is a flowchart of implementing a method for processing a system exception according to another embodiment of the present application.
  • FIG. 8 is a flowchart of implementing a method for processing a system exception according to another embodiment of the present application.
  • FIG. 9 is an interaction flowchart of a method for processing a system exception according to another embodiment of the present application.
  • FIG. 10 is a schematic diagram of a network node according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a network topology architecture of a system exception handling system provided by an embodiment of the present application. For convenience of explanation, only a part related to this embodiment is shown.
  • the system has a central server deployed in the network, and multiple communication terminals and multiple monitoring nodes are deployed in a distributed manner.
  • the communication end may be various network nodes that have been deployed in the network, such as network devices such as servers, gateways, and routers, and terminal devices such as computers, smart home appliances, and smart phones.
  • the system installed and running on the communication end is operated and maintained. Once a system abnormality occurs, the system abnormality is automatically restored based on the system abnormality processing method provided in the embodiment of the present application.
  • the communication end it collects system operation information in real time during the system operation and reports the system operation information to the monitoring node.
  • the central server and the monitoring node are the devices deployed in the network in order to realize automatic recovery of system abnormalities in the embodiments of the present application.
  • the monitoring nodes are deployed in the network in a distributed manner, and their equipment can be servers with high data processing capabilities.
  • the monitoring nodes generate alarm emails based on the system operation information reported by the communication end, and send the alarm emails to the central server. It records the communication end where the system abnormality occurs and the operating data used to indicate the abnormality of the communication system.
  • one or more communication terminals may be deployed under a monitoring node, and each monitoring node is responsible for monitoring the system operation of the communication terminal deployed under it.
  • Only one central server can be set in a network area, and the central server can simultaneously communicate with all monitoring nodes and communication terminals deployed in the network area.
  • the processing server database is set on the central server. After receiving the alarm email reported by the monitoring node, the central server outputs the running data in the alarm email to the processing solution database for matching, obtains the corresponding processing script, and pushes the processing script to A communication end with a system exception. After receiving the processing script, the communication end automatically executes the processing script, thereby realizing the automatic recovery of the system exception.
  • each monitoring node is located under the same gateway as the communication end that reports system operation information to the monitoring node, so that the monitoring node can accurately and timely obtain the system operation information of the communication end.
  • the communication end reports its system operation information to the monitoring node under the same gateway, which is more assured in communication reliability and communication rate, and relatively improves the reporting efficiency of the system operation information, and facilitates operation and maintenance management.
  • FIG. 2 shows an interaction flow of a method for processing a system exception provided by an embodiment of the present application.
  • a communication entity involved in the interaction includes the foregoing central server, a monitoring node, and a communication end.
  • the method for processing the system exception includes:
  • the communication terminal collects its system operation information in real time, and reports the system operation information to the monitoring node.
  • a program for collecting system operation information is loaded in the communication terminal in advance, and the communication terminal implements real-time collection of system operation information through the preloaded program during the system operation.
  • the collected system operation information includes, but is not limited to, business data processed by the system, system operation logs, basic resource usage of the communication end, database operation performance of the communication end, and middleware performance.
  • the communication terminal reports the system operation information to the monitoring node in a regular or real-time manner.
  • the communication end needs to determine a monitoring node that needs to report system operation information in advance. As shown in Figure 3:
  • the communication terminal obtains a monitoring node list, and the monitoring node list records each gateway in the system and the monitoring nodes deployed under each of the gateways.
  • the monitoring node list is issued by the central server to each communication end, which records the information of each gateway in the system and the monitoring nodes deployed under each gateway.
  • each gateway and each monitoring node can use their IP addresses Display.
  • the monitoring node list is maintained by the central server. When the content recorded in it is changed, the central server re-delivers it to the communication end. After receiving the new monitoring node list, the communication end updates the locally stored monitoring node list.
  • each monitoring node is located under the same gateway as the communication end to which the system operation information is reported. Therefore, in this embodiment, the communication end first looks for the communication in the monitoring node list. The gateway where the end is located.
  • the communication terminal determines the found monitoring node deployed under the gateway as the monitoring node that needs to report the system operation information.
  • the communication end After finding the gateway where the communication end is located in the monitoring node list, the communication end selects any monitoring node deployed under the gateway, and determines the monitoring node as the monitoring node to which the communication end needs to report system operation information.
  • the communication end reports its system operation information to a monitoring node under the same gateway, which is more assured in communication reliability and communication rate, and relatively improves the efficiency of reporting system operation information, and Convenient for operation and maintenance management.
  • the communication end records all the monitoring nodes deployed under its gateway. For example, if five monitoring nodes are deployed under the gateway where the communication terminal is located, then in addition to configuring one of the monitoring nodes as a monitoring node that needs to report system operating information, the communication terminal addresses the other four monitoring nodes with address information and node identifiers. Make a record.
  • the communication end After the communication end reports the system operation information to the monitoring node, it usually receives a response that the system operation information received by the monitoring node is successfully received. If the response is not received within a certain period of time, the communication end defaults to failing to report the system operation information. Then, at this time, the communication end selects another monitoring node under its gateway to report system operation information according to the information recorded in S304.
  • the embodiment corresponding to FIG. 4 considers the possibility of failure of a monitoring node or a communication link, and establishes a backup reporting mechanism for the smooth reporting of system operation information, which effectively guarantees the timeliness of system operation and maintenance.
  • the monitoring node generates an alarm email according to the system operation information, and sends the alarm email to the central server.
  • the alarm email records a communication terminal where a system abnormality occurs and a system for indicating the communication terminal. Abnormal operating data.
  • the monitoring node analyzes the system operation information reported by each communication terminal to monitor the program abnormality or business data abnormality in the system run by each communication terminal, and generates an alarm email based on the related information of the system abnormality based on the monitoring result, and sends it to the center server.
  • the alarm email mainly describes the device identification or network address of the communication end where the system abnormality occurs, and writes the operating data used to indicate the system abnormality of the communication end.
  • the monitoring node may foretell the establishment of a system normal operation model, so as to import system operation information into the model, thereby determining whether the system on the corresponding communication end operates normally.
  • Figure 5 As shown in Figure 5:
  • the monitoring node collects the system operation information of different communication terminals.
  • the monitoring node can collect the system operation information of each communication end in advance and store it in the operation information set for subsequent modeling and analysis.
  • S502 The monitoring node performs clustering on the collected system operation information to obtain multiple cluster sets.
  • the monitoring node uses a clustering algorithm, such as the CURE clustering algorithm, to cluster the collected system operation information to obtain multiple cluster sets, and the system operation information in each cluster set has the same or similar data characteristics.
  • a clustering algorithm such as the CURE clustering algorithm
  • the monitoring node marks a cluster set for indicating that the system operates normally in the multiple cluster sets.
  • the monitoring nodes mark clusters that indicate the normal operation of the system in the multiple cluster sets generated.
  • the system operation information in these cluster sets can indicate that no system abnormality has occurred on the corresponding communication end.
  • FIG. 6 the implementation of S503 is shown in FIG. 6:
  • S601 The monitoring node arranges the multiple cluster sets in descending order according to the size of the cluster.
  • the monitoring node After clustering, the amount of system operation information gathered in each cluster set is different, so first, the monitoring node will generate multiple clusters according to the size of the cluster, that is, according to the amount of system operation information gathered in the cluster set. Sets are sorted in descending order.
  • the monitoring node reads a preset scaling parameter, where the scaling parameter is used to indicate that the number of normal communication terminals of the system at the same time accounts for all communication terminals.
  • the proportionality parameter is determined by empirical values or previous system operating conditions. It is used to indicate that at the same time, the normal communication end of the system accounts for the proportion of all communication ends, that is, the system operation information used to indicate that the system is operating normally throughout the operation. Percentage of information concentration.
  • the monitoring node marks the clusters arranged in the top N positions as clusters for indicating that the system operates normally based on the preset scaling parameters.
  • the monitoring nodes After obtaining the preset scale parameters, the monitoring nodes will be ranked in the top N according to the preset scale parameters, the amount of system operation information reported by the communication end in the current statistical time period, and the amount of system operation information in each cluster set.
  • a cluster of bits is marked as a cluster to indicate that the system is operating normally.
  • the ratio of the sum of the system operation information in the labeled cluster set to the sum of the system operation information in all cluster sets is approximately equal to a preset ratio parameter.
  • the system operation information is filtered according to the experience value and the clustering algorithm, and the system operation information used to indicate the normal operation of the system is determined from it, which is used for the subsequent system normal operation modeling. deal with.
  • the monitoring node generates a normal operation model of the system based on the clusters marked, and the normal operation model of the system is used by the monitoring node to determine whether the system operation information reported by the communication terminal indicates that the communication terminal The system is running normally.
  • the monitoring node obtains the system operation information in it, and performs modeling to generate the normal operation model of the system.
  • the normal operation model of the system can be established based on neural networks.
  • the system operation information represented by the clusters is used as input samples, and the system operation status represented by the system operation information, that is, the system normal or system abnormality is used as the output result for model training.
  • the model is used to determine whether the system operation information reported by the communication end can indicate that the communication end system is operating normally.
  • the central server outputs the operation data in the alarm email to a preset processing scheme database for matching, and obtains a processing script matching the operation data.
  • the central server can only set a timing task in its background to periodically obtain the alarm email generated by the monitoring node.
  • the operating data used to indicate the abnormality of the communication end system can be attached to the email as a text attachment, or it can be reflected in the form of the email body.
  • the central server parses the relevant text content in the alarm email sent by the monitoring node, including segmenting the text content, finding out a character string that can characterize the system's operating indicators, and reading the corresponding data of the character string to convert the text information into A data table used to characterize the operating status of the system.
  • the key names in this data table are character strings that can characterize system operation indicators, and the key values are the corresponding data for each character string.
  • the character string capable of characterizing the system operation index includes, but is not limited to, a server number, a server address, a time when a system abnormality occurs, a description of a system abnormality, an operating parameter when the system is abnormal, and the like.
  • a processing plan database is created in the central server, and the processing plan database is created at least before executing S3, as shown in Figure 7:
  • S701 The central server enters a configuration mode.
  • the central server After the configuration mode is triggered, the central server presents a configurable page to the operation and maintenance user, and the operation and maintenance personnel can configure the processing plan database on the configurable page.
  • the central server receives the characteristic parameters and corresponding processing scripts input by the operation and maintenance user for describing system abnormalities.
  • the central server stores feature parameters input by the operation and maintenance user in association with corresponding processing scripts, and the feature parameters are used by the central server to match with the operation data.
  • the central server stores the characteristic parameters entered by the operation and maintenance user on the configurable page to describe a certain type of system abnormality, and stores it with the processing script used to recover the type of system abnormality.
  • the character string used to characterize the system operation index in the operating data and the character string number corresponding to the character string used to characterize the system operation index in the characteristic parameters are matched, so that Can determine the type of system exception, and further determine the processing script used to handle this type of system exception.
  • S4 The central server pushes the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to handle the system abnormality.
  • the central server After the central server determines in its processing plan database that it can match the processing script of the communication system exception, it pushes the processing script to the communication terminal according to the relevant information of the communication terminal where the system abnormality is described in the alarm email, such as the network address of the communication terminal . After receiving the script pushed by the central server, the communication end will automatically execute the processing script to recover the system exception.
  • the push of the processing script is used as a trigger condition in advance, that is, once it is detected that the central server has pushed the processing script, a thread creation action is triggered, and an execution thread for the processing script is automatically created locally.
  • the communication end After the execution thread is created, the communication end sets the priority of the execution thread to the highest. In this way, the other threads that the communication end has previously created are lower than the priority of the execution thread, and the communication end will run immediately.
  • the execution thread is used to execute a processing script, thereby realizing timely recovery of system exceptions.
  • the communication end records the execution process, generates an execution log, and feeds back to the central server the result of the successful system exception processing and the execution log.
  • the central server performs statistical analysis on the received execution logs at preset time intervals, and generates a prediction report on the system operating status according to the results of the statistical analysis.
  • the central server records the communication end that successfully handles the system abnormality according to the received execution log, and statistically analyzes the system abnormality of the communication end and the corresponding processing result at intervals, and generates the system operating status based on the results of the statistical analysis Forecast report to better help operation and maintenance personnel understand the operation of the system, better improve system functions, and improve system stability.
  • the monitoring node analyzes the system operation information reported by the communication terminal to confirm that the communication terminal has a system abnormality, so it sends an alarm email to the central server.
  • the central server analyzes the text of the alarm email, and imports the parsing result into the processing solution database for matching to find the processing script for processing thread blocking, and pushes the processing script to the communication terminal according to the address of the communication terminal in the alarm email.
  • the communication end automatically executes the processing script to complete system exception recovery.
  • FIG. 10 is a schematic diagram of a network node according to an embodiment of the present application.
  • the network node may be a central server, a communication end, or a monitoring node in FIG. 1.
  • the network node 10 of this embodiment includes a processor 100, a memory 101, and computer-readable instructions 102 stored in the memory 101 and executable on the processor 100.
  • the processor 100 executes the computer-readable instructions 102, the steps in the embodiment of a method for restoring a system abnormality corresponding to each network node are implemented.
  • the processor 100 executes In step S1, for a monitoring node, the processor 100 executes step S2 shown in FIG. 2, and for a central server, the processor 100 executes steps S3 and S4 shown in FIG.
  • the computer-readable instructions 102 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 101 and executed by the processor 100, To complete this application.
  • the one or more modules / units may be an instruction segment of a series of computer-readable instructions capable of performing a specific function, and the instruction segment is used to describe an execution process of the computer-readable instruction 102 in its corresponding network node.
  • the processor 100 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), and application-specific integrated circuits (Applications) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • ASIC Applications
  • FPGA off-the-shelf Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 101 may be an internal storage unit of a corresponding network node, such as a hard disk or a memory at a communication end.
  • the memory 101 may also be an external storage device of a corresponding network node, such as a plug-in hard disk equipped on a communication end, a smart memory card (Smart Media Card, SMC), Secure Digital (SD) card, Flash Card, etc.
  • the memory 101 may further include both an internal storage unit of a corresponding network node and an external storage device.
  • the memory 101 is configured to store the computer-readable instructions and other programs and data required by the server.
  • the memory 101 may also be used to temporarily store data that has been output or is to be output.
  • the computer-readable instructions may be stored in a computer-readable storage medium.
  • the computer may When the read instruction is executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in a source code form, an object code form, an executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instructions, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signals telecommunication signals
  • software distribution media any entity or device capable of carrying the computer-readable instructions
  • a recording medium a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present application is applied to the technical field of computers, and provides a method and a system for processing system exceptions. The method comprises: a communication terminal collecting system operating information thereof in real time, and reporting the system operating information to a monitoring node; the monitoring node generating a warning mail according to the system operating information and sending the warning mail to a central server, wherein the communication terminal having system exceptions and operation data for showing system exceptions of the communication terminal are recorded in the warning mail; the central server outputting the operating data in the warning mail to a pre-set processing solution database for matching, and acquiring processing scripts matched with the operating data; and the central server pushing the processing scripts to the communication terminal having system exceptions, wherein the processing scripts are executed automatically after being received by the communication terminal having system exceptions, and are used for processing the system exceptions. In the present application, automatic system operation and maintenance are realized, and timeliness of system operation and maintenance is also guaranteed.

Description

系统异常的处理方法和系统Method and system for processing system abnormality

本申请要求于2018年05月22日提交中国专利局、申请号为201810496049.9、发明名称为“系统异常的处理方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the Chinese Patent Office on May 22, 2018, with application number 201810496049.9 and the invention name is "System Abnormal Processing Method and System", the entire contents of which are incorporated herein by reference .

技术领域Technical field

本申请涉及计算机技术领域,尤其涉及一种系统异常的处理方法和系统。The present application relates to the field of computer technology, and in particular, to a method and system for processing system exceptions.

背景技术Background technique

随着网络技术的不断发展,服务器、网关等网络设备被大规模地投入使用,网络的容量和拓扑复杂度都越来越大,这就导致了网络系统在运行的过程中不可避免地会出现各种系统异常。With the continuous development of network technology, network equipment such as servers and gateways have been put into use on a large scale, and the capacity and topology complexity of the network have become larger and larger, which has led to the inevitable emergence of network systems during operation Various system exceptions.

现阶段,主要通过监控工具来对系统进行监控,一旦出现系统异常,通常是以邮件或电话的方式将告警信息传递给相关的运维人员,再由运维人员对系统异常进行处理。然而,许多系统故障是重复出现的,且存在相同的处理方法,现有的系统异常处理方式会导致大量繁琐而又重复的工作产生,降低了系统的运维效率。At this stage, the system is mainly monitored by monitoring tools. Once a system abnormality occurs, the alarm information is usually transmitted to the relevant operation and maintenance personnel by mail or phone, and then the operation and maintenance personnel handle the system abnormality. However, many system failures occur repeatedly and the same processing methods exist. Existing system exception handling methods will cause a lot of tedious and repetitive work, reducing the operation and maintenance efficiency of the system.

技术问题technical problem

本申请实施例提供了一种系统异常的处理方法及系统,以解决目前网络设备在出现系统异常时运维效率低的问题。The embodiments of the present application provide a method and a system for processing system abnormalities, so as to solve the problem of low operation and maintenance efficiency of current network equipment when a system abnormality occurs.

技术解决方案Technical solutions

本申请实施例的第一方面提供了一种系统异常的处理方法,包括:A first aspect of the embodiments of the present application provides a method for processing a system exception, including:

搭通信端实时采集其系统运行信息,并将所述系统运行信息上报至监控节点;The communication terminal collects its system operation information in real time and reports the system operation information to the monitoring node;

所述监控节点根据所述系统运行信息生成告警邮件,并将所述告警邮件发送至中心服务器,所述告警邮件中记录了出现系统异常的通信端及用于表示该通信端系统异常的运行数据;The monitoring node generates an alarm email according to the system operation information, and sends the alarm email to a central server. The alarm email records a communication terminal where a system abnormality occurs and operating data indicating the system abnormality of the communication terminal. ;

所述中心服务器将所述告警邮件中的所述运行数据输出至预设的处理方案数据库中进行匹配,获取与所述运行数据匹配的处理脚本;The central server outputs the operation data in the alarm email to a preset processing solution database for matching, and obtains a processing script matching the operation data;

所述中心服务器将所述处理脚本推送至所述出现系统异常的通信端,所述处理脚本被所述出现系统异常的通信端接收后自动执行,用于处理系统异常。The central server pushes the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to handle the system abnormality.

本申请实施例的第二方面,提供了一种系统异常的处理系统,包括中心服务器以及分布式部署的多个通信端及多个监控节点,According to a second aspect of the embodiments of the present application, a system abnormality processing system is provided, including a central server and multiple communication terminals and multiple monitoring nodes deployed in a distributed manner.

所述通信端用于实时采集其系统运行信息,并将所述系统运行信息上报至所述监控节点;The communication end is used to collect its system operation information in real time and report the system operation information to the monitoring node;

所述监控节点用于根据所述系统运行信息生成告警邮件,并将所述告警邮件发送至所述中心服务器,所述告警邮件中记录了出现系统异常的通信端及用于表示该通信端系统异常的运行数据;The monitoring node is configured to generate an alarm email according to the system operation information, and send the alarm email to the central server. The alarm email records a communication terminal where a system abnormality occurs and indicates the communication terminal system. Abnormal operating data;

所述中心服务器用于将所述告警邮件中的所述运行数据输出至预设的处理方案数据库中进行匹配,获取与所述运行数据匹配的处理脚本;The central server is configured to output the running data in the alarm email to a preset processing scheme database for matching, and obtain a processing script matching the running data;

所述中心服务器还用于将所述处理脚本推送至所述出现系统异常的通信端,所述处理脚本被所述出现系统异常的通信端接收后自动执行,用于处理系统异常。The central server is further configured to push the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to process the system abnormality.

有益效果Beneficial effect

本申请实施例中,本申请实施例在现有网络中部署了中心服务器,并分布式部署了多个监控节点,网络中原有的通信端实时采集其系统运行信息,并将系统运行信息上报至监控节点,监控节点根据系统运行信息,为出现系统异常的通信端生成告警邮件并发送至中心服务器,以使中心服务器在预设的处理方案数据库中匹配出对应的处理脚本,并推送至通信端自动处理。从出现系统异常至恢复系统异常,整个过程在通信端、监控节点和中心服务器之间自动完成,实现了自动化的系统运维,同时也保证了系统运维的时效性,节省了运维人员的时间与精力。In the embodiment of the present application, a central server is deployed in the existing network, and multiple monitoring nodes are deployed in a distributed manner. The original communication terminal in the network collects its system operation information in real time and reports the system operation information to Monitoring node. Based on the system operation information, the monitoring node generates an alarm email for the communication end where the system is abnormal and sends it to the central server, so that the central server matches the corresponding processing script in the preset processing plan database and pushes it to the communication end. Automatic processing. From the occurrence of system abnormality to the recovery of system abnormality, the entire process is automatically completed between the communication end, the monitoring node and the central server, realizing automatic system operation and maintenance, and also ensuring the timeliness of system operation and maintenance, saving operation and maintenance personnel Time and energy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请实施例提供的系统异常的处理系统的网络拓扑架构示意图;FIG. 1 is a schematic diagram of a network topology architecture of a system exception handling system according to an embodiment of the present application; FIG.

图2为本申请实施例提供的系统异常的处理方法的交互流程图;FIG. 2 is an interaction flowchart of a method for processing a system exception according to an embodiment of the present application; FIG.

图3为本申请另一实施例提供的系统异常的处理方法的实现流程图;FIG. 3 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application; FIG.

图4为本申请另一实施例提供的系统异常的处理方法的实现流程图;4 is a flowchart of a method for processing a system exception according to another embodiment of the present application;

图5为本申请另一实施例提供的系统异常的处理方法的实现流程图;FIG. 5 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application; FIG.

图6为本申请另一实施例提供的系统异常的处理方法的实现流程图;FIG. 6 is an implementation flowchart of a method for processing a system exception according to another embodiment of the present application; FIG.

图7为本申请另一实施例提供的系统异常的处理方法的实现流程图;FIG. 7 is a flowchart of implementing a method for processing a system exception according to another embodiment of the present application; FIG.

图8为本申请另一实施例提供的系统异常的处理方法的实现流程图;FIG. 8 is a flowchart of implementing a method for processing a system exception according to another embodiment of the present application; FIG.

图9为本申请另一实施例提供的系统异常的处理方法的交互流程图;FIG. 9 is an interaction flowchart of a method for processing a system exception according to another embodiment of the present application; FIG.

图10为本申请一实施例提供的网络节点的示意图。FIG. 10 is a schematic diagram of a network node according to an embodiment of the present application.

本发明的实施方式Embodiments of the invention

为使得本申请的发明目的、特征、优点能够更加的明显和易懂,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,下面所描述的实施例仅仅是本申请一部分实施例,而非全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the object, features, and advantages of the present application more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application. Obviously, The described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

图1为本申请实施例提供的系统异常的处理系统的网络拓扑架构示意图,为了便于说明,仅示出了与本实施例相关的部分。FIG. 1 is a schematic diagram of a network topology architecture of a system exception handling system provided by an embodiment of the present application. For convenience of explanation, only a part related to this embodiment is shown.

参照图1,该系统在网络中部署有中心服务器,并分布式部署有多个通信端及多个监控节点。Referring to FIG. 1, the system has a central server deployed in the network, and multiple communication terminals and multiple monitoring nodes are deployed in a distributed manner.

其中,通信端可以为原先就已部署在网络中的各个网络节点,例如服务器、网关、路由等网络设备,以及计算机、智能家电、智能手机等终端设备。在本申请实施例中,对通信端中安装并运行的系统进行运维,一旦出现系统异常,则基于本申请实施例提供的系统异常的处理方法来对该系统异常进行自动恢复。对通信端来说,其在系统运行过程中实时采集系统运行信息,并将系统运行信息上报至监控节点。The communication end may be various network nodes that have been deployed in the network, such as network devices such as servers, gateways, and routers, and terminal devices such as computers, smart home appliances, and smart phones. In the embodiment of the present application, the system installed and running on the communication end is operated and maintained. Once a system abnormality occurs, the system abnormality is automatically restored based on the system abnormality processing method provided in the embodiment of the present application. For the communication end, it collects system operation information in real time during the system operation and reports the system operation information to the monitoring node.

中心服务器与监控节点为本申请实施例中为了实现对系统异常的自动恢复而部署于网络中的设备。监控节点分布式部署于网络中,其设备形态可以为具备较高数据处理能力的服务器,监控节点根据通信端上报的系统运行信息生成告警邮件,并将告警邮件发送至中心服务器,在告警邮件中,记录了出现系统异常的通信端及用于表示该通信系统异常的运行数据。在本申请实施例中,一个监控节点之下可以部署一个或多个通信端,每个监控节点负责对部署于其下的通信端的系统运行情况进行监控。中心服务器在一个网络区域内可以仅设置一个,且中心服务器可以同时与该网络区域内部署的所有监控节点和通信端通信。中心服务器上设置有处理方案数据库,在接收到监控节点上报的告警邮件后,中心服务器将告警邮件中的运行数据输出至处理方案数据库中进行匹配,获取对应的处理脚本,并将处理脚本推送至出现系统异常的通信端。通信端在接收到处理脚本后,自动执行该处理脚本,从而实现对系统异常的自动恢复。The central server and the monitoring node are the devices deployed in the network in order to realize automatic recovery of system abnormalities in the embodiments of the present application. The monitoring nodes are deployed in the network in a distributed manner, and their equipment can be servers with high data processing capabilities. The monitoring nodes generate alarm emails based on the system operation information reported by the communication end, and send the alarm emails to the central server. It records the communication end where the system abnormality occurs and the operating data used to indicate the abnormality of the communication system. In the embodiment of the present application, one or more communication terminals may be deployed under a monitoring node, and each monitoring node is responsible for monitoring the system operation of the communication terminal deployed under it. Only one central server can be set in a network area, and the central server can simultaneously communicate with all monitoring nodes and communication terminals deployed in the network area. The processing server database is set on the central server. After receiving the alarm email reported by the monitoring node, the central server outputs the running data in the alarm email to the processing solution database for matching, obtains the corresponding processing script, and pushes the processing script to A communication end with a system exception. After receiving the processing script, the communication end automatically executes the processing script, thereby realizing the automatic recovery of the system exception.

在图1所示的实施例的基础之上,进一步地,每个监控节点与向其上报系统运行信息的通信端位于同一网关之下,以便于监控节点及时准确地获取到通信端的系统运行信息,并且,通信端将其系统运行信息上报至同一网关下的监控节点,在通信可靠性及通信速率上更加有保证,也相对来说提高了系统运行信息的上报效率,且便于运维管理。Based on the embodiment shown in FIG. 1, further, each monitoring node is located under the same gateway as the communication end that reports system operation information to the monitoring node, so that the monitoring node can accurately and timely obtain the system operation information of the communication end. Moreover, the communication end reports its system operation information to the monitoring node under the same gateway, which is more assured in communication reliability and communication rate, and relatively improves the reporting efficiency of the system operation information, and facilitates operation and maintenance management.

接下来,基于本申请图1所示的实施例,对本申请实施例提供的系统异常的处理方法进行详细说明。图2示出了本申请实施例提供的系统异常的处理方法的交互流程,在该交互流程中,涉及交互的通信实体包括上述中心服务器、监控节点及通信端。Next, based on the embodiment shown in FIG. 1 of the present application, a method for processing a system exception provided by the embodiment of the present application will be described in detail. FIG. 2 shows an interaction flow of a method for processing a system exception provided by an embodiment of the present application. In this interaction flow, a communication entity involved in the interaction includes the foregoing central server, a monitoring node, and a communication end.

如图2所示,该系统异常的处理方法包括:As shown in Figure 2, the method for processing the system exception includes:

S1:所述通信端实时采集其系统运行信息,并将所述系统运行信息上报至所述监控节点。S1: The communication terminal collects its system operation information in real time, and reports the system operation information to the monitoring node.

在本申请实施例中,预先在通信端中装载有用于采集系统运行信息的程序,通信端在系统运行过程中,通过该预先装载的程序实现系统运行信息的实时采集。采集到的系统运行信息包括但不限于系统处理的业务数据、系统运行日志、通信端基础资源使用情况、通信端数据库运行性能、中间件性能等。在采集到系统运行信息后,通信端采用定时上报或者实时上报的方式,将系统运行信息上报至监控节点。In the embodiment of the present application, a program for collecting system operation information is loaded in the communication terminal in advance, and the communication terminal implements real-time collection of system operation information through the preloaded program during the system operation. The collected system operation information includes, but is not limited to, business data processed by the system, system operation logs, basic resource usage of the communication end, database operation performance of the communication end, and middleware performance. After the system operation information is collected, the communication terminal reports the system operation information to the monitoring node in a regular or real-time manner.

作为本申请的一个实施例,在S1之前,通信端需要预先确定其需要上报系统运行信息的监控节点。如图3所示:As an embodiment of the present application, before S1, the communication end needs to determine a monitoring node that needs to report system operation information in advance. As shown in Figure 3:

S301:所述通信端获取监控节点列表,所述监控节点列表中记录了所述系统中的各个网关及每个所述网关下部署的所述监控节点。S301: The communication terminal obtains a monitoring node list, and the monitoring node list records each gateway in the system and the monitoring nodes deployed under each of the gateways.

监控节点列表由中心服务器下发给各通信端,其中记录了系统中的各个网关及每个网关下部署的监控节点的信息,在监控节点列表中,各个网关及各个监控节点可以以其IP地址的形式进行展现。监控节点列表由中心服务器进行维护,当其中记录的内容发生变更时,由中心服务器重新下发给通信端,通信端在接收到新的监控节点列表之后,对本地存储的监控节点列表进行更新。The monitoring node list is issued by the central server to each communication end, which records the information of each gateway in the system and the monitoring nodes deployed under each gateway. In the monitoring node list, each gateway and each monitoring node can use their IP addresses Display. The monitoring node list is maintained by the central server. When the content recorded in it is changed, the central server re-delivers it to the communication end. After receiving the new monitoring node list, the communication end updates the locally stored monitoring node list.

S302:所述通信端在所述监控节点列表中查找到其所在的网关。S302: The communication end finds its gateway in the monitoring node list.

如上文实施例中所述,优选地,每个监控节点与向其上报系统运行信息的通信端位于同一网关之下,因此,在本实施例中,通信端首先在监控节点列表中查找该通信端所处的网关。As described in the above embodiment, preferably, each monitoring node is located under the same gateway as the communication end to which the system operation information is reported. Therefore, in this embodiment, the communication end first looks for the communication in the monitoring node list. The gateway where the end is located.

S303:所述通信端将查找到的所述网关下部署的监控节点确定为需要上报所述系统运行信息的监控节点。S303: The communication terminal determines the found monitoring node deployed under the gateway as the monitoring node that needs to report the system operation information.

在监控节点列表中查找到通信端所处的网关后,通信端选择部署在该网关下的任意一个监控节点,将该监控节点确定为通信端需要向其上报系统运行信息的监控节点。After finding the gateway where the communication end is located in the monitoring node list, the communication end selects any monitoring node deployed under the gateway, and determines the monitoring node as the monitoring node to which the communication end needs to report system operation information.

图3对应的实施例中,通信端将其系统运行信息上报至同一网关下的监控节点,在通信可靠性及通信速率上更加有保证,也相对来说提高了系统运行信息的上报效率,且便于运维管理。In the embodiment corresponding to FIG. 3, the communication end reports its system operation information to a monitoring node under the same gateway, which is more assured in communication reliability and communication rate, and relatively improves the efficiency of reporting system operation information, and Convenient for operation and maintenance management.

在S303之后,进一步地,如图4所示,还包括:After S303, as shown in FIG. 4, it further includes:

S304:所述通信端对查找到的所述网关下部署的所有监控节点进行记录。S304: The communication end records all the monitoring nodes deployed under the gateway found.

通信端对于其所在网关下部署的所有监控节点,均进行记录。例如,通信端所在网关下部署了5个监控节点,那么除了将其中一个监控节点配置为需要上报系统运行信息的监控节点之外,通信端对另外4个监控节点的地址信息、节点标识等均进行记录。The communication end records all the monitoring nodes deployed under its gateway. For example, if five monitoring nodes are deployed under the gateway where the communication terminal is located, then in addition to configuring one of the monitoring nodes as a monitoring node that needs to report system operating information, the communication terminal addresses the other four monitoring nodes with address information and node identifiers. Make a record.

S305:若检测到所述系统运行信息上报失败,所述通信端在查找到的所述网关下选择另一所述监控节点,作为需要上报所述系统运行信息的监控节点。S305: If the system operation information reporting failure is detected, the communication end selects another monitoring node under the found gateway as the monitoring node that needs to report the system operation information.

通信端在向监控节点上报系统运行信息之后,通常会接收到监控节点返回的系统运行信息接收成功的响应,若在一定时间内未接收到该响应,则通信端默认此次系统运行信息上报失败,那么此时,通信端根据S304中记录的信息,在其所在网关下选择另一监控节点来进行系统运行信息上报。图4对应的实施例考虑了监控节点或通信链路的失效可能,为系统运行信息的顺利上报建立了后备上报机制,有效地保障了系统运维的及时性。After the communication end reports the system operation information to the monitoring node, it usually receives a response that the system operation information received by the monitoring node is successfully received. If the response is not received within a certain period of time, the communication end defaults to failing to report the system operation information. Then, at this time, the communication end selects another monitoring node under its gateway to report system operation information according to the information recorded in S304. The embodiment corresponding to FIG. 4 considers the possibility of failure of a monitoring node or a communication link, and establishes a backup reporting mechanism for the smooth reporting of system operation information, which effectively guarantees the timeliness of system operation and maintenance.

S2:所述监控节点根据所述系统运行信息生成告警邮件,并将所述告警邮件发送至所述中心服务器,所述告警邮件中记录了出现系统异常的通信端及用于表示该通信端系统异常的运行数据。S2: The monitoring node generates an alarm email according to the system operation information, and sends the alarm email to the central server. The alarm email records a communication terminal where a system abnormality occurs and a system for indicating the communication terminal. Abnormal operating data.

监控节点对各通信端上报的系统运行信息进行分析,以监视各通信端所运行的系统中的程序异常或者业务数据异常,并根据监视结果,基于系统异常的相关信息生成告警邮件,发送至中心服务器。在告警邮件中,主要描述了出现系统异常的通信端的设备标识或者网络地址,以及写入了用于表示通信端系统异常的运行数据。The monitoring node analyzes the system operation information reported by each communication terminal to monitor the program abnormality or business data abnormality in the system run by each communication terminal, and generates an alarm email based on the related information of the system abnormality based on the monitoring result, and sends it to the center server. The alarm email mainly describes the device identification or network address of the communication end where the system abnormality occurs, and writes the operating data used to indicate the system abnormality of the communication end.

作为本申请的一个实施例,在S2之前,监控节点可以预告建立系统正常运行模型,以将系统运行信息导入该模型,由此来判断对应的通信端上系统是否正常运行。如图5所示:As an embodiment of the present application, before S2, the monitoring node may foretell the establishment of a system normal operation model, so as to import system operation information into the model, thereby determining whether the system on the corresponding communication end operates normally. As shown in Figure 5:

S501:在预设时间段内,所述监控节点对不同所述通信端的所述系统运行信息进行采集。S501: within a preset time period, the monitoring node collects the system operation information of different communication terminals.

监控节点可以预先采集一段时间内各通信端的系统运行信息,存入运行信息集中,以用于后续的建模分析。The monitoring node can collect the system operation information of each communication end in advance and store it in the operation information set for subsequent modeling and analysis.

S502:所述监控节点对采集到的所述系统运行信息进行聚类,得到多个簇集。S502: The monitoring node performs clustering on the collected system operation information to obtain multiple cluster sets.

监控节点采用聚类算法,例如CURE聚类算法,对采集到的系统运行信息进行聚类,得到多个簇集,每个簇集中的系统运行信息具备相同或相似的数据特征。The monitoring node uses a clustering algorithm, such as the CURE clustering algorithm, to cluster the collected system operation information to obtain multiple cluster sets, and the system operation information in each cluster set has the same or similar data characteristics.

S503:所述监控节点在所述多个簇集中标记用于表明系统正常运行的簇集。S503: The monitoring node marks a cluster set for indicating that the system operates normally in the multiple cluster sets.

根据预先设置的经验值,监控节点在生成的多个簇集中标记出表明系统正常运行的簇集,在这些簇集中的系统运行信息能够表征出对应的通信端上未出现系统异常。According to the preset experience values, the monitoring nodes mark clusters that indicate the normal operation of the system in the multiple cluster sets generated. The system operation information in these cluster sets can indicate that no system abnormality has occurred on the corresponding communication end.

作为本申请的一个实施例,S503的实现如图6所示:As an embodiment of the present application, the implementation of S503 is shown in FIG. 6:

S601:所述监控节点将所述多个簇集按照簇的大小降序排列。S601: The monitoring node arranges the multiple cluster sets in descending order according to the size of the cluster.

聚类之后,每个簇集中所聚集的系统运行信息的数量不同,因此首先,监控节点将多个簇集按照簇的大小,即按照簇集中所聚集的系统运行信息的数量,将生成的簇集降序排列。After clustering, the amount of system operation information gathered in each cluster set is different, so first, the monitoring node will generate multiple clusters according to the size of the cluster, that is, according to the amount of system operation information gathered in the cluster set. Sets are sorted in descending order.

S602:所述监控节点读取预设的比例参数,所述比例参数用于表明同一时刻系统正常的通信端占所有通信端的数量比例。S602: The monitoring node reads a preset scaling parameter, where the scaling parameter is used to indicate that the number of normal communication terminals of the system at the same time accounts for all communication terminals.

比例参数由经验值或以往的系统运行情况来确定,其用于表明在同一时刻,该系统正常的通信端占所有通信端的数量比例,也即用于表明系统正常运行的系统运行信息在整个运行信息集中所占的比例。The proportionality parameter is determined by empirical values or previous system operating conditions. It is used to indicate that at the same time, the normal communication end of the system accounts for the proportion of all communication ends, that is, the system operation information used to indicate that the system is operating normally throughout the operation. Percentage of information concentration.

S603:所述监控节点基于所述预设的比例参数,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集。S603: The monitoring node marks the clusters arranged in the top N positions as clusters for indicating that the system operates normally based on the preset scaling parameters.

在获取到预设的比例参数之后,监控节点根据预设的比例参数、通信端在当前的统计时间段上报的系统运行信息的数量以及每个簇集中的系统运行信息数量,将排列在前N位的簇集标记为用于表明系统正常运行的簇集。其中,标记的簇集中的系统运行信息数量之和与所有簇集中系统运行信息总和之比约等于预设的比例参数。After obtaining the preset scale parameters, the monitoring nodes will be ranked in the top N according to the preset scale parameters, the amount of system operation information reported by the communication end in the current statistical time period, and the amount of system operation information in each cluster set. A cluster of bits is marked as a cluster to indicate that the system is operating normally. The ratio of the sum of the system operation information in the labeled cluster set to the sum of the system operation information in all cluster sets is approximately equal to a preset ratio parameter.

图6对应的实施例中,按照经验值及聚类算法来完成对系统运行信息的筛选,从中确定出用于表明系统正常运行的系统运行信息,以用于后续的系统正常运行模型的建模处理。In the embodiment corresponding to FIG. 6, the system operation information is filtered according to the experience value and the clustering algorithm, and the system operation information used to indicate the normal operation of the system is determined from it, which is used for the subsequent system normal operation modeling. deal with.

S604:所述监控节点基于标记的所述簇集生成系统正常运行模型,所述系统正常运行模型被所述监控节点用于判断所述通信端上报的所述系统运行信息是否表明所述通信端的系统正常运行。S604: The monitoring node generates a normal operation model of the system based on the clusters marked, and the normal operation model of the system is used by the monitoring node to determine whether the system operation information reported by the communication terminal indicates that the communication terminal The system is running normally.

对于标记出的簇集,监控节点获取到其中的系统运行信息,由此进行建模,生成系统正常运行模型。该系统正常运行模型可以基于神经网络来建立,将簇集中的系统运行信息作为输入样本,系统运行信息所代表的系统运行状况,即系统正常或系统异常作为输出结果,来进行模型训练。完成训练后的模型呆用于判断通信端上报的系统运行信息是否能够表明该通信端的系统正常运行。For the marked clusters, the monitoring node obtains the system operation information in it, and performs modeling to generate the normal operation model of the system. The normal operation model of the system can be established based on neural networks. The system operation information represented by the clusters is used as input samples, and the system operation status represented by the system operation information, that is, the system normal or system abnormality is used as the output result for model training. After training, the model is used to determine whether the system operation information reported by the communication end can indicate that the communication end system is operating normally.

S3:所述中心服务器将所述告警邮件中的所述运行数据输出至预设的处理方案数据库中进行匹配,获取与所述运行数据匹配的处理脚本。S3: The central server outputs the operation data in the alarm email to a preset processing scheme database for matching, and obtains a processing script matching the operation data.

在本申请实施例中,中心服务器只可以在其后台设置定时任务,以定时获取监控节点生成的告警邮件。在告警邮件中,用于表示通信端系统异常的运行数据可以以文本附件的形式在邮件中附着,也可以以邮件正文的形式体现。中心服务器对监控节点发送的告警邮件中相关的文本内容进行解析,包括对文本内容进行分词,查找出能够表征系统运行指标的字符串并读取该字符串的对应数据,以将文本信息转换成用于表征系统运行状况的数据表,该数据表中的键名为能够表征系统运行指标的字符串,键值为每个字符串的对应数据。其中,能够表征系统运行指标的字符串包括但不限于服务器编号、服务器地址、系统异常发生时间、系统异常描述、系统异常时的运行参数,等等。In the embodiment of the present application, the central server can only set a timing task in its background to periodically obtain the alarm email generated by the monitoring node. In the alarm email, the operating data used to indicate the abnormality of the communication end system can be attached to the email as a text attachment, or it can be reflected in the form of the email body. The central server parses the relevant text content in the alarm email sent by the monitoring node, including segmenting the text content, finding out a character string that can characterize the system's operating indicators, and reading the corresponding data of the character string to convert the text information into A data table used to characterize the operating status of the system. The key names in this data table are character strings that can characterize system operation indicators, and the key values are the corresponding data for each character string. The character string capable of characterizing the system operation index includes, but is not limited to, a server number, a server address, a time when a system abnormality occurs, a description of a system abnormality, an operating parameter when the system is abnormal, and the like.

在中心服务器中创建有处理方案数据库,处理方案数据库至少在执行S3之前创建,如图7所示:A processing plan database is created in the central server, and the processing plan database is created at least before executing S3, as shown in Figure 7:

S701:所述中心服务器进入配置模式。S701: The central server enters a configuration mode.

在配置模式被触发后,中心服务器向运维用户展示一可配置页面,运维人员可以在该可配置页面上对处理方案数据库进行配置。After the configuration mode is triggered, the central server presents a configurable page to the operation and maintenance user, and the operation and maintenance personnel can configure the processing plan database on the configurable page.

S702:在所述配置模式下,所述中心服务器接收运维用户输入的用于描述系统异常的特征参数以及对应的处理脚本。S702: In the configuration mode, the central server receives the characteristic parameters and corresponding processing scripts input by the operation and maintenance user for describing system abnormalities.

在可配置页面中,运维用户输入用于描述系统异常的特征参数以及对应的处理脚本。对于每一类系统异常,用于表征系统运行指标的字符串会分别对应不同的值,这些不同值的字符串即能够组成用于描述系统异常的特征参数。On the configurable page, operation and maintenance users enter characteristic parameters that describe system exceptions and corresponding processing scripts. For each type of system abnormality, the character strings used to characterize the system's operating indicators will correspond to different values, and these different value character strings can form the characteristic parameters used to describe the system abnormality.

S703:所述中心服务器将运维用户输入的特征参数与对应的处理脚本关联存储,所述特征参数被所述中心服务器用于与所述运行数据进行匹配。S703: The central server stores feature parameters input by the operation and maintenance user in association with corresponding processing scripts, and the feature parameters are used by the central server to match with the operation data.

中心服务器将运维用户在可配置页面中输入的用于描述某一类系统异常的特征参数,与用于恢复该类系统异常的处理脚本进行关联存储,这样一来,在解析出告警邮件中用于表示通信端系统异常的运行数据之后,将运行数据中用于表征系统运行指标的字符串与特征参数中用于表征系统运行指标的字符串所分别对应的字符串数字进行匹配,从而便能够确定出系统异常的类型,并进一步确定出用于处理该类系统异常的处理脚本。The central server stores the characteristic parameters entered by the operation and maintenance user on the configurable page to describe a certain type of system abnormality, and stores it with the processing script used to recover the type of system abnormality. In this way, in the analysis of the alarm email, After the operating data used to indicate the abnormality of the system at the communication end, the character string used to characterize the system operation index in the operating data and the character string number corresponding to the character string used to characterize the system operation index in the characteristic parameters are matched, so that Can determine the type of system exception, and further determine the processing script used to handle this type of system exception.

S4:所述中心服务器将所述处理脚本推送至所述出现系统异常的通信端,所述处理脚本被所述出现系统异常的通信端接收后自动执行,用于处理系统异常。S4: The central server pushes the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to handle the system abnormality.

中心服务器在其处理方案数据库中确定出能够匹配通信端系统异常的处理脚本后,根据告警邮件中描述的出现系统异常的通信端的相关信息,例如通信端的网络地址,将该处理脚本推送至通信端。通信端在接收到中心服务器推送的脚本后,会自动执行该处理脚本,以实现对系统异常的恢复。After the central server determines in its processing plan database that it can match the processing script of the communication system exception, it pushes the processing script to the communication terminal according to the relevant information of the communication terminal where the system abnormality is described in the alarm email, such as the network address of the communication terminal . After receiving the script pushed by the central server, the communication end will automatically execute the processing script to recover the system exception.

进一步地,在通信端,通过设置处理脚本的优先级来保障系统异常的及时恢复,如图8所示:Further, at the communication end, the priority of processing scripts is set to ensure the timely recovery of system exceptions, as shown in Figure 8:

S801:在检测到所述中心服务器推送的所述处理脚本后,所述通信端自动创建所述处理脚本的执行线程。S801: After detecting the processing script pushed by the central server, the communication end automatically creates an execution thread of the processing script.

在通信端,预先将处理脚本的推送作为触发条件,即,一旦检测到中心服务器推送了处理脚本,则触发线程创建动作,自动在本地创建关于该处理脚本的执行线程。At the communication end, the push of the processing script is used as a trigger condition in advance, that is, once it is detected that the central server has pushed the processing script, a thread creation action is triggered, and an execution thread for the processing script is automatically created locally.

S802:所述通信端设置所述执行线程的优先级最高,以自动优先执行所述处理脚本。S802: The communication end sets the execution thread with the highest priority to automatically execute the processing script first.

在完成了执行线程创建后,通信端将该执行线程的优先级设置为最高,这样一来,通信端此前已创建的其他线程优先级均低于该执行线程的优先级,通信端会立即运行该执行线程,以执行处理脚本,从而实现系统异常的及时恢复。After the execution thread is created, the communication end sets the priority of the execution thread to the highest. In this way, the other threads that the communication end has previously created are lower than the priority of the execution thread, and the communication end will run immediately. The execution thread is used to execute a processing script, thereby realizing timely recovery of system exceptions.

进一步地,作为本申请的一个实施例,在处理完系统异常后,如图9所示:Further, as an embodiment of the present application, after the system exception is processed, as shown in FIG. 9:

S5:所述通信端在系统异常处理成功后,向所述中心服务器反馈处理脚本的执行日志。S5: The communication end feeds back the execution log of the processing script to the central server after the system exception processing is successful.

通信端在执行处理脚本的过程中,对执行过程进行记录,生成执行日志,并向中心服务器反馈系统异常处理成功的结果及执行日志。During the execution of the processing script, the communication end records the execution process, generates an execution log, and feeds back to the central server the result of the successful system exception processing and the execution log.

S6:所述中心服务器每隔预设时间间隔对接收到的所述执行日志进行统计分析,根据统计分析的结果生成对系统运行状况的预测报告。S6: The central server performs statistical analysis on the received execution logs at preset time intervals, and generates a prediction report on the system operating status according to the results of the statistical analysis.

中心服务器根据接收到的执行日志,对成功处理系统异常的通信端进行记录,并每隔一段时间对通信端的系统异常情况及对应的处理结果进行统计分析,根据统计分析的结果生成对系统运行状况的预测报告,以更好地帮助运维人员了解系统的运行情况,更好地改善系统功能,改进系统的稳定性。The central server records the communication end that successfully handles the system abnormality according to the received execution log, and statistically analyzes the system abnormality of the communication end and the corresponding processing result at intervals, and generates the system operating status based on the results of the statistical analysis Forecast report to better help operation and maintenance personnel understand the operation of the system, better improve system functions, and improve system stability.

基于上文所述的系统异常的恢复方法,可以实现对系统异常的自动恢复。例如,系统中某一通信端出现线程阻塞的异常,监控节点通过解析通信端上报的系统运行信息,确认该通信端出现系统异常,因此发送告警邮件至中心服务器。中心服务器将告警邮件进行文本解析,并将解析结果导入处理方案数据库进行匹配,以查找到用于处理线程阻塞的处理脚本,并根据告警邮件中的通信端地址,将处理脚本推送至通信端,通信端自动执行该处理脚本,完成系统异常恢复。Based on the method for recovering system abnormalities described above, automatic recovery of system abnormalities can be achieved. For example, when a communication terminal in the system has an abnormal thread blocking, the monitoring node analyzes the system operation information reported by the communication terminal to confirm that the communication terminal has a system abnormality, so it sends an alarm email to the central server. The central server analyzes the text of the alarm email, and imports the parsing result into the processing solution database for matching to find the processing script for processing thread blocking, and pushes the processing script to the communication terminal according to the address of the communication terminal in the alarm email. The communication end automatically executes the processing script to complete system exception recovery.

应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

图10是本申请一实施例提供的网络节点的示意图,在此,网络节点可以为图1中的中心服务器、通信端或者监控节点。如图10所示,该实施例的网络节点10包括:处理器100、存储器101以及存储在所述存储器101中并可在所述处理器100上运行的计算机可读指令102。所述处理器100执行所述计算机可读指令102时实现上述各网络节点对应的系统异常的恢复方法实施例中的步骤,例如对通信端来说,所述处理器100执行图2所示的步骤S1,对监控节点来说,所述处理器100执行图2所示的步骤S2,对中心服务器来说,所述处理器100执行图2所示的步骤S3和S4。FIG. 10 is a schematic diagram of a network node according to an embodiment of the present application. Here, the network node may be a central server, a communication end, or a monitoring node in FIG. 1. As shown in FIG. 10, the network node 10 of this embodiment includes a processor 100, a memory 101, and computer-readable instructions 102 stored in the memory 101 and executable on the processor 100. When the processor 100 executes the computer-readable instructions 102, the steps in the embodiment of a method for restoring a system abnormality corresponding to each network node are implemented. For example, for a communication end, the processor 100 executes In step S1, for a monitoring node, the processor 100 executes step S2 shown in FIG. 2, and for a central server, the processor 100 executes steps S3 and S4 shown in FIG.

示例性的,所述计算机可读指令102可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器101中,并由所述处理器100执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令的指令段,该指令段用于描述所述计算机可读指令102在其对应的网络节点中的执行过程。Exemplarily, the computer-readable instructions 102 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 101 and executed by the processor 100, To complete this application. The one or more modules / units may be an instruction segment of a series of computer-readable instructions capable of performing a specific function, and the instruction segment is used to describe an execution process of the computer-readable instruction 102 in its corresponding network node.

所述处理器100可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 100 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), and application-specific integrated circuits (Applications) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

所述存储器101可以是对应的网络节点的内部存储单元,例如通信端的硬盘或内存。所述存储器101也可以是对应的网络节点的外部存储设备,例如通信端上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器101还可以既包括对应的网络节点的内部存储单元也包括外部存储设备。所述存储器101用于存储所述计算机可读指令以及所述服务器所需的其他程序和数据。所述存储器101还可以用于暂时地存储已经输出或者将要输出的数据。The memory 101 may be an internal storage unit of a corresponding network node, such as a hard disk or a memory at a communication end. The memory 101 may also be an external storage device of a corresponding network node, such as a plug-in hard disk equipped on a communication end, a smart memory card (Smart Media Card, SMC), Secure Digital (SD) card, Flash Card, etc. Further, the memory 101 may further include both an internal storage unit of a corresponding network node and an external storage device. The memory 101 is configured to store the computer-readable instructions and other programs and data required by the server. The memory 101 may also be used to temporarily store data that has been output or is to be output.

本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。This application implements all or part of the processes in the methods of the above embodiments, and may also be completed by computer-readable instructions instructing related hardware. The computer-readable instructions may be stored in a computer-readable storage medium. The computer may When the read instruction is executed by the processor, the steps of the foregoing method embodiments can be implemented. The computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in a source code form, an object code form, an executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instructions, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signals, telecommunication signals, and software distribution media. It should be noted that the content contained in the computer-readable medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdictions. For example, in some jurisdictions, the computer-readable medium Excludes electric carrier signals and telecommunication signals.

以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

一种系统异常的处理方法,其特征在于,包括:A method for processing system exceptions, which comprises: 通信端实时采集系统运行信息,并将所述系统运行信息上报至监控节点;The communication end collects system operation information in real time and reports the system operation information to a monitoring node; 所述监控节点根据所述系统运行信息生成告警邮件,并将所述告警邮件发送至中心服务器,所述告警邮件中记录了出现系统异常的通信端及用于表示该通信端系统异常的运行数据;The monitoring node generates an alarm email according to the system operation information, and sends the alarm email to a central server. The alarm email records a communication terminal where a system abnormality occurs and operating data indicating the system abnormality of the communication terminal. ; 所述中心服务器将所述告警邮件中的所述运行数据输出至预设的处理方案数据库中进行匹配,获取与所述运行数据匹配的处理脚本;The central server outputs the operation data in the alarm email to a preset processing solution database for matching, and obtains a processing script matching the operation data; 所述中心服务器将所述处理脚本推送至所述出现系统异常的通信端,所述处理脚本被所述出现系统异常的通信端接收后自动执行,用于处理系统异常。The central server pushes the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to handle the system abnormality. 如权利要求1所述的系统异常的处理方法,其特征在于,在所述通信端实时采集其系统运行信息,并将所述系统运行信息上报至所述监控节点之前,还包括:The method for processing a system abnormality according to claim 1, wherein before the communication terminal collects system operation information in real time and reports the system operation information to the monitoring node, further comprising: 所述通信端获取监控节点列表,所述监控节点列表中记录了所述系统中的各个网关及每个所述网关下部署的所述监控节点;The communication terminal obtains a list of monitoring nodes, where each gateway in the system and the monitoring nodes deployed under each of the gateways are recorded in the monitoring node list; 所述通信端在所述监控节点列表中查找到其所在的网关;The communication end finds its gateway in the list of monitoring nodes; 所述通信端将查找到的所述网关下部署的监控节点确定为需要上报所述系统运行信息的监控节点。The communication terminal determines the found monitoring node deployed under the gateway as the monitoring node that needs to report the system operation information. 如权利要求2所述的系统异常的处理方法,其特征在于,所述通信端将查找到的所述网关下部署的监控节点确定为需要上报所述系统运行信息的监控节点之后,还包括:The method for processing a system abnormality according to claim 2, wherein after the communication end determines that the monitored monitoring node deployed under the gateway is a monitoring node that needs to report the system operation information, further comprising: 所述通信端对查找到的所述网关下部署的所有监控节点进行记录;The communication end records all the monitoring nodes deployed under the gateway found; 若检测到所述系统运行信息上报失败,所述通信端在查找到的所述网关下选择另一所述监控节点,作为需要上报所述系统运行信息的监控节点。If the system operation information reporting failure is detected, the communication end selects another monitoring node under the found gateway as the monitoring node that needs to report the system operation information. 如权利要求1所述的系统异常的处理方法,其特征在于,在所述监控节点根据所述系统运行信息生成告警邮件之前,还包括:The method for processing a system abnormality according to claim 1, before the generating, by the monitoring node, an alarm email according to the system operation information, further comprising: 在预设时间段内,所述监控节点对不同所述通信端的所述系统运行信息进行采集;Within a preset time period, the monitoring node collects the system operation information of different communication terminals; 所述监控节点对采集到的所述系统运行信息进行聚类,得到多个簇集;The monitoring node performs clustering on the collected system operation information to obtain multiple cluster sets; 所述监控节点在所述多个簇集中标记用于表明系统正常运行的簇集;The monitoring node marks a cluster set for indicating that the system operates normally in the multiple cluster sets; 所述监控节点基于标记的所述簇集生成系统正常运行模型,所述系统正常运行模型被所述监控节点用于判断所述通信端上报的所述系统运行信息是否表明所述通信端的系统正常运行。The monitoring node generates a normal operation model of the system based on the marked clusters, and the system normal operation model is used by the monitoring node to determine whether the system operation information reported by the communication end indicates that the communication end system is normal. run. 如权利要求4所述的系统异常的处理方法,其特征在于,所述监控节点在所述多个簇集中标记用于表明系统正常运行的簇集,包括:The method for processing a system abnormality according to claim 4, wherein the monitoring node marks the cluster set in the multiple cluster sets to indicate that the system operates normally, comprising: 所述监控节点将所述多个簇集按照簇的大小降序排列;The monitoring node arranges the plurality of cluster sets in descending order according to the size of the cluster; 所述监控节点读取预设的比例参数,所述比例参数用于表明同一时刻系统正常的通信端占所有通信端的数量比例;The monitoring node reads a preset proportion parameter, and the proportion parameter is used to indicate that the number of normal communication terminals of the system at the same time accounts for the proportion of all communication terminals; 所述监控节点基于所述预设的比例参数,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集。The monitoring node marks the clusters arranged in the top N positions as clusters used to indicate that the system operates normally based on the preset scaling parameters. 如权利要求5所述的系统异常的处理方法,其特征在于,所述监控节点基于所述预设的比例参数,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集,包括:The method for processing a system abnormality according to claim 5, wherein the monitoring node marks the clusters ranked in the top N positions as indicating that the system is operating normally based on the preset scaling parameters. Clusters, including: 所述监控节点根据所述预设的比例参数、所述通信端在当前的统计时间段上报的所述系统运行信息的数量以及每个所述簇集中的系统运行信息数量,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集。The monitoring node will be ranked in the top N according to the preset scale parameter, the amount of the system operation information reported by the communication terminal in the current statistical time period, and the amount of the system operation information in each cluster set. The cluster of bits is marked as a cluster to indicate that the system is operating normally. 如权利要求1所述的系统异常的处理方法,其特征在于,在所述中心服务器将所述告警邮件中的所述特征数据输出至预设的处理方案数据库中进行匹配,获取与所述特征数据对应的处理脚本之前,还包括:The method for processing a system abnormality according to claim 1, wherein the central server outputs the characteristic data in the alarm email to a preset processing scheme database for matching to obtain the characteristic data Before the processing script corresponding to the data, it also includes: 所述中心服务器进入配置模式;The central server enters a configuration mode; 在所述配置模式下,所述中心服务器接收运维用户输入的用于描述系统异常的特征参数以及对应的处理脚本;In the configuration mode, the central server receives characteristic parameters and corresponding processing scripts input by the operation and maintenance user for describing system abnormalities; 所述中心服务器将运维用户输入的特征参数与对应的处理脚本关联存储,所述特征参数被所述中心服务器用于与所述运行数据进行匹配。The central server stores the characteristic parameters input by the operation and maintenance user in association with the corresponding processing script, and the characteristic parameters are used by the central server to match with the operation data. 如权利要求1所述的系统异常的处理方法,其特征在于,还包括:The method for processing a system exception according to claim 1, further comprising: 所述通信端在系统异常处理成功后,向所述中心服务器反馈处理脚本的执行日志;After the communication end successfully handles the system exception, the communication end feeds back the execution log of the processing script to the central server; 所述中心服务器每隔预设时间间隔对接收到的所述执行日志进行统计分析;Performing statistical analysis on the execution log received by the central server every preset time interval; 所述中心服务器根据统计分析的结果生成对系统运行状况的预测报告。The central server generates a prediction report on a system operating condition according to a result of statistical analysis. 如权利要求1所述的系统异常的处理方法,其特征在于,还包括:The method for processing a system exception according to claim 1, further comprising: 在检测到所述中心服务器推送的所述处理脚本后,所述通信端自动创建所述处理脚本的执行线程;After detecting the processing script pushed by the central server, the communication end automatically creates an execution thread of the processing script; 所述通信端设置所述执行线程的优先级最高,以自动优先执行所述处理脚本。The communication end sets the execution thread with the highest priority to automatically execute the processing script first. 如权利要求1所述的系统异常的处理方法,其特征在于,所述告警邮件中记录了出现系统异常的通信端的设备标识或者网络地址。The method for processing a system abnormality according to claim 1, wherein the alarm email records a device identifier or a network address of a communication end where the system abnormality occurs. 一种系统异常的处理系统,其特征在于,包括中心服务器以及分布式部署的多个通信端及多个监控节点,A system abnormality processing system is characterized in that it includes a central server and multiple communication terminals and multiple monitoring nodes in a distributed deployment. 所述通信端用于实时采集其系统运行信息,并将所述系统运行信息上报至所述监控节点;The communication end is used to collect its system operation information in real time and report the system operation information to the monitoring node; 所述监控节点用于根据所述系统运行信息生成告警邮件,并将所述告警邮件发送至所述中心服务器,所述告警邮件中记录了出现系统异常的通信端及用于表示该通信端系统异常的运行数据;The monitoring node is configured to generate an alarm email according to the system operation information, and send the alarm email to the central server. The alarm email records a communication terminal where a system abnormality occurs and indicates the communication terminal system. Abnormal operating data; 所述中心服务器用于将所述告警邮件中的所述运行数据输出至预设的处理方案数据库中进行匹配,获取与所述运行数据匹配的处理脚本;The central server is configured to output the running data in the alarm email to a preset processing scheme database for matching, and obtain a processing script matching the running data; 所述中心服务器还用于将所述处理脚本推送至所述出现系统异常的通信端,所述处理脚本被所述出现系统异常的通信端接收后自动执行,用于处理系统异常。The central server is further configured to push the processing script to the communication end where the system abnormality occurs, and the processing script is automatically executed after being received by the communication end where the system abnormality occurs, and is used to process the system abnormality. 如权利要求11所述的处理系统,其特征在于,所述监控节点与向其上报所述系统运行信息的所述通信端位于同一网关之下。The processing system according to claim 11, wherein the monitoring node is located under the same gateway as the communication end to which the system operation information is reported. 如权利要求11所述的处理系统,其特征在于,在所述通信端实时采集其系统运行信息,并将所述系统运行信息上报至所述监控节点之前,所述通信端还用于:The processing system according to claim 11, wherein before the communication terminal collects its system operation information in real time and reports the system operation information to the monitoring node, the communication terminal is further configured to: 获取监控节点列表,所述监控节点列表中记录了所述系统中的各个网关及每个所述网关下部署的所述监控节点;Obtaining a list of monitoring nodes, where each gateway in the system and the monitoring nodes deployed under each of the gateways are recorded; 在所述监控节点列表中查找到其所在的网关;Find the gateway where the monitoring node is located; 将查找到的所述网关下部署的监控节点确定为需要上报所述系统运行信息的监控节点。The found monitoring node deployed under the gateway is determined as the monitoring node that needs to report the system operation information. 如权利要求13所述的处理系统,其特征在于,在所述通信端将查找到的所述网关下部署的监控节点确定为需要上报所述系统运行信息的监控节点之后,所述通信端还用于:The processing system according to claim 13, wherein after the communication end determines that the monitoring node deployed under the found gateway is a monitoring node that needs to report the system operation information, the communication end further Used for: 对查找到的所述网关下部署的所有监控节点进行记录;Record all the monitoring nodes deployed under the gateway found; 若检测到所述系统运行信息上报失败,在查找到的所述网关下选择另一所述监控节点,作为需要上报所述系统运行信息的监控节点。If the system operation information reporting failure is detected, another monitoring node is selected under the found gateway as the monitoring node that needs to report the system operation information. 如权利要求11所述的处理系统,其特征在于,在所述监控节点根据所述系统运行信息生成告警邮件之前,所述监控节点还用于:The processing system according to claim 11, wherein before the monitoring node generates an alarm email according to the system operation information, the monitoring node is further configured to: 在预设时间段内,对不同所述通信端的所述系统运行信息进行采集;Collecting the system operation information of the different communication terminals within a preset time period; 对采集到的所述系统运行信息进行聚类,得到多个簇集;Clustering the collected system operation information to obtain multiple cluster sets; 在所述多个簇集中标记用于表明系统正常运行的簇集;Marking in the plurality of cluster sets a cluster set for indicating that the system operates normally; 基于标记的所述簇集生成系统正常运行模型,所述系统正常运行模型被所述监控节点用于判断所述通信端上报的所述系统运行信息是否表明所述通信端的系统正常运行。A normal operation model of the system is generated based on the marked clusters, and the normal operation model of the system is used by the monitoring node to determine whether the system operation information reported by the communication end indicates that the communication end system is operating normally. 如权利要求15所述的处理系统,其特征在于,所述监控节点具体用于:The processing system according to claim 15, wherein the monitoring node is specifically configured to: 将所述多个簇集按照簇的大小降序排列;Arranging the multiple cluster sets in descending order according to the size of the clusters; 读取预设的比例参数,所述比例参数用于表明同一时刻系统正常的通信端占所有通信端的数量比例;Read a preset proportion parameter, which is used to indicate the proportion of the number of all communication terminals that are normal communication terminals of the system at the same time; 基于所述预设的比例参数,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集。Based on the preset scale parameters, the clusters arranged in the top N positions are marked as clusters used to indicate that the system is operating normally. 如权利要求16所述的处理系统,其特征在于,所述监控节点具体用于:The processing system according to claim 16, wherein the monitoring node is specifically configured to: 根据所述预设的比例参数、所述通信端在当前的统计时间段上报的所述系统运行信息的数量以及每个所述簇集中的系统运行信息数量,将排列在前N位的所述簇集标记为用于表明系统正常运行的簇集。According to the preset scale parameter, the number of the system operation information reported by the communication terminal in the current statistical time period, and the number of the system operation information in each cluster set, the first Clusters are marked as clusters to indicate that the system is operating normally. 如权利要求11所述的处理系统,其特征在于,在所述中心服务器将所述告警邮件中的所述特征数据输出至预设的处理方案数据库中进行匹配,获取与所述特征数据对应的处理脚本之前,所述中心服务器还用于:The processing system according to claim 11, wherein the central server outputs the feature data in the alarm email to a preset processing scheme database for matching, and obtains a feature corresponding to the feature data. Before processing the script, the central server is also used to: 进入配置模式;Enter configuration mode; 在所述配置模式下,接收运维用户输入的用于描述系统异常的特征参数以及对应的处理脚本;In the configuration mode, receive characteristic parameters and corresponding processing scripts input by the operation and maintenance user for describing system abnormalities; 将运维用户输入的特征参数与对应的处理脚本关联存储,所述特征参数被所述中心服务器用于与所述运行数据进行匹配。The characteristic parameters input by the operation and maintenance user are stored in association with corresponding processing scripts, and the characteristic parameters are used by the central server to match with the operation data. 如权利要求11所述的处理系统,其特征在于,The processing system according to claim 11, wherein: 所述通信端在系统异常处理成功后,还用于向所述中心服务器反馈处理脚本的执行日志;The communication end is further configured to feed back the execution log of the processing script to the central server after the system exception processing is successful; 所述中心服务器还用于每隔预设时间间隔对接收到的所述执行日志进行统计分析;The central server is further configured to perform statistical analysis on the execution logs received at preset time intervals; 所述中心服务器还用于根据统计分析的结果生成对系统运行状况的预测报告。The central server is further configured to generate a prediction report on a system operating status according to a result of statistical analysis. 如权利要求11所述的处理系统,其特征在于,The processing system according to claim 11, wherein: 在检测到所述中心服务器推送的所述处理脚本后,所述通信端还用于自动创建所述处理脚本的执行线程;After detecting the processing script pushed by the central server, the communication end is further configured to automatically create an execution thread of the processing script; 所述通信端还用于设置所述执行线程的优先级最高,以自动优先执行所述处理脚本。The communication end is further configured to set the execution thread with the highest priority to automatically execute the processing script first.
PCT/CN2018/093707 2018-05-22 2018-06-29 Method and system for processing system exceptions Ceased WO2019223062A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810496049.9 2018-05-22
CN201810496049.9A CN108737182A (en) 2018-05-22 2018-05-22 The processing method and system of system exception

Publications (1)

Publication Number Publication Date
WO2019223062A1 true WO2019223062A1 (en) 2019-11-28

Family

ID=63938832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/093707 Ceased WO2019223062A1 (en) 2018-05-22 2018-06-29 Method and system for processing system exceptions

Country Status (2)

Country Link
CN (1) CN108737182A (en)
WO (1) WO2019223062A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915452A (en) * 2020-08-28 2020-11-10 平安国际智慧城市科技股份有限公司 A supervisory system, method, device, supervisory processing device, and storage medium
CN112214409A (en) * 2020-10-13 2021-01-12 中国工商银行股份有限公司 A method and device for operation and maintenance in a test environment
CN112561385A (en) * 2020-12-24 2021-03-26 平安银行股份有限公司 Risk monitoring method and system
CN113495820A (en) * 2020-04-03 2021-10-12 北京沃东天骏信息技术有限公司 Method and device for collecting and processing abnormal information and abnormal monitoring system
CN113765685A (en) * 2020-06-05 2021-12-07 腾讯科技(深圳)有限公司 Abnormity management method and device
CN115202864A (en) * 2022-05-30 2022-10-18 江铃汽车股份有限公司 Production line equipment monitoring method and device, readable storage medium and computer equipment
CN115225534A (en) * 2022-07-26 2022-10-21 雷沃工程机械集团有限公司 Method for monitoring running state of monitoring server
CN115827394A (en) * 2021-12-07 2023-03-21 湖南博轩智能科技股份有限公司 A monitoring and alarming method based on Zabbix
CN116156191A (en) * 2022-11-25 2023-05-23 天翼数字生活科技有限公司 A provincial dispatching system, method, equipment and storage medium based on national standards
CN116347273A (en) * 2023-02-28 2023-06-27 国网山东省电力公司滨州供电公司 A Data Concentration Platform for Optical Transmission Network Management
CN117458722A (en) * 2023-12-26 2024-01-26 西安民为电力科技有限公司 Data monitoring method and system based on power energy management system
CN117727150A (en) * 2023-11-23 2024-03-19 中国铁道科学研究院集团有限公司 High-speed railway perimeter intrusion alarm information processing method and system based on multi-sensing technology fusion

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109828884B (en) * 2018-12-14 2023-04-14 深圳壹账通智能科技有限公司 Add-on service data processing method, system, computer equipment and storage medium
CN111756778B (en) * 2019-03-26 2024-06-18 京东科技控股股份有限公司 Method, device and storage medium for pushing server disk cleaning script
CN110175679A (en) * 2019-05-29 2019-08-27 深圳前海微众银行股份有限公司 A kind of method and device of monitoring model training
CN111447329A (en) * 2020-03-31 2020-07-24 携程旅游信息技术(上海)有限公司 Method, system, device and medium for monitoring state server in call center
CN114077525A (en) * 2020-08-17 2022-02-22 鸿富锦精密电子(天津)有限公司 Abnormal log processing method and device, terminal equipment, cloud server and system
CN113747171B (en) * 2021-08-06 2024-04-19 天津津航计算技术研究所 Self-recovery video decoding method
CN113676356A (en) * 2021-08-27 2021-11-19 创新奇智(青岛)科技有限公司 Alarm information processing method and device, electronic equipment and readable storage medium
CN115473787B (en) * 2022-08-25 2024-12-06 上海东普信息科技有限公司 Distributed application operation and maintenance method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method
CN104699759A (en) * 2015-02-10 2015-06-10 上海新炬网络信息技术有限公司 Method for maintaining automatic operation of database
US20160057009A1 (en) * 2014-08-21 2016-02-25 Netapp, Inc. Configuration of peered cluster storage environment organized as disaster recovery group
CN105721304A (en) * 2016-04-05 2016-06-29 网宿科技股份有限公司 Adaptive routing adjustment method and system and service device
WO2017044772A1 (en) * 2015-09-09 2017-03-16 Convida Wireless, Llc Methods for enabling context-aware coap messaging
CN107632918A (en) * 2017-08-30 2018-01-26 中国工商银行股份有限公司 Calculate the monitoring system and method for storage device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561878B (en) * 2009-05-31 2012-11-21 河海大学 Unsupervised anomaly detection method and system based on improved CURE clustering algorithm
CN103532795B (en) * 2013-10-30 2017-01-04 蓝盾信息安全技术股份有限公司 A kind of monitoring system and method detecting WEB service system availability
CN105337765B (en) * 2015-10-10 2018-10-12 上海新炬网络信息技术股份有限公司 A kind of distribution hadoop cluster automatic fault diagnosis repair system
CN106789141B (en) * 2015-11-24 2020-12-11 阿里巴巴集团控股有限公司 A kind of gateway equipment fault handling method and device
CN107135156A (en) * 2017-06-07 2017-09-05 努比亚技术有限公司 Call chain collecting method, mobile terminal and computer-readable recording medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160057009A1 (en) * 2014-08-21 2016-02-25 Netapp, Inc. Configuration of peered cluster storage environment organized as disaster recovery group
CN104184819A (en) * 2014-08-29 2014-12-03 城云科技(杭州)有限公司 Multi-hierarchy load balancing cloud resource monitoring method
CN104699759A (en) * 2015-02-10 2015-06-10 上海新炬网络信息技术有限公司 Method for maintaining automatic operation of database
WO2017044772A1 (en) * 2015-09-09 2017-03-16 Convida Wireless, Llc Methods for enabling context-aware coap messaging
CN105721304A (en) * 2016-04-05 2016-06-29 网宿科技股份有限公司 Adaptive routing adjustment method and system and service device
CN107632918A (en) * 2017-08-30 2018-01-26 中国工商银行股份有限公司 Calculate the monitoring system and method for storage device

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113495820A (en) * 2020-04-03 2021-10-12 北京沃东天骏信息技术有限公司 Method and device for collecting and processing abnormal information and abnormal monitoring system
CN113765685A (en) * 2020-06-05 2021-12-07 腾讯科技(深圳)有限公司 Abnormity management method and device
CN111915452A (en) * 2020-08-28 2020-11-10 平安国际智慧城市科技股份有限公司 A supervisory system, method, device, supervisory processing device, and storage medium
CN112214409B (en) * 2020-10-13 2023-11-24 中国工商银行股份有限公司 An operation and maintenance method and device for testing environment
CN112214409A (en) * 2020-10-13 2021-01-12 中国工商银行股份有限公司 A method and device for operation and maintenance in a test environment
CN112561385A (en) * 2020-12-24 2021-03-26 平安银行股份有限公司 Risk monitoring method and system
CN115827394A (en) * 2021-12-07 2023-03-21 湖南博轩智能科技股份有限公司 A monitoring and alarming method based on Zabbix
CN115202864A (en) * 2022-05-30 2022-10-18 江铃汽车股份有限公司 Production line equipment monitoring method and device, readable storage medium and computer equipment
CN115225534A (en) * 2022-07-26 2022-10-21 雷沃工程机械集团有限公司 Method for monitoring running state of monitoring server
CN116156191A (en) * 2022-11-25 2023-05-23 天翼数字生活科技有限公司 A provincial dispatching system, method, equipment and storage medium based on national standards
CN116347273A (en) * 2023-02-28 2023-06-27 国网山东省电力公司滨州供电公司 A Data Concentration Platform for Optical Transmission Network Management
CN117727150A (en) * 2023-11-23 2024-03-19 中国铁道科学研究院集团有限公司 High-speed railway perimeter intrusion alarm information processing method and system based on multi-sensing technology fusion
CN117458722A (en) * 2023-12-26 2024-01-26 西安民为电力科技有限公司 Data monitoring method and system based on power energy management system
CN117458722B (en) * 2023-12-26 2024-03-08 西安民为电力科技有限公司 Data monitoring method and system based on power energy management system

Also Published As

Publication number Publication date
CN108737182A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
WO2019223062A1 (en) Method and system for processing system exceptions
CN103684828B (en) A kind for the treatment of method and apparatus of telecommunication equipment fault
CN107888397B (en) Method and device for determining fault type
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN105159964B (en) A log monitoring method and system
CN113328872A (en) Fault repair method, device and storage medium
CN102523137B (en) Fault monitoring method, device and system
CN110232010A (en) A kind of alarm method, alarm server and monitoring server
CN107294764A (en) Intelligent supervision method and intelligent monitoring system
CN119030860A (en) Fault node positioning method, device, electronic device and non-volatile storage medium
CN118427557A (en) Fault root cause determination method, device, equipment, storage medium and program product
CN113806191A (en) A data processing method, device, equipment and storage medium
CN111913824A (en) Method for determining data link fault reason and related equipment
CN108241744A (en) A kind of log read method and apparatus
CN116074215A (en) Network quality detection method, device, equipment and storage medium
CN104765672A (en) Error code monitoring method, device and equipment
CN113110977A (en) Safety monitoring method based on block chain system
CN114422324B (en) Alarm information processing method, device, electronic equipment and storage medium
CN101350733B (en) System for acquiring net element performance data base on preposition data server and implementing method thereof
CN114327967A (en) Equipment repair method and device, storage medium, and electronic device
CN115705259A (en) Fault processing method, related device and storage medium
CN108108289A (en) A kind of cluster resource statistical method, system, device and readable storage system
CN116594837A (en) Monitoring device, monitoring method and electronic equipment
CN117614853A (en) Alarm monitoring method, system, equipment and medium in cloud primary environment
CN117376092A (en) Fault root cause positioning method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18919520

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18919520

Country of ref document: EP

Kind code of ref document: A1