CN117370053A - Information system service operation-oriented panoramic monitoring method and system - Google Patents
Information system service operation-oriented panoramic monitoring method and system Download PDFInfo
- Publication number
- CN117370053A CN117370053A CN202311191727.8A CN202311191727A CN117370053A CN 117370053 A CN117370053 A CN 117370053A CN 202311191727 A CN202311191727 A CN 202311191727A CN 117370053 A CN117370053 A CN 117370053A
- Authority
- CN
- China
- Prior art keywords
- data
- fault
- module
- unit
- monitoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
技术领域Technical field
本发明涉及企业信息系统运维管理技术领域,尤其涉及一种面向信息系统业务运行全景监测方法及系统。The invention relates to the technical field of enterprise information system operation and maintenance management, and in particular to a method and system for panoramic monitoring of information system business operations.
背景技术Background technique
信息系统监测是通过技术手段收集信息系统运行数据,并进行分析与管理的过程。它的目的是实时了解系统的运行状况,及时发现系统可能面临的风险与问题,确保系统稳定高效地运行。信息系统监测包括用户体验监测、业务应用监测、服务监测、平台组件监测、基础设施监测等不同方面。目前信息系统运维监测存在如下几个方面的不足:Information system monitoring is the process of collecting information system operation data through technical means, and analyzing and managing it. Its purpose is to understand the operating status of the system in real time, discover risks and problems that the system may face in a timely manner, and ensure that the system operates stably and efficiently. Information system monitoring includes different aspects such as user experience monitoring, business application monitoring, service monitoring, platform component monitoring, and infrastructure monitoring. There are currently several deficiencies in information system operation and maintenance monitoring:
(1)监测范围有限:主要关注技术层面的指标监测,如服务器指标、网络指标等。对业务数据、服务调用链路、用户体验的监测还不够充分,监测以“点”为主,无法全面评估系统的运行效果。(1) Limited monitoring scope: Mainly focus on technical indicator monitoring, such as server indicators, network indicators, etc. The monitoring of business data, service calling links, and user experience is not sufficient. Monitoring is mainly based on "points" and cannot fully evaluate the operating effect of the system.
(2)监测系统孤立:不同的监测系统之间存在关联关系,由于无法实现数据和信息的有效共享,这导致在问题诊断和决策过程中,难以获得全局的系统视角,出现“信息孤岛”的现象。(2) Monitoring system isolation: There are correlations between different monitoring systems. Since data and information cannot be effectively shared, it is difficult to obtain a global system perspective during the problem diagnosis and decision-making process, resulting in the emergence of "information islands". Phenomenon.
(3)缺乏标准化:目前监测系统普遍采用各自的监测方案与指标体系,缺乏统一的监测数据标准和接口规范。这使得不同监测系统之间的数据难以互操作和比较,阻碍了监测整合。(3) Lack of standardization: Currently, monitoring systems generally adopt their own monitoring plans and indicator systems, and lack unified monitoring data standards and interface specifications. This makes it difficult to interoperate and compare data between different monitoring systems, hindering monitoring integration.
发明内容Contents of the invention
本部分的目的在于概述本发明的实施例的一些方面以及简要介绍一些较佳实施例。在本部分以及本申请的说明书摘要和发明名称中可能会做些简化或省略以避免使本部分、说明书摘要和发明名称的目的模糊,而这种简化或省略不能用于限制本发明的范围。The purpose of this section is to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section, the abstract and the title of the invention to avoid obscuring the purpose of this section, the abstract and the title of the invention, and such simplifications or omissions cannot be used to limit the scope of the invention.
鉴于上述现有存在的问题,提出了本发明。In view of the above-mentioned existing problems, the present invention is proposed.
因此,本发明提供了一种面向信息系统业务运行全景监测方法及系统,能够解决背景技术中提到的问题。Therefore, the present invention provides a method and system for panoramic monitoring of information system business operations, which can solve the problems mentioned in the background technology.
为解决上述技术问题,本发明提供如下技术方案,一种面向信息系统业务运行全景监测方法,包括:In order to solve the above technical problems, the present invention provides the following technical solution, a panoramic monitoring method for information system business operations, including:
基于业务需求构建信息系统全景监测指标体系,所述业务需求包括异常先于用户感知、告警自动关联和故障精准定位;Construct a panoramic monitoring indicator system for information systems based on business requirements, which include abnormality being detected before users, automatic alarm correlation, and precise fault location;
获取目标监测数据,并对所述目标监测数据进行预处理;Obtain target monitoring data and preprocess the target monitoring data;
将所述预处理后的数据结合所述信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。The preprocessed data is combined with the information system panoramic monitoring index system to perform alarm analysis and fault location to complete panoramic monitoring of the information system business operation.
一种面向信息系统业务运行全景监测系统,其特征在于:包括监测数据采集单元、服务器单元、通信单元、逻辑分析单元、告警单元以及终端展示与查询单元,A panoramic monitoring system for information system business operations, which is characterized by: including a monitoring data collection unit, a server unit, a communication unit, a logic analysis unit, an alarm unit, and a terminal display and query unit,
监测数据采集单元,用于收集业务性能数据、运维业务数据以及资源及关系数据,且将数据按照固定周期分为当前周期与历史周期数据,并将数据传输至服务器单元进行结构化保存;The monitoring data collection unit is used to collect business performance data, operation and maintenance business data, resource and relationship data, and divide the data into current cycle and historical cycle data according to fixed cycles, and transmit the data to the server unit for structured storage;
服务器单元,用于获取所述监测数据采集单元传输至的数据信息,并进行结构化保存,根据其他单元传输至的指令信息调用不同结构化后的数据通过通信单元传输至对应单元;The server unit is used to obtain the data information transmitted by the monitoring data collection unit and store it in a structured manner, and call different structured data according to the instruction information transmitted by other units and transmit it to the corresponding unit through the communication unit;
通信单元,用于维持系统中各个单元的连接;Communication unit, used to maintain connections between various units in the system;
逻辑分析单元,用于预设性能指标、业务应用指标以及告警指标,并通过所述通信单元向所述服务器单元发送数据分析指令,根据所述服务器单元传输至的数据结合预设性能指标、业务应用指标以及告警指标,对监测数据进行分析与判断;A logical analysis unit is used to preset performance indicators, business application indicators and alarm indicators, and sends data analysis instructions to the server unit through the communication unit, and combines the preset performance indicators, business indicators according to the data transmitted to the server unit. Apply indicators and alarm indicators to analyze and judge monitoring data;
告警单元,用于根据所述逻辑分析单元的分析与判断结果进行告警;An alarm unit is used to issue alarms based on the analysis and judgment results of the logical analysis unit;
终端展示与查询单元,用于展示所述逻辑分析单元的分析与判断结果以及结合所述告警单元的进行告警,并且为用户提供资源查询服务,通过所述通信单元调取并展示所述服务器单元中的相应数据。The terminal display and query unit is used to display the analysis and judgment results of the logical analysis unit and to perform alarms in conjunction with the alarm unit, and to provide resource query services for users, and to retrieve and display the server unit through the communication unit the corresponding data in .
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述监测数据采集单元包括数据收集模块以及数据处理模块,As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the monitoring data collection unit includes a data collection module and a data processing module,
所述数据收集模块包括第一数据收集部、第二数据收集部以及第三数据收集部,所述第一数据收集部、第二数据收集部以及第三数据收集部同时与所述数据处理模块单向连接,直接将采集到的数据传输至所述数据处理模块中;The data collection module includes a first data collection part, a second data collection part and a third data collection part. The first data collection part, the second data collection part and the third data collection part are simultaneously connected with the data processing module. One-way connection, directly transmits the collected data to the data processing module;
所述第一数据收集部通过APM工具收集系统外部数据,所述第二数据收集部通过NPM工具收集系统外部数据,所述第三数据收集部通过其他链路监控工具收集系统外部数据,所述数据处理模块在接收到数据收集模块传输至的数据后,将数据聚合成拓扑数据、链路数据和指标数据三类数据,并进行结构化之后通过通信单元发送给服务器单元,所述服务器单元对接收到的结构化之后的数据进行存储;The first data collection part collects system external data through APM tools, the second data collection part collects system external data through NPM tools, and the third data collection part collects system external data through other link monitoring tools. After receiving the data transmitted by the data collection module, the data processing module aggregates the data into three types of data: topology data, link data and indicator data, structures it and sends it to the server unit through the communication unit. The server unit Store the received structured data;
所述结构化包括将数据分为基础监测数据、业务链路数据、系统业务数据以及日志数据。The structuring includes dividing the data into basic monitoring data, business link data, system business data and log data.
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述逻辑分析单元包括指标预设模块、指标对比模块、故障分析与判断模块、故障定位模块以及故障预警模块,As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the logical analysis unit includes an indicator preset module, an indicator comparison module, a fault analysis and judgment module, a fault location module and a fault early warning module. ,
所述指标预设模块用于预设性能指标、业务应用指标以及告警指标,所述指标对比模块与所述服务器单元通过通信单元直接相连,所述指标对比模块向通信单元发送对比指令,所述对比指令包括第一对比指令、第二对比指令以及第三对比指令;The indicator preset module is used to preset performance indicators, business application indicators and alarm indicators. The indicator comparison module is directly connected to the server unit through the communication unit. The indicator comparison module sends comparison instructions to the communication unit. The comparison instruction includes a first comparison instruction, a second comparison instruction and a third comparison instruction;
当所述服务器单元获取到第一对比指令时,所述服务器单元将基础监测数据传输至所述指标对比模块中,所述指标对比模块结合所述指标预设模块中预设的指标对数据进行对比,生成第一对比结果后,并将第一对比结果传输至所述故障分析与判断模块中;When the server unit obtains the first comparison instruction, the server unit transmits the basic monitoring data to the indicator comparison module, and the indicator comparison module combines the preset indicators in the indicator preset module to perform data processing. Compare, after generating the first comparison result, and transmit the first comparison result to the fault analysis and judgment module;
当所述服务器单元获取到第二对比指令时,所述服务器单元将业务链路数据以及系统业务数据传输至所述指标对比模块中,所述指标对比模块结合所述指标预设模块中预设的指标对数据进行对比,生成第二对比结果,并将第一对比结果传输至所述故障分析与判断模块中;When the server unit obtains the second comparison instruction, the server unit transmits the service link data and system service data to the indicator comparison module, and the indicator comparison module combines the presets in the indicator preset module. Compare the data with the indicators, generate a second comparison result, and transmit the first comparison result to the fault analysis and judgment module;
当所述服务器单元获取到第三对比指令时,所述服务器单元将日志数据传输至所述指标对比模块中,所述指标对比模块结合所述指标预设模块中预设的指标对数据进行对比,生成第三对比结果,并将第三判断结果传输至所述故障分析与判断模块中。When the server unit obtains the third comparison instruction, the server unit transmits the log data to the indicator comparison module, and the indicator comparison module compares the data in combination with the indicators preset in the indicator preset module. , generate a third comparison result, and transmit the third judgment result to the fault analysis and judgment module.
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述逻辑分析单元还包括,As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the logical analysis unit further includes:
所述故障分析与判断模块接收来自所述指标对比模块传输至的指标对比结果,并对指标对比结果进行分析与判断;The fault analysis and judgment module receives the index comparison results transmitted from the index comparison module, and analyzes and judges the index comparison results;
当接收到第一判断结果时,若第一判断结果中的基础监测数据偏离预设基础监测数据百分之十,或第一判断结果中的基础监测数据出现缺失时,则认定出现故障;When receiving the first judgment result, if the basic monitoring data in the first judgment result deviates from the preset basic monitoring data by 10%, or if the basic monitoring data in the first judgment result is missing, it is determined that a fault has occurred;
当接收到第二判断结果时,若第二判断结果中的业务链路数据以及系统业务数据出现缺失时,则认定出现故障;When receiving the second judgment result, if the service link data and system service data in the second judgment result are missing, it is determined that a fault has occurred;
当接收到第三判断结果时,若第三判断结果中的日志数据出现缺失或者与前周期相比增加时,则认定出现故障。When receiving the third judgment result, if the log data in the third judgment result is missing or increased compared with the previous period, it is determined that a fault has occurred.
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述逻辑分析单元还包括,As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the logical analysis unit further includes:
所述故障定位模块通过结合神经网络训练得到具体定位模型,所述故障定位模块接收来自所述服务器单元中日志数据中的故障数据,将故障数据作为输入,将故障种类与故障位置作为输出,训练故障定位神经网络模型;The fault location module obtains a specific location model by combining neural network training. The fault location module receives fault data from the log data in the server unit, takes the fault data as input, and takes the fault type and fault location as output. Training Fault location neural network model;
当故障定位神经网络模型训练完成后,所述故障定位模块将模型保存至所述服务器单元中,若所述故障分析与判断模块判定结果为故障,则所述故障分析与判断模块向所述故障定位模块发送故障待定位指令,当所述故障定位模块获取到所述故障待定位指令后,通过所述通信单元向所述服务器单元发送调取故障定位神经网络模型指令;After the fault location neural network model training is completed, the fault location module saves the model to the server unit. If the fault analysis and judgment module determines that the result is a fault, the fault analysis and judgment module reports the fault to the fault location module. The positioning module sends a fault to be located instruction. When the fault positioning module obtains the fault to be located instruction, it sends a fault positioning neural network model instruction to the server unit through the communication unit;
当调取故障定位神经网络模型成功后,将服务器单元中实时数据直接作为模型输入接入所述故障定位神经网络模型,获取故障种类与故障位置信息,并将所述故障种类与故障位置信息传输至所述故障预警模块;When the fault location neural network model is successfully retrieved, the real-time data in the server unit is directly used as model input and connected to the fault location neural network model to obtain the fault type and fault location information, and transmit the fault type and fault location information. to the fault warning module;
所述故障预警模块对所述故障种类与故障位置信息进行告警等级确认后,将告警等级传输至所述告警单元中。The fault early warning module confirms the alarm level of the fault type and fault location information, and then transmits the alarm level to the alarm unit.
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述告警单元包括提供告警触发策略配置、告警通知策略配置,对信息系统监测提供多渠道告警,所述告警触发策略进行独立配置,告警触发条件包括单指标触发以及多指标组合触发,所述单指标触发规则包括缺失告警、阈值告警、字符比较、趋势告警、状态反转、浮动阈值告警及突变告警规则,所述多指标触发基于“与”“或”规则进行多指标联合告警,所述告警通知策略配置包括告警资源范围及通知策略,所述告警资源范围包括通过资源类型、资源单位、机房、群组、资源实例方式进行设置;所述通知策略包括接收对象、通知方式、通知时间、重复规则、升级规则,还包括对告警进行去重、分组、抑制、静默和路由功能,根据所述逻辑分析单元的分析与判断结果进行告警。As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the alarm unit includes providing alarm triggering strategy configuration and alarm notification strategy configuration, and provides multi-channel alarms for information system monitoring, and the alarm unit The triggering strategy is configured independently. Alarm triggering conditions include single indicator triggering and multi-indicator combination triggering. The single indicator triggering rules include missing alarms, threshold alarms, character comparisons, trend alarms, status reversals, floating threshold alarms and mutation alarm rules. The multi-indicator triggers multi-indicator joint alarming based on "AND" and "OR" rules. The alarm notification strategy configuration includes an alarm resource range and a notification strategy. The alarm resource range includes resource types, resource units, computer rooms, and groups. , resource instance mode is set; the notification strategy includes receiving objects, notification methods, notification time, repetition rules, upgrade rules, and also includes deduplication, grouping, suppression, silencing and routing functions for alarms. According to the logical analysis unit Alarm based on the analysis and judgment results.
作为本发明所述的面向信息系统业务运行全景监测系统的一种优选方案,其中:所述终端展示与查询单元包括用于展示系统收集的数据,并结合所述告警单元对告警信息进行展示,还包括根据用户需求对资源进行查询。As a preferred solution of the panoramic monitoring system for information system business operation according to the present invention, the terminal display and query unit includes the data collected by the display system, and displays the alarm information in combination with the alarm unit, It also includes querying resources according to user needs.
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如上所述的方法的步骤。A computer device includes a memory and a processor. The memory stores a computer program. It is characterized in that when the processor executes the computer program, the steps of the above method are implemented.
一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如上所述的方法的步骤。A computer-readable storage medium on which a computer program is stored, characterized in that when the computer program is executed by a processor, the steps of the above method are implemented.
本发明的有益效果:本发明提出一种面向信息系统业务运行全景监测方法及系统,基于业务需求构建信息系统全景监测指标体系,所述业务需求包括异常先于用户感知、告警自动关联和故障精准定位;获取目标监测数据,并对所述目标监测数据进行预处理;将所述预处理后的数据结合所述信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。实现异常先于用户感知、告警自动关联和故障精准定位。Beneficial effects of the present invention: The present invention proposes a method and system for panoramic monitoring of information system business operations, and constructs a panoramic monitoring index system for information systems based on business requirements. The business requirements include abnormality prior to user perception, automatic alarm association, and fault accuracy. Positioning; obtaining target monitoring data and preprocessing the target monitoring data; combining the preprocessed data with the information system panoramic monitoring index system to perform alarm analysis and fault location to complete a panoramic view of the information system business operation monitor. Achieve abnormality detection before users, automatic alarm correlation and precise fault location.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。其中:In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting any creative effort. in:
图1为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的方法流程图;Figure 1 is a method flow chart of a panoramic monitoring method and system for information system business operations provided by an embodiment of the present invention;
图2为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的系统结构示意图;Figure 2 is a schematic system structure diagram of a panoramic monitoring method and system for information system business operations provided by an embodiment of the present invention;
图3为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的APM/NPM外部探针纳管示意图;Figure 3 is a schematic diagram of APM/NPM external probe management for an information system business operation panoramic monitoring method and system provided by an embodiment of the present invention;
图4为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的信息系统全景监测技术架构;Figure 4 is an information system panoramic monitoring technical architecture oriented to an information system business operation panoramic monitoring method and system provided by an embodiment of the present invention;
图5为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的信息系统全景监测流程示意图;Figure 5 is a schematic flow chart of an information system panoramic monitoring method and system for information system business operation panoramic monitoring provided by one embodiment of the present invention;
图6为本发明一个实施例提供的一种面向信息系统业务运行全景监测方法及系统的计算机设备的内部结构图。Figure 6 is an internal structure diagram of a computer device for a panoramic monitoring method and system for information system business operations provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合说明书附图对本发明的具体实施方式做详细的说明,显然所描述的实施例是本发明的一部分实施例,而不是全部实施例。基于本发明中的实施例,本领域普通人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明的保护的范围。In order to make the above objects, features and advantages of the present invention more obvious and easy to understand, the specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It is obvious that the described embodiments are part of the embodiments of the present invention, not all of them. Example. Based on the embodiments of the present invention, all other embodiments obtained by ordinary people in the art without creative efforts should fall within the protection scope of the present invention.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。Many specific details are set forth in the following description to fully understand the present invention. However, the present invention can also be implemented in other ways different from those described here. Those skilled in the art can do so without departing from the connotation of the present invention. Similar generalizations are made, and therefore the present invention is not limited to the specific embodiments disclosed below.
其次,此处所称的“一个实施例”或“实施例”是指可包含于本发明至少一个实现方式中的特定特征、结构或特性。在本说明书中不同地方出现的“在一个实施例中”并非均指同一个实施例,也不是单独的或选择性的与其他实施例互相排斥的实施例。Second, reference herein to "one embodiment" or "an embodiment" refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. "In one embodiment" appearing in different places in this specification does not all refer to the same embodiment, nor is it a separate or selective embodiment that is mutually exclusive with other embodiments.
本发明结合示意图进行详细描述,在详述本发明实施例时,为便于说明,表示器件结构的剖面图会不依一般比例作局部放大,而且所述示意图只是示例,其在此不应限制本发明保护的范围。此外,在实际制作中应包含长度、宽度及深度的三维空间尺寸。The present invention will be described in detail with reference to schematic diagrams. When describing the embodiments of the present invention in detail, for the convenience of explanation, the cross-sectional diagrams showing the device structure will be partially enlarged according to the general scale. Moreover, the schematic diagrams are only examples and shall not limit the present invention. scope of protection. In addition, the three-dimensional dimensions of length, width and depth should be included in actual production.
同时在本发明的描述中,需要说明的是,术语中的“上、下、内和外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一、第二或第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。At the same time, in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer" are based on the orientation or positional relationship shown in the drawings, and are only for the convenience of describing the present invention. The invention and simplified description are not intended to indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operate in a specific orientation, and therefore are not to be construed as limitations of the invention. Furthermore, the terms "first, second or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
本发明中除非另有明确的规定和限定,术语“安装、相连、连接”应做广义理解,例如:可以是固定连接、可拆卸连接或一体式连接;同样可以是机械连接、电连接或直接连接,也可以通过中间媒介间接相连,也可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。Unless otherwise clearly stated and limited in the present invention, the terms "installation, connection, and connection" should be understood in a broad sense. For example, it can be a fixed connection, a detachable connection, or an integrated connection; it can also be a mechanical connection, an electrical connection, or a direct connection. A connection can also be indirectly connected through an intermediary, or it can be an internal connection between two components. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood on a case-by-case basis.
实施例1Example 1
参照图1-6,为本发明的第一个实施例,该实施例提供了一种面向信息系统业务运行全景监测系统,包括:Referring to Figures 1-6, a first embodiment of the present invention is shown. This embodiment provides a panoramic monitoring system for information system business operations, including:
在一个优选的实施例中,一种面向信息系统业务运行全景监测系统,包括监测数据采集单元100、服务器单元200、通信单元300、逻辑分析单元400、告警单元500以及终端展示与查询单元600,In a preferred embodiment, a panoramic monitoring system for information system business operations includes a monitoring data collection unit 100, a server unit 200, a communication unit 300, a logic analysis unit 400, an alarm unit 500, and a terminal display and query unit 600.
监测数据采集单元100,用于收集业务性能数据、运维业务数据以及资源及关系数据,且将数据按照固定周期分为当前周期与历史周期数据,并将数据传输至服务器单元200进行结构化保存;The monitoring data collection unit 100 is used to collect business performance data, operation and maintenance business data, and resource and relationship data, and divide the data into current cycle and historical cycle data according to fixed cycles, and transmit the data to the server unit 200 for structured storage. ;
服务器单元200,用于获取监测数据采集单元100传输至的数据信息,并进行结构化保存,根据其他单元传输至的指令信息调用不同结构化后的数据通过通信单元300传输至对应单元;The server unit 200 is used to obtain the data information transmitted by the monitoring data collection unit 100 and store it in a structured manner. According to the instruction information transmitted by other units, the different structured data are called and transmitted to the corresponding unit through the communication unit 300;
通信单元300,用于维持系统中各个单元的连接;Communication unit 300, used to maintain the connection of various units in the system;
逻辑分析单元400,用于预设性能指标、业务应用指标以及告警指标,并通过通信单元300向服务器单元200发送数据分析指令,根据服务器单元200传输至的数据结合预设性能指标、业务应用指标以及告警指标,对监测数据进行分析与判断;The logic analysis unit 400 is used to preset performance indicators, business application indicators and alarm indicators, and send data analysis instructions to the server unit 200 through the communication unit 300, and combine the preset performance indicators and business application indicators according to the data transmitted by the server unit 200 and alarm indicators to analyze and judge monitoring data;
告警单元500,用于根据逻辑分析单元400的分析与判断结果进行告警;The alarm unit 500 is used to issue alarms based on the analysis and judgment results of the logic analysis unit 400;
终端展示与查询单元600,用于展示逻辑分析单元400的分析与判断结果以及结合告警单元500的进行告警,并且为用户提供资源查询服务,通过通信单元300调取并展示服务器单元200中的相应数据。The terminal display and query unit 600 is used to display the analysis and judgment results of the logic analysis unit 400 and to provide alarms in conjunction with the alarm unit 500, and to provide resource query services for users, and to retrieve and display the corresponding information in the server unit 200 through the communication unit 300. data.
监测数据采集单元100包括数据收集模块101以及数据处理模块102,The monitoring data collection unit 100 includes a data collection module 101 and a data processing module 102.
数据收集模块101包括第一数据收集部101a、第二数据收集部101b以及第三数据收集部101c,第一数据收集部101a、第二数据收集部101b以及第三数据收集部101c同时与数据处理模块102单向连接,直接将采集到的数据传输至数据处理模块102中;The data collection module 101 includes a first data collection part 101a, a second data collection part 101b, and a third data collection part 101c. The first data collection part 101a, the second data collection part 101b, and the third data collection part 101c simultaneously process data. The module 102 has a one-way connection and directly transmits the collected data to the data processing module 102;
第一数据收集部101a通过APM工具收集系统外部数据,第二数据收集部101b通过NPM工具收集系统外部数据,第三数据收集部101c通过其他链路监控工具收集系统外部数据,数据处理模块102在接收到数据收集模块101传输至的数据后,将数据聚合成拓扑数据、链路数据和指标数据三类数据,并通过通信单元300发送给服务器单元200,服务器单元200对接收到的数据进行转储;The first data collection part 101a collects system external data through APM tools, the second data collection part 101b collects system external data through NPM tools, the third data collection part 101c collects system external data through other link monitoring tools, and the data processing module 102 After receiving the data transmitted to the data collection module 101, the data is aggregated into three types of data: topology data, link data and indicator data, and is sent to the server unit 200 through the communication unit 300. The server unit 200 converts the received data. store;
结构化包括将数据分为基础监测数据、业务链路数据、系统业务数据以及日志数据。Structuring includes dividing data into basic monitoring data, business link data, system business data and log data.
其中,基础监测数据可以包括端口速率、CPU使用率、内存使用率、会话数、新建连接数、丢包比率、时延等指标、业务链路数据可以包括调用请求成功率、调用平均响应、每分钟调用次数、拓扑数据、trace调用列表、span监测详情等信息、系统业务数据可以包括注册用户数、在线用户数、活跃用户数、新增用户数、访问次数、业务量、业务运行阶段分布、业务访问次数、业务对象数量等数据、日志数据。Among them, basic monitoring data can include port rate, CPU usage, memory usage, number of sessions, number of new connections, packet loss ratio, delay and other indicators. Business link data can include call request success rate, average call response, and The number of calls per minute, topology data, trace call list, span monitoring details and other information. System business data can include the number of registered users, the number of online users, the number of active users, the number of new users, the number of visits, business volume, business operation stage distribution, Data such as the number of business visits, the number of business objects, and log data.
逻辑分析单元400包括指标预设模块401、指标对比模块402、故障分析与判断模块403、故障定位模块404以及故障预警模块405,The logical analysis unit 400 includes an indicator preset module 401, an indicator comparison module 402, a fault analysis and judgment module 403, a fault location module 404 and a fault warning module 405.
指标预设模块401用于预设性能指标、业务应用指标以及告警指标,指标对比模块402与服务器单元200通过通信单元300直接相连,指标对比模块402向通信单元300发送对比指令,对比指令包括第一对比指令、第二对比指令以及第三对比指令;The indicator preset module 401 is used to preset performance indicators, business application indicators and alarm indicators. The indicator comparison module 402 is directly connected to the server unit 200 through the communication unit 300. The indicator comparison module 402 sends a comparison instruction to the communication unit 300. The comparison instruction includes: a first comparison instruction, a second comparison instruction and a third comparison instruction;
当服务器单元200获取到第一对比指令时,服务器单元200将基础监测数据传输至指标对比模块402中,指标对比模块402结合指标预设模块401中预设的指标对数据进行判断,并将判断结果传输至故障分析与判断模块403中;When the server unit 200 obtains the first comparison instruction, the server unit 200 transmits the basic monitoring data to the indicator comparison module 402. The indicator comparison module 402 judges the data in combination with the indicators preset in the indicator preset module 401, and makes the judgment The results are transmitted to the fault analysis and judgment module 403;
当服务器单元200获取到第二对比指令时,服务器单元200将业务链路数据以及系统业务数据传输至指标对比模块402中,指标对比模块402结合指标预设模块401中预设的指标对数据进行判断,并将判断结果传输至故障分析与判断模块403中;When the server unit 200 obtains the second comparison instruction, the server unit 200 transmits the service link data and the system service data to the indicator comparison module 402. The indicator comparison module 402 combines the indicators preset in the indicator preset module 401 to perform data processing. Make a judgment and transmit the judgment result to the fault analysis and judgment module 403;
当服务器单元200获取到第三对比指令时,服务器单元200将日志数据传输至指标对比模块402中,指标对比模块402结合指标预设模块401中预设的指标对数据进行判断,并将判断结果传输至故障分析与判断模块403中。When the server unit 200 obtains the third comparison instruction, the server unit 200 transmits the log data to the indicator comparison module 402. The indicator comparison module 402 judges the data in combination with the indicators preset in the indicator preset module 401, and sends the judgment results. transmitted to the fault analysis and judgment module 403.
逻辑分析单元400还包括,Logic analysis unit 400 also includes,
故障分析与判断模块403接收来自指标对比模块402传输至的指标对比结果,并对指标对比结果进行分析与判断;The fault analysis and judgment module 403 receives the index comparison results transmitted from the index comparison module 402, and analyzes and judges the index comparison results;
当接收到第一判断结果时,若第一判断结果中的基础监测数据偏离预设基础监测数据百分之十,或第一判断结果中的基础监测数据出现缺失时,则认定出现故障;When receiving the first judgment result, if the basic monitoring data in the first judgment result deviates from the preset basic monitoring data by 10%, or if the basic monitoring data in the first judgment result is missing, it is determined that a fault has occurred;
当接收到第二判断结果时,若第二判断结果中的业务链路数据以及系统业务数据出现缺失时,则认定出现故障;When receiving the second judgment result, if the service link data and system service data in the second judgment result are missing, it is determined that a fault has occurred;
当接收到第三判断结果时,若第三判断结果中的日志数据出现缺失或者与前周期相比增加时,则认定出现故障。When receiving the third judgment result, if the log data in the third judgment result is missing or increased compared with the previous period, it is determined that a fault has occurred.
逻辑分析单元400还包括,Logic analysis unit 400 also includes,
故障定位模块404通过结合神经网络训练得到具体定位模型,故障定位模块404接收来自服务器单元200中日志数据中的故障数据,将故障数据作为输入,将故障种类与故障位置作为输出,训练故障定位神经网络模型;The fault location module 404 obtains a specific location model by combining neural network training. The fault location module 404 receives the fault data from the log data in the server unit 200, uses the fault data as input, and uses the fault type and fault location as output to train the fault location neural network. network model;
当故障定位神经网络模型训练完成后,故障定位模块404将模型保存至服务器单元200中,若故障分析与判断模块403判定结果为故障,则故障分析与判断模块403向故障定位模块404发送故障待定位指令,当故障定位模块404获取到故障待定位指令后,通过通信单元300向服务器单元200发送调取故障定位神经网络模型指令;After the fault location neural network model training is completed, the fault location module 404 saves the model to the server unit 200. If the fault analysis and judgment module 403 determines that the result is a fault, the fault analysis and judgment module 403 sends a pending fault to the fault location module 404. bit instruction, when the fault location module 404 obtains the fault to be located instruction, it sends an instruction to retrieve the fault location neural network model to the server unit 200 through the communication unit 300;
当调取故障定位神经网络模型成功后,将服务器单元200中实时数据直接作为模型输入接入故障定位神经网络模型,获取故障种类与故障位置信息,并将故障种类与故障位置信息传输至故障预警模块405;When the fault location neural network model is successfully retrieved, the real-time data in the server unit 200 is directly used as model input and connected to the fault location neural network model to obtain the fault type and fault location information, and transmit the fault type and fault location information to the fault warning Module 405;
故障预警模块405对故障种类与故障位置信息进行告警等级确认后,将告警等级传输至告警单元500中。After confirming the alarm level on the fault type and fault location information, the fault warning module 405 transmits the alarm level to the alarm unit 500 .
告警单元500包括提供告警触发策略配置、告警通知策略配置,对信息系统监测提供多渠道告警,告警触发策略进行独立配置,告警触发条件包括单指标触发以及多指标组合触发,单指标触发规则包括缺失告警、阈值告警、字符比较、趋势告警、状态反转、浮动阈值告警及突变告警规则,多指标触发基于“与”“或”规则进行多指标联合告警,告警通知策略配置包括告警资源范围及通知策略,告警资源范围包括通过资源类型、资源单位、机房、群组、资源实例方式进行设置;通知策略包括接收对象、通知方式、通知时间、重复规则、升级规则,还包括对告警进行去重、分组、抑制、静默和路由功能,根据逻辑分析单元400的分析与判断结果进行告警。The alarm unit 500 includes providing alarm triggering strategy configuration and alarm notification strategy configuration, and provides multi-channel alarms for information system monitoring. The alarm triggering strategy is configured independently. The alarm triggering conditions include single indicator triggering and multiple indicator combination triggering. Single indicator triggering rules include missing Alarms, threshold alarms, character comparisons, trend alarms, status reversal, floating threshold alarms and mutation alarm rules, multi-indicator triggers multi-indicator joint alarms based on "AND" and "OR" rules, alarm notification policy configuration includes alarm resource range and notification Strategy, alarm resource scope includes setting by resource type, resource unit, computer room, group, resource instance; notification strategy includes receiving objects, notification method, notification time, repetition rules, upgrade rules, and also includes alarm deduplication, The grouping, suppression, silencing and routing functions provide alarms based on the analysis and judgment results of the logical analysis unit 400.
终端展示与查询单元600包括用于展示系统收集的数据,并结合告警单元500对告警信息进行展示,还包括根据用户需求对资源进行查询。The terminal display and query unit 600 includes displaying data collected by the system, displaying alarm information in combination with the alarm unit 500, and querying resources according to user needs.
上述各单元模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each of the above-mentioned unit modules can be embedded in or independent of the processor of the computer device in the form of hardware, or can be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to each of the above modules.
综上,本发明提出一种面向信息系统业务运行全景监测系统,基于业务需求构建信息系统全景监测指标体系,业务需求包括异常先于用户感知、告警自动关联和故障精准定位;获取目标监测数据,并对目标监测数据进行预处理;将预处理后的数据结合信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。实现异常先于用户感知、告警自动关联和故障精准定位。In summary, the present invention proposes a panoramic monitoring system for information system business operations, and constructs a panoramic monitoring index system for information systems based on business requirements. The business requirements include abnormality prior to user perception, automatic alarm association, and precise fault location; obtaining target monitoring data, And preprocess the target monitoring data; combine the preprocessed data with the information system panoramic monitoring indicator system to conduct alarm analysis and fault location, and complete the panoramic monitoring of the information system business operation. Achieve abnormality detection before users, automatic alarm correlation and precise fault location.
实施例2Example 2
参照图1-6,为本发明的一个实施例,提供了一种面向信息系统业务运行全景监测方法,包括:Referring to Figures 1-6, an embodiment of the present invention provides a method for panoramic monitoring of information system business operations, including:
基于业务需求构建信息系统全景监测指标体系,业务需求包括异常先于用户感知、告警自动关联和故障精准定位;Construct a panoramic monitoring indicator system for information systems based on business needs, which include abnormality sensing before users, automatic alarm correlation, and accurate fault location;
获取目标监测数据,并对目标监测数据进行预处理;Obtain target monitoring data and preprocess the target monitoring data;
将预处理后的数据结合信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。Combine the preprocessed data with the information system panoramic monitoring indicator system to perform alarm analysis and fault location to complete panoramic monitoring of the information system business operation.
在本申请实施例中,提出APM/NPM链路工具数据集成规范:外部系统通过APM、NPM工具或者其它链路监控工具,把APM和NPM数据聚合成拓扑数据(topo)、链路数据(trace)和指标数据(metric)三类数据,通过适配程序发送给消息队列,转储服务对接收到的数据进行转储。In the embodiment of this application, APM/NPM link tool data integration specifications are proposed: external systems aggregate APM and NPM data into topology data (topo), link data (trace) through APM, NPM tools or other link monitoring tools. ) and indicator data (metric) are sent to the message queue through the adapter program, and the dump service dumps the received data.
其中,拓扑数据Topo:拓扑数据为一段时间内业务系统应用服务、数据库、中间件或者外部系统应用服务之间的调用拓扑数据,包括事务拓扑数据、服务拓扑数据、网络拓扑数据。Among them, topology data Topo: Topology data is the topology data of calls between business system application services, databases, middleware or external system application services within a period of time, including transaction topology data, service topology data, and network topology data.
应说明的是,链路数据Trace:链路数据Trace为用户具体的调用请求、APM链路以及NPM链路,包括链路的起点、终点以及相关调用栈信息等明细数据。It should be noted that link data Trace: Link data Trace is the user's specific call request, APM link and NPM link, including detailed data such as the starting point and end point of the link, as well as related call stack information.
应说明的是,指标数据Metric:指标数据Metric是对服务、服务实例、调用链Trace等数据的汇聚、计算和统计的相关数据。It should be noted that indicator data Metric: Indicator data Metric is data related to the aggregation, calculation and statistics of services, service instances, call chain traces and other data.
应说明的是,对APM/NPM外部探针进行统一纳管:为实现对各类厂商APM及NPM探针的统一纳管,提供探针统一纳管支撑服务功能,各探针厂商须按照统一纳管的要求开发及实现相关适配的功能。It should be noted that unified management of APM/NPM external probes: In order to achieve unified management of APM and NPM probes from various manufacturers and provide unified probe management support service functions, each probe manufacturer must follow the unified management Develop and implement related adaptation functions based on management requirements.
在本申请实施例中,提出南向接口服务规范:通过南向接口实现信息系统业务数据、系统集成接口或监测指标等外部数据接收汇聚,通过API接口和消息队列方式实现数据传输能力。其中,数据即时处理场景使用API接口,基于HTTP协议,遵循RESTful规范;数据批量接收及订阅场景使用消息服务,支持Kafka、RabbitMQ等常见消息队列中间件。In the embodiment of this application, the southbound interface service specification is proposed: the southbound interface is used to realize the reception and aggregation of external data such as information system business data, system integration interfaces or monitoring indicators, and the data transmission capability is realized through API interfaces and message queues. Among them, the real-time data processing scenario uses the API interface, is based on the HTTP protocol, and follows the RESTful specification; the data batch receiving and subscription scenario uses the message service, and supports common message queue middleware such as Kafka and RabbitMQ.
在本申请实施例中,建立信息系统全景监测指标体系:包括物理设备(主机设备、网络设备、安全设备、存储设备、机房设备、专用设备等)指标、基础平台(云平台、数据中台、VMware、OpenStack、kubernetes、云上资源、云下数据库、云下中间件等)指标、信息系统(业务、服务、服务链路、服务拓扑等)指标、日志数据。规范采集和接入对象、采集指标、采集方式、采集频率。In the embodiment of this application, an information system panoramic monitoring index system is established: including physical equipment (host equipment, network equipment, security equipment, storage equipment, computer room equipment, special equipment, etc.) indicators, basic platform (cloud platform, data center, VMware, OpenStack, kubernetes, on-cloud resources, off-cloud databases, off-cloud middleware, etc.) indicators, information system (business, services, service links, service topology, etc.) indicators, log data. Standardize collection and access objects, collection indicators, collection methods, and collection frequency.
应说明的是,根据监测指标体系建立对应的指标库及采集方式,采集方式分为Agent采集、协议采集和第三方数据接入三种模式。It should be noted that the corresponding indicator library and collection method are established according to the monitoring indicator system. The collection method is divided into three modes: Agent collection, protocol collection and third-party data access.
应说明的是,Agent采集包括插件采集、脚本采集、探针等,支持可插拔的插件式采集,支持第三方采集插件接入,实现对采集范围及采集方式的扩展。It should be noted that Agent collection includes plug-in collection, script collection, probes, etc., supports pluggable plug-in collection, and supports third-party collection plug-in access to expand the collection scope and collection method.
应说明的是,协议采集包括SNMP、SSH、WMI、RESTFUL、JMX、SMI-S、IPMI、HTTP、HTTPS、JDBC等通用协议的数据采集。It should be noted that protocol collection includes data collection of common protocols such as SNMP, SSH, WMI, RESTFUL, JMX, SMI-S, IPMI, HTTP, HTTPS, and JDBC.
更进一步的,第三方数据接入,通过南向接口及APM及NPM接口规范可以支持第三方采集数据的接入。Furthermore, third-party data access can be supported through the southbound interface and APM and NPM interface specifications.
在本申请实施例中,还基于CMDB对业务链路模型及资源模型进行统一维护管理:业务链路模型包括业务场景、业务活动、服务端点,通过业务链路模型及相互关联关系实现信息系统业务场景管理。同时结合信息系统、平台组件、基础设施等资源模型以及关联关系最终形成支撑信息系统全景监测的从基础设施层、平台层、服务层到应用层等各层级节点、链路、系统集成的全场景数据模型。In the embodiment of this application, the business link model and resource model are also unified maintained and managed based on CMDB: the business link model includes business scenarios, business activities, and service endpoints, and the information system business is realized through the business link model and interrelated relationships. Scene management. At the same time, combined with resource models and relationships such as information systems, platform components, infrastructure, etc., a full scenario of nodes, links, and system integration at all levels including the infrastructure layer, platform layer, service layer, and application layer is formed to support the panoramic monitoring of the information system. Data model.
在本申请实施例中,还提供告警集中管理:提供告警触发策略配置、告警通知策略配置,对信息系统监测提供多渠道告警。告警触发策略可基于各资源模型进行独立配置,告警触发条件包括单指标触发以及多指标组合触发。单指标触发规则包括缺失告警、阈值告警、字符比较、趋势告警、状态反转、浮动阈值告警及突变告警等规则。多指标触发可以基于“与”“或”规则进行多指标联合告警。告警通知规则包括告警资源范围及通知策略,告警资源范围可按资源类型、资源单位、机房、群组、资源实例等方式进行设置,通知策略包括接收对象、通知方式、通知时间、重复规则、升级规则等。同时支持对告警进行去重、分组、抑制、静默和路由等功能。通过关联规则引擎进行告警原因分析,快速进行故障精准定位,降低无效告警的干扰,有效抵御告警风暴。In the embodiment of this application, centralized management of alarms is also provided: alarm triggering strategy configuration and alarm notification strategy configuration are provided, and multi-channel alarms are provided for information system monitoring. Alarm triggering strategies can be configured independently based on each resource model. Alarm triggering conditions include single indicator triggering and multi-indicator combination triggering. Single indicator triggering rules include missing alarms, threshold alarms, character comparisons, trend alarms, status reversal, floating threshold alarms, and mutation alarms. Multi-indicator triggering can provide multi-indicator joint alarms based on "AND" and "OR" rules. Alarm notification rules include alarm resource range and notification strategy. Alarm resource range can be set by resource type, resource unit, computer room, group, resource instance, etc. Notification strategy includes receiving objects, notification method, notification time, repeat rules, and upgrades. Rules etc. It also supports functions such as deduplication, grouping, suppression, silencing and routing of alarms. Analyze the causes of alarms through the association rule engine, quickly and accurately locate faults, reduce the interference of invalid alarms, and effectively resist alarm storms.
在本申请实施例中,还提供业务场景编排:为了解决信息系统业务场景繁多,业务场景不定期调整等业务需求,特提供业务场景编排功能。业务场景编排是指运维人员基于业务部门梳理的业务场景相关材料,进行业务场景创建、流程编排及自定义指标关联,支撑业务链路保鲜。包括场景分类维护、业务场景维护、业务流转链路维护等。实现以业务场景为导向、以业务语言为基础的系统全链路监测业务。In the embodiment of this application, business scenario orchestration is also provided: In order to solve the business needs of information systems such as numerous business scenarios and irregular adjustments to business scenarios, a business scenario orchestration function is provided. Business scenario orchestration means that operation and maintenance personnel create business scenarios, process orchestration and custom indicator correlation based on business scenario related materials sorted out by business departments to support the preservation of business links. Including scenario classification maintenance, business scenario maintenance, business flow link maintenance, etc. Realize the system's full-link monitoring service that is oriented by business scenarios and based on business language.
在本申请实施例中,还提供指标看板编排:为了解决业务场景在不同时期、不同场景下需要对业务指标及展示布局进行动态调整的需要,特提供指标看板编排功能。指标看板编排是业务部门的业务人员基于业务部门梳理的业务运营监测情况,对业务监测看板需展示的业务指标及展现形式进行编排的工作。指标看板编排包括业务场景选择、业务指标维护、指标看板和布局编排。In the embodiment of this application, indicator signboard arrangement is also provided: In order to solve the need for dynamic adjustment of business indicators and display layout in different periods and different scenarios in business scenarios, an indicator signboard arrangement function is provided. Indicator board arrangement is the work of the business personnel of the business department to arrange the business indicators and presentation forms that need to be displayed on the business monitoring board based on the business operation monitoring situation sorted out by the business department. Indicator dashboard arrangement includes business scenario selection, business indicator maintenance, indicator dashboard and layout arrangement.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种面向信息系统业务运行全景监测方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 6 . The computer device includes a processor, memory, communication interface, display screen and input device connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The communication interface of the computer device is used for wired or wireless communication with external terminals. The wireless mode can be implemented through WIFI, operator network, NFC (Near Field Communication) or other technologies. When the computer program is executed by the processor, a panoramic monitoring method for information system business operations is implemented. The display screen of the computer device may be a liquid crystal display or an electronic ink display. The input device of the computer device may be a touch layer covered on the display screen, or may be a button, trackball or touch pad provided on the computer device shell. , it can also be an external keyboard, trackpad or mouse, etc.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:In one embodiment, a computer-readable storage medium is provided with a computer program stored thereon. When the computer program is executed by a processor, the following steps are implemented:
基于业务需求构建信息系统全景监测指标体系,业务需求包括异常先于用户感知、告警自动关联和故障精准定位;Construct a panoramic monitoring indicator system for information systems based on business needs, which include abnormality sensing before users, automatic alarm correlation, and precise fault location;
获取目标监测数据,并对目标监测数据进行预处理;Obtain target monitoring data and preprocess the target monitoring data;
将预处理后的数据结合信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。Combine the preprocessed data with the information system panoramic monitoring indicator system to perform alarm analysis and fault location to complete panoramic monitoring of the information system business operation.
综上,本发明提出一种面向信息系统业务运行全景监测方法,基于业务需求构建信息系统全景监测指标体系,业务需求包括异常先于用户感知、告警自动关联和故障精准定位;获取目标监测数据,并对目标监测数据进行预处理;将预处理后的数据结合信息系统全景监测指标体系,进行告警分析及故障定位,完成信息系统业务运行的全景监测。实现异常先于用户感知、告警自动关联和故障精准定位。In summary, the present invention proposes a method for panoramic monitoring of information system business operations, and constructs a panoramic monitoring index system for information systems based on business requirements. The business requirements include abnormality prior to user perception, automatic alarm association, and accurate fault location; obtaining target monitoring data, And preprocess the target monitoring data; combine the preprocessed data with the information system panoramic monitoring indicator system to conduct alarm analysis and fault location, and complete the panoramic monitoring of the information system business operation. Achieve abnormality detection before users, automatic alarm correlation and precise fault location.
实施例3Example 3
参照图1-6,为本发明的一个实施例,提供了一种面向信息系统业务运行全景监测方法及系统,包括:Referring to Figures 1-6, an embodiment of the present invention provides a method and system for panoramic monitoring of information system business operations, including:
如图5所示,对一个信息系统开展全景监测的具体实施步骤为:As shown in Figure 5, the specific implementation steps for panoramic monitoring of an information system are:
首先梳理信息系统涉及的基础设施、平台组件、业务场景、业务活动、服务端点、物理拓扑图、微服务清单、系统集成关系等信息(其中部分信息可通过自动发现自动获取,比如网络层拓扑发现、主机上的数据库和中间件的自动发现、微服务及服务端点的自动发现)。基础设施包括主机服务器、网络设备、安全设备、存储设备等资源;平台组件包括数据库、中间件等资源;业务场景、业务活动、服务端点梳理工作主要包括梳理业务场景的名称、描述,明确不同业务场景下监测需关注的业务指标及组成业务场景的业务活动,各业务活动的根服务端点,支撑业务运行的业务调用链路等内容,最终形成业务场景清单、业务监测指标清单、业务流程图、业务调用链等内容;物联拓扑图是指信息系统部署的各资源拓扑关系;微服务清单是信息系统部署服务清单及和业务场景、业务活动的关联关系;系统集成关系用于描述不同业务场景与外部系统的集成关系,包括外部系统关联的业务场景、业务活动、集成方式、集成方向、集成内容等关键信息。First, sort out the infrastructure, platform components, business scenarios, business activities, service endpoints, physical topology diagrams, microservice lists, system integration relationships and other information involved in the information system (some of this information can be automatically obtained through automatic discovery, such as network layer topology discovery , automatic discovery of databases and middleware on the host, automatic discovery of microservices and service endpoints). Infrastructure includes host servers, network equipment, security equipment, storage equipment and other resources; platform components include databases, middleware and other resources; business scenarios, business activities, service endpoint sorting work mainly includes sorting out the names and descriptions of business scenarios, and clarifying different businesses Monitor the business indicators that need attention in the scenario and the business activities that make up the business scenario, the root service endpoints of each business activity, the business call links that support business operations, etc., and finally form a business scenario list, business monitoring indicator list, business flow chart, Business call chain and other contents; the IoT topology diagram refers to the topological relationship of various resources deployed by the information system; the microservice list is the information system deployment service list and the relationship with business scenarios and business activities; the system integration relationship is used to describe different business scenarios The integration relationship with external systems includes key information such as business scenarios, business activities, integration methods, integration directions, and integration content associated with external systems.
根据梳理的信息系统涉及的相关资源及关联关系,在系统中进行相关模型维护,参见图4中的资源配置组件。主要资源模型包括信息系统、物理机、存储设备、网络设备、安全设备、虚拟机、云组件、数据库、中间件、业务场景、业务活动、服务端点等模型及相关子模型,同时维护资源相关拓扑关系,对于服务调用链路等动态关系通过采集自动维护。According to the relevant resources and relationships involved in the sorted information system, relevant models are maintained in the system, see the resource configuration component in Figure 4. The main resource models include information systems, physical machines, storage devices, network devices, security devices, virtual machines, cloud components, databases, middleware, business scenarios, business activities, service endpoints and other models and related sub-models, while maintaining resource-related topology Relationships, dynamic relationships such as service call links are automatically maintained through collection.
信息系统监测数据的接入,依托于采集控制组件实现,参见图4。提供采集任务配置、采集任务调度执行、采集结果返回等功能,除可通过自身统一Agent进行有代理、无代理的直接采集外,也可通过南向接口接入第三方工具的数据。采集监测数据包括基础监测数据(端口速率、CPU使用率、内存使用率、会话数、新建连接数、丢包比率、时延等指标)、业务链路数据(调用请求成功率、调用平均响应、每分钟调用次数、拓扑数据、trace调用列表、span监测详情等信息)、系统业务数据(注册用户数、在线用户数、活跃用户数、新增用户数、访问次数、业务量、业务运行阶段分布、业务访问次数、业务对象数量等数据)、日志数据。The access of information system monitoring data is realized by relying on the collection control component, see Figure 4. It provides functions such as collection task configuration, collection task scheduling and execution, and return of collection results. In addition to direct collection with or without agents through its own unified agent, data from third-party tools can also be accessed through the southbound interface. The collected monitoring data includes basic monitoring data (port rate, CPU usage, memory usage, number of sessions, number of new connections, packet loss ratio, delay and other indicators), business link data (call request success rate, call average response, Number of calls per minute, topology data, trace call list, span monitoring details and other information), system business data (number of registered users, number of online users, number of active users, number of new users, number of visits, business volume, business operation stage distribution , business access times, number of business objects and other data), log data.
对于APM、NPM等应用及网络链路数据的接入,参见图2。有两种数据接入方式:对于通用的APM、NPM工具,通过系统侧适配服务主动定期获取业务链路数据;对于特定的链路工具,由链路工具厂家根据规范开发适配服务定期推送业务链路数据。对于接入的数据统一发送到消息服务队列,然后通过转储服务把数据流转到监测告警服务及资源配置服务,进行数据的告警监测及数据模型更新。For access to applications such as APM and NPM and network link data, see Figure 2. There are two data access methods: for general APM and NPM tools, business link data is actively and regularly obtained through the system-side adaptation service; for specific link tools, the link tool manufacturer develops adaptation services according to specifications and pushes them regularly. Business link data. The accessed data is uniformly sent to the message service queue, and then the data is flowed to the monitoring alarm service and resource configuration service through the dump service to perform data alarm monitoring and data model updating.
为在系统中实现对各类厂商APM及NPM探针的统一纳管,在系统侧提供探针统一纳管支撑服务功能,各探针厂商须按照统一纳管的要求开发及实现相关适配的功能,参见图3。在系统侧为各厂商探针纳管适配服务提供以下接口服务:注册探针纳管适配服务(用于探针纳管适配服务主动向系统注册纳管适配服务信息,为向各厂商下发探针运行策略提供推送地址)、注册已部署的探针服务(用于探针纳管适配服务主动向系统注册已部署的探针实例信息)、获取探针运行策略服务(用于探针纳管适配服务从系统中获取探针的运行策略)、接收探针运行状态数据服务(探针对应的适配服务通过此接口向系统提交自监测数据);在厂商侧应提供以下接口服务:接收下发的探针运行策略服务。In order to achieve unified management of APM and NPM probes from various manufacturers in the system, a unified management support service function for probes is provided on the system side. Each probe manufacturer must develop and implement relevant adaptations in accordance with the requirements of unified management. function, see Figure 3. On the system side, the following interface services are provided for the probe management and adaptation services of each manufacturer: Register Probe Management and Adaptation Service (used for the probe management and adaptation service to actively register the management and adaptation service information with the system. The manufacturer issues a probe operation policy and provides a push address), registers the deployed probe service (used for the probe management and adaptation service to actively register the deployed probe instance information with the system), and obtains the probe operation policy service (using The probe management and adaptation service obtains the operating strategy of the probe from the system) and receives the probe operating status data service (the adaptation service corresponding to the probe submits self-monitoring data to the system through this interface); it should be provided on the manufacturer side The following interface services: receive the delivered probe operation policy service.
对于采集接入的监测数据,通过监测管理组件及数据分析组件进行数据的处理并产生告警信息,同时结合数据分析能力进行告警关联分析及故障定位,参见图4。基于flink流式计算对接入的监测数据进行处理,并根据告警规则判断是否产生告警,对告警信息进行去重、降噪等处理后在根据告警通知规则进行告警信息通知发送。For the collected and accessed monitoring data, the monitoring management component and the data analysis component are used to process the data and generate alarm information. At the same time, the data analysis capabilities are combined to perform alarm correlation analysis and fault location, see Figure 4. Based on flink streaming computing, the accessed monitoring data is processed, and alarm rules are used to determine whether an alarm is generated. The alarm information is deduplicated, denoised, and then the alarm information notification is sent according to the alarm notification rules.
最后通过业务综合监测、运行全景监测等视图对信息系统全景监测信息进行可视化展现。同时根据梳理的实际业务场景信息,通过系统提供的场景编排功能、指标看板编排功能,动态配置相关业务场景监测视图及相关指标看板,满足不同系统的业务监测需要。Finally, the information system panoramic monitoring information is visually displayed through views such as business comprehensive monitoring and operation panoramic monitoring. At the same time, according to the actual business scenario information sorted out, through the scenario orchestration function and indicator dashboard orchestration function provided by the system, the relevant business scenario monitoring views and relevant indicator dashboards are dynamically configured to meet the business monitoring needs of different systems.
应说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。It should be noted that the above embodiments are only used to illustrate the technical solution of the present invention rather than to limit it. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solution of the present invention can be carried out. Modifications or equivalent substitutions without departing from the spirit and scope of the technical solution of the present invention shall be included in the scope of the claims of the present invention.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请实施例中的方案可以采用各种计算机语言实现,例如,面向对象的程序设计语言Java和直译式脚本语言JavaScript等。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The solutions in the embodiments of this application can be implemented using various computer languages, such as the object-oriented programming language Java and the literal scripting language JavaScript.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。Although the preferred embodiments of the present application have been described, those skilled in the art will be able to make additional changes and modifications to these embodiments once the basic inventive concepts are apparent. Therefore, it is intended that the appended claims be construed to include the preferred embodiments and all changes and modifications that fall within the scope of this application.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311191727.8A CN117370053A (en) | 2023-09-14 | 2023-09-14 | Information system service operation-oriented panoramic monitoring method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311191727.8A CN117370053A (en) | 2023-09-14 | 2023-09-14 | Information system service operation-oriented panoramic monitoring method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117370053A true CN117370053A (en) | 2024-01-09 |
Family
ID=89401222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311191727.8A Pending CN117370053A (en) | 2023-09-14 | 2023-09-14 | Information system service operation-oriented panoramic monitoring method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117370053A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118055427A (en) * | 2024-04-16 | 2024-05-17 | 中国电信股份有限公司浙江分公司 | Method and device for automatic network optimization of private network base station |
-
2023
- 2023-09-14 CN CN202311191727.8A patent/CN117370053A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118055427A (en) * | 2024-04-16 | 2024-05-17 | 中国电信股份有限公司浙江分公司 | Method and device for automatic network optimization of private network base station |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109586999B (en) | A container cloud platform state monitoring and early warning system, method and electronic device | |
CN103236948B (en) | A kind of telecommunications network alarm method and system | |
CN112311617A (en) | A configuration data monitoring and alarming method and system | |
CN114090366B (en) | Method, device and system for monitoring data | |
CN110232010A (en) | A kind of alarm method, alarm server and monitoring server | |
CN112214382A (en) | Alarm method and device | |
CN108989136A (en) | Business end to end performance monitoring method and device | |
CN109947616A (en) | An automatic monitoring operation and maintenance system of cloud operating system based on OpenStack technology | |
CN112636942B (en) | Method and device for monitoring service host node | |
US9577900B1 (en) | Application centric network experience monitoring | |
WO2023134285A1 (en) | Risk management method and risk management apparatus | |
CN111124609A (en) | Data acquisition method and device, data acquisition equipment and storage medium | |
CN107222346A (en) | A kind of clustered node health status method for early warning and system | |
CN107094086A (en) | A kind of information acquisition method and device | |
CN116302826A (en) | Intelligent operation and maintenance monitoring platform, method, storage medium and electronic equipment | |
CN117370053A (en) | Information system service operation-oriented panoramic monitoring method and system | |
WO2023273461A1 (en) | Robot operating state monitoring system, and method | |
CN109615218A (en) | Nuclear power information system performance monitoring system and method | |
CN114490237A (en) | Operation and maintenance monitoring method and device based on multiple data sources | |
CN118827393A (en) | eBPF-based application observation link topology construction method and related equipment | |
CN117424843A (en) | Management method, management device and ATE test system | |
CN116701106A (en) | System monitoring method, device, equipment and storage medium | |
CN115762090A (en) | Financial-level system intelligent monitoring and early warning method and system based on convolutional neural network | |
CN109120439B (en) | Distributed cluster alarm output method, apparatus, device and readable storage medium | |
CN114625763A (en) | Information analysis method and device for database, electronic equipment and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |