WO2019109961A1 - 故障诊断方法及装置 - Google Patents
故障诊断方法及装置 Download PDFInfo
- Publication number
- WO2019109961A1 WO2019109961A1 PCT/CN2018/119426 CN2018119426W WO2019109961A1 WO 2019109961 A1 WO2019109961 A1 WO 2019109961A1 CN 2018119426 W CN2018119426 W CN 2018119426W WO 2019109961 A1 WO2019109961 A1 WO 2019109961A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resource
- fault
- vnf service
- vnf
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/40—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0604—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
- H04L41/0613—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on the type or category of the network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/50—Network service management, e.g. ensuring proper service fulfilment according to agreements
- H04L41/5003—Managing SLA; Interaction between SLA and QoS
- H04L41/5009—Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/50—Network service management, e.g. ensuring proper service fulfilment according to agreements
- H04L41/5003—Managing SLA; Interaction between SLA and QoS
- H04L41/5009—Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
- H04L41/5012—Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF] determining service availability, e.g. which services are available at a certain point in time
- H04L41/5016—Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF] determining service availability, e.g. which services are available at a certain point in time based on statistics of service availability, e.g. in percentage or over a given time
Definitions
- the present application relates to the field of communication networks, and in particular, to a method and apparatus for fault diagnosis in a network.
- NFV network function virtualization
- VNF virtual network function
- NFVI network functions virtualization infrastructure
- the prior art cannot determine the relationship between the NFV service layer failure and the NFVI underlying resource failure, and cannot quickly locate and handle the fault of the NFV service layer.
- the embodiment of the invention provides a diagnostic method and device, which can realize fault location and processing of the NFV service layer quickly by determining the relationship between the NFV service layer fault and the NFVI underlying resource fault.
- the embodiment of the present application provides a diagnostic method, where the method specifically includes: determining a virtual network function VNF service fault, obtaining a diagnosis rule of a VNF service fault, and correlating resource associated data of the VNF service fault with a diagnostic rule. Match to determine the cause of the failure of the VNF service failure.
- the fault location and processing of the NFV service layer can be quickly implemented by determining the relationship between the NFV service layer fault and the NFVI underlying resource fault.
- the foregoing “acquiring the diagnosis rule of the VNF service fault” may include: calculating the diagnosis rule of the VNF service fault by using the first algorithm according to the historical resource association data associated with the VNF service fault.
- the foregoing “first algorithm” may include: a frequent item mining algorithm.
- resource association data may include at least one of the following: a key performance indicator (KPI) statistics, resource alarm information, and resource log information.
- KPI key performance indicator
- resource KPI statistics may include at least one of the following: a cumulative sum, an average value, a maximum value, and a real-time value of the resource KPI sampling data in the statistical period.
- the foregoing “resource association data” may be obtained by using a periodic polling manner or by using a subscription method.
- the determining that the VNF service fault occurs may include: determining, by using a dynamic threshold or a static threshold method, the VNF service fault.
- the embodiment of the present application provides a diagnostic apparatus, where the apparatus specifically includes: a processing module, configured to determine a virtual network function VNF service fault; and a communication module, configured to obtain a diagnosis rule of a VNF service fault; the processing module The resource association data associated with the VNF service fault is matched with the diagnosis rule to determine the fault cause of the VNF service fault.
- the fault location and processing of the NFV service layer can be quickly implemented by determining the relationship between the NFV service layer fault and the NFVI underlying resource fault.
- the processing module is specifically configured to determine, by using a dynamic threshold or a static threshold method, the VNF service fault.
- the communication module is configured to calculate a diagnosis rule of the VNF service fault by using a first algorithm according to historically recorded resource association data associated with the VNF service fault.
- each module in the second aspect may implement the functions performed in the foregoing method design of the first aspect, and the functions may be implemented by using hardware or by executing corresponding software by hardware.
- the hardware or software includes one or more modules corresponding to the functions described above. I will not repeat them here.
- an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores instructions, when executed on a computer, causing the computer to perform any of the foregoing first aspect or the first aspect. The method described in the design.
- an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores instructions, when executed on a computer, causing the computer to perform any of the foregoing second aspect or the second aspect. The method described in the design.
- an embodiment of the present invention provides a computer program product, comprising: instructions, when executed by a computer, causing a computer to perform any of the above aspects or any one of the possible aspects of the first aspect The method described in the above.
- an embodiment of the present invention provides a computer program product, comprising instructions, when executed by a computer, causing a computer to perform any one of the foregoing second aspect or the second aspect The method described in the above.
- FIG. 1 is a schematic structural diagram of a system for diagnosing a fault according to an embodiment of the present invention
- FIG. 2 is a schematic diagram of a method for acquiring resource data in a subscription manner according to an embodiment of the present invention
- FIG. 3 is a schematic diagram of a method for acquiring resource data in a polling manner according to an embodiment of the present disclosure
- FIG. 4 is a schematic diagram of a method for generating associated data for a service fault according to an embodiment of the present disclosure
- FIG. 5 is a schematic diagram of a method for diagnosing a service fault according to an embodiment of the present invention.
- FIG. 6 is a schematic structural diagram of a diagnostic apparatus according to an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of another diagnostic apparatus according to an embodiment of the present invention.
- FIG. 1 is a schematic structural diagram of a system for fault diagnosis according to an embodiment of the present invention.
- the system mainly includes a VNF service layer and an NFVI resource layer.
- the VNF implements the business function, and each VNF also corresponds to an element management system (EMS) to manage the VNF.
- the NFVI layer includes hardware resources and virtual resources.
- the hardware resources are at the bottom layer and may include resources such as computing hardware, storage hardware, and network hardware.
- the virtual resources are formed on the basis of hardware resources, including virtual computing resources, virtual storage resources, virtual network resources, etc., to form a virtual resource pool.
- the VNF business layer and the NFVI business layer each have their own management system, VNF Manager (VNFM) and NFV Infrastructure Manager (VIM).
- VNFM VNF Manager
- VIM NFV Infrastructure Manager
- the system shown in FIG. 1 also includes a database (DB) for storing data required for fault diagnosis.
- DB database
- the VNF and EMS of the business layer can directly access the database or access the database through VNFM.
- the data of the NFVI resource layer such as KPI data, and various alarm information, can be reported to VNFM through VIM and stored in the database.
- the embodiment of the present invention provides a method for fault diagnosis, which mainly includes acquisition of source data, generation of associated data, and diagnosis of service faults. These diagnostic methods may be performed by the VNF or by the EMS, and the following embodiments are described by taking VNF execution as an example.
- NFVI resource alarm information for example, NFVI resource alarm information, NFVI resource KPI data, and NFVI logs.
- NFVI resource KPI data for example, NFVI resource KPI data
- NFVI logs There are also many ways to obtain them, such as the way of subscription and the way of polling.
- FIG. 2 is a schematic diagram of a method for acquiring resource data in a subscription manner according to an embodiment of the present invention. As shown in FIG. 2, the method specifically includes:
- the VNF service requests to subscribe to the resource alarm information in the NFVI layer by using the VNFM.
- the VNF requests to subscribe to the NFVI layer alarm.
- the VNF sends a subscription request message to the VNFM, and the parameters in the subscription request message include a VNF identifier and an alarm identifier.
- the VNFM subscribes to the VNFI for alarm information.
- the VNF subscribes to the alarm information, if the NFVI layer generates an alarm during the running process, the alarm information is sent to the VNFM, and the VNFM sends the alarm information to the service layer.
- the VNF service receives the NFVI resource alarm message.
- the VNFM When the resource in the NFVI is faulty, the VNFM receives the resource alarm message sent by the NFVI layer, and the VNFM sends the alarm message to the VNF service that subscribes to the alarm information.
- the resource alarm message contains related information such as the resource identifier, the alarm identifier, and the subscribed VNF identifier.
- the VNF service stores the received alarm information in a database.
- the VNF service stores the received resource alarm information in the database.
- the stored resource alarm information includes the following fields: alarm time, resource identifier, alarm identifier, and alarm name.
- FIG. 3 is a schematic diagram of a method for acquiring resource data in a polling manner according to an embodiment of the present invention. As shown in FIG. 3, the method may include:
- the VNF service generates a KPI sampling task of the NFVI layer resource.
- the service layer obtains the KPI data of the resource layer, and usually adopts a periodic sampling manner.
- the sampling period may be 10 seconds or 1 minute.
- the VNF service layer obtains virtual resources and physical resource information such as a virtual machine (VM) and a host (host) where the service is located.
- VM virtual machine
- host host
- the VNF service requests sampling KPI data of resources in the NFVI layer.
- the service layer sends a sample request message to the VNFM.
- the VNFM After receiving the sample request message, the VNFM requests relevant data from the NFVI.
- the sampling request message includes: a VNF identifier, a VM information, a host information, and a KPI identifier.
- the KPI data of the NFVI resource may include multiple, such as current network speed, hard disk data access volume, traffic processed in a cycle on the VM, and the like.
- S304 The VNF service generates KPI statistics according to the sampled resource KPI data.
- the service layer After receiving the sampled KPI data, the service layer calculates KPI statistics in a statistical period.
- the statistical period may be N times the sampling period. For example, when the sampling period is 10 seconds, when N is 6 times, the statistical period is 1 minute.
- SVG The average of the resource KPI data sampled during the statistical period (the cumulative sum of the sampled values divided by the number of samples).
- RRL Real-Time Value
- the VNF service stores the above-mentioned resource KPI statistics into the database.
- the database may store the statistical data of the resource KPI.
- the stored resource KPI statistical data table may include the following fields: a statistical period, a KPI identifier, a KPI name, a KPI statistical data, and the like.
- the VNF service continuously obtains alarm information and KPI statistics of the underlying resources. Provides a diagnostic data source for subsequent business failures.
- FIG. 4 is a schematic diagram of a method for generating associated data of a service fault according to an embodiment of the present invention. As shown in FIG. 4, after detecting a service failure, establishing association data of a resource fault and a resource alarm associated with the fault, Specifically include:
- the VNF service determines that the service data of the service is faulty.
- the VNF service determines the service data of the service by using a dynamic threshold or a static threshold. If the service data exceeds the dynamic threshold or the static threshold, the VNF service is determined to be faulty.
- the VNF service reads the NFVI resource KPI statistics stored in the database.
- the VNF service Based on the detected service faults, the VNF service obtains resource KPI statistics of the NFVI within the associated duration from the database.
- the association duration is the length of time that may be associated with a business failure, for example, configurable to minutes or tens of minutes. In this way, it is not necessary to retain all the data in the database, and only the resource data that may be related to the service failure can be retained.
- Different resource data types such as resource alarm information and resource KPI statistics, can be configured with different association durations.
- the VNF service determines which resource KPI statistics are faulty through dynamic thresholds or static thresholds. For example, the CPU usage exceeds the threshold.
- the VNF service obtains a resource alarm information table within the associated duration from the database according to the detected service fault.
- S405 The VNF service determines the associated data of the service fault.
- the VNF service uses the faulty resource KPI statistical data determined in the above step S403 and the resource alarm information acquired in step S404 as the associated data of the VNF service fault.
- the VNF service stores the associated data of the service fault in the database.
- the VNF service stores the associated data in the service fault association table in the database.
- the association table may include: service fault time, service fault identifier, associated resource KPI statistics, and associated resource alarm information.
- the database stores the associated data of service failures and resource KPIs and resource alarms.
- the process of FIG. 4 is repeatedly executed, thereby continuously adding associated data to the service failure association table, and providing rich historical data for subsequent fault diagnosis.
- FIG. 5 is a schematic diagram of a method for diagnosing a service fault according to an embodiment of the present invention.
- a reason for diagnosing an underlying resource that causes a service fault of a VNF according to historical data in a database and a certain algorithm specifically includes :
- VNF service When a VNF service encounters a service failure during operation, it initiates a service fault diagnosis to determine the cause of the service failure.
- the VNF service reads a service fault association table in the database.
- the VNF service reads the history of the service fault association table content from the database, including the resource KPI statistics associated with the service fault, and the resource alarm information associated with the service fault.
- the VNF service determines a diagnosis rule according to the service fault association table.
- the VNF service calculates the diagnosis rule of the service fault through the related first algorithm.
- first algorithms such as frequent item mining algorithms.
- the correlation between the VNF service fault and the corresponding resource KPI fault or resource alarm may be obtained from the associated data of the VNF service fault in the historical time, that is, the diagnostic rule.
- the VNF service obtains, from the database, the associated data of the service fault in the current association duration.
- the VNF service obtains, from the database, associated data associated with the service fault to be diagnosed in the current period, including resource KPI statistics and resource alarm information.
- the VNF service determines the diagnosis result according to the diagnosis rule and the associated data.
- the resource KPI statistics and resource alarm information related to the above service faults are matched with the diagnostic rules to determine the diagnosis result. That is, the root cause of the service failure is determined by which resource alarm information or resource KPI statistics are most abnormal. For example, the VNF service failure manifests itself in a sharp drop in user 4G traffic, and eventually locates a hardware resource (network card) to generate an alarm.
- the collection of the KPI data and the alarm information of the underlying resource and the association with the VNF service fault enable the rapid location service fault, which greatly improves the fault recovery capability and system reliability.
- FIG. 6 is a schematic structural diagram of a diagnostic device according to an embodiment of the present invention.
- the diagnostic device 600 includes a storage module 601, a processing module 602, and a communication module 603.
- the processing module 602 is configured to control management of the actions of the diagnostic device, such as the processing module 602, for supporting the diagnostic device to perform the processes 501 and 503 of FIG. 5, and/or other processes for the techniques described herein.
- the communication module 403 is configured to obtain a diagnosis rule of a VNF service failure.
- the diagnostic device may further include a storage module 601 for storing statistical data of the resource KPI and the like.
- the processing module 602 can be a processor or a controller, for example, a central processing unit (CPU), a general-purpose processor, a digital signal processor (DSP), and an application-specific integrated circuit (Application-Specific). Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. It is possible to implement or carry out various exemplary logical blocks, modules and circuits described in connection with the disclosure of the embodiments of the invention.
- the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
- the communication module 603 can be a communication interface, a transceiver, a transceiver circuit, etc., wherein the communication interface is a collective name and can include one or more interfaces.
- the storage module 601 can be a memory.
- the terminal device When the processing module 602 is a processor, the communication module 603 is a communication interface, and the storage module 601 is a memory, the terminal device according to the embodiment of the present invention may be the terminal device shown in FIG.
- FIG. 7 is a schematic structural diagram of another diagnostic apparatus according to an embodiment of the present invention.
- the diagnostic apparatus 700 includes a processor 701, a communication interface 703, and a memory 701.
- the communication interface 703, the processor 702, and the memory 701 can be connected to each other through a communication connection.
- each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
- the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present invention is schematic, and is only a logical function division, and the actual implementation may have another division manner.
- the diagnostic device includes hardware structures and/or software modules corresponding to the execution of the respective functions. Those skilled in the art will readily appreciate that the present invention can be implemented in a combination of hardware or hardware and computer software in combination with the elements and algorithm steps of the various examples described in the embodiments disclosed herein.
- the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented in hardware, a software module executed by a processor, or a combination of both.
- the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本申请实施例涉及一种故障诊断方法及装置,该方法具体包括:确定发生虚拟网络功能VNF业务故障,获取VNF业务故障的诊断规则,将VNF业务故障关联的资源关联数据与诊断规则相匹配,确定VNF业务故障的故障原因。本方案中,通过对VNF业务故障关联的资源关联数据与诊断规则相匹配,可以实现通过确定NFV业务层故障和NFVI底层资源故障的关系,实现快速进行NFV业务层的故障定位与处理。
Description
本申请要求于2017年12月8日提交中国国家知识产权局、申请号为201711297407.5、发明名称为“故障诊断方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及通信网络领域,尤其涉及一种网络中故障诊断的方法及装置。
为了提高通信网络的部署灵活性,降低运营成本,网络功能虚拟化(network functions virtualization,NFV)技术快速发展起来。通过NFV技术,网络功能由虚拟化设备完成,并使得网络功能与具体硬件解耦。NFV技术已经成为云服务提供商(Cloud Service Provider,CSP)发展的主要驱动力。对于NFV的故障诊断也成为了NFV技术规模商用时关注的重点问题。
目前,NFV的故障诊断技术仅在虚拟网络功能(virtual network function,VNF)内部、或者在底层的网络功能虚拟化基础设施(network functions virtualization infrastructure,NFVI)层内部进行。例如,在NFVI层发生故障并产生警告后通过虚拟化基础设施管理器(virtualised infrastructure manager,VIM)接口发送至VNF管理器(VNF manager,VNFM),并通知到网元管理系统(element management system,EMS)。
但是,现有技术无法确定NFV业务层故障和NFVI底层资源故障的关系,无法快速进行NFV业务层的故障定位与处理。
发明内容
本发明实施例提供了一种诊断方法及装置,可以实现通过确定NFV业务层故障和NFVI底层资源故障的关系,实现快速进行NFV业务层的故障定位与处理。
第一方面,本申请实施例提供了一种诊断方法,该方法具体包括:确定发生虚拟网络功能VNF业务故障,获取VNF业务故障的诊断规则,将VNF业务故障关联的资源关联数据与诊断规则相匹配,确定VNF业务故障的故障原因。
本方案中,通过对VNF业务故障关联的资源关联数据与诊断规则相匹配,可以实现通过确定NFV业务层故障和NFVI底层资源故障的关系,实现快速进行NFV业务层的故障定位与处理。
在一个可选的实现方式中,上述“获取VNF业务故障的诊断规则”可以包括:根据历史记录的与VNF业务故障关联的资源关联数据,使用第一算法计算出VNF业务故障的诊断规则。
在另一个可选的实现方式中,上述“第一算法”可以包括:频繁项挖掘算法。
在又一个可选的实现方式中,上述“资源关联数据”可以包括下列至少一项,资源关键性能指标(key performance indicator,KPI)统计数据、资源告警信息、资源日志信息。
在再一个可选的实现方式中,上述“资源KPI统计数据”可以包括统计周期内资源KPI 采样数据的下列至少一项:累积和、平均值、最大值和实时值。
在再一个可选的实现方式中,上述“资源关联数据”可以通过周期性轮询方式获取,或者通过订阅方式获取。
在再一个可选的实现方式中,上述“确定发生所述VNF业务故障”可以包括:通过动态阈值或静态阈值方法判断出所述VNF业务故障。
第二方面,本申请实施例提供了一种诊断装置,该装置具体包括:处理模块,用于确定发生虚拟网络功能VNF业务故障;通信模块,用于获取VNF业务故障的诊断规则;该处理模块,用于将VNF业务故障关联的资源关联数据与诊断规则相匹配,确定VNF业务故障的故障原因。
本方案中,通过对VNF业务故障关联的资源关联数据与诊断规则相匹配,可以实现通过确定NFV业务层故障和NFVI底层资源故障的关系,实现快速进行NFV业务层的故障定位与处理。
其中,处理模块具体用于,通过动态阈值或静态阈值方法判断出所述VNF业务故障。通信模块具体用于,根据历史记录的与所述VNF业务故障关联的资源关联数据,使用第一算法计算出所述VNF业务故障的诊断规则。
需要说明的是,第二方面中各模块可以实现上述第一方面方法设计中所执行的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在此不做赘述。
第三方面,本发明实施例提供了一种计算机存储介质,所述计算机存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种可能的设计中所述的方法。
第四方面,本发明实施例提供了一种计算机存储介质,所述计算机存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第二方面或第二方面的任意一种可能的设计中所述的方法。
第五方面,本发明实施例提供了一种计算机程序产品,其包含指令,当所述程序被计算机所执行时,该指令使得计算机执行上述第一方面或第一方面的任意一种可能的设计中所述的方法。
第六方面,本发明实施例提供了一种计算机程序产品,其包含指令,当所述程序被计算机所执行时,该指令使得计算机执行上述第二方面或第二方面的任意一种可能的设计中所述的方法。
图1为本发明实施例提供的一种诊断故障的系统架构示意图;
图2为本发明实施例提供的一种订阅方式获取资源数据方法的示意图;
图3为本发明实施例提供的一种轮询方式获取资源数据方法的示意图;
图4为本发明实施例提供的一种业务故障产生关联数据的方法示意图;
图5为本发明实施例提供的一种诊断业务故障的方法示意图;
图6为本发明实施例提供的一种诊断装置的结构示意图;
图7为本发明实施例提供的另一种诊断装置的结构示意图。
为便于对本发明的理解,下面将结合附图及具体实施例做进一步的解释说明。
图1为本发明实施例提供的一种故障诊断的系统架构示意图。如图1所示,该系统主要包括VNF业务层和NFVI资源层。VNF实现业务功能,每个VNF还对应一个网元管理系统(element management system,EMS),对VNF进行管理。NFVI层包括硬件资源和虚拟资源,硬件资源在底层,可以包括计算硬件、存储硬件、网络硬件等资源。虚拟资源在硬件资源的基础上构成,包括虚拟计算资源、虚拟存储资源、虚拟网络资源等,形成虚拟资源池。VNF业务层和NFVI业务层各有其管理系统,VNF管理器(VNFM)和NFV基础设施管理器(VIM)。
此外,图1所示的系统还包括数据库(DB),用于存储故障诊断需要的数据。业务层的VNF和EMS可以直接访问数据库,也可以通过VNFM访问数据库。NFVI资源层的各项数据,例如KPI数据,各项告警信息等,可以通过VIM上报至VNFM,并存储在数据库中。
应用于上述系统,本发明实施例提供一种故障诊断的方法,主要包括源数据的获取,关联数据的生成、业务故障的诊断。这些诊断方法可以由VNF执行,也可以由EMS来执行,后面的实施例以VNF执行为例来说明。
数据源可以有多种,例如,NFVI资源告警信息、NFVI资源KPI数据以及NFVI日志等,获取的方式也可以有多种,例如:订阅的方式、轮询的方式等。
图2为本发明实施例提供的一种订阅方式获取资源数据方法的示意图,如图2所示,具体包括:
S201,VNF业务通过VNFM请求订阅NFVI层中资源告警信息。
VNF请求订阅NFVI层的告警。VNF发送订阅请求消息至VNFM,订阅请求消息中的参数包括VNF标识、告警标识。VNFM收到订阅消息后,向VNFI订阅告警信息。
底层资源在运行过程中,会产生一些告警信息,例如:CPU占用率过高、内存不足、网络拥塞等。当VNF订阅了告警信息后,如果NFVI层在运行过程中产生告警,就会将告警信息发送至VNFM,VNFM再将该告警信息发送至业务层。
S202:VNF业务接收到NFVI资源告警消息。
当NFVI中的资源出现故障告警时,VNFM接收NFVI层发送的资源告警消息,VNFM将该告警消息发送到订阅告警信息的VNF业务。资源告警消息中包含了资源标识、告警标识、订阅的VNF标识等相关信息。
S203:VNF业务将接收的告警信息存入数据库中。
VNF业务将收到的资源告警信息存入数据库中,存储的资源告警信息包括如下字段:告警时间、资源标识、告警标识和告警名称等。
图3为本发明实施例提供的一种轮询方式获取资源数据方法的示意图,如图3所示,具体可以包括:
S301,VNF业务生成NFVI层资源KPI采样任务。
业务层获取资源层的KPI数据,通常采用周期性采样的方式,例如,采样周期可以是10秒或1分钟等。
S302,VNF业务获取相关资源信息。
任务生成后,每到采样时间,就执行此采样任务。具体地,VNF业务层获取到业务所在 的虚拟机(virtual machine,VM)、主机(host)等虚拟资源和物理资源信息。
S303:VNF业务请求采样NFVI层中资源的KPI数据。
业务层发送采样请求消息至VNFM,VNFM收到采样请求消息后,向NFVI请求相关数据。采样请求消息中包括:VNF标识、VM的信息、host的信息和KPI标识。NFVI资源的KPI数据可以包括多个,例如当前网速、硬盘数据访问量、VM上一个周期内处理的业务量等。
S304:VNF业务根据采样的资源KPI数据,生成KPI统计数据。
业务层在接收到采样的KPI数据后,计算一个统计周期内的KPI统计数据。统计周期可以是采样周期的N倍,例如,当采样周期为10秒时,N为6倍时,统计周期为1分钟。
统计数据可以有多种,例如下列统计数据:
累积和(SUM):统计周期内采样的资源KPI数据的累计值。
平均值(SVG):统计周期内采样的资源KPI数据的平均值(采样值的累积和除以采样次数)。
最大值(MAX):统计周期内采样的资源KPI数据的最大值。
实时值(REAL):统计周期内资源KPI数据的最后一个采样值。
S305:VNF业务将上述资源KPI统计数据存入数据库中。
在数据库可存储上述资源KPI的统计数据,存储的资源KPI统计数据表中可以包括如下字段:统计周期、KPI标识、KPI名称,KPI统计数据等。
通过上述订阅方式和周期性轮询的方法,VNF业务持续获得底层资源的告警信息和KPI统计数据。为后续的业务故障提供了诊断的数据源。
图4为本发明实施例提供的一种业务故障产生关联数据的方法示意图,如图4所示,当检测到业务发生故障后,建立与此故障相关联的资源故障和资源告警的关联数据,具体包括:
S401:VNF业务判断该业务运行的业务数据发生故障。
VNF业务通过动态阈值或者静态阈值的方法判断该业务运行的业务数据,如果业务数据超过动态阈值或静态阈值,则确定VNF业务发生故障。
S402:VNF业务读取数据库中存储的NFVI资源KPI统计数据。
VNF业务根据上述检测到的业务故障,从数据库获取关联时长内的NFVI的资源KPI统计数据。关联时长为与业务故障可能存在关联的时间长度,例如,可配置为几分钟或几十分钟。这样,数据库中后续不必保留所有的数据,只保留与业务故障可能相关的资源数据即可。不同的资源数据类型,例如资源告警信息和资源KPI统计数据,可以配置不同的关联时长。
S403:VNF业务确定资源KPI故障。
VNF业务通过动态阈值或者静态阈值等方法确定哪些资源KPI统计数据存在故障。例如CPU占用率超过阈值。
S404:VNF业务读取数据库中存储的资源告警信息。
同样,VNF业务根据上述检测到的业务故障,从数据库获取关联时长内的资源告警信息表。
S405:VNF业务确定业务故障的关联数据。
VNF业务将上述步骤S403确定的存在故障的资源KPI统计数据、步骤S404获取的资源告警信息作为VNF业务故障的关联数据。
S406:VNF业务将业务故障的关联数据存入数据库中。
VNF业务将关联数据存入数据库中的业务故障关联表中,该关联表可包括:业务故障时间、业务故障标识、关联的资源KPI统计数据、关联的资源告警信息等。
这样,数据库中存储了业务故障与资源KPI和资源告警的关联数据。后续每次VNF业务运行中发生业务故障时,都重复执行图4的过程,从而不断地为业务故障关联表增加关联数据,为后续故障诊断提供丰富的历史数据。
图5为本发明实施例提供的一种诊断业务故障的方法示意图,如图所示,根据数据库中的历史数据,和一定的算法,诊断出导致VNF发生业务故障的底层资源的原因,具体包括:
S501:VNF业务发起业务故障诊断。
VNF业务在运行过程中发生业务故障时,发起业务故障诊断,以确定业务故障发生的原因。
S502:VNF业务读取数据库中的业务故障关联表。
VNF业务从数据库中读取历史记录的业务故障关联表内容,包括与业务故障关联的资源KPI统计数据,以及与业务故障关联的资源告警信息。
S503:VNF业务根据业务故障关联表确定诊断规则。
VNF业务根据业务故障关联表中的关联数据,通过相关第一算法,计算出该业务故障的诊断规则。第一算法有很多,例如频繁项挖掘算法。根据频繁项挖掘算法,可以由VNF业务故障在历史时间内的关联数据,获取到VNF业务故障和相应的资源KPI故障或资源告警的关联性,也就是所述诊断规则。
S504:VNF业务从数据库中获取本次关联时长内的业务故障的关联数据。
VNF业务从数据库中获取本周期内与需要诊断的业务故障相关联的关联数据,包括资源KPI统计数据和资源告警信息。
S505:VNF业务根据诊断规则和关联数据,确定诊断结果。
将上述业务故障相关的资源KPI统计数据和资源告警信息,与诊断规则相配,可确定诊断结果。即确定业务故障的根因是最可有哪个资源告警信息或资源KPI统计数据异常引起的。例如,VNF业务故障表现为用户4G流量大幅下滑,最终定位到一个硬件资源(网卡)产生告警。
本发明实施例通过对底层资源KPI数据和告警信息的收集,以及与VNF业务故障的关联,实现了快速定位业务故障,极大地提高了故障恢复的能力和系统可靠性。
本发明实施例可以根据上述方法示例对诊断装置进行功能模块的划分,在采用集成的模块的情况下,图6为本发明实施例提供的一种诊断装置的结构示意图。诊断装置600包括:存储模块601、处理模块602和通信模块603。处理模块602用于对诊断装置的动作进行控制管理,例如:处理模块602,用于支持诊断装置执行图5中的过程501和503,和/或用于本文所描述的技术的其它过程。通信模块403用于获取VNF业务故障的诊断规则。诊断装置还可以包括存储模块601,用于存储资源KPI的统计数据等。
其中,处理模块602可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本发明实施例公开内容所描述的各种示例性的逻辑方框,模块和电路。 所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信模块603可以是通信接口、收发器、收发电路等,其中,通信接口是统称,可以包括一个或多个接口。存储模块601可以是存储器。
当处理模块602为处理器,通信模块603为通信接口,存储模块601为存储器时,本发明实施例所涉及的终端设备可以为图7所示的终端设备。
参阅图7所示,图7为本发明实施例提供的另一种诊断装置的结构示意图,该诊断装置700包括:处理器701、通信接口703、存储器701。其中,通信接口703、处理器702以及存储器701可以通过通信连接相互连接。
上述实施例中,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。诊断装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
专业人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
Claims (16)
- 一种虚拟网络功能VNF业务故障诊断方法,其特征在于,包括:确定发生虚拟网络功能VNF业务故障;获取所述VNF业务故障的诊断规则;将所述VNF业务故障关联的资源关联数据与所述诊断规则相匹配,确定所述VNF业务故障的故障原因。
- 根据权利要求1所述的方法,其特征在于,所述获取VNF业务故障的诊断规则,包括:根据历史记录的与所述VNF业务故障关联的资源关联数据,使用第一算法计算出所述VNF业务故障的诊断规则。
- 根据权利要求2所述的方法,其特征在于,所述第一算法包括:频繁项挖掘算法。
- 根据权利要求1-3任一项所述的方法,其特征在于,所述资源关联数据包括下列至少一项:资源KPI统计数据、资源告警信息、资源日志信息。
- 根据权利要求4所述的方法,其特征在于,所述资源KPI统计数据包括统计周期内资源KPI采样数据的下列至少一项:累积和、平均值、最大值和实时值。
- 根据权利要求4所述的方法,其特征在于,所述资源关联数据通过周期性轮询方式获取,或者通过订阅方式获取。
- 根据权利要求1-6任一项所述的方法,其特征在于,确定发生所述VNF业务故障包括:通过动态阈值或静态阈值方法判断出所述VNF业务故障。
- 一种诊断装置,其特征在于,包括:处理模块,用于确定发生虚拟网络功能VNF业务故障;通信模块,用于获取所述VNF业务故障的诊断规则;所述处理模块,用于将所述VNF业务故障关联的资源关联数据与所述诊断规则相匹配,确定所述VNF业务故障的故障原因。
- 根据权利要求8所述的装置,其特征在于,所述通信模块具体用于,根据历史记录的与所述VNF业务故障关联的资源关联数据,使用第一算法计算出所述VNF业务故障的诊断规则。
- 根据权利要求9所述的装置,其特征在于,所述第一算法包括:频繁项挖掘算法。
- 根据权利要求8-10任一项所述的装置,其特征在于,所述资源关联数据包括下列至少一项:资源KPI统计数据、资源告警信息、资源日志信息。
- 根据权利要求11所述的装置,其特征在于,所述资源KPI统计数据包括统计周期内资源KPI采样数据的下列至少一项:累积和、平均值、最大值和实时值。
- 根据权利要求11所述的装置,其特征在于,所述资源关联数据通过周期性轮询方式获取,或者通过订阅方式获取。
- 根据权利要求8-13任一项所述的装置,其特征在于,所述处理模块具体用于,通过动态阈值或静态阈值方法判断出所述VNF业务故障。
- 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-7任意一项所述的方法。
- 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1-7任意一项所述的方法。
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711297407.5 | 2017-12-08 | ||
| CN201711297407.5A CN109905261A (zh) | 2017-12-08 | 2017-12-08 | 故障诊断方法及装置 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019109961A1 true WO2019109961A1 (zh) | 2019-06-13 |
Family
ID=66750799
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/119426 Ceased WO2019109961A1 (zh) | 2017-12-08 | 2018-12-05 | 故障诊断方法及装置 |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN109905261A (zh) |
| WO (1) | WO2019109961A1 (zh) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111934910A (zh) * | 2020-07-14 | 2020-11-13 | 中国联合网络通信集团有限公司 | 故障处理方法、设备及存储介质 |
| CN114357239A (zh) * | 2022-01-07 | 2022-04-15 | 重庆紫光华山智安科技有限公司 | 一种自动化视图库服务保障系统和方法 |
| CN114723197A (zh) * | 2021-01-04 | 2022-07-08 | 中国移动通信有限公司研究院 | 一种故障根因定位方法、装置及电子设备 |
| WO2023056723A1 (zh) * | 2021-10-08 | 2023-04-13 | 苏州浪潮智能科技有限公司 | 故障诊断的方法、装置、电子设备及存储介质 |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112217691A (zh) * | 2020-02-19 | 2021-01-12 | 杜义平 | 基于云平台的网络诊断处理方法及装置 |
| CN113596891B (zh) * | 2021-07-28 | 2023-07-14 | 中国联合网络通信集团有限公司 | 故障定位方法、装置、服务器、存储介质及系统 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104170323A (zh) * | 2014-04-09 | 2014-11-26 | 华为技术有限公司 | 基于网络功能虚拟化的故障处理方法及装置、系统 |
| WO2015100611A1 (zh) * | 2013-12-31 | 2015-07-09 | 华为技术有限公司 | 一种网络功能虚拟化nfv故障管理装置、设备及方法 |
| CN105187249A (zh) * | 2015-09-22 | 2015-12-23 | 华为技术有限公司 | 一种故障恢复方法及装置 |
| CN106301828A (zh) * | 2015-05-21 | 2017-01-04 | 中兴通讯股份有限公司 | 一种虚拟化网络功能业务故障的处理方法及装置 |
| EP3252600A1 (en) * | 2015-01-28 | 2017-12-06 | Nec Corporation | Virtual network function management device, system, healing method, and program |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101937447B (zh) * | 2010-06-07 | 2012-05-23 | 华为技术有限公司 | 一种告警关联规则挖掘方法、规则挖掘引擎及系统 |
| CN105165054B (zh) * | 2014-01-21 | 2019-05-24 | 华为技术有限公司 | 网络服务故障处理方法,服务管理系统和系统管理模块 |
| WO2017031698A1 (zh) * | 2015-08-25 | 2017-03-02 | 华为技术有限公司 | 一种获取vnf信息的方法、装置及系统 |
| CN106130809B (zh) * | 2016-09-07 | 2019-06-25 | 东南大学 | 一种基于日志分析的IaaS云平台网络故障定位方法及系统 |
| CN107248927B (zh) * | 2017-05-02 | 2020-06-09 | 华为技术有限公司 | 故障定位模型的生成方法、故障定位方法和装置 |
-
2017
- 2017-12-08 CN CN201711297407.5A patent/CN109905261A/zh active Pending
-
2018
- 2018-12-05 WO PCT/CN2018/119426 patent/WO2019109961A1/zh not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2015100611A1 (zh) * | 2013-12-31 | 2015-07-09 | 华为技术有限公司 | 一种网络功能虚拟化nfv故障管理装置、设备及方法 |
| CN104170323A (zh) * | 2014-04-09 | 2014-11-26 | 华为技术有限公司 | 基于网络功能虚拟化的故障处理方法及装置、系统 |
| EP3252600A1 (en) * | 2015-01-28 | 2017-12-06 | Nec Corporation | Virtual network function management device, system, healing method, and program |
| CN106301828A (zh) * | 2015-05-21 | 2017-01-04 | 中兴通讯股份有限公司 | 一种虚拟化网络功能业务故障的处理方法及装置 |
| CN105187249A (zh) * | 2015-09-22 | 2015-12-23 | 华为技术有限公司 | 一种故障恢复方法及装置 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111934910A (zh) * | 2020-07-14 | 2020-11-13 | 中国联合网络通信集团有限公司 | 故障处理方法、设备及存储介质 |
| CN111934910B (zh) * | 2020-07-14 | 2023-03-24 | 中国联合网络通信集团有限公司 | 故障处理方法、设备及存储介质 |
| CN114723197A (zh) * | 2021-01-04 | 2022-07-08 | 中国移动通信有限公司研究院 | 一种故障根因定位方法、装置及电子设备 |
| WO2023056723A1 (zh) * | 2021-10-08 | 2023-04-13 | 苏州浪潮智能科技有限公司 | 故障诊断的方法、装置、电子设备及存储介质 |
| CN114357239A (zh) * | 2022-01-07 | 2022-04-15 | 重庆紫光华山智安科技有限公司 | 一种自动化视图库服务保障系统和方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109905261A (zh) | 2019-06-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019109961A1 (zh) | 故障诊断方法及装置 | |
| Chen et al. | CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment | |
| US8224942B1 (en) | Network failure detection | |
| US10069900B2 (en) | Systems and methods for adaptive thresholding using maximum concentration intervals | |
| US9836952B2 (en) | Alarm causality templates for network function virtualization | |
| CN107092544B (zh) | 监控方法及装置 | |
| US20170104658A1 (en) | Large-scale distributed correlation | |
| US9710122B1 (en) | Customer support interface | |
| US20150149609A1 (en) | Performance monitoring to provide real or near real time remediation feedback | |
| CN105634785B (zh) | 一种故障上报方法、系统及相关装置 | |
| US20200327045A1 (en) | Test System and Test Method | |
| US10616078B1 (en) | Detecting deviating resources in a virtual environment | |
| WO2017114152A1 (zh) | 一种业务拨测方法、装置以及系统 | |
| US9590885B1 (en) | System and method of calculating and reporting of messages expiring from a queue | |
| US10291493B1 (en) | System and method for determining relevant computer performance events | |
| EP4158480A1 (en) | Actionability metric generation for events | |
| CN110362455A (zh) | 一种数据处理方法和数据处理装置 | |
| CN104104734A (zh) | 日志分析方法和装置 | |
| CN115913911A (zh) | 网络故障检测方法、设备及存储介质 | |
| US20170185475A1 (en) | Analytics-Based Dynamic Adaptation of Client-Server Mobile Applications | |
| US20180095819A1 (en) | Incident analysis program, incident analysis method, information processing device, service identification program, service identification method, and service identification device | |
| WO2024018257A1 (en) | Early detection of irregular patterns in mobile networks | |
| US11036561B2 (en) | Detecting device utilization imbalances | |
| CN105515909B (zh) | 一种数据采集测试方法和装置 | |
| EP3764232B1 (en) | Business transactions impact analysis |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18886704 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18886704 Country of ref document: EP Kind code of ref document: A1 |