HK1238037B

HK1238037B - Systems and methods for maintaining network service levels

Info

Publication number: HK1238037B
Application number: HK17112052.1A
Authority: HK
Inventors: Kakadia Deepak; Naeem Muhammad
Original assignee: Google Llc
Priority date: 2015-10-09
Filing date: 2017-11-17
Publication date: 2021-08-20

Description

System and method for maintaining network service levels

背景技术Background Art

信息通过计算机网络被传送。该信息被表示为被编组为分组的比特。该分组被从网络装置向网络装置传递，以通过计算机网络来传播信息，所述网络装置例如是交换器和路由器。每个分组被从其源向由在相应的分组中的报头信息指定的目的地传送。分组的源和目的地可以分别在网络的不同部分中，每个部分被不同的一方操作。在源和目的地之间可以有多个可能的路由。Information is transmitted across a computer network. This information is represented as bits grouped into packets. These packets are passed from network device to network device, such as switches and routers, to propagate the information across the computer network. Each packet is transmitted from its source to a destination specified by the corresponding packet's header information. The packet's source and destination may be in different parts of the network, each operated by a different party. There may be multiple possible routes between the source and destination.

诸如因特网的广域网(“WAN”)可以包括被称为自主系统(“AS”)的多个子网络。自主系统是网络的一部分，其对网络的其他部分看起来好像它具有单个路由策略的统一管理，并且向网络的其他部分呈现例如作为通过AS可达到的网络地址空间的、可到达的网络目的地的一致图像。在一些情况下，可以通过在网络内唯一的自主系统编号(“ASN”)来识别自主系统。通常，自主系统的运营者与第三方具有协议，用于允许在被相应的第三方控制的一个或多个自主系统上承载数据，该协议通常在用于通过使用来计费的运输的“结算”协议下或作为“无结算”对等协议。然后，可以在自主系统运营者之间的协议的范围内在对等点、多宿主的网络装置或因特网交换点(“IXP”)等处将数据从一个自主系统向另一个传送。Wide area networks ("WANs") such as the Internet can include multiple subnetworks called autonomous systems ("ASs"). An AS is a portion of a network that appears to the rest of the network as having unified management of a single routing policy and presents a consistent image of reachable network destinations to the rest of the network, for example, as the network address space reachable through the AS. In some cases, an AS can be identified by an autonomous system number ("ASN") that is unique within the network. Typically, the operator of an AS has an agreement with a third party to allow data to be carried on one or more ASs controlled by the respective third party, typically under a "clearance" agreement for billable transportation or as a "clearanceless" peer-to-peer agreement. Data can then be transferred from one AS to another within the scope of the agreement between the operators of the ASs at, for example, peering points, multi-homed network devices, or Internet exchange points ("IXPs").

发明内容Summary of the Invention

在一些方面，本公开涉及一种用于保持网络服务级别的方法。所述方法包括：识别在测量时段的第一部分上出现的第一多个网络事故；以及识别在所述测量时段的所述第一部分之后出现的、在所述测量时段的第二部分上出现的第二多个网络事故。所述方法包括：基于所述第一和第二多个网络事故对于所述测量时段的事故容忍限度的对应集合的影响来确定多个剩余的事故容忍限度。所述方法包括：基于通过与在所述第二网络事故的子集中的每个所述第二网络事故相关联的剩余的事故容忍限度加权的所述第二多个网络事故中的一个或多个的聚合影响特性来生成对于所述第二网络事故的至少所述子集的严重程度度量值。所述方法包括：然后选择所述第二网络事故的所述子集中的所述事故中的至少一个来用于补救。In some aspects, the present disclosure relates to a method for maintaining a network service level. The method includes identifying a first plurality of network incidents that occurred during a first portion of a measurement period, and identifying a second plurality of network incidents that occurred during a second portion of the measurement period and occurred after the first portion of the measurement period. The method includes determining a plurality of remaining incident tolerances based on the impact of the first and second plurality of network incidents on corresponding sets of incident tolerances for the measurement period. The method includes generating a severity metric for at least a subset of the second plurality of network incidents based on an aggregated impact characteristic of one or more of the second plurality of network incidents weighted by the remaining incident tolerance associated with each of the second network incidents in the subset of the second network incidents. The method includes then selecting at least one of the incidents in the subset of the second network incidents for remediation.

在一些方面，本公开涉及一种用于保持网络服务级别的系统。所述系统包括：计算机可读存储器，存储网络事故的记录；以及一个或多个处理器，被配置为访问所述计算机可读存储器，并且执行指令，所述指令在被处理器执行时使得所述处理器：使用在所述计算机可读存储器中存储的网络事故的所述记录来识别在测量时段的第一部分上出现的第一多个网络事故，并且进一步识别在所述测量时段的所述第一部分之后出现的、在所述测量时段的第二部分上出现的第二多个网络事故。所述指令在被执行时进一步使得所述处理器：基于所述第一和第二多个网络事故对于所述测量时段的事故容忍限度的对应集合的影响来确定多个剩余的事故容忍限度；基于通过与在所述第二网络事故的子集中的每个所述第二网络事故相关联的剩余的事故容忍限度加权的所述第二多个网络事故中的一个或多个的聚合影响特性来生成对于所述第二网络事故的至少所述子集的严重程度度量值；以及选择所述第二网络事故的所述子集中的所述事故中的至少一个来用于补救。In some aspects, the present disclosure relates to a system for maintaining a network service level. The system includes a computer-readable memory storing a record of network incidents; and one or more processors configured to access the computer-readable memory and execute instructions that, when executed by the processors, cause the processors to: use the record of network incidents stored in the computer-readable memory to identify a first plurality of network incidents that occurred during a first portion of a measurement period, and further identify a second plurality of network incidents that occurred during a second portion of the measurement period and occurred after the first portion of the measurement period. The instructions, when executed, further cause the processors to: determine a plurality of remaining incident tolerances based on the impact of the first and second plurality of network incidents on corresponding sets of incident tolerances for the measurement period; generate a severity metric for at least a subset of the second plurality of network incidents based on an aggregated impact characteristic of one or more of the second plurality of network incidents weighted by the remaining incident tolerance associated with each of the second network incidents in the subset of the second network incidents; and select at least one of the incidents in the subset of the second network incidents for remediation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过参考结合附图进行的下面的详细描述，将更全面地理解本公开的上面和相关的目的、特征和优点，其中：The above and related objects, features and advantages of the present disclosure will be more fully understood by referring to the following detailed description taken in conjunction with the accompanying drawings, in which:

图1是示例网络环境的方框图；FIG1 is a block diagram of an example network environment;

图2A和2B是图示可以如何在网络故障周围将通信重定向的方框图；2A and 2B are block diagrams illustrating how communications may be redirected around a network failure;

图3A是表示服务级别事故记录的示例表；FIG3A is an example table showing service level incident records;

图3B是表示服务级别事故记录的聚合的示例表；FIG3B is an example table representing an aggregation of service level incident records;

图4是图示用于保持网络服务级别的示例方法的流程图；FIG4 is a flow chart illustrating an example method for maintaining network service levels;

图5是图示用于将事故优先级化的过滤器交集的维恩图；FIG5 is a Venn diagram illustrating the intersection of filters used to prioritize incidents;

图6是适合于在所描述的各种实现方式中使用的网络装置的方框图；以及FIG6 is a block diagram of a network device suitable for use in the various described implementations; and

图7是适合于在所描述的各种实现方式中使用的计算系统的方框图。7 is a block diagram of a computing system suitable for use in the various described implementations.

为了清楚，可以不是每个组件被标注在每个附图中。附图不意欲按照比例绘制。在各个附图中，相似的附图标号和标记指示相似的元件。For clarity, not every component may be labeled in every drawing. The drawings are not intended to be drawn to scale. Like reference numerals and labels indicate like elements throughout the drawings.

具体实施方式DETAILED DESCRIPTION

计算装置通过可以跨越多个自主系统(“AS”)的网络路由来进行通信。AS网络或被称为子网的AS网络的一部分在各种环境下向各种网络客户提供服务，该环境包括但不限于数据服务网络、接入网络、传输网络和多租户网络(例如，计算“云”、托管计算服务和作为服务的网络)。网络管理者向它们的客户作出承诺，保证要由网络提供的某个级别的服务。这些服务级别协议(“SLA”)定义用于网络正常运行时间和质量(例如，带宽、时延等)的一个或多个服务级别目标(“SLO”)。通常，SLA是对于对将不可避免地发生使得网络服务中断或降级的事故的容忍度的合同限制。然而，会难以确定网络事故或一组事故对于服务级别目标是否有足以违反SLA的影响，直到已经违反了SLA。Computing devices communicate over network routes that can span multiple autonomous systems (“AS”). AS networks, or portions of AS networks referred to as subnets, provide services to various network customers in a variety of environments, including but not limited to data service networks, access networks, transport networks, and multi-tenant networks (e.g., computing “clouds,” hosted computing services, and networks as a service). Network managers make promises to their customers that a certain level of service will be provided by the network. These service level agreements (“SLAs”) define one or more service level objectives (“SLOs”) for network uptime and quality (e.g., bandwidth, latency, etc.). Typically, an SLA is a contractual limit on tolerance for incidents that will inevitably occur causing network service to be interrupted or degraded. However, it can be difficult to determine whether a network incident or a set of incidents has a sufficient impact on the service level objectives to violate the SLA until the SLA has already been violated.

如在此所述，网络管理者可以使用服务监视器来跟踪服务级别事故(“SLI”)。例如，管理者可以使用SLO相关工具，该工具识别下述事故：其中，软件定义的网络(“SDN”)控制器拒绝向新的通信流分配资源，例如，因为网络容量在请求时不足以支持该新流的要求。任何一个这样的拒绝不可能导致SLA违反。然而，重复的拒绝可能导致SLA违反。在一些实现方式中，每当通信流遇到网络拥塞时，SLI出现。一些通信协议实现了拥塞通知协议，并且服务监视器可以检测或被通知其中拥塞通知协议指示拥塞的流。例如，传输控制协议(“TCP”)具有被保留来用于显式拥塞通知(“ECN”)的报头比特，并且每当流包括具有被设置来指示拥塞的ECN比特的分组时，监视器可以记录SLI。又如，在一些实现方式中，SLA包括对于在固定或滑动的时段上平均的通信质量的一个或多个度量的值的最低要求。可以使用一个或多个度量来测量通过网络的通信，该一个或多个度量包括例如带宽、吞吐量和实际通过量，如在下面更详细地所述。服务级别事故可以包括例如：其中不可获得网络链路、拒绝或中断网络流、网络流遇到网络拥塞的事故；和/或其中网络通信质量的度量的值超过或小于阈值的事故。As described herein, a network administrator can use a service monitor to track service level incidents ("SLIs"). For example, an administrator can use an SLO-related tool that identifies an incident in which a software-defined network ("SDN") controller refuses to allocate resources to a new communication flow, for example, because network capacity is insufficient to support the requirements of the new flow at the time of the request. Any one such refusal is unlikely to result in an SLA violation. However, repeated refusal may result in an SLA violation. In some implementations, an SLI occurs whenever a communication flow encounters network congestion. Some communication protocols implement a congestion notification protocol, and a service monitor can detect or be notified of flows in which the congestion notification protocol indicates congestion. For example, the Transmission Control Protocol ("TCP") has a header bit reserved for explicit congestion notification ("ECN"), and the monitor can record an SLI whenever a flow includes a packet with an ECN bit set to indicate congestion. As another example, in some implementations, an SLA includes minimum requirements for the values of one or more metrics of communication quality averaged over a fixed or sliding period of time. Communication through the network can be measured using one or more metrics, including, for example, bandwidth, throughput, and actual throughput, as described in more detail below. Service level incidents can include, for example, incidents in which a network link is unavailable, a network flow is denied or interrupted, a network flow encounters network congestion, and/or incidents in which a value of a metric of network communication quality exceeds or falls below a threshold.

可以通过紧密地监视SLI来在SLA违反出现之前预测和防止SLA违反。例如，如在此所述的，监视器或网络分析器可以从一个或多个监视工具收集SLI记录，根据各种标准来过滤掉记录中的一些，并且从剩余的记录识别重要的SLI。在一些实现方式中，例如，基于对应的事故的影响来向每个SLI分配重要性权重。例如，在一些实现方式中，SLA包括在一个时段上的网络故障的容忍级别。如果服务级别事故在该时段的结束附近出现，则可以基于先前的事故是否已经影响了容忍级别来将该服务级别事故加权得更高或更低。即，例如，如果特定的SLA每月允许七个小时的停机时间，则如果一个月已经有最小的停机时间则可以将该月结束附近的几秒的停机时间加权得更低，并且如果在那个月存在接近或超过七个容忍小时的大量停工时间，则加权得更高。监视器或网络分析器可以然后识别一个或多个特定事故，其中，校正行为将具有防止SLA违反的最高益处。可以识别这些特定事故的原因以补救。预测和防止这些SLA违反可以通过保持网络服务级别来改善网络的运行。SLA violations can be predicted and prevented before they occur by closely monitoring SLIs. For example, as described herein, a monitor or network analyzer can collect SLI records from one or more monitoring tools, filter out some of the records based on various criteria, and identify significant SLIs from the remaining records. In some implementations, for example, each SLI is assigned a significance weight based on the impact of the corresponding incident. For example, in some implementations, an SLA includes a tolerance level for network failures over a period of time. If a service-level incident occurs near the end of that period, the service-level incident can be weighted higher or lower based on whether previous incidents have impacted the tolerance level. For example, if a particular SLA allows seven hours of downtime per month, a few seconds of downtime near the end of the month can be weighted lower if there has already been minimal downtime, and higher if there has been significant downtime approaching or exceeding the seven tolerated hours in that month. The monitor or network analyzer can then identify one or more specific incidents where corrective action would have the highest benefit in preventing SLA violations. The causes of these specific incidents can be identified and remediated. Predicting and preventing these SLA violations can improve network operations by maintaining network service levels.

图1是示例网络环境100的方框图。广泛概述而言，图1描述了多个终端节点120，它们被配置来用于通过网络110与各个主节点160进行通信。虽然因特网是大网络110的良好示例，但是本说明书也等同地适用于其他网络。如所示，终端节点120经由网络部分112来访问网络110，网络部分112可以例如是网络110的一部分、接入网(例如，因特网服务提供商(“ISP”))、传输网络或便利在终端节点120和主节点160之间的通信的任何其他网络。如图1中所示，网络110包括网络部分114和116。网络部分112、114和116中的每个是说明性的网络区域，并且可以是同一自主系统的部分，可以是不同的自主系统，或者可以包括多个自主系统。网络部分114包括各种网络节点140，通过该各个网络节点140，在终端节点120和主节点160之间传递数据。一些实现方式可以受益于软件定义的网络(“SDN”)的使用，在该软件定义的网络中，数据转发装置被一个或多个网络控制器远程控制。因此，虽然在一些实现方式中未要求，但是网络部分114被图示为SDN，其中网络节点140是被一个或多个网络控制器146远程控制的数据转发装置。网络部分116包括主节点160，其表示例如在一个或多个数据中心或服务中心内的主装置。FIG1 is a block diagram of an example network environment 100. In broad outline, FIG1 depicts a plurality of terminal nodes 120 configured to communicate with respective master nodes 160 via a network 110. While the Internet is a good example of a large network 110, this specification is equally applicable to other networks. As shown, the terminal nodes 120 access the network 110 via a network portion 112, which may be, for example, a portion of the network 110, an access network (e.g., an Internet service provider ("ISP")), a transport network, or any other network that facilitates communication between the terminal nodes 120 and the master nodes 160. As shown in FIG1 , the network 110 includes network portions 114 and 116. Each of the network portions 112, 114, and 116 is an illustrative network area and may be part of the same autonomous system, may be a different autonomous system, or may include multiple autonomous systems. The network portion 114 includes various network nodes 140 through which data is transferred between the terminal nodes 120 and the master nodes 160. Some implementations may benefit from the use of software-defined networks ("SDNs") in which data forwarding devices are remotely controlled by one or more network controllers. Thus, although not required in some implementations, network portion 114 is illustrated as an SDN, where network nodes 140 are data forwarding devices remotely controlled by one or more network controllers 146. Network portion 116 includes master nodes 160, which represent master devices within, for example, one or more data centers or service centers.

图1也图示了网络监视器180，其位于网络110中，并且具有下述能力：直接或间接地监视在网络110的范围内出现的服务级别事故(“SLI”)。网络监视器180使用一个或多个存储装置188来保存SLI的记录。虽然仅示出了一个网络监视器180，但是一些实现方式使用遍及网络110分布的多个网络监视器180。在一些这样的实现方式中，该分布的网络监视器180共享所述一个或多个存储装置188。网络分析器190访问和分析在该一个或多个存储装置188中存储的SLI记录。在一些实现方式中，网络监视器180是或包括网络分析器190。在一些实现方式中，网络分析器190与网络监视器180分离并且不同。1 also illustrates a network monitor 180 that is located within the network 110 and has the capability to directly or indirectly monitor service level incidents (“SLIs”) occurring within the scope of the network 110. The network monitor 180 uses one or more storage devices 188 to store records of the SLIs. Although only one network monitor 180 is shown, some implementations use multiple network monitors 180 distributed throughout the network 110. In some such implementations, the distributed network monitors 180 share the one or more storage devices 188. A network analyzer 190 accesses and analyzes the SLI records stored in the one or more storage devices 188. In some implementations, the network monitor 180 is or includes the network analyzer 190. In some implementations, the network analyzer 190 is separate and distinct from the network monitor 180.

在图1中所示的终端节点120和主节点160是在通过网络环境100的各种数据通信中的参与者。终端节点120和主节点160可以每个是例如如图7中所示并且如下所述的计算系统910。例如，主节点160可以是提供服务的计算系统，并且终端节点120可以是消费服务的计算系统。主节点160可以向终端节点120传送数据，该终端节点120于是作为用于所传送的数据的信宿。同样，终端节点120可以向主节点160传送数据，该主节点160于是作为用于所传送的数据的信宿。终端节点120和主节点160可以在发送和接收数据之间交替。例如，终端节点120可以向主节点160发送对数据的请求，并且主节点160可以通过提供数据来对该请求进行响应。在一些情况下，多个终端节点120和/或多个主节点160可以参与数据的交换。主节点160可以作为在多个终端节点120之间的中介，例如作为通信促进者。每个终端节点120和主节点160可以填充任何数量的角色。然而，在每个这样的能力中，终端节点120和主节点160参与经由网络环境100传送的通信。在终端节点120和主节点160之间的通信可以被构造为数据分组的流，例如，以根据诸如IPv4或IPv6的因特网协议的数据分组的形式。流可以使用例如通过IP经由网络110、112、114和116传送的开放系统互连(“OSI”)层-4传输协议，诸如传输控制协议(“TCP”)或流控制传输协议(“SCTP”)。The terminal nodes 120 and master nodes 160 shown in FIG1 are participants in various data communications through the network environment 100. The terminal nodes 120 and master nodes 160 can each be, for example, a computing system 910 as shown in FIG7 and described below. For example, the master node 160 can be a computing system that provides a service, and the terminal node 120 can be a computing system that consumes the service. The master node 160 can transmit data to the terminal node 120, which then serves as a destination for the transmitted data. Similarly, the terminal node 120 can transmit data to the master node 160, which then serves as a destination for the transmitted data. The terminal nodes 120 and master nodes 160 can alternate between sending and receiving data. For example, the terminal node 120 can send a request for data to the master node 160, and the master node 160 can respond to the request by providing data. In some cases, multiple terminal nodes 120 and/or multiple master nodes 160 can participate in the exchange of data. The master node 160 can act as an intermediary between multiple terminal nodes 120, for example, as a communication facilitator. Each terminal node 120 and master node 160 can fill any number of roles. However, in each such capacity, the terminal node 120 and master node 160 participate in communications transmitted via the network environment 100. The communications between the terminal node 120 and the master node 160 can be constructed as a stream of data packets, for example, in the form of data packets according to an Internet protocol such as IPv4 or IPv6. The stream can use, for example, an open systems interconnection ("OSI") layer-4 transport protocol, such as Transmission Control Protocol ("TCP") or Stream Control Transmission Protocol ("SCTP"), transmitted via the networks 110, 112, 114, and 116 via IP.

终端节点120可以是膝上型、台式、平板、电子板、个人数字助理、智能电话、视频游戏装置、电视、电视辅助盒(也称为“机顶盒”)、自助服务终端、便携计算机或任何其他这样的装置。终端装置120能够向用户呈现内容或便利内容向用户的呈现。在一些实现方式中，终端装置120运行操作系统，该操作系统管理在终端装置120上的软件应用的执行。在一些这样的实现方式中，制造者或分发者向该操作系统提供终端装置120。应用在由操作系统控制的计算环境内，即在操作系统的“顶部”执行。应用可以被本原地随操作系统安装，或者后来例如被分发者或用户安装。在一些实现方式中，操作系统和/或应用被嵌入在终端装置120内，例如被编码在只读存储器中。The terminal node 120 can be a laptop, desktop, tablet, electronic board, personal digital assistant, smartphone, video game device, television, television accessory box (also known as a "set-top box"), kiosk, portable computer, or any other such device. The terminal device 120 is capable of presenting content to a user or facilitating the presentation of content to a user. In some implementations, the terminal device 120 runs an operating system that manages the execution of software applications on the terminal device 120. In some such implementations, the manufacturer or distributor provides the terminal device 120 with the operating system. Applications execute within a computing environment controlled by the operating system, i.e., "on top" of the operating system. Applications can be installed natively with the operating system or later, for example, by a distributor or user. In some implementations, the operating system and/or applications are embedded within the terminal device 120, for example, encoded in read-only memory.

主节点160可以是计算机，其向其他主节点160或向终端节点120提供服务。例如，主节点160可以是电子邮件服务器、文件服务器、数据高速缓存、名称服务器、内容服务器、数据中继器、网页服务器或任何其他网络服务主机。在一些实现方式中，一个或多个主节点160是内容传递网络(“CDN”)的部分。虽然仅在网络部分116中示出，但是主节点160可以遍及网络环境100分布。A master node 160 can be a computer that provides services to other master nodes 160 or to terminal nodes 120. For example, a master node 160 can be an email server, a file server, a data cache, a name server, a content server, a data relay, a web server, or any other network service host. In some implementations, one or more master nodes 160 are part of a content delivery network ("CDN"). Although shown only in network portion 116, master nodes 160 can be distributed throughout network environment 100.

网络环境100包括网络部分110、112、114和116，终端节点120和主节点160通过它们来交换信息。网络部分110、112、114和116可以在统一控制下，例如，作为同一AS网络的部分，或者可以在不同的控制下。每个网络部分110、112、114和116由各种网络装置(例如，网络节点140)构成，该各种网络装置链接在一起以形成在参与装置之间的一个或多个通信路径(例如，数据链路142)。例如，在网络部分114中图示了网络节点140，其中，互连链路142形成数据平面。每个网络节点140包括至少一个网络接口，用于通过连接的数据平面链路142来传送和接收数据，一起形成网络。在一些实现方式中，在图6中示出和在下面描述的网络装置730适合于用作网络节点140。The network environment 100 includes network parts 110, 112, 114 and 116, through which the terminal node 120 and the master node 160 exchange information. The network parts 110, 112, 114 and 116 can be under unified control, for example, as part of the same AS network, or can be under different controls. Each network part 110, 112, 114 and 116 is composed of various network devices (e.g., network nodes 140), which are linked together to form one or more communication paths (e.g., data links 142) between the participating devices. For example, a network node 140 is illustrated in the network part 114, wherein the interconnection link 142 forms a data plane. Each network node 140 includes at least one network interface for transmitting and receiving data by the connected data plane link 142, forming a network together. In some implementations, the network device 730 shown in Figure 6 and described below is suitable for use as the network node 140.

包括各种网络部分110、112、114和116的网络环境100可以由多个网络构成，每个网络可以是下述中的任何一个：局域网(LAN)，诸如公司内联网；城域网(MAN)；广域网(WAN)；互联网络，诸如因特网；或者对等网络，诸如自组织WiFi对等网络。在装置之间的数据链路可以是有线链路(例如，光纤、网格、同轴、Cat-5、Cat-5e、Cat-6等)和/或无线链路(例如，基于无线电、卫星或微波)的任何组合。网络部分112和114被图示为更大网络部分110的部分，与其中网络监视器180负责网络110的示例一致；然而，网络部分110、112、114和116可以每个是公共的、专用的或公共和专用网络的任何组合。所述网络可以是任何类型和/或形式的数据网络和/或通信网络。The network environment 100, including various network segments 110, 112, 114, and 116, can be composed of multiple networks, each of which can be any of the following: a local area network (LAN), such as a company intranet; a metropolitan area network (MAN); a wide area network (WAN); an internetwork, such as the Internet; or a peer-to-peer network, such as an ad hoc WiFi peer-to-peer network. The data links between the devices can be any combination of wired links (e.g., fiber, mesh, coaxial, Cat-5, Cat-5e, Cat-6, etc.) and/or wireless links (e.g., radio, satellite, or microwave-based). Network segments 112 and 114 are illustrated as being part of a larger network segment 110, consistent with the example in which network monitor 180 is responsible for network 110; however, network segments 110, 112, 114, and 116 can each be public, private, or any combination of public and private networks. The networks can be any type and/or form of data network and/or communication network.

在一些实现方式中，使用网络功能虚拟化(“NFV”)来实现网络部分110、112、114或116中的一个或多个。在NFV网络中，通常在网络装置140中实现的一些网络功能被实现为在处理器(例如，通用处理器)上执行的软件。在一些实现方式中，该虚拟化的网络功能包括负载平衡、接入控制、防火墙、入侵检测和路由中的一个或多个。也可以以这种方式虚拟化其他网络功能。在一些实现方式中，该虚拟化的网络功能包括用于向网络监视器180报告网络度量、网络中断和SLI的其他指示的功能。In some implementations, one or more of network portions 110, 112, 114, or 116 are implemented using network function virtualization ("NFV"). In an NFV network, some network functions typically implemented in network device 140 are implemented as software executing on a processor (e.g., a general-purpose processor). In some implementations, the virtualized network functions include one or more of load balancing, access control, firewall, intrusion detection, and routing. Other network functions may also be virtualized in this manner. In some implementations, the virtualized network functions include functionality for reporting network metrics, network outages, and other indications of SLIs to network monitor 180.

在一些实现方式中，一个或多个网络部分110、112、114和116是软件定义网络(“SDN”)，其中，数据转发装置(例如，网络节点140)被与该数据转发装置分离的远程网络控制器146控制，例如，如相对于网络部分114所示。在一些这样的实现方式中，SDN网络节点140被一个或多个SDN控制器146经由控制平面链路148控制，该控制平面链路148与数据平面链路142不同并且因此相对于数据平面链路142在带外。在一些实现方式中，经由带内数据平面链路142或经由带内数据平面142和带外控制平面链路148的混合组合来控制SDN网络节点140。在SDN网络的一些实现方式中，多个分组数据传输在被分配的路由上流动通过网络。当SDN数据转发装置接收到未被识别的流的分组时，数据转发装置向该新的流分配或请求控制器分配路由。每个随后接收的该流的分组然后被数据转发装置沿着同一路由转发。在一些实现方式中，SDN控制器146基于与新流相关联的标准，例如基于与在该流中识别的OSI层7应用协议相关联的要求来选择用于该新流的路由。例如，用于通过IP的语音(“VoIP”)的流可能要求低的网络时延，而用于文件传递协议(“FTP”)的流可以容忍更高的时延，并且，控制器146因此将通过低时延路由来引导VoIP流量优先级化。在一些实现方式中，如果控制器146不能识别适合于该新流的路由，则它拒绝该流。拒绝该流可能构成服务级别事故。在一些实现方式中，控制器146例如经由链路186向网络监视器180报告该新流的拒绝。在一些实现方式中，链路186是控制平面的部分。在一些实现方式中，链路186是数据平面的部分。在一些实现方式中，在图6中所示并且在下面描述的SDN控制器720适合于作为网络控制器146。In some implementations, one or more network portions 110, 112, 114, and 116 are software-defined networks ("SDNs") in which data forwarding devices (e.g., network nodes 140) are controlled by remote network controllers 146 that are separate from the data forwarding devices, as shown, for example, with respect to network portion 114. In some such implementations, SDN network nodes 140 are controlled by one or more SDN controllers 146 via control plane links 148, which are distinct from and therefore out-of-band relative to data plane links 142. In some implementations, SDN network nodes 140 are controlled via in-band data plane links 142 or via a hybrid combination of in-band data plane 142 and out-of-band control plane links 148. In some implementations of SDN networks, multiple packet data transmissions flow through the network on assigned routes. When an SDN data forwarding device receives a packet for an unrecognized flow, the data forwarding device assigns or requests the controller to assign a route to the new flow. Each subsequently received packet for the flow is then forwarded by the data forwarding device along the same route. In some implementations, the SDN controller 146 selects a route for the new flow based on criteria associated with the new flow, such as requirements associated with the OSI layer 7 application protocol identified in the flow. For example, a flow for Voice over IP (“VoIP”) may require low network latency, while a flow for File Transfer Protocol (“FTP”) may tolerate higher latency, and the controller 146 will therefore prioritize VoIP traffic over a low-latency route. In some implementations, if the controller 146 cannot identify a route suitable for the new flow, it rejects the flow. Rejecting the flow may constitute a service level incident. In some implementations, the controller 146 reports the rejection of the new flow to the network monitor 180, for example, via link 186. In some implementations, link 186 is part of the control plane. In some implementations, link 186 is part of the data plane. In some implementations, the SDN controller 720 shown in FIG. 6 and described below is suitable as the network controller 146 .

网络环境100包括一个或多个网络监视器180。在一些实现方式中，网络监视器180是硬件装置，其包括一个或多个计算处理器、存储器装置、网络接口和连接电路。例如，在一些实现方式中，网络监视器180是计算装置，诸如在图7中所示和在下面描述的计算装置910。每个网络监视器180位于网络110中或与网络110进行通信，并且具有下述能力：直接或间接地监视在网络110的范围内出现的服务级别事故(“SLI”)。在一些实现方式中，网络控制器146向网络监视器180报告服务级别事故。在一些实现方式中，例如主节点160的通信参与者向网络监视器180报告服务级别事故。在一些实现方式中，网络监视器180检测服务级别事故。例如，在一些这样的实现方式中，网络监视器180周期地传送探测分组(例如，因特网控制消息协议(“ICMP”)分组)，并且使用对于探测分组的网络响应的特性来确定网络状态。网络监视器180将表示每个SLI的信息记录在一个或多个存储装置188中。在一些实现方式中，每个SLI被表示为以在该一个或多个存储装置188中存储的数据结构的记录，以用于分析。在一些实现方式中，每个SLI被表示为在数据库中的条目，例如，被表示为在关系数据库中的一组条目或表的行。每个SLI的记录可以以任何适当的格式被存储。在一些实现方式中，该一个或多个存储装置188在网络监视器180内部或与网络监视器180处于相同位置。在一些实现方式中，该一个或多个存储装置188在网络监视器180外部，例如，作为分立的数据服务器、网络附接存储(“NAS”)或存储区域网络(“SAN”)。在一些实现方式中，网络监视器180进一步包括网络分析器190。The network environment 100 includes one or more network monitors 180. In some implementations, the network monitor 180 is a hardware device that includes one or more computing processors, a memory device, a network interface, and a connection circuit. For example, in some implementations, the network monitor 180 is a computing device, such as the computing device 910 shown in Figure 7 and described below. Each network monitor 180 is located in the network 110 or communicates with the network 110 and has the following capabilities: directly or indirectly monitor service level incidents ("SLIs") that occur within the scope of the network 110. In some implementations, the network controller 146 reports service level incidents to the network monitor 180. In some implementations, communication participants such as the master node 160 report service level incidents to the network monitor 180. In some implementations, the network monitor 180 detects service level incidents. For example, in some such implementations, the network monitor 180 periodically transmits a probe packet (for example, an Internet Control Message Protocol ("ICMP") packet) and uses the characteristics of the network response to the probe packet to determine the network status. The network monitor 180 records information representing each SLI in one or more storage devices 188. In some implementations, each SLI is represented as a record in a data structure stored in the one or more storage devices 188 for analysis. In some implementations, each SLI is represented as an entry in a database, for example, as a set of entries or rows of a table in a relational database. The record for each SLI can be stored in any suitable format. In some implementations, the one or more storage devices 188 are internal to the network monitor 180 or in the same location as the network monitor 180. In some implementations, the one or more storage devices 188 are external to the network monitor 180, for example, as a discrete data server, network attached storage ("NAS"), or storage area network ("SAN"). In some implementations, the network monitor 180 further includes a network analyzer 190.

存储装置188是在网络监视器180内或者在网络监视器180外部但是可为网络监视器180访问的数据存储。存储装置188可以包括适合于存储计算机可读数据的任何装置或装置的集合。适合的数据存储装置包括易失性或非易失性存储、网络附接存储(“NAS”)和存储区域网络(“SAN”)。数据存储装置可以包含一个或多个大容量存储装置，该一个或多个大容量存储装置可以处于相同位置或分散。适合于存储数据的装置包括半导体存储器装置，诸如EPROM、EEPROM、SDRAM和闪存装置。适合于存储数据的装置包括磁盘，例如内部硬盘或可移动盘、磁光盘、光学和其他这样的较高容量格式的盘驱动器。可以将数据存储装置虚拟化。可以经由中间服务器和/或经由网络来访问数据存储装置。数据存储装置可以将数据构造为文件、数据块或分块的集合。数据存储装置可以使用例如冗余存储和/或错误恢复数据(例如奇偶校验比特)来提供错误恢复。存储装置188可以托管数据库，例如关系数据库。在一些实现方式中，数据被存储为在数据存储中存储的数据库中的一个或多个数据库表中的条目。在一些这样的实现方式中，使用查询语言来访问数据，该查询语言例如是结构化查询语言(“SQL”)或诸如PostgreSQL的变体。存储装置188可以托管文件存储系统。可以存储被结构化为知识库的数据。可以以加密形式来存储数据。可以通过一个或多个认证系统来限制对于存储数据的访问。Storage device 188 is a data storage device within network monitor 180 or external to network monitor 180 but accessible to network monitor 180. Storage device 188 may include any device or collection of devices suitable for storing computer-readable data. Suitable data storage devices include volatile or non-volatile storage, network attached storage ("NAS"), and storage area networks ("SAN"). The data storage device may include one or more mass storage devices, which may be co-located or distributed. Suitable devices for storing data include semiconductor memory devices such as EPROM, EEPROM, SDRAM, and flash memory devices. Suitable devices for storing data include magnetic disks, such as internal hard disks or removable disks, magneto-optical disks, optical, and other such higher-capacity formats. The data storage device may be virtualized. The data storage device may be accessed via an intermediate server and/or via a network. The data storage device may organize data into files, data blocks, or collections of blocks. The data storage device may provide error recovery using, for example, redundant storage and/or error recovery data (e.g., parity bits). Storage device 188 may host a database, such as a relational database. In some implementations, data is stored as entries in one or more database tables in a database stored in a data store. In some such implementations, data is accessed using a query language, such as Structured Query Language ("SQL") or a variant such as PostgreSQL. Storage device 188 may host a file storage system. Data structured as a knowledge base may be stored. Data may be stored in an encrypted form. Access to stored data may be restricted by one or more authentication systems.

在一些实现方式中，当通过网络的通信降低到低于特定的质量水平时，SLI可能出现。例如，SLA可以包括用于诸如吞吐量、带宽和时延等的一个或多个网络通信质量度量的平均值的最小或最大阈值。吞吐量是在固定时段中通过网络的一部分传送的信息的数量，例如，比特数。带宽是最大潜在吞吐量，其中，限制是物理的或人为的(例如，政策驱动)。当网络装置尝试获得比可获得的带宽可以容纳的更大的吞吐量时，拥塞出现。实际通过量是信息内容的吞吐量，不包括诸如网络配置数据、协议控制信息或丢失分组的重复传输的其他流量。时延是在当发送者传送分组和预期的接收者处理该分组时之间流逝的时间量，即，归因于传输的延迟。滞后是延迟的结果，例如，从通信参与者的角度对于延迟的感知。例如，当时延超过某个容忍阈值时，例如，当延迟变得对于最终用户可注意到或未能满足通信协议的服务质量(“QoS”)要求时，滞后可能出现。虽然当分组在传输中丢失或被破坏时滞后也可能出现，但是它一般被看作与时延同义。可以在单向传输或作为分组传输和随后响应或确认的往返时间方面测量时延(和滞后)。在一些情况下，根据路径长度，即，在路由中的中间网络装置(“跳跃”)的数量来测量延迟。每个跳跃可以有助于路由的整体时延，因此，预期具有较低跳跃数的路径具有较少的时延和转发故障的较少机会。分组延迟变化(即，传输抖动)是随着时间在时延上的变化，例如，当分组以突发到达时或以不一致的延迟到达时。传输误差可能引起不良的实际通过量、高时延或滞后和不期望有的延迟变化。传输错误的度量包括分组重发的计数、分组重发与第一传输的比率和拥塞相关的传输，诸如被设置了明显拥塞通知(“ECN”)标记的分组。可以对于每个这样的传输错误或当传输错误的一个或多个度量的值超过或小于对应的阈值时记录SLI。In some implementations, an SLI may occur when communications over a network degrade below a specific quality level. For example, an SLA may include minimum or maximum thresholds for the average values of one or more network communication quality metrics, such as throughput, bandwidth, and latency. Throughput is the amount of information, e.g., the number of bits, transmitted over a portion of a network in a fixed period of time. Bandwidth is the maximum potential throughput, where the limitation is physical or artificial (e.g., policy-driven). Congestion occurs when a network device attempts to achieve a greater throughput than the available bandwidth can accommodate. Actual throughput is the throughput of information content, excluding other traffic such as network configuration data, protocol control information, or repeated transmissions of lost packets. Latency is the amount of time that elapses between a sender transmitting a packet and the intended recipient processing it, i.e., the delay attributable to transmission. Hysteresis is the result of delay, e.g., the perception of delay from the perspective of a communication participant. Hysteresis may occur, for example, when latency exceeds a certain tolerance threshold, e.g., when the delay becomes noticeable to the end user or fails to meet the Quality of Service ("QoS") requirements of a communication protocol. Although hysteresis can also occur when packets are lost or corrupted in transit, it is generally considered synonymous with latency. Delay (and hysteresis) can be measured in terms of one-way transmission or as the round-trip time between a packet transmission and a subsequent response or acknowledgment. In some cases, latency is measured in terms of path length, i.e., the number of intermediate network devices ("hops") in a route. Each hop can contribute to the overall latency of a route, so paths with a lower number of hops are expected to have less latency and less chance of forwarding failures. Packet delay variation (i.e., transmission jitter) is the variation in latency over time, for example, when packets arrive in bursts or with inconsistent delays. Transmission errors can cause poor throughput, high latency or hysteresis, and undesirable delay variation. Metrics of transmission errors include counts of packet retransmissions, the ratio of packet retransmissions to first transmissions, and congestion-related transmissions, such as packets with the Explicit Congestion Notification ("ECN") flag set. SLI can be recorded for each such transmission error or when the value of one or more metrics of transmission errors exceeds or falls below corresponding thresholds.

网络分析器190负责由网络监视器180识别的SLI记录的分析。在一些实现方式中，网络分析器190是网络监视器180的组件或模块。在一些实现方式中，网络分析器190是硬件装置，其包括一个或多个计算处理器、存储器装置、网络接口和连接电路。例如，在一些实现方式中，网络分析器190是计算装置，诸如在图7中所示和在下面描述的计算装置910。网络分析器190从存储装置188读取SLI记录，并且将所表示的服务级别事故优先级化。在一些实现方式中，网络分析器190将由SLI记录表示的一个或多个特定的服务级别事故识别为高优先级。网络管理者可以然后进一步调查该高优先级事故，并且采取动作来处理事故的根本原因。The network analyzer 190 is responsible for analyzing the SLI records identified by the network monitor 180. In some implementations, the network analyzer 190 is a component or module of the network monitor 180. In some implementations, the network analyzer 190 is a hardware device that includes one or more computing processors, a memory device, a network interface, and connecting circuitry. For example, in some implementations, the network analyzer 190 is a computing device, such as the computing device 910 shown in Figure 7 and described below. The network analyzer 190 reads the SLI records from the storage device 188 and prioritizes the service level incidents represented. In some implementations, the network analyzer 190 identifies one or more specific service level incidents represented by the SLI records as high priority. The network administrator can then further investigate the high priority incident and take action to address the root cause of the incident.

图2A和2B是图示可以如何在网络故障214周围将通信重定向的方框图。在一些情况下，SLI可以是似乎不相关的网络故障的结果。例如，在网络链路上的SLI可以是由在网络内的其他某处的不同网络链路上的故障引起的链路的超额的结果。图2A和2B图示了该示例。Figures 2A and 2B are block diagrams illustrating how communications can be redirected around a network failure 214. In some cases, an SLI can be the result of a seemingly unrelated network failure. For example, an SLI on a network link can be the result of an excess of that link caused by a failure on a different network link elsewhere in the network. Figures 2A and 2B illustrate this example.

广泛概述而言，图2A和2B图示了网络环境200，其包括三个不同的网络区域240_(A)、240_(B)和240_(C)。所图示的网络环境200包括在区域240_(A)和240_(C)之间的数据路径210、在区域240_(A)和240_(B)之间的数据路径220和在区域240_(B)和240_(C)之间的数据路径230。所图示的数据路径210、220和230被示出为用于表示任何形式的网络路径的单线，该任何形式的网络路径包括例如单个直接链路、多个链路、链路聚合、网络结构和链路的网络等。在图2A中，数据可以经由数据路径210从网络区域240_(A)直接流动216至240_(C)。然而，在图2B中，沿着数据路径210的故障214阻挡从区域240_(A)到区域240_(C)的流216。In broad overview, Figures 2A and 2B illustrate a network environment 200 that includes three distinct network regions 240 _(A) , 240 _(B) , and 240 _(C) . The illustrated network environment 200 includes a data path 210 between regions 240 _(A) and 240 _(C) , a data path 220 between regions 240 _(A) and 240 _(B) , and a data path 230 between regions 240 _(B) and 240 _(C). The illustrated data paths 210, 220, and 230 are shown as single lines representing any form of network path, including, for example, a single direct link, multiple links, link aggregations, network fabrics, networks of links, and the like. In Figure 2A, data can flow directly 216 from network region 240 _(A) to 240 _(C) via data path 210. However, in FIG. 2B , a fault 214 along data path 210 blocks flow 216 from region 240 _(A) to region 240 _(C) .

在图2B中，从区域240_(A)到区域240_(C)的数据流过区域240_(B)，以避开故障214。即，将直接数据流216替换为从区域240_(A)到区域240_(B)的数据流226和从区域240_(B)到区域240_(C)的数据流236。如果路径220和230(即，路径)的复合容量等于或大于故障路径210的容量，则故障214将不导致服务级别事故(“SLI”)。然而，使用替选路径通过区域240_(B)重定向从区域240_(A)到区域240_(C)的流量可能影响沿着分量路径220和230中的一个或两者的其他流量，这可以导致SLI。例如，如果每个路径210、220和230具有相同的容量并且是以相同的利用率，则故障214将使得路径(路径220和230)的利用率加倍。如果该利用率初始地大于50％，则将该利用率加倍将超过所述容量，并且导致SLI。In FIG2B , data from region 240 _(A) to region 240 _(C) flows through region 240 _(B) to circumvent failure 214. That is, direct data flow 216 is replaced by data flow 226 from region 240 _(A) to region 240 _(B) and data flow 236 from region 240 _(B) to region 240 _(C) . If the combined capacity of paths 220 and 230 (i.e., paths) is equal to or greater than the capacity of failed path 210, then failure 214 will not result in a service level incident (“SLI”). However, redirecting traffic from region 240 ₍ _{A) to region 240 (C} ) through region 240 _(B) using an alternate path may affect other traffic along one or both of component paths 220 and 230, which may result in an SLI. For example, if each path 210, 220, and 230 has the same capacity and is at the same utilization, then failure 214 will double the utilization of the paths 220 and 230. If the utilization is initially greater than 50%, doubling the utilization will exceed the capacity and result in an SLI.

如果在网络路径上的结果产生的负载接近容量，则对于随后增加的流量可能有SLI。即使路径中的一个的初始利用较低，如果结果产生的组合负载超过数据路径的容量，则可能有SLI。例如，如果在故障214周围重定向流量之后在网络路径230上的组合负载接近完全的利用率，则随后增加的、从区域240_(B)到区域240_(C)的流量将沿着网络路径230失败。结果产生的SLI记录将与沿着网络路径230的流量相关联。然而，实际的根本原因是沿着网络路径210的故障214。If the resulting load on the network path approaches capacity, then SLI may be present for the subsequent increase in traffic. Even if the initial utilization of one of the paths is low, if the resulting combined load exceeds the capacity of the data path, then SLI may be present. For example, if the combined load on network path 230 approaches full utilization after redirecting traffic around failure 214, then a subsequent increase in traffic from region 240 _(B) to region 240 _(C) will fail along network path 230. The resulting SLI record will be associated with the traffic along network path 230. However, the actual root cause is failure 214 along network path 210.

在图2A和2B中图示的情况被简化。实际上，在故障周围路由的流量将影响各个替选路径，并且触发在潜在未预期的位置中的服务级别事故。但是在聚合中的SLI记录的分析可以帮助识别基本原因。因此，在一些实现方式中，网络分析器190可以查找在各个路径上的服务级别事故的集合，其指示——在聚合中——何处保证了补救行为。在一些实现方式中，网络分析器190识别与不同的服务相关联的相关事故，使得原因更可能是网络，而不是与服务本身更直接地相关的原因。2A and 2B are simplified. In reality, traffic routed around a failure will impact various alternative paths and trigger service level incidents in potentially unexpected locations. But analysis of the SLI records in the aggregate can help identify the underlying cause. Thus, in some implementations, the network analyzer 190 can look for a collection of service level incidents on various paths that indicates—in the aggregate—where remedial action is warranted. In some implementations, the network analyzer 190 identifies related incidents associated with different services, making it more likely that the cause is network, rather than a cause more directly related to the services themselves.

在一些情况下，当在流中的分组的内容不能通过网络或网络的一部分被传播时，或者当在流中的分组的内容未以满足一个或多个网络质量度量的方式被传播时，出现服务级别事故(“SLI”)。例如，当不能对于流分配网络资源时，当网络流经历拥塞时，或者当一个或多个网络通信度量的值超过或小于对应的阈值时，SLI可能出现。服务级别协议(“SLA”)可以在特定的测量时段期间允许某个数量的事故或事故的某个聚合影响。例如，SLA可以容忍在一个星期的基础上的流的多达1％的中断或拒绝。在一些实现方式中，时段具有固定的开始和结束时间，例如，可以将一个星期定义为星期天午夜凌晨到下一个星期六晚上23:59。在一些实现方式中，时段是滑动的时间窗口，例如可以将星期定义为7个连续日的窗口或168小时的任何窗口。在一些实现方式中，测量时段可以是由SLA指定的离散时间块或滑动的时间窗口。测量时段的事故容忍度随着每个事故出现而减小。即，当事故出现时，对于涵盖该事故的测量时段的剩余事故容忍度被该事故的影响降低。如果SLA允许或容忍每月10个小时的停机时间，则2个小时的停机时间为该月的剩余部分留下8小时的剩余事故容忍度。如果SLI超过SLA的事故容忍度，则SLI是SLA违反。In some cases, a service level incident ("SLI") occurs when the content of a packet in a flow cannot be propagated through a network or a portion of a network, or when the content of a packet in a flow is not propagated in a manner that satisfies one or more network quality metrics. For example, an SLI may occur when network resources cannot be allocated to a flow, when a network flow experiences congestion, or when the value of one or more network communication metrics exceeds or falls below a corresponding threshold. A service level agreement ("SLA") may allow for a certain number of incidents or a certain aggregate impact of incidents during a particular measurement period. For example, an SLA may tolerate up to 1% of the flows being interrupted or denied on a weekly basis. In some implementations, the period has fixed start and end times, for example, a week may be defined as from midnight Sunday to 11:59 PM the following Saturday. In some implementations, the period is a sliding time window, for example, a week may be defined as a window of 7 consecutive days or any window of 168 hours. In some implementations, the measurement period may be a discrete time block or a sliding time window specified by the SLA. The incident tolerance of the measurement period decreases with each incident that occurs. That is, when an incident occurs, the remaining incident tolerance for the measurement period covering the incident is reduced by the impact of the incident. If the SLA allows or tolerates 10 hours of downtime per month, 2 hours of downtime leaves 8 hours of remaining incident tolerance for the remainder of the month. If the SLI exceeds the incident tolerance of the SLA, the SLI is an SLA violation.

在一些实现方式中，在重要性或网络影响上类似的两个可比较的事故中，当一个SLI对于SLA的剩余容忍限度比另一个SLI具有更大的影响时，对于剩余的容忍限度具有更大影响的SLI相对于另一个SLI被优先级化。在一些实现方式中，导致SLA的剩余容忍限度小于阈值的SLI被看作比未导致SLA的剩余容忍限度小于阈值的可比较的SLI更严重的事故。在一些实现方式中，当一个SLI对于SLA的剩余容忍限度比另一个SLI具有更大的影响时，对于剩余的容忍限度具有更大影响的SLI相对于另一个SLI被优先级化，即使其他因素可能例如基于重要性或网络影响建议将另一个SLI优先级化。在一些实现方式中，多个因素用于识别要优先级化哪个SLI。例如，下面讨论的图5图示了使用多个过滤器来识别一组优先事故的示例。In some implementations, of two comparable incidents that are similar in importance or network impact, when one SLI has a greater impact on the remaining tolerance limit of the SLA than the other SLI, the SLI with the greater impact on the remaining tolerance limit is prioritized over the other SLI. In some implementations, an SLI that causes the remaining tolerance limit of the SLA to be less than a threshold is considered a more severe incident than a comparable SLI that does not cause the remaining tolerance limit of the SLA to be less than the threshold. In some implementations, when one SLI has a greater impact on the remaining tolerance limit of the SLA than the other SLI, the SLI with the greater impact on the remaining tolerance limit is prioritized over the other SLI, even if other factors may suggest prioritizing the other SLI based on importance or network impact. In some implementations, multiple factors are used to identify which SLI to prioritize. For example, Figure 5, discussed below, illustrates an example of using multiple filters to identify a set of priority incidents.

图3A和3B是表示服务级别事故的示例表。在一些实现方式中，通过网络事故记录来表示服务级别事故。在一些实现方式中，网络事故记录仅包括足以识别对应的事故的信息。在一些实现方式中，网络事故记录包括另外的信息，例如，可能有助于以后的诊断的信息。在一些实现方式中，网络事故记录至少包括事故出现的时间和日期信息、数据出现的路由信息和由事故出现影响的服务的描述或分类。Figures 3A and 3B are example tables representing service-level incidents. In some implementations, service-level incidents are represented by network incident records. In some implementations, network incident records include only information sufficient to identify the corresponding incident. In some implementations, network incident records include additional information, such as information that may be helpful for subsequent diagnosis. In some implementations, network incident records include at least the time and date of the incident, the routing information for the data, and a description or classification of the services affected by the incident.

图3A图示了服务级别事故(“SLI”)记录的表300的示例，其中，通过相应的行372来表示每个SLI，该相应的行372包含被影响的流(例如，不可被分配路由或已经被分配的路由变得不可用的流)的数据条目。如所示，每行372包括被影响的流的源312和目的地316的数据条目、由被影响的流支持的服务的服务级别目标(“SLO”)322和服务标识符332和当流被影响时的事件时间352。通过流的参与端的标识符来表示源312和目的地316，该标识符例如是网络地址、网络名称、地址范围、域名或任何其他这样的标识符。SLO 322被SLO的标识符表示，该标识符例如是名称或编号。在一些实现方式中，SLO 322名称是描述性字符串。在一些实现方式中，SLO 322名称是组分类标识符。在一些实现方式中，SLO 322名称是目标的描述性特性，例如，最大事故容忍级别。服务标识符332识别被SLI影响的服务或服务组。例如，如果流与由一个或多个主节点160托管的特定服务相关联，则服务标识符332可以是用于识别服务的字符串。事件时间352是时间戳，其指示何时出现SLI或何时输入SLI记录(其一般可以对应于何时出现SLI，但是可能不是其精确的时刻)。虽然被示出为单个表300，但是在图3A中表示的信息可以被存储为多个表或以非关系数据库结构被存储。表300被提供为可以如何在数据存储188中表示服务级别事故的示例；在一些实现方式中使用替选的数据结构。Figure 3A illustrates an example of a table 300 of service level incident ("SLI") records, wherein each SLI is represented by a corresponding row 372 containing a data entry for the affected flow (e.g., a flow that cannot be assigned a route or for which an assigned route has become unavailable). As shown, each row 372 includes data entries for the source 312 and destination 316 of the affected flow, the service level objective ("SLO") 322 of the service supported by the affected flow, a service identifier 332, and the event time 352 when the flow was affected. The source 312 and destination 316 are represented by identifiers of the participating ends of the flow, such as a network address, network name, address range, domain name, or any other such identifier. The SLO 322 is represented by an identifier of the SLO, such as a name or number. In some implementations, the SLO 322 name is a descriptive string. In some implementations, the SLO 322 name is a group classification identifier. In some implementations, the SLO 322 name is a descriptive characteristic of the objective, e.g., a maximum incident tolerance level. The service identifier 332 identifies the service or group of services affected by the SLI. For example, if the flow is associated with a particular service hosted by one or more master nodes 160, the service identifier 332 may be a string that identifies the service. The event time 352 is a timestamp that indicates when the SLI occurred or when the SLI record was entered (which may generally correspond to when the SLI occurred, but may not be its precise moment). Although shown as a single table 300, the information represented in FIG3A may be stored as multiple tables or in a non-relational database structure. Table 300 is provided as an example of how service level incidents may be represented in the data store 188; alternative data structures are used in some implementations.

图3B图示了用于表示服务级别事故记录的聚合的示例表305。在示例表305中，SLI记录的每个集合通过相应的行374表示，该相应的行包含与用于各种被影响的流的SLI记录相对应的聚合数据条目(例如，如在图3A中所示的表300中所示)。如所示，每行374包括：用于被事故的一个或多个所表示的集合影响的流的源区域314和目的地区域318的数据条目；用于由被影响的流支持的服务的聚合SLO级别324和聚合服务或服务类别标识符334；在该集合中的SLI记录的计数340；以及当流被影响时的事件时间范围开始354和结束356。源314和目的地318范围通过流的参与端的标识符表示，该标识符例如是网络地址、网络名称、地址范围、域名或任何其他这样的标识符。SLO级别324通过用于SLO的一般化的标识符表示，该标识符例如是名称或编号。在一些实现方式中，所表示的SLI集合可以具有相同的SLO，在这种情况下，SLO级别324可以等同于SLO 322。在一些实现方式中，所表示的SLI集合可以具有共享的SLO特性，并且该共享的SLO特性被用作SLO级别324。在一些实现方式中，SLO级别324是由该集合表示的流的目标的一般化。同样，服务或服务类别标识符334识别被所表示的SLI影响的服务或服务组。在一些实现方式中，可以通过计数340对表305进行排序。事件时间范围开始354和356是时间戳，该时间戳是用于所表示的事故集合的事件戳352的开始和结束时的时间戳。虽然被示出为单个表305，在图3B中表示的信息可以被存储为多个表或以非关系数据库结构存储。表305被提供为可以如何在数据存储188中表示服务级别事故的集合的示例；在一些实现方式中使用替选的数据结构。FIG3B illustrates an example table 305 for representing an aggregation of service-level incident records. In example table 305, each set of SLI records is represented by a corresponding row 374 containing aggregated data entries corresponding to the SLI records for various affected flows (e.g., as shown in table 300 shown in FIG3A). As shown, each row 374 includes: data entries for the source region 314 and destination region 318 of the flows affected by the one or more represented sets of incidents; the aggregate SLO level 324 and aggregate service or service class identifier 334 for the services supported by the affected flows; a count 340 of SLI records in the set; and the start 354 and end 356 of the event time range when the flows were affected. The source 314 and destination 318 ranges are represented by identifiers of the participating ends of the flows, such as network addresses, network names, address ranges, domain names, or any other such identifiers. The SLO level 324 is represented by a generalized identifier for the SLO, such as a name or number. In some implementations, the represented set of SLIs may have the same SLO, in which case SLO level 324 may be equivalent to SLO 322. In some implementations, the represented set of SLIs may have a shared SLO characteristic, and this shared SLO characteristic is used as SLO level 324. In some implementations, SLO level 324 is a generalization of the target for the flow represented by the set. Similarly, service or service class identifier 334 identifies the service or service group affected by the represented SLI. In some implementations, table 305 may be sorted by count 340. Event time range start 354 and 356 are timestamps representing the start and end of event time range 352 for the represented set of incidents. Although shown as a single table 305, the information represented in FIG3B may be stored as multiple tables or in a non-relational database structure. Table 305 is provided as an example of how a set of service-level incidents may be represented in data store 188; alternative data structures are used in some implementations.

在一些实现方式中，通过下述来生成在图3B中所示的表305中表示的数据：向在图3A中所示的表300中表示的数据应用一个或多个过滤器或聚合查询。例如，在一些实现方式中，使用查询来识别在特定的时间范围内沿着各种网络长廊发生的类似SLI，其中，网络长廊是两个终端节点集合或区域之间的一组网络路径。网络长廊可以例如通过并行网络路径、共享的网络路径、协作的网络路径、链路聚合和其他这样的冗余表征。网络长廊的端部可以是地理区域、网络服务区域、处于相同位置的计算装置、数据中心、接近的地址范围等。在一些实现方式中，网络分析器190使用一对网络地址集合来表示网络长廊，其中，网络地址集合中的每个识别或限定网络长廊的相应端部的终端节点。In some implementations, the data represented in table 305 shown in FIG3B is generated by applying one or more filters or aggregate queries to the data represented in table 300 shown in FIG3A. For example, in some implementations, queries are used to identify similar SLIs that occurred along various network corridors within a particular time range, where a network corridor is a set of network paths between two sets of terminal nodes or regions. Network corridors can be characterized, for example, by parallel network paths, shared network paths, collaborative network paths, link aggregation, and other such redundancies. The ends of a network corridor can be geographic areas, network service areas, co-located computing devices, data centers, proximate address ranges, and the like. In some implementations, the network analyzer 190 represents a network corridor using a pair of network address sets, where each of the network address sets identifies or defines a terminal node at a respective end of the network corridor.

查询或查询集合可以用于识别沿着特定网络长廊影响不同服务的、频繁出现的事故的SLI记录，其可以指示在该网络长廊中的问题。在一些实现方式中，网络分析器190利用高于某个最小阈值的计数340来识别SLI记录集合，该计数可以例如是预先配置的数字或百分比。A query or set of queries can be used to identify SLI records for frequently occurring incidents affecting different services along a particular network corridor, which can indicate a problem in that network corridor. In some implementations, the network analyzer 190 identifies a set of SLI records using a count 340 above a certain minimum threshold, which can be, for example, a preconfigured number or percentage.

图4是图示用于保持网络服务级别的示例方法400的流程图。以该方法400的广泛概述而言，网络分析器190在阶段410识别在测量时段的第一部分上出现的第一多个网络事故。在阶段420，网络分析器190识别在该测量时段的该第一部分之后出现的、在该测量时段的第二部分上出现的第二多个网络事故。在阶段430，网络分析器190基于该第一和第二多个网络事故对于该测量时段的事故容忍限度的对应集合的影响来确定多个剩余的事故容忍限度。在阶段440，网络分析器190基于通过与在该第二网络事故的子集中的每个第二网络事故相关联的、剩余的事故容忍限度加权的该第二多个网络事故中的一个或多个的聚合影响特性来生成对于该第二网络事故的至少该子集的严重程度度量值。并且在阶段450，网络分析器190选择在该第二网络事故的该子集中的至少一个事故来用于补救。FIG4 is a flow chart illustrating an example method 400 for maintaining network service levels. In a broad overview of method 400, network analyzer 190 identifies, at stage 410, a first plurality of network incidents that occurred during a first portion of a measurement period. At stage 420, network analyzer 190 identifies a second plurality of network incidents that occurred during a second portion of the measurement period, occurring after the first portion. At stage 430, network analyzer 190 determines a plurality of remaining incident tolerances based on the impact of the first and second plurality of network incidents on corresponding sets of incident tolerances for the measurement period. At stage 440, network analyzer 190 generates a severity metric for at least a subset of the second plurality of network incidents based on an aggregated impact characteristic of one or more of the second plurality of network incidents weighted by the remaining incident tolerance associated with each second network incident in the subset of the second network incidents. Furthermore, at stage 450, network analyzer 190 selects at least one incident in the subset of the second network incidents for remediation.

更详细地参见图4，在阶段410，网络分析器190识别在测量时段的第一部分上出现的第一多个网络事故。网络分析器190通过访问由网络监视器180存储在数据存储188中的记录来识别网络事故。在一些实现方式中，网络分析器190查询数据存储188以识别和/检索在测量时段的第一部分期间出现的事故的SLI记录。在一些实现方式中，网络分析器190使用查询(例如，SQL查询)来识别记录，并且同时根据限制分析的范围的标准来过滤掉或聚合记录。例如，该标准可以消除明显隔离出现的或用于不相关服务的事故的记录。在一些实现方式中，该标准将所识别的记录限于特定的网络长廊。在一些实现方式中，该标准例如在时间、地理或网络拓扑上识别事故的集群。在一些实现方式中，该查询仅返回在测量时段的第一部分期间出现多于阈值次数的事故的记录的记录集合。在一些实现方式中，多个查询或过滤器用于识别要包括在第一多个网络事故中的事故记录。下面呈现的图5图示了用于使用查询或过滤器510、520和530的组合来识别优先事故540集合的维恩图。Referring to FIG. 4 in more detail, at stage 410, network analyzer 190 identifies a first plurality of network incidents that occurred during a first portion of a measurement period. Network analyzer 190 identifies network incidents by accessing records stored in data store 188 by network monitor 180. In some implementations, network analyzer 190 queries data store 188 to identify and/or retrieve SLI records for incidents that occurred during the first portion of the measurement period. In some implementations, network analyzer 190 uses a query (e.g., an SQL query) to identify the records and simultaneously filters out or aggregates the records based on criteria that limit the scope of the analysis. For example, the criteria may eliminate records for incidents that occurred in apparent isolation or for unrelated services. In some implementations, the criteria may limit the identified records to a specific network corridor. In some implementations, the criteria may identify clusters of incidents, for example, in time, geography, or network topology. In some implementations, the query returns only a set of records for incidents that occurred more than a threshold number of times during the first portion of the measurement period. In some implementations, multiple queries or filters are used to identify incident records to be included in the first plurality of network incidents. FIG. 5 , presented below, illustrates a Venn diagram for identifying a set of priority incidents 540 using a combination of queries or filters 510 , 520 , and 530 .

仍然参见图4的阶段410，测量时段的第一部分提供用于分析在测量时段的后面部分中的事件的历史环境以，该后面部分例如是测量时段的第二部分。该测量时段的第一部分可以例如是以测量时段的起点开始并且以测量时段的百分比结束的时段，该百分比例如是测量时段的一半或67％。第一部分可以例如是以测量时段的起点开始并且在阶段410中对于数据存储188的访问的时间结束的时段。在一些实现方式中，时间的第一部分以测量时段的起点开始，并且在时间的第二部分的起点处结束，该时间的第二部分进而在测量时段的结尾或在最后的服务级别事故的时间结束。在一些这样的实现方式中，时间的第一部分的结尾是相对于测量时段的结尾的偏移，例如，使得时间的第二部分是固定长度，诸如测量时段的最后6个小时，或者使得时间的第二部分是测量时段的预先配置的百分比，诸如测量时段的最后10％。在一些实现方式中，时间的第一部分的结尾被选择为使得时间的第二部分是固定长度的时间或测量时段的预先配置的百分比的较短者。即，例如，在一些实现方式中，时间的第一部分在测量时段的结尾之前6个小时或在测量时段的90％结束，以较短者为准(其中，6小时和90％是示例数字——其他长度也可以是适合的)。Still referring to stage 410 of FIG. 4 , the first portion of the measurement period provides historical context for analyzing events in a later portion of the measurement period, such as the second portion of the measurement period. The first portion of the measurement period may, for example, begin at the start of the measurement period and end at a percentage of the measurement period, such as half or 67% of the measurement period. The first portion may, for example, begin at the start of the measurement period and end at the time of access to data store 188 in stage 410. In some implementations, the first portion of time begins at the start of the measurement period and ends at the start of the second portion of time, which in turn ends at the end of the measurement period or at the time of the last service-level incident. In some such implementations, the end of the first portion of time is offset from the end of the measurement period, for example, so that the second portion of time is a fixed length, such as the last six hours of the measurement period, or a preconfigured percentage of the measurement period, such as the last 10% of the measurement period. In some implementations, the end of the first portion of time is selected so that the second portion of time is the shorter of the fixed length or the preconfigured percentage of the measurement period. That is, for example, in some implementations, the first portion of time ends 6 hours before the end of the measurement period or 90% of the measurement period, whichever is shorter (where 6 hours and 90% are example numbers - other lengths may also be suitable).

在阶段420，网络分析器190识别在测量时段的第一部分之后出现的、在测量时段的第二部分上出现的第二多个网络事故。网络分析器190通过访问由网络监视器180存储在数据存储188中的记录来识别网络事故。在一些实现方式中，网络分析器190查询数据存储188以识别和/或检索在测量时段的第二部分期间出现的事故的SLI记录。在一些实现方式中，网络分析器190使用查询(例如，SQL查询)来根据限制分析的范围的标准识别记录。例如，该标准可以选择与在阶段410中识别的事故相关或相关联的事故的记录。在一些实现方式中，网络分析器190使用被应用到测量时段的第二部分的、在阶段410中使用的相同的查询和过滤器。At stage 420, network analyzer 190 identifies a second plurality of network incidents that occurred during a second portion of the measurement period, occurring after the first portion of the measurement period. Network analyzer 190 identifies the network incidents by accessing records stored by network monitor 180 in data store 188. In some implementations, network analyzer 190 queries data store 188 to identify and/or retrieve SLI records for incidents that occurred during the second portion of the measurement period. In some implementations, network analyzer 190 uses a query (e.g., an SQL query) to identify records based on criteria that limit the scope of the analysis. For example, the criteria may select records for incidents that are related or associated with the incidents identified in stage 410. In some implementations, network analyzer 190 uses the same query and filters used in stage 410, applied to the second portion of the measurement period.

在一些实现方式中，测量时段的第二部分与测量时段的第一部分连续，如上所述。测量时段的第二部分可以例如是以第一时段的结尾开始并且在测量时段的结尾结束的时段。第二部分可以例如是以第一时段的结尾开始并且在阶段420中对于数据存储188的访问的时间结束的时段。在一些实现方式中，第二部分与第一部分重叠或涵盖第一部分。通常，测量时段的第一部分提供用于分析在测量时段的第二部分期间的网络性能的环境。然后可以与由测量时段的第一部分或者第一和第二部分提供的环境相比较地识别在测量时段的第二部分期间出现的服务级别事故。In some implementations, the second portion of the measurement period is continuous with the first portion of the measurement period, as described above. The second portion of the measurement period can, for example, be a period that begins at the end of the first period and ends at the end of the measurement period. The second portion can, for example, be a period that begins at the end of the first period and ends at the time of access to data store 188 in stage 420. In some implementations, the second portion overlaps with or encompasses the first portion. Typically, the first portion of the measurement period provides context for analyzing network performance during the second portion of the measurement period. Service level incidents that occur during the second portion of the measurement period can then be identified by comparing the context provided by the first portion of the measurement period, or the first and second portions of the measurement period.

在阶段430，网络分析器190基于第一和第二多个网络事故对于测量时段的事故容忍限度的对应集合来确定多个剩余的事故容忍限度。对于由在第一和第二多个网络事故中的服务级别事故影响的每个SLA，网络分析器190识别对应的事故容忍限度和对于该对应的事故容忍限度的影响，例如，导致用于测量时段的剩余事故容忍限度。At stage 430, the network analyzer 190 determines a plurality of remaining incident tolerance limits based on the corresponding sets of incident tolerance limits for the first and second pluralities of network incidents for the measurement period. For each SLA affected by a service level incident in the first and second pluralities of network incidents, the network analyzer 190 identifies the corresponding incident tolerance limit and the impact on the corresponding incident tolerance limit, e.g., resulting in a remaining incident tolerance limit for the measurement period.

在阶段440处，网络分析器190基于通过与在第二网络事故的子集中的每个第二网络事故相关联的、剩余的事故容忍限度加权的第二多个网络事故中的一个或多个的聚合影响特性来生成对于第二网络事故的至少子集的严重程度度量值。每个SLI可以根据一个或多个度量被分配分值，该分值表示事故的严重程度。在一些实现方式中，度量考虑在测量时段期间的对应事故的计数。在一些实现方式中，该度量包括与被影响的网络路径相关联的优先级值。在一些实现方式中，该度量向不同的服务分配不同的值，例如，向影响较高优先级服务的事故分配较高的严重程度分值。然后，通过用于表示用于被事故影响的SLA的对应的剩余事故容忍限度的因子来调整、即加权该分值，即，严重程度度量值。在一些实现方式中，该加权因子随着剩余的事故容忍限度接近0而增大。在一些实现方式中，严重程度度量包括测量时段的事故频率。At stage 440, network analyzer 190 generates a severity metric value for at least a subset of the second network incidents based on the aggregated impact characteristics of one or more of the second plurality of network incidents weighted by the remaining incident tolerance associated with each second network incident in the subset of the second network incidents. Each SLI may be assigned a score based on one or more metrics, the score representing the severity of the incident. In some implementations, the metric considers the count of corresponding incidents during the measurement period. In some implementations, the metric includes a priority value associated with the affected network path. In some implementations, the metric assigns different values to different services, for example, assigning a higher severity score to incidents affecting higher-priority services. The score, i.e., the severity metric value, is then adjusted, i.e., weighted, by a factor representing the corresponding remaining incident tolerance for the SLA affected by the incident. In some implementations, the weighting factor increases as the remaining incident tolerance approaches zero. In some implementations, the severity metric includes the frequency of incidents during the measurement period.

在阶段450，网络分析器190选择在第二网络事故的子集中的事故的至少一个来用于补救。在一些实现方式中，网络分析器190利用大于阈值的严重程度度量值来识别一个或多个事故。在一些实现方式中，网络分析器190利用在上百分比(例如，上第75％或上第90％等)中的严重程度度量值来识别一个或多个事故。在一些实现方式中，网络分析器190利用在测量时段的第二部分内出现的事故的最高严重程度度量值来识别一个或多个事故。在一些实现方式中，具有高严重程度度量值的事故的补救比较低排名的事故的补救更可能改善整体网络条件。At stage 450, network analyzer 190 selects at least one of the incidents in the subset of the second network incidents for remediation. In some implementations, network analyzer 190 identifies one or more incidents using a severity metric value that is greater than a threshold. In some implementations, network analyzer 190 identifies one or more incidents using a severity metric value that is in the upper percentile (e.g., upper 75th percentile or upper 90th percentile, etc.). In some implementations, network analyzer 190 identifies one or more incidents using the highest severity metric value of the incidents that occurred during the second portion of the measurement period. In some implementations, remediation of incidents with high severity metric values is more likely to improve overall network conditions than remediation of lower-ranked incidents.

图5是图示用于优先级化事故的过滤器交集的维恩图。在方法400的一些实现方式中，网络分析器190基于多个事故过滤器来选择在第二网络事故的子集中的事故来用于补救。如图5中所示，在一些实现方式中，网络分析器190通过识别过滤器510、520和530的交集来识别高优先级事故的排名集合。第一过滤器510例如使用在存储188中存储的服务级别事故记录的一个或多个查询来识别具有最高出现频率的事故。在一些实现方式中，该查询选择类似事故的聚合，并且通过在每个聚合中的事故的相应计数来对它们排序。第二过滤器520识别与最大集群相关联的事故。例如，在一些实现方式中，通过共享或类似属性来将事故聚类。在一些实现方式中，通过被影响的网络链路、路由、终端节点或终端节点区域来将事故聚类，使得作为结果的集群将影响同一网络长廊的事故分组在一起。聚类过滤器520识别最大集群，并且允许网络分析器190优先级化与最大集群相关联的事故。第三过滤器530识别具有最高加权的影响分值的事故。在一些实现方式中，向事故分配用于测量事故对于网络质量的影响的一个或多个影响度量的值。该值被一个或多个因子加权，该一个或多个因子包括例如用于对应的服务级别目标或服务级别协议的剩余容忍级别。在一些实现方式中，网络分析器190基于这些过滤器510、520和530的交集540来识别优先级事故集合。在一些实现方式中，使用其他过滤器。在一些实现方式中，使用另外的过滤器。FIG5 is a Venn diagram illustrating the intersection of filters used to prioritize incidents. In some implementations of method 400, network analyzer 190 selects incidents from a subset of the second network incidents for remediation based on multiple incident filters. As shown in FIG5 , in some implementations, network analyzer 190 identifies a ranked set of high-priority incidents by identifying the intersection of filters 510, 520, and 530. First filter 510, for example, uses one or more queries of service-level incident records stored in storage 188 to identify incidents with the highest frequency of occurrence. In some implementations, this query selects clusters of similar incidents and ranks them by the corresponding count of incidents in each cluster. Second filter 520 identifies incidents associated with the largest clusters. For example, in some implementations, incidents are clustered by shared or similar attributes. In some implementations, incidents are clustered by affected network links, routes, end nodes, or end node regions, such that the resulting clusters group incidents that affect the same network corridor. Clustering filter 520 identifies the largest clusters and allows network analyzer 190 to prioritize incidents associated with the largest clusters. A third filter 530 identifies incidents with the highest weighted impact scores. In some implementations, incidents are assigned a value for one or more impact metrics that measure the impact of the incident on network quality. This value is weighted by one or more factors, including, for example, the remaining tolerance level for a corresponding service level objective or service level agreement. In some implementations, network analyzer 190 identifies a set of prioritized incidents based on the intersection 540 of filters 510, 520, and 530. In some implementations, other filters are used. In some implementations, additional filters are used.

在一些实现方式中，网络分析器190生成用于识别所选择的网络事故的报告。在一些实现方式中，通过电子邮件、SMS文本消息、自动化的电话呼叫、即时消息或用于通信的任何其他可用的介质来向一个或多个系统操作者提供该报告。In some implementations, the network analyzer 190 generates a report identifying the selected network incidents. In some implementations, the report is provided to one or more system operators via email, SMS text message, automated phone call, instant message, or any other available medium for communication.

图6是示例网络装置730的方框图。根据一个说明性实现方式，示例网络装置730适合于用于实现在此所述的中间网络装置。下面参考图7描述的计算系统910也可以适合于作为网络装置730。例如，利用网络功能虚拟化(“NFV”)，通常在硬件电路中实现的某个网络功能被实现为在处理器(例如，通用处理器)上执行的软件。广泛概述而言，网络装置730包括控制模块744和存储器736，存储器736例如用于存储装置配置和路由数据。网络装置730包括转发引擎734，该转发引擎734使用在存储器736中存储的装置配置和路由数据来管理在网络接口738处的数据流量。在一些实现方式中，网络装置730被实现来用在软件定义网络(“SDN”)中，其中，网络装置730被外部SDN控制器720例如经由控制平面链路712控制。SDN控制器720包括控制模块742和存储器726。下面参考图7描述的计算系统910也可以适合于作为SDN控制器720。在一些实现方式中，网络装置730或SDN控制器720的一个或多个功能组件被实现为被通用处理器执行的软件组件。FIG6 is a block diagram of an example network device 730. According to one illustrative implementation, the example network device 730 is suitable for use in implementing the intermediate network device described herein. The computing system 910 described below with reference to FIG7 may also be suitable for use as the network device 730. For example, using network function virtualization ("NFV"), a network function that is typically implemented in hardware circuitry is implemented as software executed on a processor (e.g., a general-purpose processor). In a broad overview, the network device 730 includes a control module 744 and a memory 736, which is used, for example, to store device configuration and routing data. The network device 730 includes a forwarding engine 734 that uses the device configuration and routing data stored in the memory 736 to manage data traffic at the network interface 738. In some implementations, the network device 730 is implemented for use in a software-defined network ("SDN"), wherein the network device 730 is controlled by an external SDN controller 720, for example, via a control plane link 712. The SDN controller 720 includes a control module 742 and a memory 726. The computing system 910 described below with reference to FIG7 may also be suitable for use as the SDN controller 720. In some implementations, one or more functional components of the network device 730 or the SDN controller 720 are implemented as software components executed by a general-purpose processor.

参见图6，更详细而言，网络装置730包括一组网络接口738。每个网络接口738可以通过一个或多个链路连接到一个或多个外部装置，形成网络(例如，在图1中所示的网络110)。外部装置经由这些链路向网络装置730发送数据分组，经由入口接口(例如，网络接口738_(a))到达。网络装置730经由出口接口(例如，网络接口738_(c))向适当的下一个跳跃转发所接收的数据分组。在一些实现方式中，转发引擎734确定哪个网络接口738用于转发所接收的每个数据分组。6 , in more detail, network device 730 includes a set of network interfaces 738. Each network interface 738 can be connected to one or more external devices via one or more links, forming a network (e.g., network 110 shown in FIG1 ). External devices send data packets to network device 730 via these links, arriving via an ingress interface (e.g., network interface 738 _(a) ). Network device 730 forwards received data packets to the appropriate next hop via an egress interface (e.g., network interface 738 _(c) ). In some implementations, forwarding engine 734 determines which network interface 738 to use for forwarding each received data packet.

转发引擎734使用在存储器736中的配置和路由数据来管理在网络接口端口738处的数据流量。在存储器736中的配置和路由数据被控制模块744控制。在一些实现方式中，转发引擎734在向出口网络接口738转发分组之前更新分组报头。例如，转发引擎734可以更新在分组报头中的ECN、TTL或校验和信息。在一些实现方式中，进入的分组包含在进入的分组的报头中嵌入的路由指令，并且转发引擎734基于嵌入的指令来转发分组。The forwarding engine 734 uses the configuration and routing data in the memory 736 to manage data traffic at the network interface port 738. The configuration and routing data in the memory 736 are controlled by the control module 744. In some implementations, the forwarding engine 734 updates the packet header before forwarding the packet to the egress network interface 738. For example, the forwarding engine 734 can update the ECN, TTL, or checksum information in the packet header. In some implementations, the incoming packet contains routing instructions embedded in the header of the incoming packet, and the forwarding engine 734 forwards the packet based on the embedded instructions.

存储器736可以是适合于存储计算机可读数据的任何装置。示例包括但不限于半导体存储器装置，诸如EPROM、EEPROM、SRAM和闪存装置。在一些实现方式中，网络装置730的存储器736包括专用于存储用于识别分组流的模式的存储器，例如作为三元内容可寻址存储器(“TCAM”)。在一些实现方式中，网络装置730的存储器736包括专用于当分组流穿过网络装置730时缓冲分组流的存储器。网络装置730可以具有任何数量的存储器装置736。Memory 736 can be any device suitable for storing computer-readable data. Examples include, but are not limited to, semiconductor memory devices such as EPROM, EEPROM, SRAM, and flash memory devices. In some implementations, memory 736 of network device 730 includes memory dedicated to storing patterns used to identify packet flows, such as, for example, a ternary content addressable memory ("TCAM"). In some implementations, memory 736 of network device 730 includes memory dedicated to buffering packet flows as they traverse network device 730. Network device 730 can have any number of memory devices 736.

控制模块744管理网络装置730的性能。在一些实现方式中，控制模块744从外部控制装置接收指令。例如，在软件定义网络(“SDN”)中，控制模块744可以从网络装置730外部的SDN控制器720接收控制指令。在一些实现方式中，控制模块744处理路由信息分组(即，控制平面分组)，并且利用对于由转发引擎734使用的路由表的修改来更新存储器736。在一些实现方式中，控制模块744向在存储器736中存储的缓冲器读取到达出口接口738的数据。可以使用通用处理器或例如专用集成电路(“ASIC”)的专用逻辑电路来实现控制模块744。The control module 744 manages the performance of the network device 730. In some implementations, the control module 744 receives instructions from an external control device. For example, in a software-defined network ("SDN"), the control module 744 can receive control instructions from an SDN controller 720 external to the network device 730. In some implementations, the control module 744 processes routing information packets (i.e., control plane packets) and updates the memory 736 with modifications to the routing table used by the forwarding engine 734. In some implementations, the control module 744 reads data arriving at the egress interface 738 into a buffer stored in the memory 736. The control module 744 can be implemented using a general-purpose processor or dedicated logic circuitry, such as an application-specific integrated circuit ("ASIC").

图7是示例计算系统910的方框图。根据一个说明性实现方式，示例计算系统910适合于用于实现在此所述的计算机化的组件。广泛概述而言，计算系统910包括：至少一个处理器950，用于根据指令来执行动作；以及一个或多个存储器装置970或975，用于存储指令和数据。所图示的示例计算系统910包括：一个或多个处理器950，该一个或多个处理器950经由总线915与存储器970通信；至少一个网络接口控制器920，其具有网络接口922，用于连接到网络装置924(例如，用于接入到网络)；以及其他组件980，例如，输入/输出(“I/O”)组件930。通常，处理器950将执行从存储器接收的指令。所图示的处理器950包含高速缓存975或直接连接到高速缓存975。在一些情况下，指令被从存储器970读取到高速缓存975内，并且被处理器950从高速缓存975执行。FIG7 is a block diagram of an example computing system 910. According to one illustrative implementation, the example computing system 910 is suitable for implementing the computerized components described herein. In broad overview, the computing system 910 includes at least one processor 950 for performing actions according to instructions, and one or more memory devices 970 or 975 for storing instructions and data. The illustrated example computing system 910 includes one or more processors 950 communicating with a memory 970 via a bus 915; at least one network interface controller 920 having a network interface 922 for connecting to a network device 924 (e.g., for accessing a network); and other components 980, such as input/output (“I/O”) components 930. Typically, the processor 950 executes instructions received from the memory. The illustrated processor 950 includes a cache 975 or is directly connected to the cache 975. In some cases, instructions are read from the memory 970 into the cache 975 and executed by the processor 950 from the cache 975.

更详细而言，处理器950可以是处理指令的任何逻辑电路，该指令例如是从存储器970或高速缓存975获取的指令。在许多实施例中，处理器950是微处理器单元或专用处理器。计算装置910可以基于在此所述的能够运行的任何处理器或一组处理器。处理器950可以是单核或多核处理器。处理器950可以是多个不同的处理器。在一些实现方式中，处理器950被实现为在一个或多个“芯片”上的电路。More specifically, processor 950 can be any logic circuit that processes instructions, such as instructions retrieved from memory 970 or cache 975. In many embodiments, processor 950 is a microprocessor unit or a dedicated processor. Computing device 910 can be based on any processor or set of processors capable of operating as described herein. Processor 950 can be a single-core or multi-core processor. Processor 950 can be a plurality of different processors. In some implementations, processor 950 is implemented as a circuit on one or more "chips."

存储器970可以是适合于存储计算机可读数据的任何装置。存储器970可以是具有固定存储的装置或用于读取可移动存储介质的装置。示例包括所有形式的非易失性存储器、介质和存储器装置、半导体存储器装置(例如，EPROM、EEPROM、SDRAM和闪存装置)、磁盘、磁光盘和光盘(例如，CD-ROM、DVD-ROM或(蓝光)盘)。计算装置910可以具有任何数量的存储器装置970。Memory 970 can be any device suitable for storing computer-readable data. Memory 970 can be a device with fixed storage or a device for reading removable storage media. Examples include all forms of non-volatile memory, media, and storage devices, semiconductor memory devices (e.g., EPROM, EEPROM, SDRAM, and flash memory devices), magnetic disks, magneto-optical disks, and optical disks (e.g., CD-ROM, DVD-ROM, or (Blu-ray) disks). Computing device 910 can have any number of memory devices 970.

高速缓存975通常是接近处理器950布置以获得快速的访问时间的一种形式的计算机存储器。在一些实现方式中，高速缓存975是处理器950的一部分或在与处理器950相同的芯片上。在一些实现方式中，存在多级高速缓存975，例如L2和L3高速缓存层。The cache 975 is a form of computer memory that is typically placed close to the processor 950 for fast access times. In some implementations, the cache 975 is part of the processor 950 or on the same chip as the processor 950. In some implementations, there are multiple levels of cache 975, such as L2 and L3 cache layers.

网络接口控制器920经由网络接口922(有时被称为网络接口端口)管理数据交换。网络接口控制器920处理OSI模型的物理和数据链路层以用于网络通信。在一些实现方式中，网络接口控制器的任务中的一些被处理器950中的一个或多个处理。在一些实现方式中，网络接口控制器920被并入到处理器950内例如作为在同一芯片上的电路。在一些实现方式中，计算系统910具有被单个控制器920控制的多个网络接口922。在一些实现方式中，计算系统910具有多个网络接口控制器920。在一些实现方式中，每个网络接口922是用于物理网络链路(例如，cat-5以太网链路)的连接点。在一些实现方式中，网络接口控制器920支持无线网络连接，并且接口922是无线(例如，无线电)接收器/发送器(例如，用于IEEE802.11协议、近场通信“NFC”、蓝牙、BLE、ANT或任何其他无线协议中的任何一个)。在一些实现方式中，网络接口控制器920实现一个或多个网络协议，诸如以太网。通常，计算装置910通过网络接口922经由物理或无线链路与其他计算装置交换数据。网络接口922可以直接地链接到另一个装置或经由中间装置链接到另一个装置，该中间装置例如是诸如集线器、桥接器、交换器或路由器的网络装置，该中间装置将计算装置910连接到诸如因特网的数据网络。The network interface controller 920 manages data exchange via a network interface 922 (sometimes referred to as a network interface port). The network interface controller 920 handles the physical and data link layers of the OSI model for network communications. In some implementations, some of the network interface controller's tasks are handled by one or more of the processors 950. In some implementations, the network interface controller 920 is incorporated into the processor 950, for example as a circuit on the same chip. In some implementations, the computing system 910 has multiple network interfaces 922 controlled by a single controller 920. In some implementations, the computing system 910 has multiple network interface controllers 920. In some implementations, each network interface 922 is a connection point for a physical network link (e.g., a Cat-5 Ethernet link). In some implementations, the network interface controller 920 supports wireless network connections, and the interface 922 is a wireless (e.g., radio) receiver/transmitter (e.g., for any of the IEEE 802.11 protocols, near field communication (NFC), Bluetooth, BLE, ANT, or any other wireless protocol). In some implementations, the network interface controller 920 implements one or more network protocols, such as Ethernet. Typically, the computing device 910 exchanges data with other computing devices via a physical or wireless link through the network interface 922. The network interface 922 can be directly linked to another device or linked to another device via an intermediate device, such as a network device such as a hub, bridge, switch, or router, which connects the computing device 910 to a data network such as the Internet.

计算系统910可以包括一个或多个输入或输出(“I/O”)组件930，或者提供用于一个或多个输入或输出(“I/O”)组件930的接口。输入装置无限制地包括键盘、麦克风、触摸屏、脚踏板、传感器、MIDI装置和诸如鼠标或跟踪球的指示装置。输出装置无限制地包括视频显示器、扬声器、可刷新的布莱叶盲文终端、灯、MIDI装置和2-D或3-D打印机。The computing system 910 may include, or provide an interface for, one or more input or output ("I/O") components 930. Input devices include, without limitation, keyboards, microphones, touch screens, foot pedals, sensors, MIDI devices, and pointing devices such as a mouse or trackball. Output devices include, without limitation, video displays, speakers, refreshable Braille terminals, lights, MIDI devices, and 2-D or 3-D printers.

其他组件980可以包括I/O接口、外部串行装置端口和任何附加的协处理器。例如，计算装置910可以包括接口(例如，通用串行总线(“USB”)接口)，用于连接输入装置、输出装置或另外的存储器装置(例如，便携式闪存驱动器或外部介质驱动器)。在一些实现方式中，计算装置910包括另外的装置980，诸如协处理器。例如，数学协处理器可以以高精度或复杂的计算来帮助处理器950。Other components 980 may include I/O interfaces, external serial device ports, and any additional coprocessors. For example, computing device 910 may include an interface (e.g., a Universal Serial Bus ("USB") interface) for connecting input devices, output devices, or additional memory devices (e.g., a portable flash drive or external media drive). In some implementations, computing device 910 includes additional devices 980, such as coprocessors. For example, a math coprocessor may assist processor 950 with high-precision or complex calculations.

在本说明书中描述的主题和操作的实现方式可以被实现在数字电路中或在有形介质、固件或硬件上实现的计算机软件中，该有形介质、固件或硬件包括在本说明书中公开的结构和它们的结构等同物或与它们中的一个或多个组合。在本说明书中描述的主题的实现方式可以被实现为在有形介质上实现的一个或多个计算机程序，即，计算机程序指令的一个或多个模块，其在一个或多个计算机存储介质上编码以由数据处理设备执行或用于控制数据处理设备的操作。计算机存储介质可以是或被包括在计算机可读存储装置、计算机可读存储基片、随机或串行存取存储器阵列或装置或它们中的一个或多个的组合。计算机存储介质也可以是或被包括在一个或多个分立的组件或介质(例如，多个CD、盘或其他存储装置)。计算机存储介质是有形的，并且以非瞬时形式来存储诸如计算机可执行指令的数据。The subject matter and implementation of the operations described in this specification may be implemented in digital circuits or in computer software implemented on tangible media, firmware, or hardware, which includes the structures disclosed in this specification and their structural equivalents or one or more combinations thereof. The implementation of the subject matter described in this specification may be implemented as one or more computer programs implemented on tangible media, that is, one or more modules of computer program instructions, which are encoded on one or more computer storage media to be executed by a data processing device or used to control the operation of a data processing device. A computer storage medium may be or be included in a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of these. A computer storage medium may also be or be included in one or more discrete components or media (e.g., multiple CDs, disks, or other storage devices). A computer storage medium is tangible and stores data such as computer-executable instructions in a non-transient form.

可以以任何形式的编程语言来编写计算机程序(也被称为程序、软件、软件应用、脚本或代码)，该编程语言包括编译语言、解释性语言、声明性语言和过程语言，并且可以以任何形式来部署计算机程序，包括作为独立程序或作为模块、组件、子例程、对象或适合于在计算环境中使用的其他单元。计算机程序可以但是不必对应于在文件系统中的文件。程序可以被存储在保持其他程序或数据(例如，在标记语言文档中存储的一个或多个脚本)的文件的一部分中、在专用于所讨论的程序的单个文件中或在多个协作文件(例如，存储一个或多个模块、库、子程序或代码的部分的文件)中。计算机程序可以被部署为在一个计算机上或在位于一个位置或分布在多个位置并且通过通信网络互连的多个计算机上执行。A computer program (also referred to as a program, software, software application, script, or code) can be written in any form of programming language, including compiled languages, interpreted languages, declarative languages, and procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple collaborative files (e.g., files that store one or more modules, libraries, subroutines, or portions of code). A computer program may be deployed to execute on one computer or on multiple computers that are located in one location or distributed in multiple locations and interconnected by a communication network.

在本说明书中描述的过程和逻辑流可以被一个或多个可编程处理器执行，该一个或多个可编程处理器执行一个或多个计算机程序以通过对输入数据进行操作和生成输出来执行动作。该过程和逻辑流也可以由专用逻辑电路执行，并且设备也可以被实现为专用逻辑电路，该专用逻辑电路例如是现场可编程门阵列(“FPGA”)或专用集成电路(“ASIC”)。这样的专用电路可以被称为计算机处理器，即使它不是通用处理器。The processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows may also be performed by, and devices may be implemented as, special-purpose logic circuitry, such as a field programmable gate array ("FPGA") or an application-specific integrated circuit ("ASIC"). Such a special-purpose circuit may be referred to as a computer processor, even though it is not a general-purpose processor.

虽然本说明书包含许多具体实现方式细节，但是这些不应当被解释为对任何发明或所要求保护的内容的范围的限制，而是应被解释为特定于特定发明的特定实现方式的特征的描述。在本说明书中在分立的实现方式的上下文中描述的某些特征也可以组合地在单个实现方式中实现。相反，在单个实现方式的上下文中描述的各种特征也可以分别在多个实现方式中或在任何适当的子组合中实现。而且，虽然特征在上面可能被描述为以某个组合发挥作用并且甚至初始如此要求保护，但是在一些情况下来自要求保护的组合的一个或多个特征可以从该组合去除，并且要求保护的组合可能针对子组合或子组合的变体。Although this specification contains many specific implementation details, these should not be interpreted as limitations on the scope of any invention or the claimed content, but should be interpreted as descriptions of features specific to a particular implementation of a particular invention. Certain features described in this specification in the context of discrete implementations may also be implemented in combination in a single implementation. Conversely, various features described in the context of a single implementation may also be implemented in multiple implementations or in any appropriate sub-combination. Moreover, although features may be described above as functioning in a certain combination and even initially claimed as such, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may be directed to a sub-combination or a variant of the sub-combination.

类似地，虽然在附图中以特定顺序描述了操作，但是这不应当被理解为要求以以所示的特定顺序或依序来执行这样的操作或执行所有图示的操作，以实现期望的结果。在特定情况下，多任务和并行处理可能是有益的。而且，在上述的实现方式中的各个系统组件的分离不应当被理解为在所有的实现方式中要求这样的分离，并且应当理解，所描述的程序组件和系统通常可以被一起整合在单个软件产品中或被封装到多个软件产品内。Similarly, although operations are described in a particular order in the accompanying drawings, this should not be understood as requiring that such operations be performed in the particular order shown or in sequence, or that all illustrated operations be performed, in order to achieve the desired results. In certain circumstances, multitasking and parallel processing may be beneficial. Moreover, the separation of various system components in the above-described implementations should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

对于“或”的引用可以被解释为包括性的，使得使用“或”描述的任何术语可以指示所描述的术语的单个、多于一个和全部中的任何一个。标签“第一”、“第二”和“第三”等不必然意味着指示排序，并且通常仅用于区分相似或类似的项目或元件。References to "or" can be interpreted as inclusive, such that any term described using "or" can refer to any of a single, more than one, and all of the terms described. Labels "first," "second," and "third," etc. are not necessarily meant to indicate an order and are generally used only to distinguish between similar or analogous items or elements.

因此，已经描述了主题的特定实现方式。其他实现方式在所附的权利要求的范围内。在一些情况下，在权利要求中记载的动作可以以不同的顺序被执行，并且仍然获得期望的结果。另外，在附图中描述的过程不必要求所示的特定顺序或依序以获得期望的结果。在某些实现方式中，可以使用多任务或并行处理。Thus, particular implementations of the subject matter have been described. Other implementations are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve the desired results. Additionally, the processes depicted in the accompanying drawings do not necessarily require the specific order shown or sequential order to achieve the desired results. In some implementations, multitasking or parallel processing may be employed.

Claims

1. A method for maintaining network service levels, the method comprising:

Identify a first plurality of network incidents that occur in the first part of the measurement period, and designate the first plurality of network incidents as the first network incident;

Identify a second plurality of network incidents that occur after the first portion of the measurement period and during the second portion of the measurement period, the second plurality of network incidents being designated as a second network incident;

Multiple remaining incident tolerance limits are determined based on the impact of the first network incident and the second network incident on the corresponding set of incident tolerance limits for the measurement period, wherein the incident tolerance limit is the tolerance level for network incidents over a time period, and wherein the remaining incident tolerance limits are obtained by reducing the incident tolerance limit as each incident occurs.

Generate a severity metric for each second network incident, wherein each second network incident is assigned a score representing the severity of the incident based on one or more metrics, and wherein the severity metric is weighted based on the remaining incident tolerance limit associated with each second network incident; and

At least one incident in the second network incident is selected for remediation based on a comparison of the severity metrics.

2. The method of claim 1, wherein the identified first plurality of network incidents and second plurality of network incidents are represented by network incident records stored in a computer-readable storage medium.

3. The method according to claim 2, wherein the network incident log includes at least:

(i) Information on the time and date of the accident.

(ii) the routing information for the occurrence of the aforementioned incident, and

(iii) A description or classification of the services affected by the incident.

4. The method of claim 1, further comprising limiting a subset of the second network incidents to network incidents that meet the selection criteria.

5. The method of claim 4, further comprising selecting a subset of the second network incidents based on the fact that the count of network incidents satisfying the selection criteria exceeds a threshold.

6. The method of claim 1, further comprising limiting the subset of the second network incident to network incidents affecting network flows, each network flow having at least one corresponding terminal node in a shared geographical area.

7. The method of claim 1, further comprising selecting a subset of the second network incidents, wherein selecting a subset of the second network incidents comprises: identifying network incidents affecting network flow between the first terminal node and the second terminal node based on a first terminal node and a second terminal node, the first terminal node having network addresses within a first set of network addresses, and the second terminal node having network addresses within a second set of network addresses.

8. The method of claim 7, further comprising: selecting the first set of network addresses and the second set of network addresses based on a shared network link between nodes addressed in the first set of network addresses and nodes addressed in the second set of network addresses.

9. The method according to claim 1, wherein the measurement period is a rolling time window.

10. A system for maintaining network service levels, the system comprising:

Computer-readable storage device for storing records of network incidents; and

One or more processors are configured to access the computer-readable memory and execute instructions that, when executed by the processor, cause the processor to:

The records of network incidents stored in the computer-readable storage are used to identify a first plurality of network incidents that occurred during a first portion of the measurement period, the first plurality of network incidents being designated as a first network incident.

Multiple remaining incident tolerance limits are determined based on the impact of the first plurality of network incidents and the second plurality of network incidents on the corresponding set of incident tolerance limits for the measurement period, wherein the incident tolerance limit is a tolerance level for network incidents over a time period, and wherein the remaining incident tolerance limits are obtained by reducing the incident tolerance limit as each incident occurs.

11. The system of claim 10, wherein the identified first plurality of network incidents and second plurality of network incidents are represented by network incident records stored in the computer-readable storage medium.

12. The system of claim 11, wherein the network incident log includes at least:

(i) Information on the time and date of the accident.

(iii) A description or classification of the services affected by the incident.

13. The system of claim 10, wherein the instructions, when executed by the processor, cause the processor to limit the subset of the second network incidents to network incidents that satisfy the selection criteria.

14. The system of claim 13, wherein the instruction, when executed by the processor, causes the processor to select a subset of the second network incidents based on the fact that the count of network incidents satisfying the selection criterion exceeds a threshold.

15. The system of claim 10, wherein the instructions, when executed by the processor, cause the processor to limit the subset of the second network incidents to network incidents affecting network flows, each network flow having at least one corresponding terminal node in a shared geographical area.

16. The system of claim 10, wherein when the instruction is executed by the processor, the processor causes the processor to select a subset of the second network incident, wherein selecting a subset of the second network incident comprises: identifying network incidents affecting network flow between the first terminal node and the second terminal node based on a first terminal node and a second terminal node, the first terminal node having a network address in a first set of network addresses, and the second terminal node having a network address in a second set of network addresses.

17. The system of claim 16, wherein, when executed by the processor, the instructions cause the processor to select the first set of network addresses and the second set of network addresses based on a shared network link between nodes addressed in the first set of network addresses and nodes addressed in the second set of network addresses.

18. The system of claim 10, wherein the measurement period is a rolling time window.