[go: up one dir, main page]

HK1227195B - Method to route packets in a distributed direct interconnect network - Google Patents

Method to route packets in a distributed direct interconnect network Download PDF

Info

Publication number
HK1227195B
HK1227195B HK17100823.4A HK17100823A HK1227195B HK 1227195 B HK1227195 B HK 1227195B HK 17100823 A HK17100823 A HK 17100823A HK 1227195 B HK1227195 B HK 1227195B
Authority
HK
Hong Kong
Prior art keywords
node
packets
nodes
routing
network
Prior art date
Application number
HK17100823.4A
Other languages
Chinese (zh)
Other versions
HK1227195A1 (en
Inventor
Dan Oprea
Andrei CATANA
Udo NEUSTADTER
Original Assignee
洛克波特网络股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 洛克波特网络股份有限公司 filed Critical 洛克波特网络股份有限公司
Publication of HK1227195A1 publication Critical patent/HK1227195A1/en
Publication of HK1227195B publication Critical patent/HK1227195B/en

Links

Description

在分布式直接互连网络中路由分组的方法Method for routing packets in a distributed direct interconnection network

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请与2014年8月29日提交的标题为“Method and Apparatus to Manage theDirect Interconnect Switch Wiring and Growth in Computer Networks”的PCT专利申请号PCT/CA2014/000652有关,其公开内容通过引用结合在本文中。This application is related to PCT patent application number PCT/CA2014/000652, filed on August 29, 2014, entitled “Method and Apparatus to Manage the Direct Interconnect Switch Wiring and Growth in Computer Networks,” the disclosure of which is incorporated herein by reference.

技术领域Technical Field

本发明涉及用于数据中心和云数据中心服务器互连的计算机网络。特别地,本发明涉及用于在诸如“蜻蜓”布线结构的环形(torus)或高基数拓扑上实现的直接互连网络中路由分组的方法。The present invention relates to computer networks for interconnecting servers in data centers and cloud data centers. In particular, the present invention relates to methods for routing packets in direct interconnect networks implemented on torus or high-radix topologies such as "dragonfly" wiring structures.

背景技术Background Art

术语数据中心(DC)通常是指用来容纳全部通过大量的结构化电缆线路互连的大型计算机系统(常常被包含在容纳装置的机架上)及其关联组件的设施。云数据中心(CDC)是用来是指类似地存储实体的数据的大型的通常为备用设施的术语。The term data center (DC) generally refers to a facility used to house large computer systems (often contained in racks housing the equipment) and their associated components, all interconnected by extensive structured cabling. Cloud data center (CDC) is a term used to refer to large, often redundant facilities that similarly store the data of an entity.

网络交换机是链接网络设备以用于通信/处理目的的计算机联网装置。换句话说,交换机是能够从连接到其上的任何设备接收消息并且将该消息发送到特定设备的电信设备,该特定设备是该消息将要被中继到的设备。网络交换机也通常被称为处理并路由数据的多端口网桥。这里,通过端口,我们是指交换机与它附连到的计算机/服务器/CPU之间的接口(用于电缆或插头的出口)。A network switch is a computer networking device that links network devices for communication/processing purposes. In other words, a switch is a telecommunications device capable of receiving messages from any device connected to it and sending them to a specific device to which the message will be relayed. A network switch is also commonly referred to as a multi-port bridge that processes and routes data. Here, "port" refers to the interface (the outlet for the cable or plug) between the switch and the computer/server/CPU to which it is attached.

现今,DC和CDC通常使用一组二层交换机来实现数据中心联网。二层交换机在第2层(数据链路层)处处理并路由数据,所述数据链路层是在同一局域网上的节点(例如,服务器)或广域网中的相邻节点之间传送数据的协议层。然而,要解决的关键问题是如何构建大容量计算机网络,其能够承载包含非常大数目的端口(数千个)的非常大的聚合带宽(数千TB)、需要最小结构和空间(即,使对于用于容纳具有卡片机架的大量机柜的大房间的需要最小化)并且是容易地可扩展的以及可以有助于使功耗最小化。Today, DCs and CDCs typically use a set of Layer 2 switches to implement data center networking. Layer 2 switches process and route data at Layer 2 (the Data Link Layer), which is the protocol layer that transmits data between nodes (e.g., servers) on the same local area network or between adjacent nodes in a wide area network. However, a key issue to be addressed is how to build a high-capacity computer network that can carry very large aggregate bandwidth (thousands of terabytes) with a very large number of ports (thousands), requires minimal structure and space (i.e., minimizes the need for large rooms to house a large number of cabinets with card racks), is easily scalable, and can help minimize power consumption.

传统的网络拓扑实施例是基于按照如图1中所示的层次树结构而组织的完全独立的交换机。核心交换机2是具有非常大的交换容量的甚高速低计数端口。第二层使用聚合交换机4(具有大数目的端口的中等容量交换机)来实现,然而第三层使用较低速度、大端口计数(例如,四十/四十八)、低容量边缘交换机6来实现。通常边缘交换机是层2,然而聚合端口是层2和/或层3,并且核心交换机通常是层3。这个实施例在所提供的示例中在最多六跳链路(上至核心交换机2的三跳以及下至目的地服务器8的三跳)中提供任何服务器8到任何服务器连接性。出于冗余可靠性目的这样的层次结构也通常是重复的。例如,参考图1,在没有重复的情况下如果最右边的边缘交换机6出故障,则不存在到最右边的服务器8的连接性。至少,核心交换机2是重复的,因为核心交换机2的故障将产生总数据中心连接性故障。由于显然的原因,这个方法在解决将来CDC的挑战时具有显著的局限性。例如,因为每个交换机是完全自备式的,所以这增加复杂性、显著的楼层空间利用、易于人为错误的复杂电缆线路以及人工交换机配置/提供以及增加的能量成本。Conventional network topology implementations are based on completely independent switches organized in a hierarchical tree structure, as shown in Figure 1 . Core switch 2 is a very high-speed, low-port-count switch with very large switching capacity. Layer 2 is implemented using aggregation switches 4 (medium-capacity switches with a large number of ports), while Layer 3 is implemented using lower-speed, large-port-count (e.g., 40/48), low-capacity edge switches 6. Edge switches are typically Layer 2, while aggregation ports are Layer 2 and/or Layer 3, and core switches are typically Layer 3. In the example provided, this implementation provides any server 8 to any server connectivity over a maximum of six hops (three hops up to core switch 2 and three hops down to destination server 8). Such hierarchical structures are also typically replicated for redundancy and reliability purposes. For example, referring to Figure 1 , without replication, if the rightmost edge switch 6 fails, there would be no connectivity to the rightmost server 8. At the very least, core switch 2 is replicated, as a failure of core switch 2 would result in a total data center connectivity failure. For obvious reasons, this approach has significant limitations in addressing the challenges of future CDCs. For example, because each switch is completely self-contained, this adds complexity, significant floor space utilization, complex cabling that is prone to human error, and manual switch configuration/provisioning, as well as increased energy costs.

然而,已经做出许多尝试,以改进数据中心中的交换可扩展性、可靠性、容量和延迟。例如,已经做出努力来通过使用统一控制平面(例如,来自Juniper Networks的QFabic系统交换机;参见例如http://www.juniper.net/us/en/productservices/switching/qfabric-system/)来实现更复杂的交换解决方案,但是这样的系统仍然使用并维持传统的层次架构。此外,考虑到系统用户以及要存储、存取和处理的数据的数目的指数增加,当确定计算机网络系统的性能要求时处理能力已成为最重要的因素。虽然服务器性能一直持续地改进,但是一个服务器不够强大到满足需要。这是并行处理的使用已变得至关重要的原因。结果,在多达80%的许多情况下,什么是占主导的北美业务流现在已主要成为东西方业务流。不管业务流的这种改变,网络架构尚未演进为对于这个模型来说是最佳的。它因此仍然是确定在并行处理通信期间CPU之间的交互的速度的通信网络(其互连计算节点(服务器))的拓扑。However, many attempts have been made to improve switching scalability, reliability, capacity, and latency in data centers. For example, efforts have been made to implement more complex switching solutions by using a unified control plane (e.g., the QFabic system switches from Juniper Networks; see, for example, http://www.juniper.net/us/en/productservices/switching/qfabric-system/), but such systems still use and maintain a traditional hierarchical architecture. Furthermore, given the exponential increase in the number of system users and the data to be stored, accessed, and processed, processing power has become the most important factor when determining the performance requirements of computer network systems. While server performance has continued to improve, a single server is no longer powerful enough to meet the needs. This is why the use of parallel processing has become crucial. As a result, in many cases, up to 80% of traffic flows, what was once a predominantly North American traffic flow has now become primarily an East-West traffic flow. Despite this shift in traffic flows, network architecture has not yet evolved to be optimal for this model. Therefore, it is still the topology of the communication network (which interconnects the compute nodes (servers)) that determines the speed of interaction between CPUs during parallel processing communications.

对于增加的东西方业务通信的这个需要导致较新的更平坦网络架构(例如,环形的/环形网络)的创建。环形互连系统是用于在并行计算机系统中以网状方式连接网络节点(服务器)的网络拓扑。环形拓扑能够具有布置在能够被虚拟化为阵列的2、3或更多(N)个维度上的节点,其中处理器/服务器连接到它们最近的邻居处理器/服务器,并且其中阵列的相对边缘上的处理器/服务器被连接。以这种方式,每个节点在N维环形配置中具有2N个连接(图2提供了3-D环形互连的示例)。因为环形拓扑中的每个节点经由短电缆线路连接到相邻节点,所以在并行处理期间存在低网络延迟。实际上,环形拓扑以最小跳数提供对任何节点(服务器)的访问。例如,实现3x 3x 3x4结构(108个节点)的四维环形需要平均2.5跳以便提供任意点到任意点的连接性。图4提供6x 6 2-D环形的示例,示出了从拐角节点1.6转向所有其它35个节点所需要的最小跳数。如所示,从节点1.6到达任何目的地所需要的跳数能够被标绘为在3跳(10个节点)处具有峰并且通常分别在5跳(4个节点)和1跳(4个节点)处具有尾的贝尔曲线。This need for increased east-west business communications has led to the creation of newer, flatter network architectures (e.g., ring-shaped/ring networks). A ring interconnect system is a network topology for connecting network nodes (servers) in a meshed manner in a parallel computer system. A ring topology can have nodes arranged in 2, 3, or more (N) dimensions that can be virtualized as an array, wherein processors/servers are connected to their nearest neighbor processors/servers, and wherein processors/servers on opposite edges of the array are connected. In this way, each node has 2N connections in an N-dimensional ring configuration (Figure 2 provides an example of a 3-D ring interconnect). Because each node in the ring topology is connected to an adjacent node via a short cable line, there is low network latency during parallel processing. In fact, the ring topology provides access to any node (server) with a minimum number of hops. For example, a four-dimensional ring implementing a 3x 3x 3x4 structure (108 nodes) requires an average of 2.5 hops to provide connectivity from any point to any point. Figure 4 provides an example of a 6x 6 2-D ring, showing the minimum number of hops required to turn from a corner node 1.6 to all other 35 nodes. As shown, the number of hops required to reach any destination from node 1.6 can be plotted as a bell curve with a peak at 3 hops (10 nodes) and tails typically at 5 hops (4 nodes) and 1 hop (4 nodes), respectively.

遗憾的是,大型环形网络实施例尚未实际用于DC或CDC的商业部署,因为大型实施例可能花费数月来构建,电缆线路可能是复杂的(对于每个节点来说2N个连接),并且它们在扩展是必要的情况下修改可能是代价高且麻烦的。然而,在对于处理能力的需要已经比商业缺点重的情况下,环形拓扑在超级计算机中的实现一直是非常成功的。在这方面,IBM的蓝色基因超级计算机提供3-D环形互连网络的示例,其中64个机柜容纳65,536个节点(131,072个CPU)以提供每秒千万亿次浮点运算(petaflops)处理能力(参见插图的图3),然而富士的PRIMEHPC FX10超级计算机系统是被容纳在包括98,304个节点的1,024个机架中的6-D环形互连的示例。虽然以上示例处理环形拓扑,但是它们同样地适用于其它平坦网络拓扑。Unfortunately, large ring network embodiments have not yet been practical for commercial deployment of DC or CDC because large embodiments may take months to build, cabling may be complex (2N connections for each node), and they may be costly and cumbersome to modify when expansion is necessary. However, in situations where the need for processing power has outweighed commercial disadvantages, the implementation of ring topologies in supercomputers has been very successful. In this regard, IBM's Blue Gene supercomputer provides an example of a 3-D ring interconnect network, where 64 cabinets house 65,536 nodes (131,072 CPUs) to provide petaflops of processing power (see illustrated FIG3 ), while Fuji's PRIMEHPC FX10 supercomputer system is an example of a 6-D ring interconnect housed in 1,024 racks comprising 98,304 nodes. Although the above examples deal with ring topologies, they are equally applicable to other flat network topologies.

本发明更具体地处理在环形或高基数网络结构中从节点到节点的数据分组遍历和路由的重要问题。在这方面,它是确定数据的分组在网络中从源转向目的地所采取的实际路径的路由。出于本文目的,延迟是指分组在网络中到达目的地所花费的时间,并且通常是从当报头到达源节点的输入端时到当它到达目的地节点的输入端时来测量的。跳计数是指在源与目的地之间遍历(经过)的链路或节点的数目,并且表示用于确定延迟的近似值。吞吐量是按照比特/每秒(bits/sec)测量的网络每输入端口/节点接受的数据速率。The present invention more specifically deals with the important problem of data packet traversal and routing from node to node in a ring or high-radix network structure. In this regard, it is the routing that determines the actual path that a packet of data takes in a network from a source to a destination. For the purposes of this article, latency refers to the time it takes for a packet to reach its destination in a network, and is typically measured from when the header arrives at the input of the source node to when it arrives at the input of the destination node. Hop count refers to the number of links or nodes traversed (passed through) between the source and the destination, and represents an approximate value for determining latency. Throughput is the rate of data accepted by each input port/node of the network measured in bits/second (bits/sec).

当路由时的有用目标是为了在节点当中均匀地分发业务(负荷均衡)以便避免热点发展(使用/需求已超过期望的或可接受的阈值的通路或节点区域)并且为了使争用(当两个或更多个节点试图同时在同一条电线或路径上发送消息或分组时)最小化,从而改进网络延迟和吞吐量。为此选择的路由影响从节点到节点的跳数,并且可能在路由未被优化时潜在地甚至由此影响能量消耗。A useful goal when routing is to distribute traffic evenly among nodes (load balancing) to avoid the development of hotspots (areas of paths or nodes where usage/demand exceeds a desired or acceptable threshold) and to minimize contention (when two or more nodes attempt to send messages or packets on the same wire or path at the same time), thereby improving network latency and throughput. The route chosen for this purpose affects the number of hops from node to node and, if routing is not optimized, can potentially even affect energy consumption.

网络操作的拓扑还无疑影响延迟,因为拓扑影响节点之间的平均最小跳计数和距离。例如,在环形中,不仅存在分组能够为了到达目的地而采取的数个路径(即,在存在“路径分集”的环形中),而且在任何源和目的地对之间也存在多个最小长度路径。作为示例,图5示出了分组能够为了在2-D环形网格中从节点S 11转向节点D 12而采取的三个最小路由(3跳)的示例,同时也示出了5跳的较长的第四路由。通过路由完成的路径计算是基本上仅基于拓扑完成的-源路由在逐跳基础上动态地基于分组源-目的地对。The topology of network operation also undoubtedly affects delay, because topology affects the average minimum hop count and distance between nodes. For example, in a ring, there are not only several paths that a packet can take in order to reach a destination (that is, in a ring with "path diversity"), but also multiple minimum length paths between any source and destination pair. As an example, FIG5 shows an example of three minimum routes (3 hops) that a packet can take in order to turn from node S 11 to node D 12 in a 2-D ring grid, while also showing a longer fourth route of 5 hops. The path calculation performed by routing is essentially performed only based on topology - source routing is dynamically based on packet source-destination pairs on a hop-by-hop basis.

利用路径分集的路由方法学在网络中具有更好的故障容错和更好的负荷均衡。然而,路由方法学不总是实现这样的目标,并且能够通常被划分成三个类别:确定性、无关和自适应。确定性路由是指一对给定节点之间的路由不顾网络的当前状态(即,不顾网络业务量)都被提前确定的事实。维序路由(DOR)是确定性路由的示例,其中从节点A到节点B的所有消息将总是遍历同一路径。具体地,消息逐个维度遍历(X-Y路由),从而在切换到下一个维度之前到达在一个维度上与其目的地匹配的纵坐标。作为一个示例,图6能够被用来示出DOR,其中分组首先沿着第一维度(X)行进与从节点1至5到9所需要的一样远,后面是沿着第二维度(Y)行进到目的地节点10。尽管这样的路由通常易于实现并且无死锁(死锁是指沿着从源到目的地的通路存在死循环的情形),然而未利用路径分集并且因此负荷均衡较差。Routing methodologies that exploit path diversity offer better fault tolerance and load balancing within the network. However, routing methodologies do not always achieve this goal and can generally be categorized into three categories: deterministic, agnostic, and adaptive. Deterministic routing refers to the fact that the route between a given pair of nodes is determined in advance, regardless of the current state of the network (i.e., regardless of network traffic). Dimension-ordered routing (DOR) is an example of deterministic routing, in which all messages from node A to node B will always traverse the same path. Specifically, messages traverse dimension by dimension (X-Y routing), reaching a ordinate that matches their destination in one dimension before switching to the next. As an example, Figure 6 can be used to illustrate DOR, in which a packet first travels along the first dimension (X) as far as required from nodes 1 to 5 to 9, followed by traveling along the second dimension (Y) to destination node 10. While such routing is generally easy to implement and deadlock-free (deadlock is a situation where an infinite loop exists along the path from source to destination), it does not exploit path diversity and therefore suffers from poor load balancing.

“无关(oblivious)”的路由算法是其中不顾网络的当前状态随机地做出路由判定的那些算法(确定性路由是无关路由的子集)。尽管这意味着无关路由可能实现起来简单,然而它也不能适于业务和网络情况。众所周知的无关路由方法的示例是Valiant算法(为本领域的技术人员所知)。在这个方法中,从节点A发送到节点B的分组被首先从A发送到随机地选择的中间节点X(一跳),并且然后从X发送到B。再次参考图6,Valiant算法能够将节点2随机地选择为从源节点1到目的地节点10的中间节点,意味着路径1-2-3-7-11-10例如能够被用于路由目的。这通常使任何业务模式随机化,并且因为所有模式似乎是均匀随机的,所以网络负荷是十分均衡的。事实上,Valiant算法一直被认为能够通常在几乎任何拓扑上针对任何业务模式使负荷均衡。然而,一个问题是Valiant路由通常是非最小的(最小路由是在源与目的地之间需要最小跳数的路径),这常常导致显著的跳计数增加,并且这进一步增加网络延迟并且可以潜在地增加能量消耗。然而,诸如在拥塞最小的网络中存在例外。非最小路由可能随着附加节点被遍历而显著地增加延迟并且可能增加功耗;另一方面,在经历拥塞的网络中,非最小路由可以实际上帮助避免节点或热点,并且从而实际上导致较低延迟。"Oblivious" routing algorithms are those in which routing decisions are made randomly, regardless of the current state of the network (deterministic routing is a subset of oblivious routing). While this means that oblivious routing may be simple to implement, it may not be adaptable to traffic and network conditions. An example of a well-known oblivious routing approach is the Valiant algorithm (known to those skilled in the art). In this approach, a packet sent from node A to node B is first sent from A to a randomly selected intermediate node X (one hop), and then from X to B. Referring again to Figure 6, the Valiant algorithm can randomly select node 2 as an intermediate node from source node 1 to destination node 10, meaning that the path 1-2-3-7-11-10, for example, can be used for routing purposes. This generally randomizes any traffic pattern, and because all patterns appear to be uniformly random, the network load is very balanced. In fact, the Valiant algorithm has been recognized as being able to generally balance the load for any traffic pattern on almost any topology. However, one problem is that valiant routing is often non-minimal (a minimal route is a path requiring the minimum number of hops between a source and a destination), which often results in a significant increase in hop count, which further increases network latency and can potentially increase energy consumption. However, there are exceptions, such as in networks with minimal congestion. Non-minimal routing can significantly increase latency and potentially increase power consumption as additional nodes are traversed; on the other hand, in networks experiencing congestion, non-minimal routing can actually help avoid nodes or hotspots and thereby actually result in lower latency.

如果最小路由是期望的或必要的,则能够修改Valiant算法以通过指定中间节点必须位于最小象限内来将其随机判定限于最小路由/最短路径。作为示例,参考图6,节点1(具有节点10的目的地)的最小象限将包含节点2、5和0,并且将导致3的常用跳计数。If minimal routing is desired or necessary, the Valiant algorithm can be modified to limit its random decisions to minimal routing/shortest paths by specifying that intermediate nodes must be within a minimal quadrant. As an example, referring to Figure 6, the minimal quadrant for node 1 (with a destination of node 10) would contain nodes 2, 5, and 0, and would result in a common hop count of 3.

Valiant算法提供将降低热点发展的可能性的某种水平的路径选择随机化。然而,存在另一个难题,在影响这些效率情况下-Valiant路由仅在与DOR路由相结合地使用时才无死锁,其本身在负荷均衡和热点避免方面是失败的。The Valiant algorithm provides some level of randomization of path selection that will reduce the likelihood of hotspot development. However, there is another difficulty that affects these efficiencies - Valiant routing is only deadlock-free when used in conjunction with DOR routing, which by itself fails in terms of load balancing and hotspot avoidance.

自适应路由算法是其中路由判定是基于网络或网络业务的状态的那些算法。它们通常涉及流控机制,并且在这方面常常使用缓冲器占用。自适应路由能够采用全局节点信息(其是成本性能合理的),或者能够使用来自仅本地节点的信息,包括例如用于测量网络拥塞的队列占用。使用单独来自本地节点的信息的问题是这有时可能导致次优选择。自适应路由也可能限于最小路径,或者它能够通过采用具有活锁(即,与分组行进未进行到目的地的死锁类似的情形;常常是资源缺乏的结果)的潜在性的非最小路径从而是完全自适应的(即,在采取最短路径时没有限制)。这有时能够通过每分组允许特定数目的误编路由并且通过将更高优先级给予被误编路由许多次的分组来克服。自适应路由的另一问题是它可能对保存数据分组排序造成问题-分组需要按照相同的次序到达目的地或者否则你需要实现分组重排序机制。Adaptive routing algorithms are those in which routing decisions are based on the state of the network or network traffic. They typically involve flow control mechanisms, and in this regard, buffer occupancy is often used. Adaptive routing can use global node information (which is cost-effective), or it can use information from only local nodes, including, for example, queue occupancy for measuring network congestion. The problem with using information from local nodes alone is that this can sometimes lead to suboptimal choices. Adaptive routing can also be limited to minimum paths, or it can be fully adaptive (i.e., without restrictions on taking the shortest path) by using non-minimum paths with the potential for livelock (i.e., a situation similar to deadlock where packets do not reach their destination; often the result of a lack of resources). This can sometimes be overcome by allowing a certain number of misroutes per packet and by giving higher priority to packets that have been misrouted many times. Another problem with adaptive routing is that it can cause problems with preserving the order of data packets - packets need to arrive at their destination in the same order, or else you need to implement a packet reordering mechanism.

最后,重要的是提及能够通过源表或本地表来实现路由。利用源表,整个路由在源处被指定,并且能够被嵌入到分组报头中。因此使延迟最小化了,因为路由不必在每个节点处被查找或者被逐跳路由。还能够使源表针对每目的地指定多个路由以能够管理故障,并且,在路由被随机地选择(即,无关路由)时,能够提高负荷均衡。利用本地节点表,另一方面,较小的路由表被采用。然而,分组将采取的下一个步骤在每个节点处被确定,并且这增加了每跳延迟。Finally, it is important to mention that routing can be implemented through a source table or a local table. With a source table, the entire route is specified at the source and can be embedded in the packet header. This minimizes latency because the route does not have to be looked up at each node or routed hop by hop. The source table can also specify multiple routes per destination to manage failures and improve load balancing when routes are randomly selected (i.e., independent routes). With a local node table, on the other hand, a smaller routing table is used. However, the next step a packet will take is determined at each node, and this increases per-hop latency.

本发明设法克服现有技术中的缺陷并且改进在环形或高基数网络拓扑中路由分组的已知方法。The present invention seeks to overcome the deficiencies in the prior art and to improve upon known methods of routing packets in ring or high radix network topologies.

发明内容Summary of the Invention

在一个方面中,本发明提供一种用于在环形网格或高基数拓扑上路由数据分组以提供低延迟、增加的吞吐量以及避免热点和死锁的新颖方法。In one aspect, the present invention provides a novel method for routing data packets over a ring mesh or high radix topology to provide low latency, increased throughput, and avoid hot spots and deadlocks.

在又一个方面中,本发明提供一种用于在环形网格或高基数结构中实现业务去相关(随机化)、带宽分配中增加的效率、负荷均衡和流量工程(traffic engineering)的方法。In yet another aspect, the present invention provides a method for achieving traffic decorrelation (randomization), increased efficiency in bandwidth allocation, load balancing and traffic engineering in a ring mesh or high radix structure.

在一个实施例中,本发明提供一种在直接互连网络中从源节点向目的地节点路由分组的计算机实现的方法,该方法包括以下步骤:在网络拓扑中发现所有节点以及每个节点上的所有输出端口;将在所述网络拓扑中发现的节点和输出端口包括在拓扑数据库中以便允许所述节点和端口被包括在最短路径路由计算中;基于包含在所述拓扑数据库中的那些节点和输出端口来计算从每个节点上的每个输出端口到所述网络拓扑中的每个其它节点的最短路径;在每个节点上生成源路由数据库,其包含从所述每个节点上的每个输出端口到所述网络拓扑中的所有其它节点的最短路径;在所述源节点处接收分组;以循环方式或加权循环方式将所接收到的分组发送到所述源节点的输出端口,由此所述接收到的分组中的每一个此后在所述源节点的所述输出端口处被分段成微片(flits)并且沿着从所述源节点上的所述输出端口到所述目的地节点的多个最短路径被分发,使得所述分组从而沿着所述网络拓扑中的多个替换的路由而被分发;以及在所述目的地节点处对所述分组进行重组和重排序,使得所述分组符合它们的原始形式和次序。In one embodiment, the present invention provides a computer-implemented method for routing packets from a source node to a destination node in a directly interconnected network, the method comprising the steps of: discovering all nodes in a network topology and all output ports on each node; including the nodes and output ports discovered in the network topology in a topology database to allow the nodes and ports to be included in a shortest path routing calculation; computing the shortest path from each output port on each node to every other node in the network topology based on those nodes and output ports contained in the topology database; generating a source routing database at each node containing the shortest paths from each output port on each node to every other node in the network topology ... an output port to all other nodes in the network topology; receiving packets at the source node; sending the received packets to the output port of the source node in a round-robin or weighted round-robin manner, whereby each of the received packets is thereafter segmented into flits at the output port of the source node and distributed along a plurality of shortest paths from the output port on the source node to the destination node, such that the packets are thereby distributed along a plurality of alternative routes in the network topology; and reassembling and reordering the packets at the destination node so that the packets conform to their original form and order.

在另一个实施例中,本发明提供一种路由分组的方法,其中使用虫孔(wormhole)交换将微片转发到目的地节点。In another embodiment, the present invention provides a method of routing packets in which flits are forwarded to a destination node using wormhole switching.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

现在将参考附图通过示例对本发明的实施例进行描述,附图中:Embodiments of the present invention will now be described by way of example with reference to the accompanying drawings, in which:

图1是传统的数据中心网络实施例的高级视图;FIG1 is a high-level diagram of a conventional data center network embodiment;

图2是具有8个节点的3维环形互连的图;FIG2 is a diagram of a 3-dimensional ring interconnect with 8 nodes;

图3是示出了采用环形架构的IBM蓝色基因处理单元的层次的图;FIG3 is a diagram showing the hierarchy of an IBM BlueGene processing unit employing a ring architecture;

图4是显示了从拐角节点(1.6)开始到达35个节点中的任一个所需要的最小跳数的6x 6个2-D折叠环形中的36个节点的高级图;Figure 4 is a high-level diagram of 36 nodes in a 6 x 6 2-D folded ring showing the minimum number of hops required to reach any of the 35 nodes starting from a corner node (1.6);

图5是显示了在节点S的每个端口(x+、x-、y+、y-)上从节点S开始到节点D的数个路由的4x 4 2-D环形实施例的图;FIG5 is a diagram of a 4 x 4 2-D ring embodiment showing several routes starting from node S to node D on each port (x+, x-, y+, y-) of node S;

图6是用于帮助示出各种路由算法如何能够从节点向节点路由分组的简单网络拓扑的图;FIG6 is a diagram of a simple network topology used to help illustrate how various routing algorithms can route packets from node to node;

图7是根据本发明的系统的具有所有端口的高速网络接口卡的高级图;7 is a high-level diagram of a high-speed network interface card with all ports of a system according to the present invention;

图8是示出了根据本发明的系统的关键功能块的主机网络接口(HNI)卡的高级图;FIG8 is a high-level diagram of a host network interface (HNI) card showing key functional blocks of a system according to the present invention;

图9是跨越环形网格实现分组传送所需要的分组分段的高级视图;FIG9 is a high-level view of the packet segmentation required to achieve packet delivery across a ring mesh;

图10示出了在36节点2-D环形结构中从节点S(2.4)上的每个端口开始到节点D(5.2)上的每个端口的路由和跳数的图;FIG10 shows a diagram of the routes and hop counts starting from each port on node S (2.4) to each port on node D (5.2) in a 36-node 2-D ring structure;

图11示出了根据本发明的呈现软件组件TD、SR、WS、FC的高级图;FIG11 shows a high-level diagram of presentation software components TD, SR, WS, FC according to the present invention;

图12示出了根据本发明的分组分段的过程的流程图;FIG12 shows a flow chart of a process of packet segmentation according to the present invention;

图13显示了通过分组分段实现微片生成的伪代码;Figure 13 shows the pseudo code for implementing flit generation through group segmentation;

图14显示了根据本发明的用于从每个端口准备源路由的处理步骤的流程图;FIG14 shows a flow chart of the processing steps for preparing source routing from each port according to the present invention;

图15显示了根据本发明的实现路径计算的伪代码;FIG15 shows a pseudo code for implementing path calculation according to the present invention;

图16显示了根据本发明的通过路径计算生成的源路由表格式;FIG16 shows a format of a source routing table generated by path calculation according to the present invention;

图17显示了根据本发明的在每个端口上传送“微片”的过程的流程图;FIG17 shows a flow chart of the process of transmitting a "flick" on each port according to the present invention;

图18显示了根据本发明的在每个端口上传送“微片”的虫孔交换过程;FIG18 shows a wormhole exchange process for transmitting "flicks" on each port according to the present invention;

图19显示了根据本发明的实现如图17中的流程图中所描述的微片传送的伪代码;FIG19 shows pseudo code for implementing a flitter transfer as described in the flowchart of FIG17 in accordance with the present invention;

图20显示了用于发现为了拓扑更新而触发拓扑发现过程的节点或链路的流程图;FIG20 shows a flow chart for discovering nodes or links that trigger a topology discovery process for topology update;

图21显示了用于描述拓扑数据库更新的流程图;FIG21 shows a flowchart for describing a topology database update;

图22示出了节点和/或链路故障检测和恢复的过程以及拓扑更新的过程的流程图;以及FIG22 is a flowchart showing a process of node and/or link failure detection and recovery and a process of topology update; and

图23显示了根据本发明的实现链路和/或节点故障检测的伪代码;FIG23 shows pseudo code for implementing link and/or node failure detection according to the present invention;

具体实施方式DETAILED DESCRIPTION

本发明利用环形网格或高基数布线来实现用于数据中心应用的直接互连交换。特别地,本发明尤其可与2014年8月29日提交的标题为“Method and Apparatus to Managethe Direct Interconnect Switch Wiring and Growth in Computer Networks”的相关PCT专利申请号PCT/CA2014/000652中所描述的环形系统一起使用,该PCT专利申请的公开内容通过引用结合在本文中。这样的架构能够在单个交换域中提供具有成千上万个服务器的高性能网络互连。The present invention utilizes a ring mesh or high-radix cabling to implement a direct interconnect switch for data center applications. Specifically, the present invention is particularly useful with the ring system described in related PCT patent application No. PCT/CA2014/000652, filed on August 29, 2014, entitled "Method and Apparatus to Manage the Direct Interconnect Switch Wiring and Growth in Computer Networks," the disclosure of which is incorporated herein by reference. Such an architecture can provide high-performance network interconnects with tens of thousands of servers within a single switch domain.

控制交换机的软件遵循软件定义网络模型。交换机控制平面(交换机OS)优选地在单独的服务器上运行,但是与在每个PCIe卡上运行的分布式处理紧密地相互作用。参考图11,关键软件组件包括拓扑发现(TD)、跨越拓扑的分组路由(SR)、虫孔交换(WS)、流控(FC)机制以及交换机状态监视和交换机管理。TD组件负责基本上发现节点存在(或不存在)于环形网格中(并且从而允许基于节点存在更新可能的路径);SR组件负责源路由计算(从源节点的每个端口到任意目的地节点可用的所有路径);WS组件负责基于路径向量的跨越环形的微片操纵,该路径向量被包括在微片报头中,具有最小延迟(在每个节点处的贯通模型);FC组件负责环形链路利用并且控制到源的消息以便通过输入排队机制来调节业务;交换机状态监视组件负责业务均衡、在故障条件下的业务重新路由以及对于交换机增长的支持;并且交换机管理组件负责维护交换机的完整映像,包括所有互连的服务器、到外部世界的所有互连的I/O链路以及连接性状态。The software that controls the switch follows the software-defined networking model. The switch control plane (switch OS) preferably runs on a separate server, but interacts closely with the distributed processing running on each PCIe card. Referring to Figure 11, key software components include topology discovery (TD), packet routing (SR) across the topology, wormhole switching (WS), flow control (FC) mechanisms, and switch state monitoring and switch management. The TD component is responsible for essentially discovering the presence (or absence) of nodes in the ring mesh (and thereby allowing possible paths to be updated based on node presence); the SR component is responsible for source route calculation (all paths available from each port of the source node to any destination node); the WS component is responsible for micro-slice manipulation across the ring based on the path vector, which is included in the micro-slice header with minimum latency (through model at each node); the FC component is responsible for ring link utilization and controlling messages to the source in order to regulate traffic through an input queuing mechanism; the switch state monitoring component is responsible for traffic balancing, traffic rerouting under failure conditions, and support for switch growth; and the switch management component is responsible for maintaining a complete image of the switch, including all interconnected servers, all interconnected I/O links to the outside world, and connectivity status.

本发明主要涉及本发明的新颖拓扑发现(TD组件)和路由协议(SR组件)-尤其可用于实现在PCT专利申请号PCT/CA2014/000652中所公开的新颖交换机架构上的关键功能。虽然通过环形结构路由数据分组的传统的实施方式做出隐式假定,即业务跨越目的地节点被随机地分发,但是由于跨越环形传播并且有能力完全禁用跨越环形网格传送分组的能力的热点发展,实际的实施例已经实际上在最大容量下的单个源/单个目的地连续分组流的情况下显示出了交换机故障。本发明因此更具体地参考用于诸如环形和高基数网络的直接互连网络拓扑的路由优化和实现,以便对于任何类型的业务解决性能问题和热点发展(无论是从任何源到任何目的地的大型流(高度相关的)还是到随机分布式目的地的业务(典型的互联网业务))。与如上面先前所描述的在环形中路由分组的已知方法相反,本发明描述了发现并更新拓扑中的所有节点的方法,以及向每个目的地节点路由分组的方法,其中最短路径在每个节点处在每个输出端口(例如,x+、x-、y+、y-、z+、z-、w+和w-)上被计算,并且由此分组被以循环方式或加权循环方式沿着从每个输出端口起的路径分发(即,促进多个替换的路径)以增加带宽和业务分发从而避免热点。流控(FC)机制被用来将用于分组路由的输出端口选择从简单循环连续的源节点输出端口选择(即,首先端口x+,接下来x-,然后y+、y-、z+、z-、w+和w-)改变为由FC标识为忙碌的链路被跳过的加权循环,或者那些端口上的业务量被降低至最大链路容量的低百分比(例如,10%)。The present invention is primarily directed to the novel topology discovery (TD component) and routing protocol (SR component) of the present invention - particularly useful for implementing key functionality on the novel switch architecture disclosed in PCT Patent Application No. PCT/CA2014/000652. While conventional implementations of routing data packets through a ring structure make the implicit assumption that traffic is randomly distributed across destination nodes, practical embodiments have demonstrated switch failures in the presence of single source/single destination continuous packet flows at maximum capacity due to the development of hotspots that propagate across the ring and have the ability to completely disable the ability to transmit packets across the ring mesh. The present invention therefore more specifically refers to routing optimization and implementation for directly interconnected network topologies such as rings and high-radix networks in order to address performance issues and hotspot development for any type of traffic, whether large flows from any source to any destination (highly correlated) or traffic to randomly distributed destinations (typical of Internet traffic). In contrast to known methods of routing packets in a ring as previously described above, the present invention describes a method of discovering and updating all nodes in a topology, and a method of routing packets to each destination node, wherein the shortest path is calculated at each node on each output port (e.g., x+, x-, y+, y-, z+, z-, w+, and w-), and whereby packets are distributed along the paths from each output port in a round-robin or weighted round-robin manner (i.e., facilitating multiple alternative paths) to increase bandwidth and traffic distribution and thereby avoid hot spots. A flow control (FC) mechanism is used to change the output port selection for packet routing from a simple round-robin sequential selection of source node output ports (i.e., port x+ first, next x-, then y+, y-, z+, z-, w+, and w-) to a weighted round-robin in which links identified as busy by FC are skipped, or the traffic on those ports is reduced to a low percentage (e.g., 10%) of the maximum link capacity.

数据分组重排序被实现在节点目的地处。这样的方法能够特别利用PCT专利申请号PCT/CA2014/000652中讨论并且在本文中进一步讨论的节点(服务器)发现的新颖方法。本发明因此解决针对诸如环形或高基数网络的直接互连网络拓扑的分组路由优化的问题,以便对于任何类型的业务解决性能问题和热点发展,而不管那是从任何源到任何目的地的大型流(高度相关的)还是到随机分布式目的地的业务(典型的互联网业务)。Data packet reordering is implemented at the node destination. Such an approach can particularly leverage the novel approach to node (server) discovery discussed in PCT Patent Application No. PCT/CA2014/000652 and further discussed herein. The present invention thus addresses the problem of packet routing optimization for directly interconnected network topologies such as ring or high-radix networks, in order to address performance issues and hotspot development for any type of traffic, whether it is large flows from any source to any destination (highly correlated) or traffic to randomly distributed destinations (typical Internet traffic).

首先,图7显示了具有如下所有端口的高速网络接口卡10的高级别框图:用于访问外部世界的I/O端口13(N-S接口)、4D环形互连端口14、用于访问计算机服务器的PCIe端口15,以及控件19。本领域的技术人员将理解,本发明能够与具有更少或更多端口的HNI卡一起并且与其它环形或高基数拓扑(例如诸如“蜻蜓”)相关联地工作。7 shows a high-level block diagram of a high-speed network interface card 10 having all of the following ports: I/O ports 13 (N-S interface) for accessing the outside world, 4D ring interconnect ports 14, PCIe ports 15 for accessing computer servers, and controls 19. Those skilled in the art will appreciate that the present invention can work with HNI cards having fewer or more ports and in association with other ring or high-radix topologies, such as, for example, a "dragonfly."

图8利用关键功能块示出了根据本发明的系统的HNI卡10的高级别框图,所述关键功能块包括具有RAM 37和ROM 38的处理器31、交换块32、物理接口装置36、环形互连端口14、I/O端口13以及PCIe端口15。这个实施例的物理实现基于具有附连存储器的多核处理器以及用于实现环形互连的PHY装置。使用处理器的支持多个队列操纵的硬件加速辅助能力来在软件中实现交换功能。本领域的技术人员将理解,本发明也将在物理实现方式不同的情况下(即,在交换功能加速用FPGA或ASIC实现的情况下)工作,或者,处理器不必驻留在与用于交换功能的卡相同的卡上。FIG8 illustrates a high-level block diagram of an HNI card 10 of a system according to the present invention using key functional blocks, including a processor 31 with RAM 37 and ROM 38, a switch block 32, a physical interface device 36, a ring interconnect port 14, I/O ports 13, and a PCIe port 15. The physical implementation of this embodiment is based on a multi-core processor with attached memory and a PHY device for implementing the ring interconnect. The switching function is implemented in software using the processor's hardware acceleration capabilities to support multiple queue manipulation. Those skilled in the art will appreciate that the present invention will also work with different physical implementations (i.e., when the switching function is accelerated using an FPGA or ASIC), or that the processor need not reside on the same card as the card used for the switching function.

图10显示了根据本发明的用于在说明分组路由时使用的简单2D环形布线图。如图所示,该结构是其中每个连接的长度自始至终相等的折叠2D环形。此图中的每个节点表示经由被容纳在服务器中的PCIe交换卡(如上描述的)互连的服务器。该图示出了如在源节点S处计算的在各个维度(x+、x-、y+、y-)上开始从源节点S(2.4)的端口到目的地节点D(5.2)的最短最大程度不相交路径(无公共链路或节点)。在这方面,图10将三个计算的路径和路由向量示出为具有5跳的最小跳计数(从(S)2.4起:(x+、x+、x+、y-、y-);从(S)2.4起:(y-、y-、x+、x+、x+);并且从(S)2.4起:(x-、x-、y-、y-、x-),然而另一路径被计算为具有7跳的非最小跳计数(从(S)2.4起:(y+、y+、y+、x+、x+、x+、y+))。假定在源S节点与D节点之间有大业务量,分组/微片将被以循环方式或加权循环方式,沿着从每个输出端口起的路径在每个源节点S输出端口开始到节点D的最短路径上发送到目的地节点,从而一方面使用多个路径来增加带宽并且分发业务以避免热点发展(在源节点处对输出端口的加权循环选择将增加沿着特定路径的分组的数目)。第二替换路由也可以被计算并且被存储在每个节点处,由此,除第一跳之外,路由提供替换的单独的路径以在必要时在链路或节点故障的情况下帮助克服网络拥塞和热点。本方法出于负荷均衡目的而跨越环形或高基数网络分发业务,但是能够引入分组重排序。例如,因为在y+上开始的路径(图10)比最小路径(5跳)长2跳,所以在这个路径上发送的所有分组将以可能产生分组重排序问题的额外延迟到达。根据本发明,因此必须在目的地节点处对分组进行重排序。这在本领域的技术人员的技能内。FIG10 shows a simple 2D ring wiring diagram for use in illustrating packet routing according to the present invention. As shown, the structure is a folded 2D ring in which the length of each connection is equal throughout. Each node in this diagram represents a server interconnected via a PCIe switch card (as described above) housed in the server. The diagram shows the shortest maximally disjoint paths (with no common links or nodes) from a port of a source node S (2.4) to a destination node D (5.2) in each dimension (x+, x-, y+, y-), as calculated at the source node S. In this regard, FIG10 shows three calculated paths and routing vectors as having a minimum hop count of 5 hops (from (S)2.4: (x+, x+, x+, y-, y-); from (S)2.4: (y-, y-, x+, x+, x+); and from (S)2.4: (x-, x-, y-, y-, x-), whereas another path is calculated as having a non-minimum hop count of 7 hops (from (S)2.4: (y+, y+, y+, x+, x+, x+, y+)). Assuming that there is a large amount of traffic between the source node S and the D node, the packets/micro-slices will be sent to the destination node in a round-robin manner or a weighted round-robin manner along the path from each output port of the source node S on the shortest path starting from each output port of the source node S to the node D, thereby using multiple paths to increase bandwidth on the one hand. And distribute the traffic to avoid hotspot development (weighted round-robin selection of output ports at the source node will increase the number of packets along a particular path). A second alternative route can also be calculated and stored at each node, whereby, in addition to the first hop, the route provides an alternative, separate path to help overcome network congestion and hotspots when necessary in the event of a link or node failure. This method distributes traffic across a ring or high cardinality network for load balancing purposes, but can introduce packet reordering. For example, because the path starting on y+ (Figure 10) is 2 hops longer than the minimum path (5 hops), all packets sent on this path will arrive with additional delays that may create packet reordering problems. According to the present invention, the packets must therefore be reordered at the destination node. This is within the skill of those skilled in the art.

参考示出了分组分段的高级视图的图9,重要的是注意,遍历直接互连环形或高基数的分组在路由期间被分段成更小大小的分组(微片)。分组分段是跨越环形网格所需要的,以便避免“线端阻塞”(使用公共资源(链路和缓冲器)的分组网络中的常见情况)。微片的大小由业务类型确定以便使开销税最小化,并且实际值范围在64个字节与512个字节之间。重要的是注意,本发明能够处理以太网分组、Infiniband分组、PCIe流或NVMe事务等等。跨越环形结构的微片在透明地承载业务。外部接口I/O端口13特定于业务的类型并且终止用户业务,从而经由“微片”跨越环形将数据透明地传送到目的地PCIe端口15。重要的是注意,本发明能够处理跨越环形(经由以太网微片)的以太网分组。本文的关键优点是微片的可变大小,因为以太网分组的微片具有分组划界机制。因此,微片大小从64个字节(最小以太网分组大小)到512个字节(净荷和报头)是可变的。如果分组大于512个字节,则系统将会将原始分组分段成512个字节的微片:头微片、体微片和尾微片。Referring to Figure 9, which shows a high-level view of packet segmentation, it is important to note that packets traversing a directly interconnected ring or high cardinality are segmented into smaller sized packets (microslices) during routing. Packet segmentation is required across the ring mesh in order to avoid "line end blocking" (a common situation in packet networks that use common resources (links and buffers)). The size of the microslice is determined by the type of traffic in order to minimize overhead, and actual values range between 64 bytes and 512 bytes. It is important to note that the present invention is capable of handling Ethernet packets, Infiniband packets, PCIe streams, or NVMe transactions, etc. The microslices across the ring structure carry traffic transparently. The external interface I/O port 13 is specific to the type of traffic and terminates the user traffic, thereby transparently transmitting the data across the ring to the destination PCIe port 15 via the "microslice". It is important to note that the present invention is capable of handling Ethernet packets that cross the ring (via Ethernet microslices). The key advantage of this article is the variable size of the microslices, because the microslices of Ethernet packets have a packet demarcation mechanism. Therefore, the flit size is variable from 64 bytes (minimum Ethernet packet size) to 512 bytes (payload and header). If the packet is larger than 512 bytes, the system will segment the original packet into 512-byte fragments: head flit, body flit, and tail flit.

图12提供了呈现分组分段的方法的流程图。一旦从主机(即,PCIe接口)接收到新分组(步骤402),HNI处理器就提取报头以便让处理器分析该分组的目的地(步骤404)和路由路径。在这个阶段,分组进入分段循环。字节计数器被初始化(步骤412)并且分组结束-EOP-分析器(步骤405)检测EOP,并且如果计数器小于或者等于476个字节(即,净荷小于或者等于476个字节),则分组被发送到微片生成器(步骤407)、报头生成器(步骤408)以及优先级判定块(步骤409)。微片此后被插入到指派的出口队列中(步骤410)以用于在环形网格上被发送到下一跳,并且我们注意看是否已接收到新分组(402)。字节计数器在每个单独的微片之后递增(步骤411),分组分段被发送到微片生成器(步骤407)并且过程遵循上面所描述的步骤,直到分组中的最后微片到达并被发送为止。作为关键观察结果,环形链路上的微片优选地是后面有8个“空闲”字节的512个字节(476个字节的净荷和36个字节的报头)。如果分组比476个字节短,则微片(通常是尾微片)将具有不到512个字节,并且将使用“空闲”字节完成分组划界。图13提供了通过分组分段实现微片生成的伪代码。Figure 12 provides a flow chart illustrating a method for packet segmentation. Upon receiving a new packet from the host (i.e., PCIe interface) (step 402), the HNI processor extracts the header to allow the processor to analyze the packet's destination (step 404) and routing path. At this stage, the packet enters a segmentation loop. A byte counter is initialized (step 412) and the End of Packet (EOP) analyzer (step 405) detects the EOP, and if the counter is less than or equal to 476 bytes (i.e., the payload is less than or equal to 476 bytes), the packet is sent to the Flit Generator (step 407), the Header Generator (step 408), and the Priority Decision Block (step 409). The flit is then inserted into the assigned egress queue (step 410) for transmission to the next hop on the ring mesh, and we note whether a new packet has been received (402). The byte counter is incremented after each individual flit (step 411), the packet segment is sent to the flit generator (step 407) and the process follows the steps described above until the last flit in the packet arrives and is sent. As a key observation, a flit on a ring link is preferably 512 bytes (476 bytes of payload and 36 bytes of header) followed by 8 "idle" bytes. If the packet is shorter than 476 bytes, then the flit (usually the tail flit) will have less than 512 bytes and the "idle" bytes will be used to complete the packet demarcation. Figure 13 provides pseudo code to implement flit generation through packet segmentation.

在目的地处,通过剥去环形报头并且对于微片在原始分组中的偏移使用“字节计数”,微片被组装成分组。本领域的技术人员将知道为目的地节点实现每源分组队列,创建分组分段的有序列表,并且一旦被按照原始顺序完全组装就释放分组。图14示出了根据本发明的呈现路由计算的流程图。本发明还在图15处提供伪代码用于计算在源节点的每个输出端口上最大不相交的(无公共链路和节点)从源节点到任何目的地节点的所有路径(源路由)。一检测到更新(即,每当已经添加/检测到节点或者存在与其相关联的临界问题/故障时)拓扑更新功能412(其发现活动服务器/节点及其活动端口)就将启动路由/路径计算(414)。如后面有步骤420和421(搜索拓扑列表中的所有目的地节点)的步骤419(在拓扑列表中找到第一目的地节点)中所描述的,根据本发明将在来自拓扑的当前节点与每个目的地节点之间使用Dijkstra算法来计算最短路径(步骤415)。在步骤422中,受步骤417影响的所有路径成本被重置为初始值。所计算的从源到目的地的最短路径将被存储在源路由数据库中并且它将通过将属于所存储的路径的所有链路上的成本增加到无限大而被从另外的最短路径计算中移除(以便提供不相交路径)(417)-这将具有从拓扑数据库中移除节点和链路以迫使不再用于不相交路径选择目的的实际效果。该过程继续直到所有输出端口将被使用并且源节点与目的地节点之间的所有可能的路径被存储在数据库中为止。结果然后作为每个节点的路由向量被概括在输出路由表(418)中,其中路由将首先从最短路径开始。最短路径计算它本身能够使用Dijkstra算法(路由协议实施中使用的众所周知的方法)来实现。本文特别新颖的是最短路径在每个输出端口处被确定并且相应地被使用。At the destination, the flits are assembled into packets by stripping off the ring header and using a "byte count" for the offset of the flits in the original packet. Those skilled in the art will know that a per-source packet queue is implemented for the destination node, creating an ordered list of packet segments, and releasing the packets once they are fully assembled in the original order. Figure 14 shows a flow chart presenting a route calculation according to the present invention. The present invention also provides pseudocode at Figure 15 for calculating all paths (source routes) from the source node to any destination node that are maximally disjoint (without common links and nodes) on each output port of the source node. As soon as an update is detected (i.e., whenever a node has been added/detected or there is a critical problem/failure associated with it), the topology update function 412 (which finds active servers/nodes and their active ports) will start the route/path calculation (414). As described in step 419 (finding the first destination node in the topology list) followed by steps 420 and 421 (searching all destination nodes in the topology list), the shortest path will be calculated using the Dijkstra algorithm between the current node from the topology and each destination node (step 415) according to the present invention. In step 422, all path costs affected by step 417 are reset to their initial values. The calculated shortest path from the source to the destination will be stored in the source routing database and will be removed from the further shortest path calculations (in order to provide disjoint paths) (417) by increasing the costs on all links belonging to the stored path to infinity - this will have the practical effect of removing nodes and links from the topology database to force them to no longer be used for disjoint path selection purposes. This process continues until all output ports will be used and all possible paths between the source node and the destination node are stored in the database. The results are then summarized in the output routing table (418) as a routing vector for each node, where routing will start with the shortest path first. The shortest path calculation itself can be implemented using Dijkstra's algorithm (a well-known method used in routing protocol implementations). What is particularly novel in this invention is that the shortest path is determined at each output port and used accordingly.

图16呈现路由输出表的格式。每个路由具有表示特定路径的跳数以及要在每跳上选择的输出端口的列表/向量的附连数目。特别地,图16呈现从源节点号1000到目的地节点0(节点1000和节点0的使用仅仅用于示例目的)的具有8个端口(x+、x-、y+、y-、z+、z-、w+和w-)的4-D环形(8x8x 8x 8)中的路由的输出表的示例快照。应当注意的是,8个路由的第一组中的8个路由(即,x+、y+、z+、w+、x-、y-、z-和w-)的组中的第一输出选择例如对应于源节点的每个输出端口。8个路径完全不相交并且具有不同的长度:5跳、7跳或9跳。最重要的特征是跨越启用业务分发和可预测性能的环形的路径分发。图16中的8个路径的第二组表示路由路径的第二替代方案。该方法可以计算许多替换的路由,但是距离(跳数)将变得越来越大。由于实际的原因,该方法可以根据需要通常仅存储8个不同路径的第一选项。Figure 16 presents the format of the routing output table. Each route has a list/vector of attached numbers representing the number of hops for a particular path and the output ports to be selected on each hop. In particular, Figure 16 presents an example snapshot of the output table for a route in a 4-D ring (8x8x8x8) with 8 ports (x+, x-, y+, y-, z+, z-, w+, and w-) from source node number 1000 to destination node 0 (the use of node 1000 and node 0 is for illustrative purposes only). It should be noted that the first output selection in the group of 8 routes (i.e., x+, y+, z+, w+, x-, y-, z-, and w-) in the first group of 8 routes corresponds, for example, to each output port of the source node. The 8 paths are completely disjoint and have different lengths: 5 hops, 7 hops, or 9 hops. The most important feature is the path distribution across the ring that enables traffic distribution and predictable performance. The second group of 8 paths in Figure 16 represents a second alternative for the routing path. This method can calculate many alternative routes, but the distance (number of hops) will become increasingly larger. Due to practical reasons, this method can typically only store the first options of 8 different paths as needed.

图17示出了根据本发明的表示分组交换中使用的步骤的流程图。FIG. 17 shows a flow chart representing the steps used in packet switching according to the present invention.

随着新分组被推送到输出队列中,分组传送状态机启动跨越环形的传送。在这方面,讨论的(in question)输出端口“k”(端口1至端口8,即上面所讨论的输出端口:x+、y+、z+、w+、x-、y-、z-和w-)被用来遵循虫孔模型传送微片(如图18中所示):头微片和后续微片递增(体微片),将被沿着从端口k起的同一路径传送,直到如图17中所示到达分组结束(尾微片)为止。一旦到达分组结束,计数器k就递增以指示下一个输出端口并且具有下一个分组标识符的下一个分组(被指定为下一个有序分组)使用这个虫孔交换模型在逐微片基础上在环形上被传送。一旦检测到分组结束,端口号就被再次递增(即,k=k+1)并且下一个分组将被传送等等。这个实施例将在循环模型中在节点的所有输出端口14上分发源自一个节点的分组,以便使长的数据流与特定目的地去相关并且在环形网格中避免“热点”发展。本领域的技术人员将理解,能够根据需要修改这个分发方法以必要时针对不同类型的业务分发提高性能。例如,你能够用加权循环方法或任何其它用户定义的方法代替循环法。本发明的方法已被发现为比Valiant无关分发更好地执行,并且在结果方面是非常一致的且可预测的。这从网络的性能观点看是非常显著的。图19显示了实现如在图17处的流程图中所描述和示出的微片传送的伪代码。As new packets are pushed into the output queues, the packet forwarding state machine initiates forwarding across the ring. In this regard, the output port "k" in question (ports 1 through 8, i.e., the output ports discussed above: x+, y+, z+, w+, x-, y-, z-, and w-) is used to forward flits following a wormhole model (as shown in FIG18 ): the head flits and subsequent flits, incremented (body flits), are forwarded along the same path from port k until the end of the packet (tail flits) is reached as shown in FIG17 . Once the end of the packet is reached, a counter k is incremented to indicate the next output port and the next packet with the next packet identifier (designated as the next in-order packet) is forwarded on a flite-by-flit basis across the ring using this wormhole switching model. Once the end of the packet is detected, the port number is incremented again (i.e., k=k+1) and the next packet is forwarded, and so on. This embodiment distributes packets originating from a node across all of the node's output ports 14 in a round-robin model in order to decorrelate long data flows with specific destinations and avoid the development of "hot spots" in the ring mesh. Those skilled in the art will appreciate that this distribution method can be modified as needed to improve performance for different types of traffic distribution, as necessary. For example, one can replace the round-robin method with a weighted round-robin method or any other user-defined method. The method of the present invention has been found to perform better than Valiant's independent distribution and is very consistent and predictable in terms of results. This is very significant from a network performance perspective. FIG19 shows pseudo code for implementing the flits delivery as described and shown in the flow chart at FIG17 .

图20示出了根据本发明的呈现拓扑发现的方法的流程图。如上面所指出的,拓扑发现组件(TD)对于基本上发现节点(和节点端口)存在(或不存在)于环形网格中是重要的,所述节点/端口然后可以被用在以上路由分发中。在这方面,TD允许如PCT专利申请号PCT/CA2014/000652中特别公开的本发明的新颖网络以简化且高效方式缩放。TD基于“Hello”协议和数据库交换协议。特别地,在每个节点和每个端口上发送并接收“Hello”消息。一旦消息作为“Hello”被接收到,并且“Hello”分组中的节点ID被管理框架验证,则本地节点信息就被更新以指示到邻居的新链路的可用性。节点信息的收集在每个节点上被维护在邻居数据库中并且通过数据库交换协议与邻居同步。每个节点将它对邻居数据库的改变发送到邻居节点,从而逐步传播整个拓扑。在特定超时之后,如果对邻居数据库的改变发生了,则为路径计算生成功能调用。对于“Hello”消息交换,邻居到邻居永远以定义的时间间隔继续,以便检测拓扑中的改变。Figure 20 shows a flowchart of a method for presenting topology discovery according to the present invention. As noted above, the topology discovery component (TD) is important for essentially discovering the presence (or absence) of nodes (and node ports) in a ring mesh, which can then be used in the above route distribution. In this regard, TD allows the novel network of the present invention, as particularly disclosed in PCT patent application number PCT/CA2014/000652, to be scaled in a simplified and efficient manner. TD is based on the "Hello" protocol and the database exchange protocol. In particular, a "Hello" message is sent and received on each node and each port. Once the message is received as a "Hello" and the node ID in the "Hello" packet is verified by the management framework, the local node information is updated to indicate the availability of new links to the neighbors. The collection of node information is maintained in a neighbor database on each node and synchronized with the neighbors via the database exchange protocol. Each node sends its changes to the neighbor database to the neighbor node, thereby gradually propagating the entire topology. After a specific timeout, if a change to the neighbor database occurs, a function call is generated for path calculation. The "Hello" message exchange from neighbor to neighbor continues forever at defined time intervals in order to detect changes in the topology.

图21示出了根据本发明的呈现针对拓扑发现的DB交换更新的流程图。FIG. 21 shows a flow chart presenting DB exchange updates for topology discovery according to the present invention.

这基本上涉及负责在每个节点中维护交换机的完整映像的TD组件,包括所有经互连的服务器以及输入/输出链路和连接性状态。This essentially involves the TD component being responsible for maintaining a complete image of the switch in each node, including all interconnected servers and input/output links and connectivity status.

图22提供了呈现本发明的系统如何发现节点/链路故障检测的流程图以及重新路由以避开已出故障的节点的过程。随着链路故障被检测到602,直通块被编程使得去往已出故障的链路的业务被重新路由到CPU 603。业务源被通知以避开已出故障的链路发送业务604。这触发拓扑更新事件605。FIG22 provides a flow chart showing how the system of the present invention discovers node/link failure detection and the process of rerouting to avoid the node that has failed. As a link failure is detected 602, the pass-through block is programmed so that the business going to the link that has failed is rerouted to the CPU 603. The business source is notified to send business 604 to avoid the link that has failed. This triggers a topology update event 605.

随着链路恢复被检测到606,直通块被编程使得由于已出故障的链路而去往CPU的业务被重新路由回到同一链路607。As link recovery is detected 606 , the pass-through block is programmed so that traffic destined for the CPU due to the failed link is rerouted back to the same link 607 .

业务源被通知以链路故障之前的同一方式发送业务608。这触发拓扑更新事件609。在链路切断的情况下将发生类似的过程。图23提供了针对链路/节点故障检测和恢复的伪代码。The traffic source is informed to send traffic in the same manner as before the link failure 608. This triggers a topology update event 609. A similar process will occur in the case of a link cut. Figure 23 provides pseudo code for link/node failure detection and recovery.

本发明的一个实施例是用于新节点插入的方法。为了插入节点,互连现有节点的布线被中断(链路切断)。随着环形布线中断被检测到602,业务重新路由API被调用604。链路成本被设置为无限大并且拓扑被更新(为所有交换机节点计算最短(较低成本)路径)。这定义将被用来控制分组流以到达任何目的地的路由。新节点插入触发邻居发现。管理软件检查新节点ID并且与所提供的数据库进行比较。如果节点是有效的,则软件启动拓扑更新和路由计算的过程。管理数据库将随着新节点被更新并且节点状态将被设置为活动的。One embodiment of the present invention is a method for new node insertion. In order to insert a node, the wiring interconnecting the existing nodes is interrupted (link cut). As the ring wiring interruption is detected 602, the service rerouting API is called 604. The link cost is set to infinity and the topology is updated (the shortest (lower cost) path is calculated for all switch nodes). This defines the route that will be used to control the flow of packets to reach any destination. The new node insertion triggers neighbor discovery. The management software checks the new node ID and compares it with the provided database. If the node is valid, the software starts the process of topology update and route calculation. The management database will be updated with the new node and the node status will be set to active.

尽管已经描述了本发明的特定实施例,然而对于本领域的技术人员而言将显而易见的是,可以在以下权利要求的范围内做出对实施例的变化和修改。While specific embodiments of the present invention have been described, it will be apparent to those skilled in the art that changes and modifications of the embodiments may be made within the scope of the following claims.

Claims (2)

1.一种在直接互连网络中从源节点向目的地节点路由分组的计算机实现的方法,该方法包括以下步骤:1. A computer-implemented method for routing packets from a source node to a destination node in a directly interconnected network, the method comprising the following steps: 在网络拓扑中发现所有节点以及每个节点上的所有输出端口;Discover all nodes and all output ports on each node in the network topology; 将在所述网络拓扑中发现的节点和输出端口包括在拓扑数据库中以便允许所述节点和端口被包括在最短路径路由计算中;The nodes and output ports found in the network topology are included in the topology database so that the nodes and ports can be included in the shortest path route calculation; 基于包含在所述拓扑数据库中的那些节点和输出端口来计算从每个节点上的每个输出端口到所述网络拓扑中的每个其它节点的最短路径;The shortest path from each output port on each node to each other node in the network topology is calculated based on the nodes and output ports contained in the topology database. 在每个节点上生成源路由数据库,其包含从所述每个节点上的每个输出端口到所述网络拓扑中的所有其它节点的最短路径;A source routing database is generated on each node, which contains the shortest paths from each output port on each node to all other nodes in the network topology; 在所述源节点处接收分组;Receive packets at the source node; 以循环方式或加权循环方式将所接收到的分组发送到所述源节点的输出端口,由此所述接收到的分组中的每一个此后在所述源节点的所述输出端口处被分段成更小的分组并且沿着从所述源节点上的所述输出端口到所述目的地节点的多个最短路径而被分发,使得所述分组从而沿着所述网络拓扑中的多个替换的路由而被分发;以及The received packets are sent to the output port of the source node in a round-robin or weighted round-robin manner, whereby each of the received packets is subsequently segmented into smaller packets at the output port of the source node and distributed along multiple shortest paths from the output port of the source node to the destination node, such that the packets are distributed along multiple alternative routes in the network topology; and 在所述目的地节点处对所述分组进行重组和重排序,使得所述分组符合它们的原始形式和次序。At the destination node, the packets are reorganized and reordered so that they conform to their original form and order. 2.根据权利要求1所述的计算机实现的方法,其中通过使用虫孔交换将所述更小的分组转发到所述目的地节点。2. The computer-implemented method of claim 1, wherein the smaller packets are forwarded to the destination node using wormhole switching.
HK17100823.4A 2014-02-13 2015-02-13 Method to route packets in a distributed direct interconnect network HK1227195B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US61/939,487 2014-02-13

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
HK42020004075.6A Division HK40013842B (en) 2014-02-13 2017-01-23 Method to route packets in a distributed direct interconnect network

Related Child Applications (1)

Application Number Title Priority Date Filing Date
HK42020004075.6A Addition HK40013842B (en) 2014-02-13 2017-01-23 Method to route packets in a distributed direct interconnect network

Publications (2)

Publication Number Publication Date
HK1227195A1 HK1227195A1 (en) 2017-10-13
HK1227195B true HK1227195B (en) 2020-12-11

Family

ID=

Similar Documents

Publication Publication Date Title
US11362934B2 (en) Method to route packets in a distributed direct interconnect network
EP2777229B1 (en) System and method for providing deadlock free routing between switches in a fat-tree topology
US10015056B2 (en) System, method and apparatus for improving the performance of collective operations in high performance computing
US11121984B2 (en) Routing tables for forwarding packets between switches in a data center network
KR20150016309A (en) System and method for routing traffic between distinct infiniband subnets based on fat-tree routing
WO2015066367A1 (en) Network topology of hierarchical ring with recursive shortcuts
CN108111410B (en) Method and device for constructing deadlock-free route in network with Cartesian topology
CN108259387B (en) Switching system constructed by switch and routing method thereof
Shim et al. Static virtual channel allocation in oblivious routing
Adda et al. Routing and fault tolerance in Z-fat tree
US7307995B1 (en) System and method for linking a plurality of network switches
Guay et al. vFtree-A fat-tree routing algorithm using virtual lanes to alleviate congestion
Bogdanski Optimized routing for fat-tree topologies
Bogdanski et al. sFtree: A fully connected and deadlock-free switch-to-switch routing algorithm for fat-trees
HK1227195B (en) Method to route packets in a distributed direct interconnect network
Csernai et al. Poincare: A hyperbolic data center architecture
HK40013842B (en) Method to route packets in a distributed direct interconnect network
HK40013842A (en) Method to route packets in a distributed direct interconnect network
HK1227195A1 (en) Method to route packets in a distributed direct interconnect network
Alshahrani Delay modeling in data center networks: A taxonomy and performance analysis
Guay et al. using Virtual Lanes to Alleviate Congestion
Bogdanski et al. Deadlock-Free Switch-to-Switch Routing Algorithm for Fat-Trees