[go: up one dir, main page]

HK40081215B - Reconfigurable computing pods using optical networks with one-to-many optical switches - Google Patents

Reconfigurable computing pods using optical networks with one-to-many optical switches Download PDF

Info

Publication number
HK40081215B
HK40081215B HK42023069405.1A HK42023069405A HK40081215B HK 40081215 B HK40081215 B HK 40081215B HK 42023069405 A HK42023069405 A HK 42023069405A HK 40081215 B HK40081215 B HK 40081215B
Authority
HK
Hong Kong
Prior art keywords
segment
building block
dimension
workload
building blocks
Prior art date
Application number
HK42023069405.1A
Other languages
Chinese (zh)
Other versions
HK40081215A (en
Inventor
耶利米·威尔库克
Original Assignee
谷歌有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 谷歌有限责任公司 filed Critical 谷歌有限责任公司
Publication of HK40081215A publication Critical patent/HK40081215A/en
Publication of HK40081215B publication Critical patent/HK40081215B/en

Links

Description

使用具有一对多光交换机的光网络的可重新配置的计算平台Reconfigurable computing platform using optical networks with one-to-many optical switches

本申请为下述申请的分案申请,This application is a divisional application of the following application.

原申请的申请号:202010617949.1The original application number was 202010617949.1

原申请的申请日:2020年7月1日,The original application was filed on July 1, 2020.

原申请的发明名称:使用具有一对多光交换机的光网络的可重新配置的计算平台Original application title: Reconfigurable computing platform using an optical network with one-to-many optical switches

背景技术Background Technology

一些计算工作量(例如机器学习训练)需要大量的处理节点来有效地完成工作量。处理节点可以通过互连网络彼此通信。例如,在机器学习训练中,处理节点可以相互通信以收敛于最优的深度学习模型。互连网络对于处理单元实现收敛的速度和效率是关键的。Some computational workloads (such as machine learning training) require a large number of processing nodes to efficiently complete the task. These processing nodes can communicate with each other through an interconnected network. For example, in machine learning training, processing nodes can communicate with each other to converge to the optimal deep learning model. The interconnected network is crucial for the speed and efficiency of convergence for the processing units.

由于机器学习和其它工作工作量在尺寸和复杂性上变化,包括多个处理节点的超级计算机的刚性结构可以限制超级计算机的可用性、可伸缩性和性能。例如,如果具有连接特定处理节点排列的刚性互连网络的超级计算机的一些处理节点发生故障,则超级计算机可能不能替换这些处理节点,从而导致可用性和性能降低。一些特定的排列也可以导致比其它排列更高的性能,而与故障节点无关。Because machine learning and other tasks vary in size and complexity, the rigid architecture of supercomputers, which include multiple processing nodes, can limit their availability, scalability, and performance. For example, if some processing nodes in a supercomputer with a rigid interconnect network connecting a particular arrangement of processing nodes fail, the supercomputer may not be able to replace those nodes, leading to reduced availability and performance. Certain arrangements may also result in higher performance than others, regardless of the failed nodes.

发明内容Summary of the Invention

本说明书描述了涉及可重新配置的计算节点的超级平台(superpod)的技术,使用光网络从超级平台生成工作量集群。This specification describes the technology involving a superpod with reconfigurable compute nodes, using an optical network to generate workload clusters from the superpod.

通常,本说明书中描述的主题的一个创新方面可以体现在包括接收指定计算工作量的所请求的计算节点的请求数据的方法中。请求数据指定计算节点的目标排列。从包括一组构建块的超级平台中选择所述构建块的子集,所述一组构建块中的每个构建块包括m维度排列的计算节点。每个构建块连接到光网络,该光网络包括用于m维中的每一维的两个或更多个光路交换(OCS)交换机。对于m个维度中的每个维度,每个构建块包括一段或多段沿着维度互连的计算节点。每个段包括在所述段的第一端上的第一计算节点和在所述段的与所述第一端相对的第二端上的第二计算节点。对于m维中的每一维,第一计算节点的第一部分连接到用于该维的两个或更多个OCS交换机中的第一OCS交换机,第一计算节点的一个或多个附加部分连接到用于该维的两个或更多个OCS交换机中的相应附加OCS交换机,并且每个段的第二计算节点连接到具有输入和多个输出的相应一对多光交换机的输入。第一输出连接到第一OCS交换机,并且对于第一计算节点的每个附加部分,相应的附加输出连接到用于第一计算节点的附加部分的附加OCS交换机。确定与计算节点的目标排列相匹配的计算节点的子集的逻辑排列。对于m维中的每一维,逻辑排列限定每个构建块的段与一个或多个其它构建块的对应段之间的连接。生成计算节点的工作量集群,其包括构建块的子集并且基于逻辑排列彼此连接。生成包括为工作量集群的每个维度配置用于维度的两个或更多个OCS交换机中的每一个的相应路由数据。用于工作量集群的每个维度的相应路由数据指定计算工作量的数据如何沿着工作量集群的维度在计算节点之间路由。所述生成还包括基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每一段计算节点中的所述第二计算节点连接到与所述逻辑排列中所述第二计算节点所连接到的对应段的对应第一计算节点相同的OCS交换机。使所述工作量集群的计算节点执行所述计算工作量。这个和其它方面的其它实施方式包括被配置成执行在计算机存储设备上编码的方法的动作的相应系统,方法和计算机程序。一个或多个计算机的系统可以借助于安装在系统上的软件,固件,硬件或它们的组合来配置,使得在操作中使系统执行动作。一个或多个计算机程序可以通过指令来配置,所述指令在由数据处理装置执行时使所述装置执行所述动作。Typically, an innovative aspect of the subject matter described in this specification can be embodied in a method including receiving request data for requested compute nodes specifying a computational workload. The request data specifies a target arrangement of compute nodes. A subset of the building blocks is selected from a superplatform comprising a set of building blocks, each building block comprising compute nodes arranged in m dimensions. Each building block is connected to an optical network comprising two or more optical path switches (OCS) for each of the m dimensions. For each of the m dimensions, each building block comprises one or more segments of compute nodes interconnected along the dimension. Each segment comprises a first compute node at a first end of the segment and a second compute node at a second end of the segment opposite the first end. For each of the m dimensions, a first portion of the first compute node is connected to a first OCS switch among the two or more OCS switches for that dimension, one or more additional portions of the first compute node are connected to corresponding additional OCS switches among the two or more OCS switches for that dimension, and the second compute node of each segment is connected to the input of a corresponding one-to-many optical switch having inputs and multiple outputs. A first output is connected to a first OCS switch, and for each additional portion of the first compute node, a corresponding additional output is connected to an additional OCS switch for the additional portion of the first compute node. A logical arrangement of a subset of compute nodes matching a target arrangement of compute nodes is determined. For each dimension in m dimensions, the logical arrangement defines the connection between a segment of each building block and a corresponding segment of one or more other building blocks. A workload cluster of compute nodes is generated, comprising a subset of building blocks and connected to each other based on the logical arrangement. Generation includes configuring corresponding routing data for each dimension of the workload cluster for each of two or more OCS switches for that dimension. The corresponding routing data for each dimension of the workload cluster specifies how the data of the computational workload is routed between compute nodes along the dimension of the workload cluster. The generation also includes configuring at least a portion of the one-to-many switches based on the logical arrangement such that a second compute node in each segment of compute nodes is connected to the same OCS switch as the corresponding first compute node in the corresponding segment to which the second compute node in the logical arrangement is connected. The compute nodes of the workload cluster are then made to perform the computational workload. Other implementations of this and other aspects include corresponding systems, methods, and computer programs configured to perform actions of methods encoded on a computer storage device. A system of one or more computers can be configured by means of software, firmware, hardware, or a combination thereof installed on the system to cause the system to perform actions during operation. One or more computer programs can be configured by instructions that, when executed by a data processing device, cause the device to perform the actions.

这些和其它实施方式可以各自可选地包括一个或多个以下特征。在一些方面中,基于逻辑排列来配置一对多交换机的至少一部分,使得计算节点的每一段的第二计算节点连接到与逻辑排列中第二计算节点所连接到的相应段的对应第一计算节点相同的OCS交换机,所述配置可包括针对子集中的第一构建块识别子集中沿着特定维度与第一构建块相邻的第二构建块,以及对于沿着特定维度的第一构建块的每一段,识别第二构建块的对应段,识别第二构建块的对应段的第一计算节点所连接的OCS交换机,以及配置该段所连接的一对多交换机,以将该段的第二计算节点连接到所识别的OCS交换机。These and other implementations may each optionally include one or more of the following features. In some aspects, at least a portion of a one-to-many switch is configured based on a logical arrangement such that a second compute node in each segment of a compute node is connected to the same OCS switch as the corresponding first compute node in the corresponding segment to which the second compute node in the logical arrangement is connected. The configuration may include identifying second building blocks in a subset that are adjacent to the first building blocks along a specific dimension for a first building block in the subset, and for each segment of the first building block along the specific dimension, identifying the corresponding segment of the second building block, identifying the OCS switch to which the first compute node of the corresponding segment of the second building block is connected, and configuring the one-to-many switch to which the segment is connected to connect the second compute node of the segment to the identified OCS switch.

在一些方面中,识别所述第二构建块的所述对应段包括识别所述第二构建块的段,所述第二构建块的所述段在所述逻辑排列中沿着所述特定维度、沿着与所述第一构建块的所述段相同的逻辑轴。In some aspects, identifying the corresponding segment of the second building block includes identifying a segment of the second building block that is along the specific dimension in the logical arrangement, along the same logical axis as the segment of the first building block.

在一些方面,第一计算节点的一个或多个附加部分是第一计算节点的一个附加部分,一对多光交换机是具有一个输入和两个输出的一对二光交换机;所述第一计算节点的第一部分包括所述第一计算节点的一半,所述第一计算节点的附加部分包括所述第一计算节点的一半。In some respects, one or more additional portions of the first computing node are an additional portion of the first computing node, and a one-to-many optical switch is a one-to-two optical switch having one input and two outputs; a first portion of the first computing node includes half of the first computing node, and an additional portion of the first computing node includes half of the first computing node.

在一些方面中,请求数据指定不同类型的计算节点,且选择构建块的子集包含针对由请求数据指定的每一类型的计算节点选择包含指定类型的一个或多个计算节点的构建块。In some aspects, the request data specifies different types of compute nodes, and the selection of a subset of building blocks includes selecting building blocks containing one or more compute nodes of the specified type for each type of compute node specified by the request data.

在一些方面中,用于超级平台的每一维度的相应路由数据可包括用于维度的两个或更多个OCS交换机中的每一个OCS交换机的OCS交换机路由表。在一些方面中,每一构建块可包括计算节点的三维环面或计算节点的网格中的一个。In some aspects, the corresponding routing data for each dimension of the super platform may include an OCS switch routing table for each of the two or more OCS switches in the dimension. In some aspects, each building block may include a three-dimensional torus of a compute node or one of the meshes of compute nodes.

在一些方面中,超级平台包括多个工作量集群,且每一工作量集群包括构建块的不同子集,且执行与每个其它工作量集群不同的工作量。In some respects, the super platform comprises multiple workload clusters, each of which includes a different subset of building blocks and performs workloads that are different from each of the other workload clusters.

一些方面包括接收指示工作量集群的给定构建块已发生失效的数据,以及用可用构建块替换给定构建块。用可用构建块替换给定构建块可以包括更新所述光网络的一个或多个光路交换机的路由数据,以停止在所述工作量集群中的所述给定构建块与一个或多个其他构建块之间路由数据,以及更新所述光网络的所述一个或多个光路交换机的路由数据,以在所述工作量集群中的所述可用构建块与所述一个或多个其他构建块之间路由数据。在一些方面,计算节点的目标排列包括计算节点的n维排列,其中n大于或等于2。Some aspects include receiving data indicating that a given building block of the workload cluster has failed, and replacing the given building block with an available building block. Replacing the given building block with an available building block may include updating the routing data of one or more optical path switches of the optical network to stop routing data between the given building block and one or more other building blocks in the workload cluster, and updating the routing data of the one or more optical path switches of the optical network to route data between the available building block and the one or more other building blocks in the workload cluster. In some aspects, the target arrangement of compute nodes includes an n-dimensional arrangement of compute nodes, where n is greater than or equal to 2.

本说明书中描述的主题可以在特定实施例中实现,以便实现以下优点中的一个或多个。使用光网络为工作量动态配置计算节点的集群导致计算节点的较高可用性,因为其它计算节点可以容易地代替故障的或离线计算节点。可以从包括连接到光网络的计算节点的超级平台配置工作量集群。计算节点排列的灵活性导致计算节点的更高性能和更有效地分配针对每个工作量优化(或改进)的适当数量和排列的计算节点。利用包括使用光网络连接的多种类型的计算节点的超级平台,可以生成工作量集群,该工作量集群不仅包括适当数量和排列的计算节点,还包括用于每个工作量的适当类型的计算节点,例如,而不仅仅限于在数据中心或其它位置中彼此物理接近的计算节点。The subject matter described in this specification can be implemented in specific embodiments to achieve one or more of the following advantages. Clusters that dynamically configure compute nodes for workloads using optical networks result in higher availability of compute nodes because other compute nodes can easily replace failed or offline compute nodes. Workload clusters can be configured from a super platform that includes compute nodes connected to an optical network. The flexibility in compute node arrangement leads to higher performance and more efficient allocation of the appropriate number and arrangement of compute nodes optimized (or improved) for each workload. Utilizing a super platform that includes multiple types of compute nodes connected using optical networks, workload clusters can be generated that include not only an appropriate number and arrangement of compute nodes, but also appropriate types of compute nodes for each workload, e.g., and not only compute nodes physically close to each other in a data center or other location.

使用光网络为工作量配置工作量集群还为工作量提供了故障隔离和更好的安全性。例如,一些传统的超级计算机在组成超级计算机的各种计算机之间路由流量。如果其中一台计算机发生故障,该通信路径将丢失。使用光网络,数据可以被快速地重新路由和/或可用的计算节点可以替换(例如,代替)故障的计算节点。例如,通过重新配置光路交换(OCS)交换机,可以将超级平台中的另一个计算节点连接到工作量集群中的其它计算节点。此外,由OCS交换机提供的工作量之间的物理隔离(例如,不同光路径的物理隔离)与使用易受攻击的软件来管理分离相比,提供了在同一超级平台中执行的各种工作量之间的更好的安全性。Using optical networks to configure workload clusters also provides fault isolation and enhanced security for workloads. For example, some traditional supercomputers route traffic between the various computers that make up the supercomputer. If one of these computers fails, the communication path is lost. With optical networks, data can be quickly rerouted and/or available compute nodes can replace (e.g., substitute for) a failed compute node. For example, by reconfiguring an Optical Path Switch (OCS), another compute node in a superplatform can be connected to other compute nodes in the workload cluster. Furthermore, the physical isolation between workloads provided by OCS switches (e.g., physical isolation between different optical paths) offers better security between various workloads performing within the same superplatform compared to managing separation using vulnerable software.

相对于分组交换网络,使用光网络连接构建块还可以减少在构建块之间传输数据的等待时间。例如,在分组交换中,由于分组需要由交换机接收、缓冲和在另一个端口上再次发送,因此存在额外的等待时间。使用OCS交换机来连接构建块提供了真正的端到端光路径,而中间没有分组交换或缓冲。Compared to packet-switched networks, using optical networks to connect building blocks can also reduce latency in data transmission between building blocks. For example, in packet switching, there is additional latency because packets need to be received by a switch, buffered, and retransmitted on another port. Using OCS switches to connect building blocks provides a true end-to-end optical path without intermediate packet switching or buffering.

在光网络中可以包括一对多光交换机,以增加用于给定大小的OCS交换机的超级平台的尺寸。这又允许从用于给定大小的OCS交换机的超级平台的计算节点生成更大的工作量集群。类似地,对于给定大小的超级平台,使用一对多光交换机可以减少在每个OCS交换机上使用的OCS端口的数量。Optical networks can include one-to-many optical switches to increase the size of a superplatform for a given size of OCS switches. This, in turn, allows for the generation of larger workload clusters from the compute nodes of the superplatform for a given size of OCS switches. Similarly, for a given size of superplatform, using one-to-many optical switches can reduce the number of OCS ports used on each OCS switch.

下面将参照附图描述前述主题的各种特征和优点。根据本文所述的主题和权利要求书,其它特征和优点是显而易见的。The various features and advantages of the foregoing subject matter will now be described with reference to the accompanying drawings. Other features and advantages will be apparent from the subject matter and claims described herein.

附图说明Attached Figure Description

图1是其中示例性处理系统生成计算节点的工作量集群并使用工作量集群执行计算工作量的环境的框图。Figure 1 is a block diagram of an environment in which an exemplary processing system generates a workload cluster of compute nodes and uses the workload cluster to perform compute workloads.

图2示出了示例性逻辑超级平台和从超级平台中的构建块的一部分生成的示例性工作量集群。Figure 2 illustrates an exemplary logical superplatform and an exemplary workload cluster generated from a portion of the building blocks in the superplatform.

图3示出了示例性构建块和使用该构建块生成的示例性工作量集群。Figure 3 illustrates an exemplary building block and an exemplary workload cluster generated using that building block.

图4示出了从计算节点到光路交换(OCS)交换机的示例光链路。Figure 4 shows an example optical link from a compute node to an optical path switching (OCS) switch.

图5示出了用于形成构建块的逻辑计算托盘。Figure 5 shows the logic computation tray used to form building blocks.

图6示出了省略了一个维度的示例性构建块的子块。Figure 6 shows a sub-block of an exemplary building block with one dimension omitted.

图7示出了示例性构建块。Figure 7 shows an exemplary building block.

图8示出了用于超级平台的OCS结构拓扑。Figure 8 shows the OCS structure topology used for the super platform.

图9示出了示例超级平台的组件。Figure 9 shows the components of the example super platform.

图10是示出用于生成工作量集群并使用该工作量集群执行计算工作量的示例性过程的流程图。Figure 10 is a flowchart illustrating an exemplary process for generating a workload cluster and using the workload cluster to perform computational workloads.

图11是示出用于重新配置光网络以替换失效的构建块的示例性过程的流程图。Figure 11 is a flowchart illustrating an exemplary process for reconfiguring an optical network to replace a failed building block.

图12示出了包括构建块和1×2光交换机的示例超级平台的一部分。Figure 12 shows a portion of an example super platform including building blocks and a 1×2 optical switch.

图13示出了示例性工作量集群。Figure 13 illustrates an exemplary workload cluster.

图14是示出用于生成工作量集群并使用该工作量集群执行计算工作量的示例性过程的流程图。Figure 14 is a flowchart illustrating an exemplary process for generating a workload cluster and using the workload cluster to perform computational workloads.

在各个附图中相同的附图标记和标号表示相同的元件。In the various figures, the same reference numerals and designations denote the same elements.

具体实施方式Detailed Implementation

通常,这里描述的系统和技术可以配置光网络结构,以从超级平台生成计算节点的工作量集群,该超级平台包括经由光网络连接的计算节点的多个构建块。例如,超级平台可以包括一组互连的构建块。每个构建块可以包括m维排列(例如二维或三维排列)的多个计算节点。Typically, the systems and techniques described herein can be configured with optical network architectures to generate workload clusters of compute nodes from a superplatform comprising multiple building blocks of compute nodes connected via an optical network. For example, the superplatform may comprise a set of interconnected building blocks. Each building block may comprise multiple compute nodes arranged in m dimensions (e.g., two-dimensional or three-dimensional arrangement).

用户可以为特定工作量指定计算节点的目标排列。例如,用户可以提供机器学习工作量,并指定计算节点的目标排列以进行机器学习计算。目标排列可以定义n个维度中的每个维度的计算节点的数目,例如,其中n大于或等于2。也就是说,目标排列可以定义工作量集群的大小和形状。例如,一些机器学习模型和计算在非正方形拓扑上表现得更好。Users can specify a target arrangement of compute nodes for a given workload. For example, a user can provide a machine learning workload and specify a target arrangement of compute nodes for machine learning computations. The target arrangement can define the number of compute nodes in each of the n dimensions, for example, where n is greater than or equal to 2. That is, the target arrangement can define the size and shape of the workload cluster. For example, some machine learning models and computations perform better on non-square topologies.

横截面带宽也可以成为对整个计算的限制,例如,等待数据传输的计算节点留下了空闲计算周期。根据工作是如何在计算节点上分配的,以及在不同维度上需要通过网络传送多少数据,工作量集群的形状可能对工作量集群中的计算节点的性能有影响。Cross-sectional bandwidth can also be a limitation on the overall computation; for example, compute nodes waiting for data transfer leave idle computation cycles. Depending on how the work is distributed across compute nodes and how much data needs to be transferred over the network in different dimensions, the shape of the workload cluster can impact the performance of the compute nodes within it.

对于将具有所有计算节点到所有计算节点数据流量的工作量,立方体形工作量集群将减少计算节点之间的跳跃(hop)数。如果工作量具有大量的本地通信,然后在特定维度中将数据传送到相邻组的计算节点,并且工作量调用链接在一起的许多这些相邻通信,则工作量可以受益于在特定维度中具有比在其它维度中更多的计算节点的排列。因此,使用户能够指定工作量集群中的计算节点的排列允许用户指定可以导致其工作量的更好性能的排列。For workloads with all-to-all compute node data traffic, a cube-shaped workload cluster reduces the number of hops between compute nodes. If a workload has a large amount of local communication, then transfers data to adjacent groups of compute nodes in a specific dimension, and the workload invokes many of these adjacent communications linked together, the workload can benefit from having a larger arrangement of compute nodes in that specific dimension than in other dimensions. Therefore, enabling users to specify the arrangement of compute nodes in a workload cluster allows users to specify an arrangement that can lead to better performance for their workload.

如果不同类型的计算节点被包括在超级平台中,则该请求还可以指定每种类型的计算节点被包括在工作量集群中的数目。这允许用户指定对于特定工作量执行得更好的计算节点的排列。If different types of compute nodes are included in the super platform, the request can also specify the number of each type of compute node to be included in the workload cluster. This allows users to specify the arrangement of compute nodes that perform better for a particular workload.

工作量调度器可以例如基于构建块的可用性、构建块的健康状况(例如,工作或失效)和/或超级平台中的工作量的优先级(例如,由超级平台的计算节点执行或将由超级平台的计算节点执行的工作量的优先级)来选择用于工作量集群的构建块。工作量调度器可以向光路交换(OCS)管理器提供识别所选择的构建块和构建块的目标排列的数据。然后,OCS管理器可以配置光网络的一个或多个OCS交换机,以将构建块连接在一起,从而形成工作量集群。然后,工作量调度器可以在工作量集群的计算节点上执行计算工作量。The workload scheduler can select building blocks for a workload cluster based, for example, on the availability of building blocks, the health status of building blocks (e.g., working or failed), and/or the priority of workloads in the superplatform (e.g., the priority of workloads performed by or to be performed by compute nodes of the superplatform). The workload scheduler can provide data to the Optical Path Switch (OCS) manager identifying the selected building blocks and their target arrangement. The OCS manager can then configure one or more OCS switches of the optical network to connect the building blocks together, thereby forming a workload cluster. The workload scheduler can then execute computational workloads on the compute nodes of the workload cluster.

如果工作量集群中的一个构建块失效,则通过简单地重新配置OCS交换机,可以快速地用另一个构建块替换失效的构建块。例如,工作量调度器可以选择超级平台中的可用构建块来替换失效的构建块。工作量调度器可以指示OCS管理器用所选择的构建块替换失效的构建块。然后,OCS管理器可以重新配置OCS交换机,使得所选择的构建块被连接到工作量集群的其它构建块,并且使得失效的构建块不再被连接到工作量集群的构建块。If a building block in the workload cluster fails, it can be quickly replaced with another building block by simply reconfiguring the OCS switches. For example, the workload scheduler can select an available building block from the super platform to replace the failed one. The workload scheduler can then instruct the OCS manager to replace the failed building block with the selected building block. The OCS manager can then reconfigure the OCS switches so that the selected building block is connected to other building blocks in the workload cluster, and the failed building block is no longer connected to any other building blocks in the workload cluster.

图1是环境100的框图,其中示例性处理系统130生成计算节点的工作量集群并使用工作量集群执行计算工作量。处理系统130可通过数据通信网络120(例如,局域网(LAN),广域网(WAN),因特网,移动网络或其组合)从用户装置110接收计算工作量112。示例工作量112包括软件应用,机器学习模型(例如,训练和/或使用机器学习模型),编码和解码视频以及数字信号处理工作量等。Figure 1 is a block diagram of environment 100, in which an exemplary processing system 130 generates a cluster of computational nodes and performs computational workloads using the cluster of workloads. The processing system 130 can receive computational workloads 112 from user device 110 via a data communication network 120 (e.g., a local area network (LAN), a wide area network (WAN), the Internet, a mobile network, or a combination thereof). Example workloads 112 include software applications, machine learning models (e.g., training and/or using machine learning models), video encoding and decoding, and digital signal processing workloads, etc.

用户还可以为工作量112指定计算节点的所请求集群114。例如,用户可以指定计算节点的所请求集群的集群的目标形状和大小。也就是说,用户可以在多个维度上指定计算节点的数量和计算节点的形状。例如,如果计算节点分布在三个维度x,y和z上,则用户可以在每个维度中指定多个计算节点。用户还可以指定一种或多种类型的计算节点包括在集群中。如下所述,处理系统130可以包括不同类型的计算节点。The user can also specify the requested cluster 114 of compute nodes for workload 112. For example, the user can specify the target shape and size of the requested cluster of compute nodes. That is, the user can specify the number and shape of compute nodes in multiple dimensions. For example, if the compute nodes are distributed across three dimensions x, y, and z, the user can specify multiple compute nodes in each dimension. The user can also specify one or more types of compute nodes to be included in the cluster. As described below, the processing system 130 may include different types of compute nodes.

如下所述,处理系统130可以使用构建块来生成与该集群的目标形状和大小相匹配的工作量集群。每个构建块可以包括以m维(例如,3维或其他适当数量的维)排列的多个计算节点。因此,用户可以根据多维中的每一维中的构建块的数量来指定目标形状和尺寸。例如,处理系统130可以向用户设备110提供使用户能够在每个维度中选择最大数量的构建块的用户界面。As described below, the processing system 130 can use building blocks to generate a workload cluster that matches the target shape and size of the cluster. Each building block can include multiple computational nodes arranged in m dimensions (e.g., 3 dimensions or other suitable number of dimensions). Therefore, the user can specify the target shape and size based on the number of building blocks in each dimension. For example, the processing system 130 can provide the user device 110 with a user interface that allows the user to select the maximum number of building blocks in each dimension.

用户设备110可以向处理系统130提供工作量112和指定所请求集群114的数据。例如,用户设备110可以通过网络120向处理系统130提供请求数据,该请求数据包括工作量112和指定所请求集群114的数据。User equipment 110 can provide workload 112 and data specifying the requested cluster 114 to processing system 130. For example, user equipment 110 can provide requested data, including workload 112 and data specifying the requested cluster 114, to processing system 130 via network 120.

处理系统130包括单元调度器140和一个或多个单元150。单元150是一组一个或多个超级平台。例如,所示的单元150包括四个超级平台152-158。每个超级平台152-158包括一组构建块160,在本文中也称为构建块池。在该示例中,每个超级平台152-158包括64个构建块160。然而,超级平台152-158可包括其它数量的构建块160,例如20、50、100或另一适当的数量。超级平台152-158还可以包括不同量的构建块160。例如,超级平台152可以包括64个构建块,而超级平台154包括100个构建块。Processing system 130 includes a unit scheduler 140 and one or more units 150. A unit 150 is a group of one or more superplatforms. For example, the unit 150 shown includes four superplatforms 152-158. Each superplatform 152-158 includes a set of building blocks 160, also referred to herein as a building block pool. In this example, each superplatform 152-158 includes 64 building blocks 160. However, superplatforms 152-158 may include other numbers of building blocks 160, such as 20, 50, 100, or another suitable number. Superplatforms 152-158 may also include different numbers of building blocks 160. For example, superplatform 152 may include 64 building blocks, while superplatform 154 includes 100 building blocks.

如下面更详细描述的,每个构建块160可以包括逻辑上排列在两个或更多个维度中的多个计算节点。例如,构建块160可以包括64个计算节点,这些计算节点沿着三个维度排列,每个维度中有四个计算节点。这种计算节点的排列在本文中被称为4×4×4构建块,沿着x维有四个计算节点,沿着y维有四个计算节点并且沿着z维有四个计算节点。其他数量的维度(例如两个维度)以及每个维度中的其他数量的计算节点也是可能的,例如3×1,2×2×2,6×2,2×3×4等。As described in more detail below, each building block 160 may include multiple computation nodes logically arranged in two or more dimensions. For example, building block 160 may include 64 computation nodes arranged along three dimensions, with four computation nodes in each dimension. This arrangement of computation nodes is referred to herein as a 4×4×4 building block, with four computation nodes along the x-dimensional axis, four along the y-dimensional axis, and four along the z-dimensional axis. Other numbers of dimensions (e.g., two dimensions) and other numbers of computation nodes in each dimension are also possible, such as 3×1, 2×2×2, 6×2, 2×3×4, etc.

构建块还可以包括单个计算节点。然而,如下所述,为了生成工作量集群,构建块之间的光链路被配置成将构建块连接在一起。因此,尽管较小的构建块(例如,具有单个计算节点的构建块)可以在生成工作量集群时提供更大的灵活性,但是较小的构建块可能需要更多的OCS交换机配置和更多的光网络组件(例如,电缆和交换机)。构建块中的计算节点的数目可以基于工作量集群的期望灵活性和将构建块连接在一起以形成工作量集群的需求与所需的OCS交换机的数目之间的折衷来选择。Building blocks can also include a single compute node. However, as described below, to generate a workload cluster, optical links between building blocks are configured to connect the building blocks together. Therefore, while smaller building blocks (e.g., those with a single compute node) may offer greater flexibility in generating workload clusters, smaller building blocks may require more OCS switch configurations and more optical network components (e.g., cabling and switches). The number of compute nodes in a building block can be chosen based on a trade-off between the desired flexibility of the workload cluster and the need to connect building blocks together to form a workload cluster, and the required number of OCS switches.

构建块160的每个计算节点可以包括专用集成电路(ASIC),例如用于机器学习工作量的Tensor处理单元(TPU),图形处理单元(GPU)或其它类型的处理单元。例如,每个计算节点可以是包括处理单元的单个处理器芯片。Each computing node in building block 160 may include an application-specific integrated circuit (ASIC), such as a Tensor Processing Unit (TPU), a Graphics Processing Unit (GPU), or other types of processing units for machine learning workloads. For example, each computing node may be a single processor chip that includes processing units.

在一些实施方式中,超级平台中的所有构建块160具有相同的计算节点。例如,超级平台152可以包括64个构建块,每个构建块具有用于执行机器学习工作量的4×4×4排列的64个TPU。超级平台还可以包括不同类型的计算节点。例如,超级平台154可以包括具有TPU的60个构建块和具有执行除机器学习工作量之外的任务的专用处理单元的4个构建块。以这种方式,用于工作量的工作量集群可以包括不同类型的计算节点。超级平台可以包括超级平台中的每种类型的计算节点的多个构建块,用于冗余和/或允许多个工作量在超级平台中运行。In some implementations, all building blocks 160 in the superplatform have the same compute nodes. For example, superplatform 152 may include 64 building blocks, each with 64 TPUs arranged in a 4×4×4 pattern for performing machine learning workloads. The superplatform may also include different types of compute nodes. For example, superplatform 154 may include 60 building blocks with TPUs and 4 building blocks with dedicated processing units for performing tasks other than machine learning workloads. In this way, workload clusters for workloads may include different types of compute nodes. The superplatform may include multiple building blocks of each type of compute node in the superplatform for redundancy and/or to allow multiple workloads to run in the superplatform.

在一些实施方式中,超级平台中的所有构建块160具有相同的排列,例如相同的尺寸和形状。例如,超级平台152的每个构建块160可以具有4×4×4的排列。超级平台还可以具有不同排列的构建块。例如,超级平台154可以具有32个4×4×4排列的构建块和32个16×8×16排列的构建块。不同的构建块排列可以具有相同或不同的计算节点。例如,具有TPU的构建块可以具有与具有GPU的构建块不同的排列。In some implementations, all building blocks 160 in the super platform have the same arrangement, such as the same size and shape. For example, each building block 160 of the super platform 152 may have a 4×4×4 arrangement. The super platform may also have building blocks with different arrangements. For example, the super platform 154 may have 32 building blocks arranged in a 4×4×4 arrangement and 32 building blocks arranged in a 16×8×16 arrangement. Different building block arrangements may have the same or different compute nodes. For example, building blocks with TPUs may have a different arrangement than building blocks with GPUs.

超级平台可以具有不同层级的构建块。例如,超级平台152可以包括具有4×4×4排列的基本级构建块。超级平台152还可以包括具有更多计算节点的中间级构建块。例如,中间级构建块可以具有8×8×8的排列,例如,由8个基本级构建块构成。以这种方式,与基本层构建块被连接以生成更大的工作量集群相比,可以使用具有更少链路配置的中间级构建块来生成更大的工作量集群。在超级平台中还具有基本级构建块为较小的工作量集群提供了灵活性,该较小的工作量集群可能不需要中间级构建块中的计算节点的数量。The superplatform can have building blocks of different levels. For example, superplatform 152 can include basic-level building blocks with a 4×4×4 arrangement. Superplatform 152 can also include intermediate-level building blocks with more compute nodes. For example, intermediate-level building blocks can have an 8×8×8 arrangement, for example, consisting of 8 basic-level building blocks. In this way, a larger workload cluster can be generated using intermediate-level building blocks with fewer links compared to basic-level building blocks being connected to generate a larger workload cluster. The presence of basic-level building blocks in the superplatform also provides flexibility for smaller workload clusters that may not require the number of compute nodes found in the intermediate-level building blocks.

单元150内的超级平台152-158可在构建块中具有相同或不同类型的计算节点。例如,单元150可以包括具有TPU构建块的一个或多个超级平台以及具有GPU构建块的一个或多个超级平台。构建块的尺寸和形状在单元150的不同超级平台152-158中也可以相同或不同。The superplatforms 152-158 within unit 150 may have the same or different types of compute nodes in the building blocks. For example, unit 150 may include one or more superplatforms with TPU building blocks and one or more superplatforms with GPU building blocks. The size and shape of the building blocks may also be the same or different in the different superplatforms 152-158 of unit 150.

每个单元150还包括共享数据存储器162和共享辅助计算组件164。单元150中的每个超级平台152-158可使用共享数据存储器162,以例如存储由在超级平台152-158中执行的工作量生成的数据。共享数据存储器162可以包括硬盘驱动器,固态驱动器,闪存和/或其它适当的数据存储设备。共享辅助计算组件164可包括在单元150内共享的CPU(例如,通用CPU机器),GPU和/或其它加速器(例如,视频解码,图像解码等)。辅助计算组件164还可包括存储装置,存储器装置和/或可由计算节点通过网络共享的其它计算组件。Each unit 150 also includes a shared data storage 162 and a shared auxiliary computing component 164. Each superplatform 152-158 in unit 150 may use the shared data storage 162 to, for example, store data generated by workloads performed in superplatform 152-158. The shared data storage 162 may include a hard disk drive, solid-state drive, flash memory, and/or other suitable data storage device. The shared auxiliary computing component 164 may include a CPU (e.g., a general-purpose CPU machine), GPU, and/or other accelerators (e.g., video decoder, image decoder, etc.) shared within unit 150. The auxiliary computing component 164 may also include storage devices, memory devices, and/or other computing components that can be shared by computing nodes via a network.

单元调度器140可以为从用户设备110接收的每个工作量选择单元150和/或单元150的超级平台152-158。单元调度器140可基于为工作量指定的目标排列、超级平台152-158中的构建块160的可用性以及超级平台152-158中的构建块的健康状况来选择超级平台。例如,针对工作量,单元调度器140可以选择包括至少足够数量的可用的且健康的构建块的超级平台,以生成具有目标排列的工作量集群。如果请求数据指定了计算节点的类型,则单元调度器140可以选择具有至少足够数量的具有指定类型的计算节点的可用的且健康的构建块的超级平台。Unit scheduler 140 can select units 150 and/or superplatforms 152-158 of units 150 for each workload received from user equipment 110. Unit scheduler 140 can select superplatforms based on a target permutation specified for the workload, the availability of building blocks 160 in superplatforms 152-158, and the health status of the building blocks in superplatforms 152-158. For example, for a workload, unit scheduler 140 can select superplatforms that include at least a sufficient number of available and healthy building blocks to generate a workload cluster with a target permutation. If the request data specifies a type of compute node, unit scheduler 140 can select a superplatform with at least a sufficient number of available and healthy building blocks of the specified type of compute node.

如下所述,每个超级平台152-158还可以包括工作量调度器和OCS管理器。当单元调度器140选择单元150的超级平台时,单元调度器140可以向该超级平台150的工作量调度器提供工作量和指定所请求集群的数据。如下面更详细描述的,工作量调度器可以基于构建块的可用性和健康度以及可选地基于超级平台中的工作量的优先级,从超级平台的构建块中选择一组构建块来连接以形成工作量集群。例如,如下所述,如果工作量调度器接收到对工作量集群的请求,该工作量集群包括比超级平台中健康且可用的构建块的数量更多的构建块,则工作量调度器可以将较低优先级工作量的构建块重新分配给所请求的工作量集群。工作量调度器可以向OCS管理器提供标识所选择的构建块的数据。然后,OCS管理器可以配置一个或多个OCS交换机,以将构建块连接在一起,从而形成工作量集群。然后,工作量调度器可以在工作量集群的计算节点上执行工作量。As described below, each superplatform 152-158 may also include a workload scheduler and an OCS manager. When unit scheduler 140 selects a superplatform for unit 150, unit scheduler 140 may provide the workload and data specifying the requested cluster to the workload scheduler of that superplatform 150. As described in more detail below, the workload scheduler may select a set of building blocks from the superplatform's building blocks to connect to form a workload cluster based on the availability and health of the building blocks and optionally based on the priority of the workloads in the superplatform. For example, as described below, if the workload scheduler receives a request for a workload cluster that includes more building blocks than the number of healthy and available building blocks in the superplatform, the workload scheduler may reallocate the building blocks of lower-priority workloads to the requested workload cluster. The workload scheduler may provide data identifying the selected building blocks to the OCS manager. The OCS manager may then configure one or more OCS switches to connect the building blocks together to form a workload cluster. The workload scheduler may then execute the workload on the compute nodes of the workload cluster.

在一些实施方式中,例如,当为工作量选择超级平台152-158时,单元调度器140平衡各个单元150和超级平台152-158之间的负载。例如,当在具有用于工作量的构建块的能力的两个或更多个超级平台之间进行选择时,单元调度器140可以选择具有最大能力的超级平台,例如,最大可用且健康的构建块,或者具有最大总能力的单元的超级平台。In some implementations, for example, when selecting superplatforms 152-158 for a workload, unit scheduler 140 balances the load between individual units 150 and superplatforms 152-158. For example, when selecting between two or more superplatforms with the capacity for building blocks for a workload, unit scheduler 140 may select the superplatform with the maximum capacity, such as the superplatform with the most available and healthy building blocks, or the superplatform with the unit with the maximum total capacity.

在一些实施方式中,单元调度器140还可以确定用于工作量的目标排列。例如,单元调度器140可以基于估计的工作量的计算需求和一个或多个类型的可用计算节点的吞吐量来确定构建块的目标排列。在该示例中,单元调度器140可以将所确定的目标排列提供给超级平台的工作量调度器。In some implementations, the unit scheduler 140 may also determine a target arrangement for the workload. For example, the unit scheduler 140 may determine the target arrangement of building blocks based on the estimated computational requirements of the workload and the throughput of one or more types of available compute nodes. In this example, the unit scheduler 140 may provide the determined target arrangement to the workload scheduler of the super platform.

图2示出了示例逻辑超级平台210和从超级平台210中的一部分构建块生成的示例工作量集群220,230和240。在该示例中,超级平台210包括64个构建块,每个构建块具有4×4×4的排列。尽管在本文件中描述的许多示例是根据4×4×4构建块来描述的,但是相同的技术也可以应用于构建块的其它排列。Figure 2 illustrates an example logical superplatform 210 and example workload clusters 220, 230, and 240 generated from a subset of building blocks in superplatform 210. In this example, superplatform 210 comprises 64 building blocks, each arranged in a 4×4×4 pattern. Although many examples described in this document are based on 4×4×4 building blocks, the same techniques can be applied to other arrangements of building blocks.

在超级平台210中,用阴影线表示的构建块被分配给工作量,如下所述。用实心白色表示的构建块是健康的可用构建块。用实心黑色表示的构建块是不健康的节点,其例如由于故障而不能用于生成工作量集群。In Super Platform 210, building blocks, indicated by shaded lines, are assigned to workloads as described below. Building blocks indicated by solid white lines are healthy and available. Building blocks indicated by solid black lines are unhealthy nodes, which, for example, cannot be used to generate workload clusters due to failures.

工作量集群220是8×8×4的平台,其包括来自超级平台210的4×4×4构建块中的4个。也就是说,工作量集群220具有沿着x维的八个计算节点,沿着y维的八个计算节点,以及沿着z维的四个计算节点。由于每个构建块沿着每个维度具有四个计算节点,因此工作量集群220包括沿着x维度的两个构建块,沿着y维度的两个构建块,以及沿着z维度的一个构建块。Workload cluster 220 is an 8×8×4 platform, comprising four of the four×4×4 building blocks from super platform 210. That is, workload cluster 220 has eight compute nodes along the x-dimensional axis, eight compute nodes along the y-dimensional axis, and four compute nodes along the z-dimensional axis. Since each building block has four compute nodes along each dimension, workload cluster 220 comprises two building blocks along the x-dimensional axis, two building blocks along the y-dimensional axis, and one building block along the z-dimensional axis.

工作量集群220的四个构建块用对角阴影线示出,以示出它们在超级平台210中的位置。如图所示,工作量集群220的构建块彼此不相邻。如下面更详细地描述的,光网络的使用使得能够从超级平台210中的工作量集群的任何组合生成工作量集群,而不管它们在超级平台210中的相对位置。The four building blocks of workload cluster 220 are shown with diagonal shading to indicate their positions within super platform 210. As shown, the building blocks of workload cluster 220 are not adjacent to each other. As described in more detail below, the use of optical networking enables the generation of workload clusters from any combination of workload clusters within super platform 210, regardless of their relative positions within super platform 210.

工作量集群230是包括超级平台210中的8个构建块的8×8×8平台。具体而言,工作量集群包括沿着每个维度的两个构建块,其向工作量集群230提供沿着每个维度的八个计算节点。工作量集群230的构建块用竖直阴影线示出,以示出它们在超级平台210中的位置。Workload cluster 230 is an 8×8×8 platform comprising eight building blocks from superplatform 210. Specifically, the workload cluster includes two building blocks along each dimension, providing eight compute nodes along each dimension to workload cluster 230. The building blocks of workload cluster 230 are shown with vertical shading to indicate their location within superplatform 210.

工作量集群240是包括超级平台210中的32个构建块的16×8×16平台。特别地,工作量集群240包括沿着x维的四个构建块,沿着y维的两个构建块,以及沿着z维的四个构建块,这为工作量集群提供了沿着x维的16个计算节点,沿着y维的八个计算节点,以及沿着z维的16个计算节点。工作量集群240的构建块用交叉阴影线示出,以示出它们在超级平台210中的位置。Workload cluster 240 is a 16×8×16 platform comprising 32 building blocks from superplatform 210. Specifically, workload cluster 240 includes four building blocks along the x-dimensional plane, two building blocks along the y-dimensional plane, and four building blocks along the z-dimensional plane, providing the workload cluster with 16 compute nodes along the x-dimensional plane, eight compute nodes along the y-dimensional plane, and 16 compute nodes along the z-dimensional plane. The building blocks of workload cluster 240 are shown with cross-hatching to indicate their location within superplatform 210.

工作量集群220,230和240仅仅是可以为工作量生成的超级平台210的集群的一些示例。工作量集群的许多其它排列也是可能的。尽管示例性工作量集群220,230和240具有矩形形状,但其它形状也是可能的。Workload clusters 220, 230, and 240 are merely some examples of clusters that can be generated for the super platform 210 of the workload. Many other arrangements of workload clusters are also possible. Although the exemplary workload clusters 220, 230, and 240 have a rectangular shape, other shapes are also possible.

包括工作量集群220,230和240的工作量集群的形状是逻辑形状而不是物理形状。光网络被配置成使得构建块沿着每个维度通信,就好像工作量集群在逻辑配置中物理地连接一样。然而,物理构建块及其对应的计算节点可以以各种方式物理地排列在数据中心中。工作量220,230和240的构建块可以从任何健康的可用构建块中选择,而不限制超级平台210中的构建块之间的物理关系,除了构建块都连接到超级平台210的光网络。例如,如上所述和图2所示,工作量集群220,230和240包括物理上不相邻的构建块。The workload clusters 220, 230, and 240 have logical rather than physical shapes. The optical network is configured so that the building blocks communicate along each dimension as if the workload clusters were physically connected in a logical configuration. However, the physical building blocks and their corresponding compute nodes can be physically arranged in the data center in various ways. The building blocks of workloads 220, 230, and 240 can be selected from any healthy available building blocks without restricting the physical relationships between building blocks in the super platform 210, except that the building blocks are all connected to the optical network of the super platform 210. For example, as described above and in Figure 2, workload clusters 220, 230, and 240 include physically non-adjacent building blocks.

此外,工作量集群的逻辑排列不受超级平台的构建块的物理排列的限制。例如,构建块可以被排列成八行和八列,沿着z维仅有一个构建块。然而,通过配置光网络以创建这种逻辑排列,可以配置工作量集群,使得工作量集群包括沿着z维的多个构建块。Furthermore, the logical arrangement of workload clusters is not limited by the physical arrangement of the building blocks of the super platform. For example, building blocks can be arranged in eight rows and eight columns, with only one building block along the z-dimensional direction. However, by configuring the optical network to create this logical arrangement, workload clusters can be configured to include multiple building blocks along the z-dimensional direction.

图3示出了示例构建块310和使用构建块310生成的示例工作量集群320,330和340。构建块310是沿着每个维度具有四个计算节点的4×4×4构建块。在该示例中,构建块310的每个维度包括16个段,每个段中有四个计算节点。例如,在构建块310的顶部有16个计算节点。对于这16个计算节点中的每一个,沿着y维存在包括计算节点和三个其它计算节点的段,三个其它计算节点包括构建块310底部上的对应的最后计算节点。例如,沿着y维的一个段包括计算节点301-304。Figure 3 illustrates an example building block 310 and example workload clusters 320, 330, and 340 generated using building block 310. Building block 310 is a 4×4×4 building block with four compute nodes along each dimension. In this example, each dimension of building block 310 comprises 16 segments, each containing four compute nodes. For example, there are 16 compute nodes at the top of building block 310. For each of these 16 compute nodes, there exists a segment along the y-dimensional dimension comprising the compute node and three other compute nodes, the three other compute nodes being the corresponding last compute node at the bottom of building block 310. For example, one segment along the y-dimensional dimension comprises compute nodes 301-304.

每段计算节点沿着逻辑轴。例如,计算节点301-304沿着逻辑轴,并且计算节点301-304右侧的四个计算节点沿着不同的逻辑轴。计算节点305-308也沿着不同的逻辑轴。如图3所示,4×4×4构建块沿着每个维度具有16个逻辑轴。如下所述,可以使用用于逻辑轴的一个或多个OCS交换机将在相同逻辑轴上的不同构建块的计算节点连接在一起。Each compute node is along a logical axis. For example, compute nodes 301-304 are along a logical axis, and the four compute nodes to the right of compute nodes 301-304 are along different logical axes. Compute nodes 305-308 are also along different logical axes. As shown in Figure 3, the 4×4×4 building block has 16 logical axes along each dimension. As described below, one or more OCS switches for logical axes can be used to connect compute nodes of different building blocks on the same logical axis.

构建块310内的计算节点可以通过由导电材料(例如,铜电缆)制成的内部链路318彼此连接。可使用内部链路318来连接每一维的每一段中的计算节点。例如,存在将计算节点301连接到计算节点302的内部链路318。还存在将计算节点302连接到计算节点303的内部链路318,以及将计算节点303连接到计算节点304的另一内部链路318。每个其它段中的计算节点可以以相同的方式连接,以便在构建块310的计算节点之间提供内部数据通信。Computational nodes within building block 310 can be connected to each other via internal links 318 made of a conductive material (e.g., copper cables). Internal links 318 can be used to connect computational nodes in each segment of each dimension. For example, there is an internal link 318 connecting computational node 301 to computational node 302. There is also an internal link 318 connecting computational node 302 to computational node 303, and another internal link 318 connecting computational node 303 to computational node 304. Computational nodes in each other segment can be connected in the same manner to provide internal data communication between the computational nodes of building block 310.

构建块310还包括将构建块310连接到光网络的外部链路311-316。光网络将构建块310连接到其它构建块。在该示例中,构建块310包括用于x维的16个外部输入链接311。也就是说,构建块310包括用于沿着x维的16个段中的每一个段的外部输入链路311。类似地,构建块310包括用于沿x维的每个段的外部输出链路312,用于沿y维的每个段的外部输入链路313,用于沿y维的每个段的外部输出链路314,用于沿z维的每个段的外部输入链路315,以及用于沿z维的每个段的外部输出链路316。由于构建块的一些排列可以具有多于三个的维度,例如环面,其可以具有任何数量的维度,因此构建块310可以包括用于构建块310的每个维度的类似的外部链接。Building block 310 also includes external links 311-316 connecting building block 310 to the optical network. The optical network connects building block 310 to other building blocks. In this example, building block 310 includes 16 external input links 311 for the x-dimensional dimension. That is, building block 310 includes external input links 311 for each of the 16 segments along the x-dimensional dimension. Similarly, building block 310 includes external output links 312 for each segment along the x-dimensional dimension, external input links 313 for each segment along the y-dimensional dimension, external output links 314 for each segment along the y-dimensional dimension, external input links 315 for each segment along the z-dimensional dimension, and external output links 316 for each segment along the z-dimensional dimension. Since some permutations of building blocks can have more than three dimensions, such as tori, which can have any number of dimensions, building block 310 can include similar external links for each dimension of building block 310.

每个外部链路311-316可以是将其计算节点的对应段上的计算节点连接到光网络的光纤链路。例如,每个外部链路311-316可以将其计算节点连接到光网络的OCS交换机。如下所述,光网络可以包括用于每个维度的一个或多个OCS交换机,对于每个维度,构建块310具有段。也就是说,用于x维的外部链路311和312可以连接到与外部链路313和314不同的OCS交换机。OCS交换机可以被配置为将构建块连接到其他构建块以形成工作量集群,如下面更详细描述的。Each external link 311-316 can be an optical fiber link that connects a compute node on its corresponding segment to an optical network. For example, each external link 311-316 can connect its compute node to an OCS switch of the optical network. As described below, the optical network can include one or more OCS switches for each dimension, and for each dimension, building block 310 has segments. That is, external links 311 and 312 for the x-dimensional can be connected to different OCS switches than external links 313 and 314. The OCS switches can be configured to connect building blocks to other building blocks to form workload clusters, as described in more detail below.

构建块310是4×4×4网格排列的形式。对于4×4×4(或其它尺寸的构建块),其它排列也是可能的。例如,构建块310可以是具有环绕式环面链路的三维环面的形式,类似于工作量集群320。工作量集群320也可通过配置光网络以提供环绕式环面链路321-323而从单个网格构建块310产生。Building block 310 is in the form of a 4×4×4 grid arrangement. Other arrangements are also possible for building blocks of 4×4×4 (or other sizes). For example, building block 310 can be in the form of a three-dimensional torus with surrounding toroidal links, similar to workload cluster 320. Workload cluster 320 can also be generated from a single grid building block 310 by configuring an optical network to provide surrounding toroidal links 321-323.

环面链路321-323提供每个段的一端与每个段的另一端之间的环绕式数据通信。例如,环面链路321将沿x维的每个段的每个端部处的计算节点连接到段的另一端部处的相应计算节点。环面链路321可以包括将计算节点325连接到计算节点326的链路。类似地,环面链路322可以包括将计算节点325连接到计算节点327的链路。Toroidal links 321-323 provide wraparound data communication between one end of each segment and the other end of each segment. For example, toroidal link 321 connects a compute node at each end of each segment along the x-dimensional plane to a corresponding compute node at the other end of the segment. Toroidal link 321 may include a link connecting compute node 325 to compute node 326. Similarly, toroidal link 322 may include a link connecting compute node 325 to compute node 327.

环面链路321-323可以是导电电缆,例如铜电缆,或光链路。例如,环面链路321-323的光链路可将其对应的计算机节点连接到一个或多个OCS交换机。OCS交换机可以被配置成将数据从每个段的一端路由到每个段的另一端。构建块310可以包括用于每个维度的OCS交换机。例如,环面链路321可以连接到第一OCS交换机,该第一OCS交换机在沿着x维的每个段的一端和沿着x维的每个段的另一端之间路由数据。类似地,环面链路322可以连接到第二OCS交换机,该第二OCS交换机在沿着y维的每个段的一端和沿着y维的每个段的另一端之间路由数据。环面链路322可以连接到第三OCS交换机,该第三OCS交换机在沿着z维的每个段的一端和沿着z维的每个段的另一端之间路由数据。The toroidal links 321-323 can be conductive cables, such as copper cables, or optical links. For example, the optical links of toroidal links 321-323 can connect their corresponding computer nodes to one or more OCS switches. The OCS switches can be configured to route data from one end of each segment to the other end of each segment. Building block 310 can include an OCS switch for each dimension. For example, toroidal link 321 can be connected to a first OCS switch that routes data between one end of each segment along the x-dimensional plane and the other end of each segment along the x-dimensional plane. Similarly, toroidal link 322 can be connected to a second OCS switch that routes data between one end of each segment along the y-dimensional plane and the other end of each segment along the y-dimensional plane. Toroidal link 322 can be connected to a third OCS switch that routes data between one end of each segment along the z-dimensional plane and the other end of each segment along the z-dimensional plane.

工作量集群330包括形成4×8×4平台的两个构建块338和339。每个构建块338和339可以与构建块310或工作量集群320相同。两个构建块使用外部链路337沿y方向连接。例如,一个或多个OCS交换机可以被配置为在构建块338的y维段和构建块339的y维段之间路由数据。Workload cluster 330 comprises two building blocks 338 and 339 forming a 4×8×4 platform. Each building block 338 and 339 may be identical to building block 310 or workload cluster 320. The two building blocks are connected along the y-direction using external link 337. For example, one or more OCS switches may be configured to route data between the y-dimensional segments of building block 338 and building block 339.

此外,一个或多个OCS交换机可被配置为沿着所有三个维度在每一段的一端与每一段的另一端之间提供环绕式链路331-333。在该示例中,环绕式链路333将构建块338的y维段的一端连接到构建块339的y维段的一端,从而为由两个构建块338和339的组合形成的y维段提供全环绕式通信。Furthermore, one or more OCS switches can be configured to provide wraparound links 331-333 along all three dimensions between one end of each segment and the other end of each segment. In this example, wraparound link 333 connects one end of the y-dimensional segment of building block 338 to one end of the y-dimensional segment of building block 339, thereby providing full wraparound communication for the y-dimensional segment formed by the combination of two building blocks 338 and 339.

工作量集群340包括形成8×8×8集群的8个构建块(一个未示出)。每个构建块348可以与构建块310相同。沿着x维连接的构建块链路使用外部链路345a-345c连接。类似地,沿着y维连接的构建块链路使用外部链路344A-344C连接,并且沿着z维连接的构建块使用外部链路346A-346C连接。例如,一个或多个OCS交换机可以被配置为在x维段之间路由数据,一个或多个OCS交换机可以被配置为在y维段之间路由数据,并且一个或多个OCS交换机可以被配置为在z维段之间路由数据。存在附加的外部链路,每个维度将在图3中未示出的构建块连接到相邻的构建块。此外,一个或多个OCS交换机可以被配置为沿着所有三个维度在每个段的一端和每个段的另一端之间提供环绕式链路341-343。Workload cluster 340 comprises eight building blocks (one not shown) forming an 8×8×8 cluster. Each building block 348 may be identical to building block 310. Building block links connected along the x-dimensional plane use external links 345a-345c. Similarly, building block links connected along the y-dimensional plane use external links 344A-344C, and building blocks connected along the z-dimensional plane use external links 346A-346C. For example, one or more OCS switches may be configured to route data between x-dimensional segments, one or more OCS switches may be configured to route data between y-dimensional segments, and one or more OCS switches may be configured to route data between z-dimensional segments. Additional external links exist, with each dimension connecting the building block (not shown in Figure 3) to adjacent building blocks. Furthermore, one or more OCS switches may be configured to provide wraparound links 341-343 between one end of each segment and the other end of each segment along all three dimensions.

图4示出了从计算节点到OCS交换机的示例光链路400。超级平台的计算节点可以安装在数据中心机架的托盘中。每个计算节点可以包括六个高速电链路。这些电链路中的两个电链路可以连接在计算节点的电路板上,并且四个电链路可以被路由到外部电连接器,例如,八进制小尺寸可插拔(Octal Small Form Factor Pluggable,OSFP)连接器,其连接到端口410,例如,OSFP端口。在该示例中,端口410通过电触点412连接到光模块420。如果需要,光模块420可以将电链路转换为光链路,以扩展外部链路的长度,例如扩展到超过一公里(km),以提供大型数据中心中的计算节点之间的数据通信。光模块的类型可以根据构建块和OCS交换机之间所需的长度以及链路的所需速度和带宽而变化。Figure 4 illustrates an example optical link 400 from a compute node to an OCS switch. The compute nodes of the super platform can be mounted in trays within a data center rack. Each compute node can include six high-speed electrical links. Two of these electrical links can be connected to the compute node's circuit board, and four electrical links can be routed to external electrical connectors, such as Octal Small Form Factor Pluggable (OSFP) connectors, which connect to port 410, for example, an OSFP port. In this example, port 410 is connected to optical module 420 via electrical contact 412. If needed, optical module 420 can convert the electrical links to optical links to extend the length of the external links, for example, to more than one kilometer (km), to provide data communication between compute nodes in a large data center. The type of optical module can vary depending on the required length between the building block and the OCS switch, as well as the required speed and bandwidth of the link.

光模块420通过光纤电缆422和424连接到循环器430。光纤电缆422可以包括用于将数据从光模块420传送到循环器430的一个或多个光纤电缆。光纤电缆424可以包括用于从循环器430接收数据的一个或多个光纤电缆。例如,光纤电缆422和424可以包括双向光纤或单向TX/RX光纤对。循环器430可以通过从单向光纤转换为双向光纤来减少光纤电缆的数量(例如,从两对到单对光纤电缆432)。这与OCS交换机440的单个OCS端口445很好地对准,OCS交换机440通常容纳在一起交换的一对光路(2个光纤)。在一些实施方式中,循环器430可以集成到光模块420中或者从光链路400中省略。Optical module 420 is connected to circulator 430 via fiber optic cables 422 and 424. Fiber optic cable 422 may include one or more fiber optic cables for transmitting data from optical module 420 to circulator 430. Fiber optic cable 424 may include one or more fiber optic cables for receiving data from circulator 430. For example, fiber optic cables 422 and 424 may include bidirectional or unidirectional TX/RX fiber pairs. Circulator 430 can reduce the number of fiber optic cables (e.g., from two pairs to a single pair of fiber optic cables 432) by converting from unidirectional to bidirectional fiber. This aligns well with the single OCS port 445 of OCS switch 440, which typically houses a pair of optical paths (2 fibers) switched together. In some implementations, circulator 430 may be integrated into optical module 420 or omitted from optical link 400.

图5-7示出了如何使用多个计算托盘(tray)形成4×4×4构建块。可以使用类似的技术来形成其它尺寸和形状的构建块。Figure 5-7 illustrates how to form a 4×4×4 building block using multiple compute trays. Similar techniques can be used to form building blocks of other sizes and shapes.

图5示出了用于形成4×4×4构建块的逻辑计算托盘500。4×4×4构建块的基本硬件块是具有2×2×1拓扑的单个计算托盘500。在该示例中,计算托盘500具有沿着x维的两个计算节点,沿着y维的两个节点,以及沿着z维的一个节点。例如,计算节点501和502形成x维段,而计算节点503和504形成x维段。类似地,计算节点501和503形成y维段,并且计算节点502和504形成y维段。Figure 5 illustrates a logical compute tray 500 used to form a 4×4×4 building block. The basic hardware block of the 4×4×4 building block is a single compute tray 500 with a 2×2×1 topology. In this example, the compute tray 500 has two compute nodes along the x-dimensional line, two nodes along the y-dimensional line, and one node along the z-dimensional line. For example, compute nodes 501 and 502 form an x-dimensional segment, while compute nodes 503 and 504 form an x-dimensional segment. Similarly, compute nodes 501 and 503 form a y-dimensional segment, and compute nodes 502 and 504 form a y-dimensional segment.

每个计算节点501-504使用内部链路510(例如铜电缆或印刷电路板上的迹线)连接到两个其他计算节点。每个计算节点还连接到四个外部端口。计算节点501连接到外部端口521。类似地,计算节点502连接到外部端口522,计算节点503连接到外部端口523,并且计算节点504连接到外部端口524。如上所述,外部端口521-524可以是OSFP或将计算节点连接到OCS交换机的其它端口。端口可以容纳附接到光纤电缆的电铜或光纤模块。Each compute node 501-504 is connected to two other compute nodes via an internal link 510 (e.g., a copper cable or a trace on a printed circuit board). Each compute node is also connected to four external ports. Compute node 501 is connected to external port 521. Similarly, compute node 502 is connected to external port 522, compute node 503 is connected to external port 523, and compute node 504 is connected to external port 524. As mentioned above, external ports 521-524 can be OSFPs or other ports that connect compute nodes to the OCS switch. The ports can accommodate copper or fiber optic modules attached to fiber optic cables.

每个计算节点501-504的外部端口521-524具有x维端口、y维端口和两个z维端口。这是因为每个计算节点501-504已经使用内部链路510连接到x维和y维中的另一个计算节点。具有两个z维外部端口允许每个计算节点501-504还连接到沿着z维的两个计算节点。Each compute node 501-504 has external ports 521-524 with an x-dimensional port, a y-dimensional port, and two z-dimensional ports. This is because each compute node 501-504 is already connected to another compute node in the x and y dimensions via internal link 510. Having two z-dimensional external ports allows each compute node 501-504 to also connect to two compute nodes along the z-dimensional direction.

图6示出了省略了一维(z维)的示例性构建块的子块600。特别地,子块600是由2×2的计算托盘排列(例如,图5的计算托盘500的2×2排列)形成的4×4×1块。子块600包括2×2排列的四个计算托盘620A-620D。每个计算托盘620A-620D可以与图5的计算托盘500相同,包括2×2×1排列的四个计算节点622。Figure 6 illustrates a sub-block 600 with the exemplary building block having its one-dimensional (z-dimensional) component omitted. Specifically, sub-block 600 is a 4×4×1 block formed by a 2×2 arrangement of computation trays (e.g., a 2×2 arrangement of computation tray 500 in Figure 5). Sub-block 600 includes four computation trays 620A-620D arranged in a 2×2 configuration. Each computation tray 620A-620D may be identical to computation tray 500 in Figure 5, including four computation nodes 622 arranged in a 2×2×1 configuration.

计算托盘620A-620D的计算节点622可使用内部链路631-634(例如,铜电缆)连接。例如,计算托盘620A的两个计算节点622使用内部链路632沿着y维连接到计算托盘620B的两个计算节点622。Computation nodes 622 of computation trays 620A-620D can be connected using internal links 631-634 (e.g., copper cables). For example, two computation nodes 622 of computation tray 620A can be connected along the y-axis to two computation nodes 622 of computation tray 620B using internal link 632.

每个计算托盘620A-620D的两个计算节点622也沿着x维连接到外部链路640。类似地,每个计算托盘620A-620D的两个计算节点也沿着y维连接到外部线641。特别地,在每个x维段的末端和每个y维段的末端处的计算节点被连接到外部链路640。这些外部链路640可以是例如使用图4的光链路400将计算节点连接到OCS交换机的光纤电缆,从而将包括计算节点的构建块连接到OCS交换机。The two compute nodes 622 of each compute tray 620A-620D are also connected to the external link 640 along the x-dimensional axis. Similarly, the two compute nodes of each compute tray 620A-620D are also connected to the external line 641 along the y-dimensional axis. Specifically, the compute nodes at the end of each x-dimensional segment and the end of each y-dimensional segment are connected to the external link 640. These external links 640 can be, for example, fiber optic cables using the optical link 400 of Figure 4 to connect the compute nodes to the OCS switch, thereby connecting the building blocks including the compute nodes to the OCS switch.

4×4×4构建块可以通过将四个子块600沿着z维连接在一起来形成。例如,每个计算托盘620A-620A的计算节点622可以使用内部链路连接到排列在z维中的其他子块600上的计算托盘的一个或两个对应计算节点。每个z维段末端的计算节点可以包括连接到OCS交换机的外部链路640,类似于x维和y维段末端的外部链路。A 4×4×4 building block can be formed by connecting four sub-blocks 600 together along the z-dimensional axis. For example, compute nodes 622 of each compute tray 620A-620A can be connected to one or two corresponding compute nodes of other compute trays arranged in the z-dimensional axis using internal links. The compute nodes at the end of each z-dimensional segment can include external links 640 connected to the OCS switch, similar to the external links at the ends of the x-dimensional and y-dimensional segments.

图7示出了示例性构建块700。构建块700包括沿着z维连接的四个子块710A-710D。每个子块710A-710D可以与图6的子块600相同。图7示出了沿着z维的子块710A-710D之间的一些连接。Figure 7 illustrates an exemplary building block 700. Building block 700 includes four sub-blocks 710A-710D connected along the z-dimensional axis. Each sub-block 710A-710D may be identical to sub-block 600 of Figure 6. Figure 7 illustrates some connections between sub-blocks 710A-710D along the z-dimensional axis.

特别地,构建块700包括沿着z维的子块710A-710D的计算托盘715的对应计算节点716之间的内部链路730-733。例如,内部链路730沿着z维连接计算节点0的一段。类似地,内部链路731沿着z维连接计算节点1的一段,内部链路732沿着z维连接计算节点8的一段,并且内部链路733沿着z维连接计算节点9的一段。尽管未示出,但类似的内部链路连接计算节点2-7和A-F的段。Specifically, building block 700 includes internal links 730-733 between corresponding computing nodes 716 of computing trays 715 along the z-dimensional sub-blocks 710A-710D. For example, internal link 730 connects a segment of computing node 0 along the z-dimensional. Similarly, internal link 731 connects a segment of computing node 1 along the z-dimensional, internal link 732 connects a segment of computing node 8 along the z-dimensional, and internal link 733 connects a segment of computing node 9 along the z-dimensional. Although not shown, similar internal links connect segments of computing nodes 2-7 and A-F.

构建块700还包括在沿着z维的每个段的末端处的外部链路720。尽管外部链路720仅针对计算节点0、1、8和9的段示出,但是计算节点2-7和A-F的每个其他段还包括外部链路720。外部链路可以将段连接到OCS交换机,类似于x维和y维段的末端处的外部链路。Building block 700 also includes external links 720 at the end of each segment along the z-dimensional axis. Although external links 720 are only shown for segments of compute nodes 0, 1, 8, and 9, external links 720 are also included for each of the other segments of compute nodes 2-7 and A-F. External links can connect segments to OCS switches, similar to external links at the ends of x-dimensional and y-dimensional segments.

图8示出了用于超级平台的OCS结构拓扑800。在该示例中,OCS结构拓扑包括用于沿着包括64个构建块805(即,构建块0-63)的超级平台的4×4×4构建块的每一维的每一段的单独的OCS交换机。4×4×4构建块805包括沿着x维的16个段,沿着y维的16个段和沿着z维的16个段。在该示例中,OCS结构拓扑包括用于x维的16个OCS交换机,用于y维的16个OCS交换机,以及用于z维的16个OCS交换机,总共48个OCS交换机,其可以被配置为生成各种工作量集群。Figure 8 illustrates an OCS topology 800 for a super platform. In this example, the OCS topology includes individual OCS switches for each segment of each dimension along a 4×4×4 building block of the super platform, comprising 64 building blocks 805 (i.e., building blocks 0-63). The 4×4×4 building block 805 includes 16 segments along the x-dimensional, 16 segments along the y-dimensional, and 16 segments along the z-dimensional. In this example, the OCS topology includes 16 OCS switches for the x-dimensional, 16 OCS switches for the y-dimensional, and 16 OCS switches for the z-dimensional, for a total of 48 OCS switches, which can be configured to generate various workload clusters.

也就是说,OCS结构拓扑包括用于构建块的每个逻辑轴的OCS交换机。在同一逻辑轴上的构建块的段被连接到同一OCS交换机。以这种方式,用于逻辑轴的OCS交换机可以被配置为在创建工作量集群时将沿着逻辑轴的计算节点的段连接在一起,使得沿着逻辑轴的计算节点可以经由用于逻辑轴的OCS交换机彼此通信。如果构建块A将被逻辑地安排在工作量集群中的构建块B的右边,则用于沿x维的逻辑轴的OCS交换机可以被配置为在该逻辑轴上的构建块A的段和该逻辑轴上的构建块B的段之间路由数据。In other words, the OCS topology includes an OCS switch for each logical axis of a building block. Segments of building blocks on the same logical axis are connected to the same OCS switch. In this way, the OCS switches for logical axes can be configured to connect segments of compute nodes along the logical axis when creating a workload cluster, allowing compute nodes along the logical axis to communicate with each other via the OCS switches for the logical axis. If building block A is logically positioned to the right of building block B in the workload cluster, the OCS switches for the x-dimensional logical axis can be configured to route data between segments of building block A and segments of building block B on that logical axis.

对于x维,OCS结构拓扑800包括16个OCS交换机,包括OCS交换机810。对于沿着x维的每个段,每个构建块805包括外部输入链路811和外部输出链路812,它们连接到用于该段的OCS交换机810。这些外部链路811和812可以与图4的光链路400相同或相似。For the x-dimensional dimension, the OCS topology 800 includes 16 OCS switches, including OCS switch 810. For each segment along the x-dimensional dimension, each building block 805 includes an external input link 811 and an external output link 812, which are connected to the OCS switch 810 for that segment. These external links 811 and 812 may be the same as or similar to the optical link 400 in Figure 4.

对于y维,OCS结构拓扑800包括16个OCS交换机,包括OCS交换机820。对于沿着y维的每个段,每个构建块805包括外部输入链路821和外部输出链路822,它们连接到用于该段的OCS交换机810。这些外部链路821和822可以与图4的光链路400相同或相似。For the y-axis, the OCS topology 800 includes 16 OCS switches, including OCS switch 820. For each segment along the y-axis, each building block 805 includes an external input link 821 and an external output link 822, which are connected to the OCS switch 810 for that segment. These external links 821 and 822 may be the same as or similar to the optical link 400 in Figure 4.

对于z维,OCS结构拓扑800包括16个OCS交换机,包括OCS交换机830。对于沿着z维的每个段,每个构建块805包括外部输入链路831和外部输出链路832,它们连接到用于该段的OCS交换机810。这些外部链路821和822可以与图4的光链路400相同或相似。For the z-dimensional dimension, the OCS topology 800 includes 16 OCS switches, including OCS switch 830. For each segment along the z-dimensional dimension, each building block 805 includes an external input link 831 and an external output link 832, which are connected to the OCS switch 810 for that segment. These external links 821 and 822 may be the same as or similar to the optical link 400 in Figure 4.

在其它示例中,多个段可以共享相同的OCS交换机,例如,这取决于OCS基数和/或超级平台中的构建块的数量。例如,如果OCS交换机具有足够数量的端口用于超级平台中所有构建块的所有x维段,则所有x维段可以连接到相同的OCS交换机。在另一个示例中,如果OCS交换机具有足够数量的端口,则每个维度的两个段可以共享OCS交换机。然而,通过具有连接到同一OCS交换机的超级平台的所有构建块的相应段,使得能够使用单个路由表在这些段的计算节点之间进行数据通信。此外,为每个分段或每个维度使用单独的OCS交换机可以简化故障排除和诊断。例如,如果在特定的段或维度上存在数据通信的问题,则识别可能有故障的OCS将比如果将多个OCS用于特定的段或维度更容易。In other examples, multiple segments can share the same OCS switch, depending on the OCS cardinality and/or the number of building blocks in the superplatform. For instance, if the OCS switch has a sufficient number of ports for all x-dimensional segments of all building blocks in the superplatform, then all x-dimensional segments can be connected to the same OCS switch. In another example, if the OCS switch has a sufficient number of ports, two segments in each dimension can share the OCS switch. However, having corresponding segments of all building blocks of the superplatform connected to the same OCS switch makes it possible to use a single routing table for data communication between the compute nodes of these segments. Furthermore, using a separate OCS switch for each segment or each dimension simplifies troubleshooting and diagnostics. For example, if there is a problem with data communication on a particular segment or dimension, identifying the potentially faulty OCS is easier than if multiple OCSs are used for that particular segment or dimension.

图9示出了示例超级平台900的部件。例如,超级平台900可以是图1的处理系统130的超级平台中的一个。示例超级平台900包括64个4×4×4构建块960,其可用于生成执行计算工作量(例如,机器学习工作量)的工作量集群。如上所述,每个4×4×4构建块960包括32个计算节点,其中四个计算节点沿着三个维度中的每一个维度排列。例如,构建块960可以与上述构建块310,工作量集群320或构建块700相同或类似。Figure 9 illustrates the components of an example superplatform 900. For example, superplatform 900 could be one of the superplatforms in the processing system 130 of Figure 1. The example superplatform 900 includes 64 4×4×4 building blocks 960, which can be used to generate workload clusters that perform computational workloads (e.g., machine learning workloads). As described above, each 4×4×4 building block 960 includes 32 compute nodes, with four compute nodes arranged along each of the three dimensions. For example, building block 960 could be the same as or similar to building block 310, workload cluster 320, or building block 700 described above.

示例性超级平台900包括光网络970,该光网络970包括48个OCS交换机930,940和950,这些OCS交换机930,940和950使用用于每个构建块960的96个外部链路931,932和933而连接到构建块。每个外部链路可以是与图4的光链路400类似或相同的光纤链路。An exemplary super platform 900 includes an optical network 970 comprising 48 OCS switches 930, 940, and 950, which are connected to the building blocks 960 via 96 external links 931, 932, and 933 for each building block. Each external link may be a fiber optic link similar to or the same as optical link 400 in Figure 4.

光网络970包括用于每个构建块的每个维度的每个段的OCS交换机,类似于图8的OCS结构拓扑800。对于x维,光网络970包括16个OCS交换机950,每个OCS交换机950用于沿着x维的每个段。对于每个构建块960,光网络970还包括用于构建块960的沿着x维的每个段的输入外部链路和输出外部链路。这些外部链路将该段上的计算节点连接到该段的OCS交换机950。由于每个构建块960包括沿着x维的16个段,所以光网络970包括32个外部链路933(即,16个输入链路和16个输出链路),这些链路将每个构建块960的x维段连接到用于这些段的相应OCS交换机950。Optical network 970 includes an OCS switch for each segment of each dimension of each building block, similar to the OCS topology 800 in Figure 8. For the x-dimensional dimension, optical network 970 includes 16 OCS switches 950, each OCS switch 950 for each segment along the x-dimensional dimension. For each building block 960, optical network 970 also includes input external links and output external links for each segment along the x-dimensional dimension of building block 960. These external links connect the compute nodes on that segment to the OCS switch 950 for that segment. Since each building block 960 includes 16 segments along the x-dimensional dimension, optical network 970 includes 32 external links 933 (i.e., 16 input links and 16 output links) that connect the x-dimensional segments of each building block 960 to the corresponding OCS switches 950 for those segments.

对于y维,光网络970包括16个OCS交换机930,每个OCS交换机用于沿y维的每个段。对于每个构建块960,光网络970还包括用于构建块960的沿着y维的每个段的输入外部链路和输出外部链路。这些外部链路将该段上的计算节点连接到该段的OCS交换机930。由于每个构建块960包括沿着y维的16个段,所以光网络970包括32个外部链路931(即,16个输入链路和16个输出链路),这些链路将每个构建块960的y维段连接到用于这些段的相应OCS交换机930。For the y-dimensional dimension, the optical network 970 includes 16 OCS switches 930, each OCS switch for each segment along the y-dimensional dimension. For each building block 960, the optical network 970 also includes input external links and output external links for each segment along the y-dimensional dimension of the building block 960. These external links connect the compute nodes on that segment to the OCS switch 930 for that segment. Since each building block 960 includes 16 segments along the y-dimensional dimension, the optical network 970 includes 32 external links 931 (i.e., 16 input links and 16 output links) that connect the y-dimensional segments of each building block 960 to the corresponding OCS switches 930 for those segments.

对于z维,光网络970包括16个OCS交换机940,每个OCS交换机940用于沿z维的每个段。对于每个构建块960,光网络970还包括用于构建块960的沿着z维的每个段的输入外部链路和输出外部链路。这些外部链路将该段上的计算节点连接到该段的OCS交换机940。由于每个构建块960包括沿着z维的16个段,所以光网络970包括32个外部链路932(即,16个输入链路和16个输出链路),这些链路将每个构建块960的z维段连接到用于这些段的相应OCS交换机940。For the z-dimensional dimension, the optical network 970 includes 16 OCS switches 940, each OCS switch 940 for each segment along the z-dimensional dimension. For each building block 960, the optical network 970 also includes input external links and output external links for each segment along the z-dimensional dimension of the building block 960. These external links connect the compute nodes on that segment to the OCS switch 940 for that segment. Since each building block 960 includes 16 segments along the z-dimensional dimension, the optical network 970 includes 32 external links 932 (i.e., 16 input links and 16 output links) that connect the z-dimensional segments of each building block 960 to the corresponding OCS switches 940 for those segments.

工作量调度器910可以接收请求数据,该请求数据包括工作量和指定用于执行工作量的所请求集群的构建块960的数据。请求数据还可以包括工作量的优先级。优先级可以以水平表示,例如高、中或低,或数字表示,例如在1-100的范围内或另一个适当的范围内。例如,工作量调度器910可以从用户设备或单元调度器(例如图1的用户设备110或单元调度器140)接收请求数据。如上所述,请求数据可以指定计算节点的目标n维排列,例如包括计算节点的构建块的目标排列。Workload scheduler 910 can receive request data that includes workload and data specifying the building blocks 960 of the requested cluster for executing the workload. The request data may also include the priority of the workload. Priority can be represented horizontally, such as high, medium, or low, or numerically, such as in the range of 1-100 or another suitable range. For example, workload scheduler 910 may receive request data from a user equipment or a cell scheduler (e.g., user equipment 110 or cell scheduler 140 of FIG. 1). As described above, the request data may specify a target n-dimensional arrangement of compute nodes, such as a target arrangement of building blocks including compute nodes.

工作量调度器910可以选择一组构建块960以生成与请求数据所指定的目标排列相匹配的工作量集群。例如,工作量调度器910可以在超级平台900中识别一组可用的健康的构建块。可用的健康的构建块是不执行另一个工作量或不是工作量集群的一部分并且未失效的构建块。Workload scheduler 910 can select a set of building blocks 960 to generate a workload cluster that matches the target arrangement specified by the requested data. For example, workload scheduler 910 can identify a set of available healthy building blocks in super platform 900. Available healthy building blocks are those that do not perform another workload or are not part of a workload cluster and are not invalid.

例如,工作量调度器910可以例如以数据库的形式来维护和更新状态数据,该状态数据指示超级平台中的每个构建块960的状态。构建块960的可用性状态可以指示构建块960是否被分配给工作量集群。构建块960的健康状态可以指示构建块是工作的还是失效的。工作量调度器910可识别具有可用性状态的构建块960,可用性状态指示构建块960未被分配给工作量且具有健康的工作状态。当构建块960被分配给工作量(例如,用于生成用于该工作量的工作量集群),或者具有健康状态变化(例如,从工作状态变化到失效状态或反之亦然)时,工作量调度器可以相应地更新构建块960的状态数据。For example, the workload scheduler 910 can maintain and update status data, such as in the form of a database, indicating the status of each building block 960 in the super platform. The availability status of a building block 960 can indicate whether it has been assigned to a workload cluster. The health status of a building block 960 can indicate whether it is active or inactive. The workload scheduler 910 can identify building blocks 960 with an availability status indicating that they are not assigned to a workload and are in a healthy active state. When a building block 960 is assigned to a workload (e.g., for generating a workload cluster for that workload), or when its health status changes (e.g., from active to inactive or vice versa), the workload scheduler can update the status data of the building block 960 accordingly.

从所识别的构建块960中,工作量调度器910可以选择与由目标排列所限定的数量相匹配的构建块960的数量。如果请求数据指定了一种或多种类型的计算节点,则工作量调度器910可以从所识别的构建块960中选择具有所请求的一种或多种类型的计算节点的构建块。例如,如果请求数据指定具有两个TPU构建块和两个GPU构建块的2×2排列的构建块,则工作量调度器910可以选择具有TPU的两个可用健康的构建块和具有GPU的两个健康可用的构建块。From the identified building blocks 960, the workload scheduler 910 can select a number of building blocks 960 that match the number defined by the target arrangement. If the request data specifies one or more types of compute nodes, the workload scheduler 910 can select building blocks from the identified building blocks 960 that have the requested one or more types of compute nodes. For example, if the request data specifies a 2×2 arrangement of building blocks with two TPU building blocks and two GPU building blocks, the workload scheduler 910 can select two available and healthy building blocks with TPUs and two healthy and available building blocks with GPUs.

工作量调度器910还可以基于当前在超级平台中运行的每个工作量的优先级和包括在请求数据中的工作量的优先级来选择构建块960。如果超级平台900没有足够的可用健康的构建块来为所请求的工作量生成工作量集群,则工作量调度器910可以确定在超级平台900中是否存在具有比所请求的工作量低的优先级的任何正在执行的工作量。如果是,则工作量调度器910可以将来自一个或多个较低优先级工作量的工作量集群的构建块重新分配给所请求的工作量的工作量集群。例如,工作量调度器910可以终止较低优先级工作量,延迟较低优先级工作量,或者减小用于较低优先级工作量的工作量集群的大小,以释放构建块,用于较高优先级工作量。Workload scheduler 910 can also select building blocks 960 based on the priority of each workload currently running in the superplatform and the priority of the workloads included in the requested data. If the superplatform 900 does not have enough healthy available building blocks to generate a workload cluster for the requested workload, workload scheduler 910 can determine whether there are any ongoing workloads in the superplatform 900 with a lower priority than the requested workload. If so, workload scheduler 910 can reallocate building blocks from one or more workload clusters of lower priority workloads to the workload cluster of the requested workload. For example, workload scheduler 910 can terminate lower priority workloads, delay lower priority workloads, or reduce the size of the workload clusters used for lower priority workloads to free up building blocks for higher priority workloads.

工作量调度器910可以简单地通过重新配置光网络(例如,通过如下所述配置OCS交换机)将构建块从一个工作量集群重新分配到另一个工作量集群,使得构建块连接到较高优先级工作量的构建块而不是较低优先级工作量的构建块。类似地,如果较高优先级工作量的构建块失效,则工作量调度器910可以通过重新配置光网络,将用于较低优先级工作量的工作量集群的构建块重新分配给较高优先级工作量的工作量集群。Workload scheduler 910 can easily reassign building blocks from one workload cluster to another by reconfiguring the optical network (e.g., by configuring the OCS switch as described below), so that building blocks are connected to building blocks of higher-priority workloads instead of building blocks of lower-priority workloads. Similarly, if a building block of a higher-priority workload fails, workload scheduler 910 can reassign building blocks intended for lower-priority workload clusters to higher-priority workload clusters by reconfiguring the optical network.

工作量调度器910可以生成按作业配置数据912并将其提供给超级平台900的OCS管理器920。按作业配置数据912可以指定用于工作量的所选择的构建块960以及构建块的排列。例如,如果该排列是2×2排列,则该排列包括用于构建块的四个点。按作业配置数据可指定哪一个所选择的构建块960进入在四个点中的每一个。The workload scheduler 910 can generate job configuration data 912 and provide it to the OCS manager 920 of the super platform 900. The job configuration data 912 can specify the selected building blocks 960 used for the workload and the arrangement of the building blocks. For example, if the arrangement is a 2×2 arrangement, then the arrangement includes four points for the building blocks. The job configuration data can specify which of the selected building blocks 960 enters each of the four points.

按作业配置数据912可以使用用于每个构建块的逻辑标识符来识别所选择的构建块960。例如,每个构建块960可以包括唯一的逻辑标识符。在特定示例中,64个构建块960可被编号为0-63,且这些编号可为唯一逻辑标识符。By configuring the job data 912, the selected building block 960 can be identified using a logical identifier for each building block. For example, each building block 960 can include a unique logical identifier. In a specific example, the 64 building blocks 960 can be numbered 0-63, and these numbers can be unique logical identifiers.

OCS管理器920使用按作业配置数据912来配置OCS交换机930,940和/或950,以生成与由按作业配置数据指定的排列相匹配的工作量集群。每个OCS交换机930,940和950包括用于在OCS交换机的物理端口之间路由数据的路由表。例如,假设用于第一构建块的x维段的输出外部链路连接到用于第二构建块的相应x维段的输入外部链路。在该示例中,用于该x维段的OCS交换机950的路由表将指示这些段所连接的OCS交换机的物理端口之间的数据将在彼此之间被路由。OCS Manager 920 uses job-specific configuration data 912 to configure OCS switches 930, 940, and/or 950 to generate workload clusters that match the arrangement specified by the job-specific configuration data. Each OCS switch 930, 940, and 950 includes a routing table for routing data between the physical ports of the OCS switch. For example, suppose the output external link of an x-dimensional segment for a first building block is connected to the input external link of a corresponding x-dimensional segment for a second building block. In this example, the routing table of the OCS switch 950 for that x-dimensional segment will indicate that data between the physical ports of the OCS switches to which these segments are connected will be routed to each other.

OCS管理器920可以维护将每个OCS交换机920,930和940的每个端口映射到每个构建块的每个逻辑端口的端口数据。对于构建块的每个x维段,该端口数据可以指定外部输入链路连接到OCS交换机950的哪个物理端口以及外部输出链路连接到OCS交换机950的哪个物理端口。端口数据可以包括用于超级平台900的每个构建块960的每个维度的相同数据。The OCS Manager 920 can maintain port data that maps each port of each OCS switch 920, 930, and 940 to each logical port of each building block. For each x-dimensional segment of a building block, this port data can specify which physical port of the OCS switch 950 an external input link is connected to and which physical port of the OCS switch 950 an external output link is connected to. The port data can include the same data for each dimension of each building block 960 for the Super Platform 900.

OCS管理器920可使用此端口数据来配置OCS交换机930,940和/或950的路由表以产生用于工作量的工作量集群。例如,假设在2×1排列中第一构建块将连接到第二构建块,其中第一构建块在x维上在第二构建块的左侧。OCS管理器920将更新用于x维的OCS交换机950的路由表,以在第一构建块和第二构建块的x维段之间路由数据。由于构建块的每个x维段将需要被连接,所以OCS管理器920可以更新每个OCS交换机950的路由表。The OCS Manager 920 can use this port data to configure the routing tables of OCS switches 930, 940, and/or 950 to generate workload clusters for workloads. For example, suppose a first building block will be connected to a second building block in a 2×1 arrangement, where the first building block is to the left of the second building block in the x-dimensional direction. The OCS Manager 920 will update the routing table of the OCS switch 950 for the x-dimensional direction to route data between the x-dimensional segments of the first and second building blocks. Since each x-dimensional segment of the building block will need to be connected, the OCS Manager 920 can update the routing table of each OCS switch 950.

对于每个x维段,OCS管理器920可以更新用于该段的OCS交换机950的路由表。特别地,OCS管理器920可以更新路由表以将第一构建块的段所连接的OCS交换机950的物理端口映射到第二构建块的段所连接的OCS交换机的物理端口。当每个x维段包括输入和输出链路时,OCS管理器920可以更新路由表,使得第一构建块的输入链路连接到第二构建块的输出链路,并且第一构建块的输出链路连接到第二构建块的输入链路。For each x-dimensional segment, the OCS Manager 920 can update the routing table of the OCS Switch 950 used for that segment. Specifically, the OCS Manager 920 can update the routing table to map the physical ports of the OCS Switch 950 to which the segments of the first building block are connected to the physical ports of the OCS Switch to which the segments of the second building block are connected. When each x-dimensional segment includes both input and output links, the OCS Manager 920 can update the routing table such that the input links of the first building block are connected to the output links of the second building block, and vice versa.

OCS管理器920可以通过从每个OCS交换机获得当前路由表来更新路由表。OCS管理器920可以更新适当的路由表,并将更新的路由表发送到适当的OCS交换机。在另一示例中,OCS管理器920可以向OCS交换机发送指定更新的更新数据,并且OCS交换机可以根据更新数据来更新它们的路由表。The OCS Manager 920 can update its routing table by obtaining the current routing table from each OCS switch. The OCS Manager 920 can then update the appropriate routing table and send the updated routing table to the appropriate OCS switch. In another example, the OCS Manager 920 can send update data specifying the update to the OCS switch, and the OCS switches can update their routing tables based on the update data.

在用更新的路由表配置OCS交换机之后,生成工作量集群。然后,工作量调度器910可以使工作量由工作量集群的计算节点执行。例如,工作量调度器910可以将工作量提供给工作量集群的计算节点以供执行。After configuring the OCS switch with the updated routing table, a workload cluster is generated. The workload scheduler 910 then allows the workload to be executed by the compute nodes of the workload cluster. For example, the workload scheduler 910 can provide workloads to the compute nodes of the workload cluster for execution.

在完成工作量之后,工作量调度器910可以将用于生成工作量集群的每个构建块的状态更新回可用。工作量调度器910还可以指示OCS管理器920移除用于生成工作量集群的构建块之间的连接。接着,OCS管理器920可以更新路由表,从而移除用于在构建块之间路由数据的OCS交换机的物理端口之间的映射。After completing the workload, the workload scheduler 910 can update the status of each building block used to generate the workload cluster back to available. The workload scheduler 910 can also instruct the OCS manager 920 to remove the connections between the building blocks used to generate the workload cluster. The OCS manager 920 can then update the routing table, thereby removing the mappings between the physical ports of the OCS switches used to route data between the building blocks.

以这种方式使用OCS交换机来配置光结构拓扑以生成用于工作量的工作量集群,使得超级平台能够以动态和安全的方式托管多个工作量。工作量调度器920可以在接收到新的工作量并且完成工作量时,动态地生成和终止工作量集群。与传统的超级计算机相比,OCS交换机提供的段之间的路由在同一超级平台中执行的不同工作量之间提供了更好的安全性。例如,OCS交换机利用工作量之间的气隙将工作量彼此物理地去耦合。传统的超级计算机使用软件,该软件提供工作量之间的隔离,这对于数据破坏来说更易受影响。Using OCS switches in this way to configure optical topology to generate workload clusters allows the super platform to host multiple workloads dynamically and securely. The workload scheduler 920 can dynamically create and terminate workload clusters as new workloads are received and completed. Compared to traditional supercomputers, the routing between segments provided by OCS switches offers better security between different workloads executed within the same super platform. For example, OCS switches utilize air gaps between workloads to physically decouple them from each other. Traditional supercomputers use software to provide isolation between workloads, which is more vulnerable to data corruption.

图10是示出用于生成工作量集群并使用该工作量集群执行计算工作量的示例性过程1000的流程图。过程1000的操作可由包括一个或多个数据处理装置的系统来进行。例如,过程1000的操作可以由图1的处理系统130进行。Figure 10 is a flowchart illustrating an exemplary process 1000 for generating a workload cluster and performing computational workloads using that workload cluster. The operation of process 1000 can be performed by a system including one or more data processing devices. For example, the operation of process 1000 can be performed by the processing system 130 of Figure 1.

系统接收指定计算节点的所请求集群的请求数据(1010)。例如,可以从用户设备接收请求数据。请求数据可以包括计算工作量和指定计算节点的目标n维排列的数据。例如,请求数据可以指定包括计算节点的构建块的目标n维排列。The system receives request data (1010) for the requested cluster of specified compute nodes. For example, request data can be received from a user equipment. The request data may include computational workload and data on the target n-dimensional permutation of the specified compute nodes. For example, the request data may specify a target n-dimensional permutation including the building blocks of the compute nodes.

在一些实施方式中,请求数据还可以指定构建块的计算节点的类型。超级平台可以包括具有不同类型的计算节点的构建块。例如,超级平台可以包括90个构建块,每个构建块包括4×4×4排列的TPU,超级平台还包括10个专用构建块,其包括2×1排列的专用计算节点。请求数据可以指定每种类型的计算节点的构建块的数量以及这些构建块的排列。In some implementations, the request data may also specify the type of compute nodes for the building blocks. A super platform may include building blocks with different types of compute nodes. For example, a super platform may include 90 building blocks, each comprising a 4×4×4 arrangement of TPUs, and 10 dedicated building blocks comprising dedicated compute nodes arranged in a 2×1 pattern. The request data may specify the number of building blocks of each type of compute node and the arrangement of these building blocks.

该系统从包括一组构建块的超级平台中选择用于所请求集群的构建块的子集(1020)。如上所述,超级平台可包括具有三维排列的计算节点(例如,4×4×4排列的计算节点)的一组构建块。系统可以选择与目标排列所限定的数量相匹配的构建块的数量。如上所述,系统可选择健康的且可用于所请求集群的构建块。The system selects a subset (1020) of building blocks for the requested cluster from a superplatform comprising a set of building blocks. As described above, the superplatform may comprise a set of building blocks having a three-dimensional arrangement of compute nodes (e.g., a 4×4×4 arrangement of compute nodes). The system can select a number of building blocks that matches the number defined by the target arrangement. As described above, the system can select healthy building blocks that are available for the requested cluster.

构建块的子集可以是构建块的适当子集。适当的子集是不包括该集合的所有成员的子集。例如,生成与计算节点的目标排列相匹配的工作量集群可能需要少于所有的构建块。A subset of building blocks can be an appropriate subset of building blocks. An appropriate subset is a subset that does not include all members of the set. For example, generating a workload cluster that matches the target arrangement of compute nodes may require fewer than all building blocks.

该系统生成包括所选择子集的计算节点的工作量集群(1030)。工作量集群可以具有与请求数据所指定的目标排列相匹配的构建块的排列。例如,如果请求数据指定了4×8×4排列的计算节点,则工作量集群可以包括被排列为类似于图3的工作量集群330的两个构建块。The system generates a workload cluster (1030) that includes a selected subset of compute nodes. The workload cluster can have an arrangement of building blocks that matches the target arrangement specified in the request data. For example, if the request data specifies a 4×8×4 arrangement of compute nodes, the workload cluster can include two building blocks arranged similarly to workload cluster 330 in Figure 3.

为了生成工作量集群,系统可以为工作量集群的每个维度配置路由数据。例如,如上所述,超级平台可以包括光网络,该光网络包括用于构建块的每个维度的一个或多个OCS交换机。维度的路由数据可以包括一个或多个OCS交换机的路由表。如上参考图9所述,OCS交换机的路由表可以被配置为沿着每个维度在适当段的计算节点之间路由数据。To generate workload clusters, the system can configure routing data for each dimension of the workload cluster. For example, as described above, the super platform may include an optical network comprising one or more OCS switches for each dimension of the building block. The routing data for a dimension may include routing tables for one or more OCS switches. As illustrated above with reference to Figure 9, the routing tables of the OCS switches can be configured to route data along each dimension between compute nodes in the appropriate segments.

系统使工作量集群的计算节点执行计算工作量(1040)。例如,系统可以向工作量集群的计算节点提供计算工作量。在执行计算工作量的同时,配置的OCS交换机可以在工作量集群的构建块之间路由数据。尽管计算节点在目标排列中没有物理地连接,但是所配置的OCS交换机可以在构建块的计算节点之间路由数据,就好像计算节点在目标排列中物理地连接一样。The system enables compute nodes in the workload cluster to perform compute workloads (1040). For example, the system can provide compute workloads to compute nodes in the workload cluster. While the compute workloads are being performed, the configured OCS switches can route data between building blocks of the workload cluster. Although the compute nodes are not physically connected in the target arrangement, the configured OCS switches can route data between compute nodes in the building blocks as if the compute nodes were physically connected in the target arrangement.

例如,一个维度的每个段的计算节点可以通过OCS交换机将数据传送到该段的其他计算节点,这些计算节点在不同的构建块中,就好像该段中的计算节点在单个物理段中物理地连接一样。这与分组交换网络不同,因为工作量集群的这种配置在中间没有分组交换或缓冲的情况下在相应的段之间提供真正的端到端光路径。在分组交换中,由于分组需要由交换机接收、缓冲和在另一个端口上再次发送,因此增加了等待时间。For example, compute nodes in each segment of a dimension can transmit data to other compute nodes in that segment via an OCS switch. These compute nodes are in different building blocks, as if the compute nodes in that segment were physically connected within a single physical segment. This differs from packet-switched networks because this configuration of workload clusters provides a true end-to-end optical path between the corresponding segments without intermediate packet switching or buffering. In packet switching, latency is increased because packets need to be received by the switch, buffered, and retransmitted on another port.

在完成计算工作量之后,例如,通过将构建块的状态更新为可用状态,并将路由数据更新为不再在工作量集群的构建块之间路由数据,系统可以释放构建块,用于其它工作量。After completing the computational workload, for example by updating the state of the building blocks to an available state and updating the routing data to no longer route data between building blocks in the workload cluster, the system can release the building blocks for other workloads.

图11是示出用于重新配置光网络以替换失效的构建块的示例性过程1100的流程图。过程1100的操作可由包括一个或多个数据处理装置的系统来执行。例如,过程1100的操作可以由图1的处理系统130执行。Figure 11 is a flowchart illustrating an exemplary process 1100 for reconfiguring an optical network to replace a failed building block. Operation of process 1100 can be performed by a system including one or more data processing devices. For example, operation of process 1100 can be performed by the processing system 130 of Figure 1.

该系统使工作量集群的计算节点执行计算工作量(1110)。例如,系统可以生成工作量集群,并使计算节点使用图10的过程1000执行计算工作量。This system enables the compute nodes of a workload cluster to perform computational workloads (1110). For example, the system can generate a workload cluster and enable the compute nodes to perform computational workloads using process 1000 of Figure 10.

系统接收指示工作量集群的构建块已经失效的数据(1120)。例如,如果构建块的一个或多个计算节点失效,则另一组件(例如,监视组件)可确定构建块已经失效,并向系统发送指示构建块已经失效的数据。The system receives data (1120) indicating that a building block of the workload cluster has failed. For example, if one or more compute nodes of a building block fail, another component (e.g., a monitoring component) can determine that the building block has failed and send data to the system indicating that the building block has failed.

系统识别可用的构建块(1130)。例如,系统可以在与工作量集群的其它构建块相同的超级平台中识别可用的健康构建块。系统可以基于例如由系统维护的构建块的状态数据来识别可用的健康构建块。The system identifies available building blocks (1130). For example, the system can identify available healthy building blocks in the same superplatform as other building blocks in the workload cluster. The system can identify available healthy building blocks based on, for example, the status data of the building blocks maintained by the system.

系统用所识别的可用的构建块替换失效的构建块(1140)。该系统可以更新连接构建块的光网络的一个或多个OCS交换机的路由数据,以用所识别的可用的构建块替换失效的构建块。例如,系统可以更新一个或多个OCS交换机的路由表,从而移除工作量集群的其他构建块与失效的构建块之间的连接。该系统还可以更新一个或多个OCS交换机的路由表,以将所识别的构建块连接到工作量集群的其它构建块。The system replaces the failed building block (1140) with the identified available building blocks. The system can update the routing data of one or more OCS switches in the optical network connecting the building block to replace the failed building block with the identified available building block. For example, the system can update the routing tables of one or more OCS switches to remove the connection between the failed building block and other building blocks in the workload cluster. The system can also update the routing tables of one or more OCS switches to connect the identified building block to other building blocks in the workload cluster.

该系统可以在逻辑上将所识别的构建块排列在失败的构建块点的逻辑点中。如上所述,OCS交换机的路由表可以将连接到构建块的一段的OCS交换机的物理端口映射到连接到另一构建块的相应段的OCS交换机的物理端口。在该示例中,系统可以通过更新到所识别的可用构建块(而不是失效的构建块)的对应段的映射,来进行替换。The system can logically rank the identified building blocks within the logical points of the failed building block points. As mentioned above, the OCS switch's routing table can map the physical ports of an OCS switch connected to one segment of a building block to the physical ports of an OCS switch connected to the corresponding segment of another building block. In this example, the system can perform replacements by updating the mapping to the corresponding segment of the identified available building block (rather than the failed building block).

例如,假设用于失效构建块的特定x维段的输入外部链路连接到OCS交换机的第一端口,并且用于所识别的可用构建块的相应x维段的输入外部链路连接到OCS交换机的第二端口。还假定路由表将第一端口映射到OCS交换机的第三端口,该第三端口连接到另一构建块的相应x维段。为了进行替换,系统可以更新路由表的映射以将第二端口映射到第三端口,而不是将第一端口映射到第三端口。系统可以对失效构建块的每一段执行此操作。For example, suppose the input external link for a specific x-segment of a failed building block is connected to a first port of an OCS switch, and the input external link for the corresponding x-segment of an identified available building block is connected to a second port of the OCS switch. Also suppose the routing table maps the first port to a third port of the OCS switch, which is connected to the corresponding x-segment of another building block. To perform a replacement, the system can update the routing table mapping to map the second port to the third port, instead of mapping the first port to the third port. The system can do this for each segment of the failed building block.

如上所述,用于超级平台的光网络结构可以包括用于构建块的每个逻辑轴的一个或多个OCS交换机。4×4×4构建块沿着每个维度具有16个逻辑轴。因此,光网络结构可以包括48个OCS交换机,其可以被配置为连接各种逻辑排列的构建块。As described above, the optical network architecture for the super platform can include one or more OCS switches for each logical axis of the building blocks. A 4×4×4 building block has 16 logical axes along each dimension. Therefore, the optical network architecture can include 48 OCS switches, which can be configured to connect building blocks with various logical arrangements.

由于每个构建块的每个段具有到与该段相对应的逻辑轴的OCS交换机的输入连接和输出连接,因此用于每个轴的OCS交换机将具有用于64构建块超级平台的128个端口。因此,在这种配置中,如果每个逻辑轴使用一个OCS交换机,则每个OCS交换机将需要至少128个端口用于64个构建块超级平台。对于用于连接构建块的给定大小(例如,给定端口计数)的OCS交换机,一对多交换机可以用于增加包括在超级平台中的构建块的数量,和/或对于给定大小的OCS交换机,一对多交换机可以用于减少在每个OCS交换机上使用的端口的数量。Since each segment of each building block has input and output connections to the OCS switch corresponding to that segment's logical axis, the OCS switch for each axis will have 128 ports for a 64-building-block superplatform. Therefore, in this configuration, if one OCS switch is used per logical axis, each OCS switch will require at least 128 ports for 64 building-block superplatforms. For a given size of OCS switches used to connect building blocks (e.g., a given port count), a one-to-many switch can be used to increase the number of building blocks included in the superplatform, and/or for a given size of OCS switches, a one-to-many switch can be used to reduce the number of ports used on each OCS switch.

图12示出了包括构建块1211-1214和1×2光交换机1261-1264的示例性超级平台1200的一部分。1×2光交换机是具有一个输入和两个输出的示例性一对多光交换机。尽管使用了术语输入和输出,但是光可以在任一方向上穿过交换机,例如,从一个输出到输入以及从输入到一个输出。如下所述,可以使用具有不同数量输出的其它一对多交换机,例如1×3光交换机(一个输入和三个输出),1×4光交换机(一个输入和四个输出),或其它适当的一对多光交换机。Figure 12 illustrates a portion of an exemplary super platform 1200 including building blocks 1211-1214 and 1×2 optical switches 1261-1264. A 1×2 optical switch is an exemplary one-to-many optical switch with one input and two outputs. Although the terms input and output are used, light can travel through the switch in either direction, e.g., from one output to one input and from one input to one output. As described below, other one-to-many switches with different numbers of outputs can be used, such as a 1×3 optical switch (one input and three outputs), a 1×4 optical switch (one input and four outputs), or other suitable one-to-many optical switches.

为了清楚起见,该示例示出了位于同一逻辑轴上(沿着x维)的段1221-1224与两个OCS交换机1271和1272的连接。然而,每个逻辑轴上的每个构建块的段可以连接到用于该逻辑轴的两个OCS交换机。例如,段1231-1234可以连接到两个OCS交换机(未示出),段1241-1244可以连接到两个OCS交换机(未示出),并且段1251-1254可以连接到两个OCS交换机(未示出)。沿着x维的每个其它逻辑轴的段,沿着y维的每个其它逻辑轴的段以及沿着z维的每个逻辑轴的段也可以以类似的方式连接到用于该逻辑轴的两个OCS交换机。For clarity, this example illustrates the connection of segments 1221-1224 located on the same logical axis (along the x-dimensional) to two OCS switches 1271 and 1272. However, segments of each building block on each logical axis can be connected to the two OCS switches used for that logical axis. For example, segments 1231-1234 can be connected to two OCS switches (not shown), segments 1241-1244 can be connected to two OCS switches (not shown), and segments 1251-1254 can be connected to two OCS switches (not shown). Segments along each other logical axis in the x-dimensional, each other logical axis in the y-dimensional, and each logical axis in the z-dimensional can also be connected to the two OCS switches used for that logical axis in a similar manner.

超级平台1200可包括具有与构建块1211-1214相同配置的其它构建块。例如,超级平台可以包括64个构建块,在图1200中仅示出4个。这些构建块的段可以连接到用于它们的逻辑轴的相应OCS交换机。The super platform 1200 may include other building blocks with the same configuration as building blocks 1211-1214. For example, the super platform may include 64 building blocks, of which only 4 are shown in Figure 1200. Segments of these building blocks may be connected to corresponding OCS switches for their logical axes.

例如使用光纤电缆,每个段1221-1224的一侧连接到用于该段的相应1×2交换机1261-1264的输入。例如使用光纤电缆,每个1×2交换机1261-1264的一个输出连接到OCS交换机1271,并且每个1×2交换机1261-1264的另一个输出连接到另一个OCS交换机1272。每个构建块的每个段可以在一侧连接到用于该段的1×2交换机。可以选择性地调整用于该段的1×2交换机,以将段的该侧连接到OCS交换机1271或OCS交换机1272(对于所示的逻辑轴)。对于包括64个具有4×4×4结构的构建块的超级平台,该超级平台可以包括3,0721×2个交换机,每个构建块的每段有一个交换机。For example, using fiber optic cables, one side of each segment 1221-1224 is connected to the input of the corresponding 1×2 switch 1261-1264 for that segment. For example, using fiber optic cables, one output of each 1×2 switch 1261-1264 is connected to OCS switch 1271, and the other output of each 1×2 switch 1261-1264 is connected to another OCS switch 1272. Each segment of each building block can be connected to the 1×2 switch for that segment on one side. The 1×2 switch for that segment can be selectively adjusted to connect that side of the segment to either OCS switch 1271 or OCS switch 1272 (for the logical axis shown). For a super platform comprising 64 building blocks with a 4×4×4 structure, the super platform can include 3,0721×2 switches, with one switch per segment of each building block.

每段的另一侧连接到OCS交换机1271或OCS交换机1272。例如,段1221和1223的另一侧连接到OCS交换机1271,而段1222和1224的另一侧连接到OCS交换机1272。以这种方式,与使用单个OCS端口并且每个构建块的输入和输出连接被连接到一个OCS交换机相比,构建块1211-1214使用OCS交换机1271和1272的更少的端口。例如,存在到OCS交换机1271的六个连接(对于所示的四个构建块)。如果OCS交换机1271是用于包括段1221-1224的逻辑轴的唯一交换机,则将存在到OCS交换机1271的八个连接,四个段1221-1224中的每段有两个连接。OCS交换机1272类似地具有六个连接而不是八个连接。The other side of each segment connects to either OCS switch 1271 or OCS switch 1272. For example, the other side of segments 1221 and 1223 connects to OCS switch 1271, while the other side of segments 1222 and 1224 connects to OCS switch 1272. In this way, building blocks 1211-1214 use fewer ports on OCS switches 1271 and 1272 compared to using a single OCS port and having the input and output connections of each building block connected to one OCS switch. For example, there are six connections to OCS switch 1271 (for the four building blocks shown). If OCS switch 1271 is the only switch used for the logical axis including segments 1221-1224, there will be eight connections to OCS switch 1271, with two connections for each of the four segments 1221-1224. OCS switch 1272 similarly has six connections instead of eight.

工作量调度器可以通过配置OCS交换机1271和1272(以及用于其他逻辑轴的其他段的OCS交换机)以及用于每个段的1×2交换机,来使用构建块1221-1224(和/或超级平台1200的其他构建块)创建工作量集群。如上所述,工作量调度器可以为工作量集群选择构建块,并配置用于OCS交换机的路由表,以在工作量集群中的构建块的段之间路由数据。The workload scheduler can create workload clusters using building blocks 1221-1224 (and/or other building blocks of the super platform 1200) by configuring OCS switches 1271 and 1272 (and OCS switches for other segments on other logical axes) and 1×2 switches for each segment. As described above, the workload scheduler can select building blocks for the workload cluster and configure routing tables for the OCS switches to route data between segments of the building blocks in the workload cluster.

在该示例中,当沿着给定逻辑轴的每个段被连接到一对OCS交换机时,工作量调度器为每个逻辑轴的每对OCS交换机配置路由表,使得在该逻辑轴上的工作量集群中的构建块的段可以彼此通信。类似地,工作量调度器可以基于段将与之通信的另一构建块的段来设置每个1×2交换机,以将其对应的段连接到两个OCS交换机中的一个。In this example, when each segment along a given logical axis is connected to a pair of OCS switches, the workload scheduler configures a routing table for each pair of OCS switches for each logical axis, enabling segments of building blocks in the workload cluster along that logical axis to communicate with each other. Similarly, the workload scheduler can configure each 1×2 switch based on segments of another building block with which the segment will communicate, to connect its corresponding segment to one of the two OCS switches.

例如,假设构建块1211将被逻辑地排列在构建块1213的左侧,如图12所示。在该示例中,段1221将需要能够与段1223通信;段1231将需要与段1233通信;段1241将需要与段1243通信;并且段1251将需要与段1253通信。For example, suppose building block 1211 will be logically arranged to the left of building block 1213, as shown in Figure 12. In this example, segment 1221 will need to be able to communicate with segment 1223; segment 1231 will need to communicate with segment 1233; segment 1241 will need to communicate with segment 1243; and segment 1251 will need to communicate with segment 1253.

特别地,段1221,1231,1241和1251右侧的计算节点应该分别连接到段1223,1233,1243和1253左侧的计算节点。例如,段1221右侧的计算节点应该连接到段1223左侧的计算节点。段1221,1231,1241和1251右侧的计算节点连接到它们相应的1×2交换机(段1221的交换机1261)的输入端。段1223,1233,1243和1253左侧的计算节点连接到OCS交换机1271。因此,为了分别在段1221,1231,1241和1251的右侧上的计算节点与段1223,1233,1243和1253的左侧上的计算节点之间进行连接,工作量调度器可以为每个段1221,1231,1241和1251配置1×2交换机,以将1×2交换机的输入连接到输出(O1),该输出(O1)连接到OCS交换机1271。例如,1×2交换机1261将被配置为使得输入被路由到连接到OCS交换机1271的输出。由于光可以在两个方向上通过1×2交换机传播,所以可以经由1×2交换机1261和OCS交换机1271,在段1221右侧的计算节点和段1223左侧的计算节点之间在两个方向上路由数据。Specifically, the compute nodes to the right of segments 1221, 1231, 1241, and 1251 should be connected to the compute nodes to the left of segments 1223, 1233, 1243, and 1253, respectively. For example, the compute nodes to the right of segment 1221 should be connected to the compute nodes to the left of segment 1223. The compute nodes to the right of segments 1221, 1231, 1241, and 1251 are connected to the input of their respective 1×2 switches (switch 1261 for segment 1221). The compute nodes to the left of segments 1223, 1233, 1243, and 1253 are connected to OCS switch 1271. Therefore, to connect the compute nodes on the right side of segments 1221, 1231, 1241, and 1251 to the compute nodes on the left side of segments 1223, 1233, 1243, and 1253 respectively, the workload scheduler can configure a 1×2 switch for each segment 1221, 1231, 1241, and 1251 to connect the input of the 1×2 switch to an output (O1) that is connected to the OCS switch 1271. For example, the 1×2 switch 1261 will be configured such that the input is routed to the output connected to the OCS switch 1271. Since light can propagate in both directions through the 1×2 switch, data can be routed in both directions between the compute nodes on the right side of segment 1221 and the compute nodes on the left side of segment 1223 via the 1×2 switch 1261 and the OCS switch 1271.

如果工作量集群沿着x维仅具有两个构建块(例如,2×4×4排列),则段1223,1233,1243和1253右侧的计算节点也应该分别连接到段1221,1231,1241和1251左侧的计算节点。例如,段1223右侧的计算节点应该连接到段1221左侧的计算节点。段1223,1233,1243和1253右侧的计算节点连接到它们各自的1×2交换机(段1223的交换机1263)的输入。段1221,1231,1241和1251左侧的计算节点连接到OCS交换机1271。因此,为了分别在段1223,1233,1243和1253的右侧上的计算节点与段1221,1231,1241和1251的左侧上的计算节点之间进行连接,工作量调度器可以为每个段1223,1233,1243和1253配置1×2交换机,以将1×2交换机的输入连接到输出(O1),该输出(O1)连接到OCS交换机1271。例如,1×2交换机1263将被配置为使得输入被路由到连接到OCS交换机1271的输出(O1)。If the workload cluster has only two building blocks along the x-dimensional plane (e.g., a 2×4×4 arrangement), then the compute nodes to the right of segments 1223, 1233, 1243, and 1253 should also be connected to the compute nodes to the left of segments 1221, 1231, 1241, and 1251, respectively. For example, the compute nodes to the right of segment 1223 should be connected to the compute nodes to the left of segment 1221. The compute nodes to the right of segments 1223, 1233, 1243, and 1253 are connected to the input of their respective 1×2 switches (switch 1263 for segment 1223). The compute nodes to the left of segments 1221, 1231, 1241, and 1251 are connected to OCS switch 1271. Therefore, to connect the compute nodes on the right side of segments 1223, 1233, 1243, and 1253 to the compute nodes on the left side of segments 1221, 1231, 1241, and 1251, respectively, the workload scheduler can configure a 1×2 switch for each segment 1223, 1233, 1243, and 1253 to connect the input of the 1×2 switch to an output (O1) that is connected to OCS switch 1271. For example, 1×2 switch 1263 would be configured such that the input is routed to the output (O1) connected to OCS switch 1271.

工作量调度器还可以配置OCS交换机1271,以在对应的段之间路由数据。例如,工作量调度器可以配置OCS交换机1271的路由表,以在段1221和段1223之间路由数据。特别地,工作量调度器可以配置路由表,以将在连接到1×2交换机1261的输出O1的端口处接收到的数据路由到被连接到段1223左侧的计算节点的端口。类似地,工作量调度器可以配置路由表,以将在连接到1×2交换机1263的输出O1的端口处接收到的数据路由到段1221左侧的计算节点。工作量调度器可以以类似的方式为每个其它段1231,1241,1251及其对应的段1233,1243和1253配置OCS交换机1271的路由表。The workload scheduler can also configure the OCS switch 1271 to route data between corresponding segments. For example, the workload scheduler can configure the routing table of the OCS switch 1271 to route data between segments 1221 and 1223. Specifically, the workload scheduler can configure the routing table to route data received at a port connected to output O1 of the 1×2 switch 1261 to a port of the compute node connected to the left of segment 1223. Similarly, the workload scheduler can configure the routing table to route data received at a port connected to output O1 of the 1×2 switch 1263 to a compute node to the left of segment 1221. The workload scheduler can configure the routing table of the OCS switch 1271 for each of the other segments 1231, 1241, 1251 and their corresponding segments 1233, 1243, and 1253 in a similar manner.

在另一个示例中,假设构建块1211将被逻辑地安排在构建块1212的左边,而不是构建块1213的左边。在该示例中,段1221将需要能够与段1222通信。段1231将需要与段1232通信,段1241将需要与段1242通信;并且段1251将需要与段1252通信。In another example, suppose that building block 1211 will be logically arranged to the left of building block 1212, rather than to the left of building block 1213. In this example, segment 1221 will need to be able to communicate with segment 1222. Segment 1231 will need to communicate with segment 1232, segment 1241 will need to communicate with segment 1242, and segment 1251 will need to communicate with segment 1252.

特别地,段1221,1231,1241和1251右侧的计算节点应该分别连接到段1222,1232,1242和1252左侧的计算节点。段1221,1231,1241和1251右侧的计算节点连接到它们各自的1×2交换机(段1221的交换机1261)的输入。然而,段1222,1232,1242和1252左侧的计算节点连接到OCS交换机1272。因此,为了分别在段1221,1231,1241和1251的右侧上的计算节点与段1222,1232,1242和12532的左侧上的计算节点之间进行连接,工作量调度器可以为每个段1221,1231,1241和1251配置1×2交换机以将1×2交换机的输入连接输出(O2),该到输出(O2)连接到OCS交换机1272。例如,1×2交换机1261将被配置为使得输入被路由到连接到OCS交换机1272的输出(O2)。Specifically, the compute nodes to the right of segments 1221, 1231, 1241, and 1251 should be connected to the compute nodes to the left of segments 1222, 1232, 1242, and 1252, respectively. The compute nodes to the right of segments 1221, 1231, 1241, and 1251 are connected to the input of their respective 1×2 switches (switch 1261 for segment 1221). However, the compute nodes to the left of segments 1222, 1232, 1242, and 1252 are connected to OCS switch 1272. Therefore, in order to connect the compute nodes on the right side of segments 1221, 1231, 1241, and 1251 to the compute nodes on the left side of segments 1222, 1232, 1242, and 12532, respectively, the workload scheduler can configure a 1×2 switch for each segment 1221, 1231, 1241, and 1251 to connect the input of the 1×2 switch to the output (O2), which in turn connects to the OCS switch 1272. For example, the 1×2 switch 1261 would be configured such that the input is routed to the output (O2) connected to the OCS switch 1272.

如果工作量集群沿着x维仅具有两个构建块(例如,2×4×4排列),则段1222,1232,1242和1252右侧的计算节点应该分别连接到段1221,1231,1241和1251左侧的计算节点。例如,段1222右侧的计算节点应该连接到段1221左侧的计算节点。段1222,1232,1242和1252右侧的计算节点连接到它们各自的1×2交换机(段1222的交换机1262)的输入。段1221,1231,1241和1251左侧的计算节点连接到OCS交换机1271。因此,为了分别在段1222,1232,1242和1252的右侧上的计算节点与段1221,1231,1241和1251的左侧上的计算节点之间进行连接,工作量调度器可以为每个段1222,1232,1242和1252配置1×2交换机,以将1×2交换机的输入连接到输出(O1),该输出(O1)连接到OCS交换机1271。例如,1×2交换机1262将被配置为使得输入被路由到连接到OCS交换机1271的输出(O1)。If the workload cluster has only two building blocks along the x-dimensional plane (e.g., a 2×4×4 arrangement), then the compute nodes to the right of segments 1222, 1232, 1242, and 1252 should be connected to the compute nodes to the left of segments 1221, 1231, 1241, and 1251, respectively. For example, the compute nodes to the right of segment 1222 should be connected to the compute nodes to the left of segment 1221. The compute nodes to the right of segments 1222, 1232, 1242, and 1252 are connected to the input of their respective 1×2 switches (switch 1262 for segment 1222). The compute nodes to the left of segments 1221, 1231, 1241, and 1251 are connected to OCS switch 1271. Therefore, to connect the compute nodes on the right side of segments 1222, 1232, 1242, and 1252 to the compute nodes on the left side of segments 1221, 1231, 1241, and 1251, respectively, the workload scheduler can configure a 1×2 switch for each segment 1222, 1232, 1242, and 1252 to connect the input of the 1×2 switch to an output (O1) that is connected to OCS switch 1271. For example, 1×2 switch 1262 would be configured such that the input is routed to the output (O1) connected to OCS switch 1271.

工作量调度器还可以配置OCS交换机1271和1272,以在对应的段之间路由数据。例如,工作量调度器可以配置OCS交换机1272的路由表,以分别将从连接到段1221,1231,1241和1251右侧的计算节点的1×2交换机的输出(O2)接收到的数据路由到段1222,1232,1242和1252左侧的计算节点。特别地,工作量调度器可以配置路由表,以将在连接到1×2交换机1261的输出(O2)的端口处接收到的数据路由到连接到段1222左侧的计算节点的端口。类似地,工作量调度器可以配置OCS交换机1271的路由表,以将在连接到1×2交换机1262的输出(O1)的端口处接收到的数据路由到段1221左侧的计算节点。工作量调度器可以以类似的方式为每个其它段1231,1241,1251及其对应的段1232,1242和1252配置OCS交换机1271的路由表。The workload scheduler can also configure OCS switches 1271 and 1272 to route data between corresponding segments. For example, the workload scheduler can configure the routing table of OCS switch 1272 to route data received from the output (O2) of a 1×2 switch connected to compute nodes on the right side of segments 1221, 1231, 1241, and 1251 to compute nodes on the left side of segments 1222, 1232, 1242, and 1252, respectively. Specifically, the workload scheduler can configure the routing table to route data received at a port connected to the output (O2) of 1×2 switch 1261 to a port connected to a compute node on the left side of segment 1222. Similarly, the workload scheduler can configure the routing table of OCS switch 1271 to route data received at a port connected to the output (O1) of 1×2 switch 1262 to a compute node on the left side of segment 1221. The workload scheduler can configure the routing table of OCS switch 1271 for each of the other segments 1231, 1241, 1251 and their corresponding segments 1232, 1242 and 1252 in a similar manner.

如上所述,可以使用其它一对多交换机来代替1×2交换机。例如,超级平台可以包括用于每个逻辑轴的三个OCS交换机。在该示例中,每个构建块的每个段的一侧可以连接到具有三个输出的1×3交换机的输入。用于一段的1×3交换机的三个输出可以连接到对应于该段的逻辑轴的三个OCS交换机。1×3交换机和OCS交换机可以以与上述类似的方式配置,以将构建块的段与其它构建块的对应段连接。相对于使用1×2交换机和每个逻辑轴两个OCS交换机,使用1×3交换机和每个逻辑轴三个OCS交换机,能够对于给定大小的OCS交换机实现更大的超级平台和/或每个OCS交换机使用更少的端口。然而,这也导致构建块的每个逻辑轴有更多的OCS交换机。也可以使用其它一对多交换机,例如1×4,1×5等,每个逻辑轴的OCS交换机的数量等于一对多交换机的输出的数量。As described above, other one-to-many switches can be used instead of 1×2 switches. For example, a super platform can include three OCS switches for each logical axis. In this example, one side of each segment of each building block can be connected to the input of a 1×3 switch with three outputs. The three outputs of the 1×3 switch for a segment can be connected to the three OCS switches corresponding to the logical axis of that segment. The 1×3 switches and OCS switches can be configured in a similar manner to those described above to connect segments of a building block to corresponding segments of other building blocks. Compared to using 1×2 switches and two OCS switches per logical axis, using 1×3 switches and three OCS switches per logical axis enables a larger super platform for a given size of OCS switches and/or uses fewer ports per OCS switch. However, this also results in more OCS switches per logical axis of the building block. Other one-to-many switches, such as 1×4, 1×5, etc., can also be used, where the number of OCS switches per logical axis equals the number of outputs of the one-to-many switch.

图13示出了示例性工作量集群1300。工作量集群1300是由8个构建块组成的8×8×8集群。1311-1317(在构建块1315下方的一个未示出)。每个构建块是4×4×4构建块,其具有沿着x维上的16个逻辑轴的16段计算节点,沿着y维上的16个逻辑轴的16段计算节点,以及沿着z维上的16个逻辑轴的16段计算节点。对于这个例子,假设从中创建工作量集群1300的超级平台包括用于每个逻辑轴的两个OCS交换机和用于每个构建块的每个段的相应的1×2交换机。Figure 13 illustrates an exemplary workload cluster 1300. Workload cluster 1300 is an 8×8×8 cluster consisting of eight building blocks, 1311-1317 (one below building block 1315 is not shown). Each building block is a 4×4×4 building block with 16 segments of compute nodes along 16 logical axes in the x-dimensional direction, 16 segments of compute nodes along 16 logical axes in the y-dimensional direction, and 16 segments of compute nodes along 16 logical axes in the z-dimensional direction. For this example, it is assumed that the superplatform from which workload cluster 1300 is created includes two OCS switches for each logical axis and a corresponding 1×2 switch for each segment of each building block.

工作量调度器可以通过配置OCS交换机和1×2交换机来将构建块的段连接到其它构建块的对应段,从而创建工作量集群1300。例如,构建块1311在逻辑上在构建块1312之上。工作量调度器可以为y维上的每个逻辑轴配置OCS交换机,使得OCS交换机在y维上将数据从构建块1312的每个段的顶部计算节点路由到构建块1311的对应段的底部计算节点。例如,工作量调度器可以为逻辑轴1330(沿y维的最左边和最前面的段)配置OCS交换机,使得OCS交换机在构建块1312的计算节点1331与构建块1311的计算节点1332之间路由数据。工作量调度器还可以为构建块1311和1312的y维上的每个段配置1×2交换机,以将这些段连接到适当的OCS交换机,如以上关于图12所述。The workload scheduler can create workload cluster 1300 by configuring OCS switches and 1×2 switches to connect segments of building blocks to corresponding segments of other building blocks. For example, building block 1311 is logically above building block 1312. The workload scheduler can configure an OCS switch for each logical axis in the y-dimensional space, such that the OCS switch routes data in the y-dimensional space from the top compute node of each segment of building block 1312 to the bottom compute node of the corresponding segment of building block 1311. For example, the workload scheduler can configure an OCS switch for logical axis 1330 (along the leftmost and foremost segments in the y-dimensional space), such that the OCS switch routes data between compute node 1331 of building block 1312 and compute node 1332 of building block 1311. The workload scheduler can also configure 1×2 switches for each segment in the y-dimensional space of building blocks 1311 and 1312 to connect these segments to the appropriate OCS switches, as described above with respect to Figure 12.

类似地,构建块1311在逻辑上位于构建块1313的左侧。工作量调度器可以为x维上的每个逻辑轴配置OCS交换机,使得OCS交换机在x维上将数据从构建块1311的每个段的最右面的计算节点路由到构建块1313的相应段的最左面的计算节点。例如,工作量调度器可以为逻辑轴1320(沿着x维的最上面和最前面的段)配置OCS交换机,使得OCS交换机在构建块1311的计算节点1321与构建块1313的计算节点1322之间路由数据。工作量调度器还可以为构建块1311和1313的x维上的每个段配置1×2交换机,以将这些段连接到适当的OCS交换机,如以上关于图12所述。Similarly, building block 1311 is logically located to the left of building block 1313. The workload scheduler can configure an OCS switch for each logical axis in the x-dimensional space, such that the OCS switch routes data in the x-dimensional space from the rightmost compute node of each segment of building block 1311 to the leftmost compute node of the corresponding segment of building block 1313. For example, the workload scheduler can configure an OCS switch for logical axis 1320 (along the topmost and foremost segments in the x-dimensional space), such that the OCS switch routes data between compute node 1321 of building block 1311 and compute node 1322 of building block 1313. The workload scheduler can also configure a 1×2 switch for each segment in the x-dimensional space of building blocks 1311 and 1313 to connect these segments to the appropriate OCS switch, as described above with respect to Figure 12.

类似地,构建块1314在逻辑上沿着z维位于构建块1317的前面。工作量调度器可以为z维上的每个逻辑轴配置OCS交换机,使得OCS交换机在z维上将数据从构建块1314的每个段的最后面的计算节点路由到构建块1317的相应段的最前面的计算节点。例如,工作量调度器可以为逻辑轴1340(沿z维的最上面和最右面)配置OCS交换机,使得OCS交换机在构建块1314的计算节点1341与构建块1317的计算节点1342之间路由数据。工作量调度器还可以为构建块1314和1317的z维上的每个段配置1×2交换机,以将这些段连接到适当的OCS交换机,如以上关于图12所述。Similarly, building block 1314 is logically located ahead of building block 1317 along the z-dimensional axis. The workload scheduler can configure an OCS switch for each logical axis in the z-dimensional axis, such that the OCS switch routes data along the z-dimensional axis from the last compute node of each segment of building block 1314 to the first compute node of the corresponding segment of building block 1317. For example, the workload scheduler can configure an OCS switch for logical axis 1340 (along the top and rightmost sides of the z-dimensional axis), such that the OCS switch routes data between compute node 1341 of building block 1314 and compute node 1342 of building block 1317. The workload scheduler can also configure a 1×2 switch for each segment in the z-dimensional axis of building blocks 1314 and 1317 to connect these segments to the appropriate OCS switch, as described above with respect to Figure 12.

工作量调度器可以为每个逻辑轴配置OCS交换机,并且为每个段配置1×2交换机,使得每个构建块的段与相邻构建块的对应段通信。给定段的对应段是与给定段在相同逻辑轴上的段。The workload scheduler can configure an OCS switch for each logical axis and a 1×2 switch for each segment, enabling segments of each building block to communicate with corresponding segments of adjacent building blocks. The corresponding segment of a given segment is the segment on the same logical axis as the given segment.

图14是示出示例性过程1400的流程图,用于生成工作量集群并使用该工作量集群执行计算工作量。过程1100的操作可由包括一个或多个数据处理装置的系统来执行。例如,过程1100的操作可以由图1的处理系统130执行。Figure 14 is a flowchart illustrating an exemplary process 1400 for generating a workload cluster and performing computational workloads using that workload cluster. The operation of process 1100 can be performed by a system including one or more data processing devices. For example, the operation of process 1100 can be performed by the processing system 130 of Figure 1.

该系统接收请求数据,该请求数据指定用于计算工作量的所请求的计算节点(1410)。例如,可以从用户设备接收请求数据。请求数据可以包括计算工作量和指定计算节点的目标n维排列的数据。例如,请求数据可以指定包括计算节点的构建块的目标n维排列。The system receives request data specifying the requested compute nodes (1410) for the computational workload. For example, the request data can be received from a user equipment. The request data may include the computational workload and data specifying a target n-dimensional permutation of the compute nodes. For example, the request data may specify a target n-dimensional permutation of building blocks including the compute nodes.

系统从包括一组构建块的超级平台中选择用于所请求集群的构建块的子集(1420)。如上所述,超级平台可包括具有三维排列的计算节点(例如,4×4×4排列的计算节点)的一组构建块。系统可以选择与目标排列所限定的量相匹配的构建块的量。如上所述,系统可选择健康且可用于所请求集群的构建块。The system selects a subset (1420) of building blocks from a superplatform comprising a set of building blocks for the requested cluster. As described above, the superplatform may comprise a set of building blocks having a three-dimensional arrangement of compute nodes (e.g., a 4×4×4 arrangement of compute nodes). The system may select an amount of building blocks that matches the amount defined by the target arrangement. As described above, the system may select healthy building blocks that are available for the requested cluster.

在超级平台中,每个构建块可以连接到光网络,该光网络包括用于m维中的每一维的两个或更多个OCS交换机。例如,光网络可以包括用于m维度的每一维度的每个逻辑轴的两个OCS交换机。对于m维度中的每一维度,每个构建块可包括一段或多段沿着该维度互连的计算节点。例如,每个构建块可以包括沿着该维度的每个逻辑轴的段。In the super platform, each building block can be connected to an optical network comprising two or more OCS switches for each dimension in m dimensions. For example, the optical network could include two OCS switches for each logical axis of each dimension in m dimensions. For each dimension in m dimensions, each building block could include one or more segments of compute nodes interconnected along that dimension. For example, each building block could include segments along each logical axis of that dimension.

每个段可以包括在该段的第一端上的第一计算节点和在该段的与第一端相对的第二端上的第二计算节点。如果该段包括多于两个的计算节点,则附加的计算节点可以在第一计算节点和第二计算节点之间的段内。Each segment may include a first compute node at a first end of the segment and a second compute node at a second end of the segment opposite to the first end. If the segment includes more than two compute nodes, additional compute nodes may be located within the segment between the first and second compute nodes.

第一计算节点的第一部分连接到该维度的两个或更多个OCS交换机中的第一OCS交换机。第一计算节点的一个或多个附加部分被连接到该维度的两个或更多个OCS交换机中的相应的附加OCS交换机。如上所述,光网络可以包括用于构建块的每个逻辑轴的两个OCS交换机。对于给定的逻辑轴,一些构建块的第一计算节点可以连接到两个OCS交换机中的第一OCS交换机。其它构建块的第一计算节点可以连接到两个OCS交换机中的第二OCS交换机。这些连接可以是直接连接,而没有任何中间的一对多交换机。A first portion of the first compute node is connected to the first OCS switch among two or more OCS switches in that dimension. One or more additional portions of the first compute node are connected to corresponding additional OCS switches among two or more OCS switches in that dimension. As described above, the optical network may include two OCS switches for each logical axis of a building block. For a given logical axis, the first compute node of some building blocks may be connected to the first OCS switch among the two OCS switches. The first compute node of other building blocks may be connected to the second OCS switch among the two OCS switches. These connections may be direct connections without any intermediate one-to-many switches.

可以将这些部分分配给每个OCS交换机,使得OCS交换机平衡。也就是说,如果光网络包括用于每个逻辑轴的两个OCS交换机,则该逻辑轴上的一半(或大约一半)的段可以被分配给第一OCS交换机,并且该逻辑轴上的一半(或大约一半)的段可以被分配给第二OCS交换机。These segments can be allocated to each OCS switch to balance the OCS switch distribution. That is, if the optical network includes two OCS switches for each logical axis, half (or approximately half) of the segments on that logical axis can be allocated to the first OCS switch, and half (or approximately half) of the segments on that logical axis can be allocated to the second OCS switch.

每个段的第二计算节点连接到具有输入和多个输出的相应一对多光交换机的输入。例如,1×2光交换机具有一个输入和两个输出。一对多光交换机的第一输出可以连接到第一OCS交换机。每个附加输出连接到附加的OCS交换机。例如,如果光网络包括用于每个逻辑轴的两个OCS交换机和连接到每个段的1×2光交换机,则用于该段的1×2交换机的第二输出可以连接到用于该段的逻辑轴的第二OCS交换机。The second compute node of each segment connects to the input of a corresponding one-to-many optical switch with inputs and multiple outputs. For example, a 1×2 optical switch has one input and two outputs. The first output of the one-to-many optical switch can be connected to a first OCS switch. Each additional output connects to an additional OCS switch. For example, if the optical network includes two OCS switches for each logical axis and a 1×2 optical switch connected to each segment, the second output of the 1×2 switch for that segment can be connected to a second OCS switch for that logical axis.

系统确定与计算节点的目标排列相匹配的计算节点子集的逻辑排列(1430)。逻辑排列可以是构建块的布局的内存式模型。对于m维中的每一维,逻辑排列可以限定每个构建块的段与一个或多个其它构建块的对应段之间的连接。例如,逻辑排列可以指定哪个构建块将进入计算节点的目标排列中的哪个位置。在特定示例中,如果目标排列是类似于图13的工作量集群的8×8×8排列,则逻辑排列可指定哪个构建块处于顶部、前面、左面的位置,哪个构建块处于顶部、右面、前面的位置,哪个构建块处于底部、左面、前面的位置,哪个构建块处于底部、右面、前面的位置等等。The system determines a logical arrangement (1430) of a subset of compute nodes that matches the target arrangement of compute nodes. The logical arrangement can be a memory-based model of the layout of building blocks. For each dimension in m dimensions, the logical arrangement can define the connections between segments of each building block and corresponding segments of one or more other building blocks. For example, the logical arrangement can specify which building block will enter which position in the target arrangement of compute nodes. In a specific example, if the target arrangement is an 8×8×8 arrangement similar to the workload cluster in Figure 13, the logical arrangement can specify which building block is at the top, front, and left; which building block is at the top, right, and front; which building block is at the bottom, left, and front; which building block is at the bottom, right, and front; and so on.

基于这些位置,在相同逻辑轴上并且沿着该轴相邻的构建块的段将彼此连接。例如,如果一个构建块在逻辑上排列在另一个构建块之上,则沿着顶部构建块的y维的每个段将连接到在相同逻辑轴上的底部构建块的对应段。Based on these positions, segments of building blocks that are on the same logical axis and adjacent along that axis will be connected to each other. For example, if one building block is logically above another building block, then each segment along the y-dimensional axis of the top building block will be connected to the corresponding segment of the bottom building block on the same logical axis.

系统生成计算节点的工作量集群,其包括构建块的子集并且基于逻辑排列彼此连接(1440)。系统可以使用组成操作1450和1460来生成工作量集群。The system generates a workload cluster of compute nodes, which consists of a subset of building blocks and is interconnected based on a logical arrangement (1440). The system can use composition operations 1450 and 1460 to generate the workload cluster.

对于工作量集群的每个维度,系统为维度的两个或更多个OCS交换机中的每一个配置相应的路由数据(1450)。用于工作量集群的每个维度的相应路由数据指定了计算工作量的数据如何沿着工作量集群的维度在计算节点之间路由。For each dimension of the workload cluster, the system configures corresponding routing data (1450) for each of the two or more OCS switches in the dimension. The corresponding routing data for each dimension of the workload cluster specifies how the data for computing workloads is routed between compute nodes along the dimension of the workload cluster.

例如,如果光网络包括用于每个维度的每个逻辑轴的两个OCS交换机,则系统可以配置用于每个逻辑轴的OCS交换机,以沿着逻辑轴在段之间路由数据。路由数据可以使OCS交换机在同一逻辑轴上的相邻段之间路由数据。For example, if an optical network includes two OCS switches for each logical axis in each dimension, the system can be configured to route data between segments along the logical axis. Routing data allows the OCS switches to route data between adjacent segments on the same logical axis.

系统基于逻辑排列来配置所述一对多交换机的至少一部分,使得每段计算节点中的第二计算节点连接到与逻辑排列中第二计算节点所连接的对应段的对应第一计算节点相同的OCS交换机(1460)。例如,如果第一构建块在第二构建块之上,则第一构建块的一段将需要连接到第二构建块的在相同逻辑轴上的对应段。如果第一构建块的该段的第一计算节点被连接到用于该逻辑轴的第一OCS交换机,则可以配置用于第二构建块的该段的一对多交换机,使得一对多交换机的输入被路由到第一OCS交换机。The system configures at least a portion of the one-to-many switch based on a logical arrangement, such that a second compute node in each segment is connected to the same OCS switch (1460) as the corresponding first compute node in the segment to which the second compute node in the logical arrangement is connected. For example, if a first building block is on top of a second building block, a segment of the first building block will need to be connected to the corresponding segment of the second building block on the same logical axis. If the first compute node of that segment of the first building block is connected to the first OCS switch for that logical axis, a one-to-many switch for that segment of the second building block can be configured such that the input of the one-to-many switch is routed to the first OCS switch.

系统使工作量集群的计算节点执行计算工作量(1470)。例如,系统可以向工作量集群的计算节点提供计算工作量。在执行计算工作量的同时,所配置的OCS交换机和一对多光交换机可以在工作量集群的构建块之间路由数据。尽管计算节点在目标排列中没有物理连接,但是所配置的OCS交换机和一对多光交换机可以在构建块的计算节点之间路由数据,就好像计算节点在目标排列中物理连接一样。The system enables compute nodes in the workload cluster to perform compute workloads (1470). For example, the system can provide compute workloads to compute nodes in the workload cluster. While the compute workloads are being performed, the configured OCS switches and one-to-many optical switches can route data between building blocks of the workload cluster. Although the compute nodes are not physically connected in the target arrangement, the configured OCS switches and one-to-many optical switches can route data between compute nodes in the building blocks as if the compute nodes were physically connected in the target arrangement.

在本说明书中描述的主题和操作的实施例可以在数字电子电路中实现,或者在计算机软件,固件或硬件中实现,包括在本说明书中公开的结构和它们的结构等价物,或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以被实现为一个或多个计算机程序,即计算机程序指令的一个或多个模块,其被编码在计算机存储介质上以便由数据处理设备执行或控制数据处理设备的操作。可替换地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电,光或电磁信号,该信号被生成来编码信息以便发送到适当的接收机设备以便由数据处理设备执行。计算机存储介质可以是或包括在计算机可读存储设备,计算机可读存储衬底,随机或串行存取存储器阵列或设备,或它们中的一个或多个的组合中。此外,当计算机存储介质不是传播信号时,计算机存储介质可以是在人工生成的传播信号中编码的计算机程序指令的源或目的地。计算机存储介质还可以是一个或多个单独的物理组件或介质(例如,多个CD,磁盘或其它存储设备),或被包括在其中。Embodiments of the subject matter and operations described herein can be implemented in digital electronic circuits, or in computer software, firmware, or hardware, including the structures disclosed herein and their structural equivalents, or combinations thereof. Embodiments of the subject matter described herein can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a computer storage medium for execution by a data processing device or for controlling the operation of a data processing device. Alternatively or additionally, the program instructions can be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical, or electromagnetic signal, generated to encode information for transmission to a suitable receiving device for execution by the data processing device. The computer storage medium can be or is included in a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or combinations thereof. Furthermore, when the computer storage medium is not a propagated signal, it can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. The computer storage medium can also be one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices), or included therein.

本说明书中描述的操作可以实现为由数据处理装置对存储在一个或多个计算机可读存储设备上的数据或从其它源接收的数据执行的操作。The operations described in this specification can be implemented as operations performed by a data processing device on data stored on one or more computer-readable storage devices or on data received from other sources.

术语"数据处理装置"包括用于处理数据的所有类型的装置,设备和机器,包括例如可编程处理器,计算机,片上系统,或前述的多个系统或组合。该装置可以包括专用逻辑电路,例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)。除了硬件之外,该装置还可以包括为所讨论的计算机程序创建执行环境的代码,例如构成处理器固件,协议栈,数据库管理系统,操作系统,跨平台运行时环境,虚拟机或它们中的一个或多个的组合的代码。该装置和执行环境可以实现各种不同的计算模型基础设施,例如Web服务,分布式计算和网格计算基础设施。The term "data processing apparatus" includes all types of devices, apparatuses, and machines for processing data, including, for example, programmable processors, computers, systems-on-a-chip, or a combination of the foregoing. The apparatus may include special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, cross-platform runtime environments, virtual machines, or combinations thereof. The apparatus and execution environment can implement a variety of different computing model infrastructures, such as web services, distributed computing, and grid computing infrastructures.

计算机程序(也称为程序,软件,软件应用,脚本或代码)可以以任何形式的编程语言来编写,包括编译或解释语言,声明性语言或过程语言,并且可以以任何形式来部署,包括作为独立程序或作为模块,组件,子例程,对象或适于在计算环境中使用的其它单元。计算机程序可以但不必须对应于文件系统中的文件。程序可以存储在保存其它程序或数据的文件的一部分(例如,存储在标记语言文档中的一个或多个脚本),存储在专用于所述程序的单个文件中,或者存储在多个协调文件(例如,存储一个或多个模块,子程序或代码部分的文件)中。计算机程序可被部署为在一个计算机上或在位于一个站点或分布在多个站点上并通过通信网络互连的多个计算机上执行。Computer programs (also known as programs, software, software applications, scripts, or code) can be written in any programming language, including compiled or interpreted languages, declarative languages, or procedural languages, and can be deployed in any form, including as standalone programs or as modules, components, subroutines, objects, or other units suitable for use in a computing environment. A computer program may, but must not, correspond to a file in a file system. A program may be stored as part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to said program, or in multiple coordinating files (e.g., a file storing one or more modules, subroutines, or code sections). A computer program can be deployed to execute on a single computer or on multiple computers located at a single site or distributed across multiple sites and interconnected by a communication network.

本说明书中描述的过程和逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程处理器来执行,以通过对输入数据进行操作并生成输出来执行动作。过程和逻辑流程也可以由专用逻辑电路(例如,FPGA(现场可编程门阵列)或ASIC(专用集成电路))来执行,并且装置也可以实现为专用逻辑电路。The processes and logic flows described in this specification can be executed by one or more programmable processors that execute one or more computer programs to perform actions by manipulating input data and generating outputs. The processes and logic flows can also be executed by special-purpose logic circuitry (e.g., FPGA (Field-Programmable Gate Array) or ASIC (Application-Specific Integrated Circuit)), and the apparatus can also be implemented as special-purpose logic circuitry.

例如,适于执行计算机程序的处理器包括通用和专用微处理器,以及任何类型的数字计算机的任何一个或多个处理器。通常,处理器将从只读存储器或随机存取存储器或两者接收指令和数据。计算机的基本元件是用于根据指令执行动作的处理器以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括或被可操作地耦合以从一个或多个大容量存储设备接收数据或向一个或多个大容量存储设备传送数据,所述大容量存储设备用于存储数据,例如磁盘,磁光盘或光盘。然而,计算机不必具有这样的设备。此外,计算机可以嵌入在另一设备中,例如,移动电话,个人数字助理(PDA),移动音频或视频播放器,游戏控制台,全球定位系统(GPS)接收器或便携式存储设备(例如,通用串行总线(USB)闪速驱动器)等。适于存储计算机程序指令和数据的设备包括所有形式的非易失性存储器,介质和存储器设备,包括例如半导体存储器设备,例如EPROM,EEPROM和闪存设备;磁盘,例如内部硬盘或可移动磁盘;磁光盘;CDROM和DVD-ROM盘。处理器和存储器可以由专用逻辑电路补充或结合在专用逻辑电路中。For example, processors suitable for executing computer programs include general-purpose and special-purpose microprocessors, as well as any one or more processors in any type of digital computer. Typically, the processor receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are the processor for performing actions according to instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for receiving or transferring data to, such as magnetic disks, magneto-optical disks, or optical disks. However, a computer does not necessarily have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device (e.g., a Universal Serial Bus (USB) flash drive), etc. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; CD-ROMs and DVD-ROMs. Processors and memory can be supplemented by or integrated into dedicated logic circuits.

为了提供与用户的交互,本说明书中描述的主题的实施例可以在计算机上实现,该计算机具有用于向用户显示信息的显示设备,例如CRT(阴极射线管)或LCD(液晶显示器)监视器,以及键盘和定点设备,例如鼠标或轨迹球,用户可以通过该定点设备向计算机提供输入。也可以使用其他类型的设备来提供与用户的交互;例如,提供给用户的反馈可以是任何形式的感觉反馈,例如视觉反馈,听觉反馈或触觉反馈;并且可以以任何形式接收来自用户的输入,包括声音,语音或触觉输入。此外,计算机可以通过向用户使用的设备发送文档和从用户使用的设备接收文档来与用户交互;例如,通过响应于从网页浏览器接收的请求向用户的客户端设备上的网页浏览器发送网页。To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device for displaying information to the user, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, and a keyboard and pointing device, such as a mouse or trackball, through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including sound, speech, or tactile input. Furthermore, the computer can interact with the user by sending documents to and receiving documents from the device used by the user; for example, by sending a webpage to a web browser on the user's client device in response to a request received from a web browser.

本说明书中描述的主题的实施例可以在计算系统中实现,该计算系统包括后端组件,例如作为数据服务器,或者包括中间件组件,例如应用服务器,或者包括前端组件,例如具有图形用户界面或网页浏览器的客户端计算机,用户可以通过该图形用户界面或Web浏览器与本说明书中描述的主题的实现进行交互。或一个或多个这样的后端,中间件或前端组件的任何组合。系统的组件可以通过任何形式或介质的数字数据通信(例如,通信网络)来互连。通信网络的示例包括局域网("LAN")和广域网("WAN"),网络间(例如,因特网)和对等网络(例如,hoc对等网络)。Embodiments of the subject matter described in this specification can be implemented in a computing system that includes backend components, such as a data server, or middleware components, such as an application server, or frontend components, such as a client computer with a graphical user interface or web browser through which a user can interact with the implementation of the subject matter described in this specification. Or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication (e.g., a communication network) of any form or medium. Examples of communication networks include local area networks ("LANs") and wide area networks ("WANs"), inter-network (e.g., the Internet) and peer-to-peer networks (e.g., hoc peer-to-peer networks).

计算系统可以包括客户端和服务器。客户端和服务器通常彼此远离,并且通常通过通信网络进行交互。客户端和服务器之间的关系是通过在各个计算机上运行并且彼此具有客户端-服务器关系的计算机程序而产生的。在一些实施例中,服务器将数据(例如,HTML页面)传输到客户端设备(例如,用于向与客户端设备交互的用户显示数据和从与客户端设备交互的用户接收用户输入)。可以在服务器处从客户端设备接收在客户端设备处生成的数据(例如,用户交互的结果)。A computing system may include clients and servers. Clients and servers are typically geographically separated and usually interact via a communication network. The relationship between clients and servers is generated by computer programs running on separate computers that have a client-server relationship with each other. In some embodiments, the server transmits data (e.g., HTML pages) to the client device (e.g., for displaying data to a user interacting with the client device and receiving user input from the user interacting with the client device). Data generated at the client device (e.g., the result of user interaction) may be received at the server from the client device.

虽然本说明书包含许多具体的实现细节,但这些不应被解释为对任何发明的范围或所要求保护的范围的限制,而应被解释为对特定发明的特定实施例的特定特征的描述。本说明书中在单独实施例的上下文中描述的某些特征也可以在单个实施例中组合实现。相反,在单个实施例的上下文中描述的各种特征也可以在多个实施例中单独地或以任何合适的子组合来实现。此外,尽管上面可以将特征描述为在某些组合中起作用,并且甚至最初如此要求保护,但是在一些情况下,可以从组合中去除要求保护的组合中的一个或多个特征,并且要求保护的组合可以针对子组合或子组合的变型。While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claims, but rather as descriptions of specific features of specific embodiments of a particular invention. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations, and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may be for sub-combinations or variations thereof.

类似地,虽然在附图中以特定顺序描述了操作,但这不应被理解为要求以所示的特定顺序或以顺序的顺序执行这些操作,或者要求执行所有示出的操作以获得期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统组件的分离不应被理解为需要所有实施例中的这种分离,并且应当理解,所描述的程序组件和系统通常可以被集成在单个软件产品中或者被封装到多个软件产品中。Similarly, although operations are described in a specific order in the accompanying drawings, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order, or requiring all shown operations to be performed to obtain the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various system components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated into a single software product or packaged into multiple software products.

因此,已经描述了本主题的特定实施例。其它实施例在所附权利要求的范围内。在一些情况下,权利要求中所述的动作可以以不同的顺序执行,并且仍然实现期望的结果。此外,附图中所示的过程不一定需要所示的特定顺序或顺序,以获得所需的结果。在某些实现中,多任务和并行处理可能是有利的。Therefore, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims may be performed in a different order and still achieve the desired result. Furthermore, the processes shown in the figures do not necessarily require the specific order or sequence shown to obtain the desired result. In some implementations, multitasking and parallel processing may be advantageous.

Claims (20)

1.一种由一个或多个数据处理设备执行的方法,其特征在于,所述方法包括:1. A method executed by one or more data processing devices, characterized in that the method comprises: 识别用于计算工作量的计算节点的目标排列;Identify the target arrangement of computation nodes used to calculate workload; 从一组构建块中选择用于所述计算工作量的所述构建块的子集,所述一组构建块中的每个构建块包括m维度排列的计算节点,其中:A subset of the building blocks used for the computational workload is selected from a set of building blocks, each building block in the set comprising computational nodes arranged in m dimensions, wherein: 每个构建块连接到光网络,所述光网络包括用于m维度中的每个维度的两个或更多个光交换机以及用于每个构建块的一组一对多交换机;Each building block is connected to an optical network, which includes two or more optical switches for each dimension in the m-dimensional space and a set of one-to-many switches for each building block. 对于所述m维度中的每个维度:For each of the m dimensions: 每个构建块包括一段或多段沿着所述维度互连的计算节点,每段包括在段的第一端上的第一计算节点和在段的与所述第一端相对的第二端上的第二计算节点;Each building block includes one or more segments of compute nodes interconnected along the said dimension, each segment including a first compute node at a first end of the segment and a second compute node at a second end of the segment opposite to the first end; 对于所述一组构建块的第一子集,沿着所述维度的每段的所述第一计算节点连接到所述维度的所述两个或更多个光交换机中的第一光交换机;For a first subset of the set of building blocks, the first compute node along each segment of the dimension is connected to the first optical switch among the two or more optical switches in the dimension; 对于所述一组构建块的一个或多个第二子集中的每一个,沿着所述维度的每段的所述第一计算节点连接到用于所述维度的所述两个或更多个光交换机中的相应附加光交换机;和For each of one or more second subsets of the set of building blocks, the first compute node along each segment of the dimension is connected to a corresponding additional optical switch among the two or more optical switches for the dimension; and 每个构建块的每段的所述第二计算节点连接到用于所述构建块和段的相应的一对多光交换机的输入,每个一对多光交换机具有所述输入和多个输出,其中所述多个输出中的第一输出连接到用于所述维度的所述第一光交换机,并且所述多个输出中的每个附加输出连接到用于所述维度的对应的相应附加光交换机;The second compute node of each segment of each building block is connected to the input of a corresponding one-to-many optical switch for the building block and segment, each one-to-many optical switch having the input and a plurality of outputs, wherein a first output of the plurality of outputs is connected to the first optical switch for the dimension, and each additional output of the plurality of outputs is connected to a corresponding additional optical switch for the dimension. 确定与所述计算节点的所述目标排列匹配的所述构建块的子集的逻辑排列;Determine the logical arrangement of a subset of the building blocks that match the target arrangement of the computing nodes; 生成包括所述构建块的所述子集的计算节点的工作量集群,所述生成包括:Generate a workload cluster of compute nodes comprising the subset of the building blocks, the generation comprising: 针对所述工作量集群的每个维度,为所述维度的所述两个或更多个光交换机中的每一个光交换机配置相应的路由数据,所述工作量集群的每个维度的所述相应的路由数据指示所述计算工作量的数据如何沿着所述工作量集群的所述维度在计算节点之间被路由;和For each dimension of the workload cluster, corresponding routing data is configured for each of the two or more optical switches in that dimension, the corresponding routing data for each dimension of the workload cluster indicating how the computational workload data is routed along that dimension of the workload cluster between compute nodes; and 基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每段的所述第二计算节点连接到与所述逻辑排列中所述第二计算节点所连接的对应段的对应第一计算节点相同的光交换机。Based on the logical arrangement, at least a portion of the one-to-many switch is configured such that the second computing node of each segment is connected to the same optical switch as the corresponding first computing node of the segment to which the second computing node in the logical arrangement is connected. 2.根据权利要求1所述的方法,其特征在于,每个光交换机包括光路交换机(OCS)。2. The method according to claim 1, wherein each optical switch comprises an optical path switch (OCS). 3.根据权利要求1所述的方法,其特征在于,还包括使所述工作量集群执行所述计算工作量。3. The method according to claim 1, characterized in that it further includes causing the workload cluster to perform the computational workload. 4.根据权利要求1所述的方法,其特征在于,用于每个构建块的所述一组一对多交换机包括用于所述构建块的每个维度的每段的相应的一对多交换机。4. The method according to claim 1, wherein the set of one-to-many switches for each building block comprises a corresponding one-to-many switch for each segment of each dimension of the building block. 5.根据权利要求1所述的方法,其特征在于,每个维度的每段包括沿着所述维度的逻辑轴。5. The method according to claim 1, wherein each segment of each dimension includes a logical axis along said dimension. 6.根据权利要求1所述的方法,其特征在于,基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每段的所述第二计算节点连接到与在所述逻辑排列中所述第二计算节点所连接的对应段的对应第一计算节点相同的光交换机,包括:6. The method according to claim 1, characterized in that configuring at least a portion of the one-to-many switch based on the logical arrangement, such that the second computing node of each segment is connected to the same optical switch as the corresponding first computing node of the corresponding segment to which the second computing node is connected in the logical arrangement, includes: 对于所述子集中的第一构建块,识别在逻辑上沿着特定维度与所述第一构建块相邻的所述子集中的第二构建块;For a first building block in the subset, identify a second building block in the subset that is logically adjacent to the first building block along a specific dimension; 对于沿着所述特定维度的所述第一构建块的每段:For each segment of the first building block along the specific dimension: 识别所述第二构建块的对应段;Identify the corresponding segment of the second building block; 识别光交换机,所述光交换机连接到所述第二构建块的所述对应段的所述第一计算节点;和Identify the optical switch, which is connected to the first computing node of the corresponding segment of the second building block; and 配置段所连接的所述一对多交换机的所述至少一部分中的对应一对多交换机,以将段的所述第二计算节点连接到所识别的光交换机。Configure the corresponding one-to-many switch among at least a portion of the one-to-many switches to which the segment is connected, so as to connect the second computing node of the segment to the identified optical switch. 7.根据权利要求1所述的方法,其特征在于:7. The method according to claim 1, characterized in that: 所述一对多光交换机是具有一个输入和两个输出的一对二光交换机。The one-to-many optical switch is a one-to-two optical switch with one input and two outputs. 8.根据权利要求1所述的方法,其特征在于,所述一组构建块包括多个工作量集群,并且其中每个工作量集群包括所述构建块的不同子集。8. The method according to claim 1, wherein the set of building blocks comprises a plurality of workload clusters, and wherein each workload cluster comprises a different subset of the building blocks. 9.根据权利要求1所述的方法,其特征在于,还包括:9. The method according to claim 1, characterized in that it further comprises: 接收指示所述工作量集群中的给定构建块已经失效的数据;和Receive data indicating that a given building block in the workload cluster has expired; and 用可用构建块来替换所述给定构建块。Replace the given building block with an available building block. 10.根据权利要求9所述的方法,其特征在于,用可用构建块替换所述给定构建块包括:10. The method of claim 9, wherein replacing the given building block with an available building block comprises: 更新所述光网络的一个或多个光交换机的路由数据,以停止在所述工作量集群中的所述给定构建块与一个或多个其他构建块之间路由数据;和Update the routing data of one or more optical switches in the optical network to stop routing data between the given building block and one or more other building blocks in the workload cluster; and 更新所述光网络的所述一个或多个光交换机的路由数据,以在所述工作量集群中的所述可用构建块与所述一个或多个其他构建块之间路由数据。Update the routing data of the one or more optical switches in the optical network to route data between the available building blocks in the workload cluster and the one or more other building blocks. 11.一种用于配置超级平台的系统,其特征在于,包括:11. A system for configuring a super platform, characterized in that it comprises: 一组构建块,每个构建块包括m维度排列的计算节点;和A set of building blocks, each building block consisting of compute nodes arranged in m dimensions; and 光网络,所述光网络包括用于m维度中的每个维度的两个或更多个光交换机,以及用于每个构建块的一组一对多交换机,其中:An optical network comprising two or more optical switches for each dimension in m dimensions, and a set of one-to-many switches for each building block, wherein: 每个构建块连接到所述光网络;Each building block is connected to the optical network; 对于所述m维度中的每个维度:For each of the m dimensions: 每个构建块包括一段或多段沿着所述维度互连的计算节点,每段包括在段的第一端上的第一计算节点和在段的与所述第一端相对的第二端上的第二计算节点;Each building block includes one or more segments of compute nodes interconnected along the said dimension, each segment including a first compute node at a first end of the segment and a second compute node at a second end of the segment opposite to the first end; 对于所述一组构建块的第一子集,沿着所述维度的每段的所述第一计算节点连接到所述维度的所述两个或更多个光交换机中的第一光交换机;For a first subset of the set of building blocks, the first compute node along each segment of the dimension is connected to the first optical switch among the two or more optical switches in the dimension; 对于所述一组构建块的一个或多个第二子集中的每一个子集,沿着所述维度的每段的所述第一计算节点连接到用于所述维度的所述两个或更多个光交换机中的相应的附加光交换机;和For each subset of one or more second subsets of the set of building blocks, the first compute node along each segment of the dimension is connected to a corresponding additional optical switch among the two or more optical switches for the dimension; and 每个构建块的每段的所述第二计算节点连接到用于所述构建块和段的相应的一对多光交换机的输入,每个一对多光交换机具有所述输入和多个输出,其中所述多个输出中的第一输出连接到用于所述维度的所述第一光交换机,并且所述多个输出中的每个附加输出连接到用于所述维度的对应的相应附加光交换机。The second compute node of each segment of each building block is connected to the input of a corresponding one-to-many optical switch for the building block and segment. Each one-to-many optical switch has the input and a plurality of outputs, wherein a first output of the plurality of outputs is connected to the first optical switch for the dimension, and each additional output of the plurality of outputs is connected to a corresponding additional optical switch for the dimension. 12.根据权利要求11所述的系统,其特征在于,进一步包括在一个或多个计算机上实施的OCS管理器,所述OCS管理器被配置为:12. The system of claim 11, further comprising an OCS manager implemented on one or more computers, the OCS manager being configured to: 确定与用于计算工作量的所述计算节点的目标排列相匹配的所述构建块的子集的逻辑排列;和Determine the logical arrangement of a subset of the building blocks that matches the target arrangement of the computation nodes used to compute the workload; and 生成包括所述构建块的所述子集的计算节点的工作量集群,所述生成包括:Generate a workload cluster of compute nodes comprising the subset of the building blocks, the generation comprising: 针对所述工作量集群的每个维度,为所述维度的所述两个或更多个光交换机中的每一个配置相应的路由数据,所述工作量集群的每个维度的所述相应的路由数据指示所述计算工作量的数据如何沿着所述工作量集群的所述维度在计算节点之间被路由;和For each dimension of the workload cluster, corresponding routing data is configured for each of the two or more optical switches in that dimension, the corresponding routing data for each dimension of the workload cluster indicating how the computational workload data is routed along that dimension of the workload cluster between compute nodes; and 基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每段的所述第二计算节点连接到与在所述逻辑排列中所述第二计算节点所连接的对应段的对应第一计算节点相同的光交换机。At least a portion of the one-to-many switch is configured based on the logical arrangement such that the second computing node of each segment is connected to the same optical switch as the corresponding first computing node of the corresponding segment to which the second computing node is connected in the logical arrangement. 13.根据权利要求11所述的系统,其特征在于,每个光交换机包括光路交换机(OCS)。13. The system according to claim 11, wherein each optical switch comprises an optical path switch (OCS). 14.根据权利要求11所述的系统,其特征在于,用于每个构建块的所述一组一对多交换机包括用于所述构建块的每个维度的每段的相应的一对多交换机。14. The system of claim 11, wherein the set of one-to-many switches for each building block comprises a corresponding one-to-many switch for each segment of each dimension of the building block. 15.根据权利要求11所述的系统,其特征在于,每个维度的每段包括沿着所述维度的逻辑轴。15. The system of claim 11, wherein each segment of each dimension comprises a logical axis along said dimension. 16.根据权利要求12所述的系统,其特征在于,基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每段的所述第二计算节点连接到与在所述逻辑排列中所述第二计算节点所连接的对应段的对应第一计算节点相同的光交换机,包括:16. The system according to claim 12, characterized in that configuring at least a portion of the one-to-many switches based on the logical arrangement, such that the second computing node of each segment is connected to the same optical switch as the corresponding first computing node of the corresponding segment to which the second computing node is connected in the logical arrangement, includes: 对于所述子集中的第一构建块,识别在逻辑上沿着特定维度与所述第一构建块相邻的所述子集中的第二构建块;For a first building block in the subset, identify a second building block in the subset that is logically adjacent to the first building block along a specific dimension; 对于沿着所述特定维度的所述第一构建块的每段:For each segment of the first building block along the specific dimension: 识别所述第二构建块的对应段;Identify the corresponding segment of the second building block; 识别光交换机,所述光交换机连接到所述第二构建块的所述对应段的所述第一计算节点;和Identify the optical switch, which is connected to the first computing node of the corresponding segment of the second building block; and 配置段所连接的所述一对多交换机的所述至少一部分中的对应一对多交换机,以将段的所述第二计算节点连接到所识别的光交换机。Configure the corresponding one-to-many switch among at least a portion of the one-to-many switches to which the segment is connected, so as to connect the second computing node of the segment to the identified optical switch. 17.根据权利要求11所述的系统,其特征在于:17. The system according to claim 11, characterized in that: 所述一对多光交换机是具有一个输入和两个输出的一对二光交换机。The one-to-many optical switch is a one-to-two optical switch with one input and two outputs. 18.根据权利要求11所述的系统,其特征在于,所述一组构建块包括多个工作量集群,并且其中每个工作量集群包括所述构建块的不同子集。18. The system of claim 11, wherein the set of building blocks comprises a plurality of workload clusters, and wherein each workload cluster comprises a different subset of the building blocks. 19.一种用计算机程序编码的非暂时性计算机存储介质,其特征在于,所述程序包括指令,当由一个或多个数据处理设备执行所述指令时,所述指令使所述一个或多个数据处理设备执行操作,所述操作包括:19. A non-transitory computer storage medium encoded by a computer program, characterized in that the program includes instructions that, when executed by one or more data processing devices, cause the one or more data processing devices to perform an operation, the operation comprising: 识别用于计算工作量的计算节点的目标排列;Identify the target arrangement of computation nodes used to calculate workload; 从一组构建块中选择用于所述计算工作量的所述构建块的子集,所述一组构建块中的每个构建块包括m维度排列的计算节点,其中:A subset of the building blocks used for the computational workload is selected from a set of building blocks, each building block in the set comprising computational nodes arranged in m dimensions, wherein: 每个构建块连接到光网络,所述光网络包括用于m维度中的每个维度的两个或更多个光交换机以及用于每个构建块的一组一对多交换机;Each building block is connected to an optical network, which includes two or more optical switches for each dimension in the m-dimensional space and a set of one-to-many switches for each building block. 对于所述m维度中的每个维度:For each of the m dimensions: 每个构建块包括一段或多段沿着所述维度互连的计算节点,每段包括在段的第一端上的第一计算节点和在段的与所述第一端相对的第二端上的第二计算节点;Each building block includes one or more segments of compute nodes interconnected along the said dimension, each segment including a first compute node at a first end of the segment and a second compute node at a second end of the segment opposite to the first end; 对于所述一组构建块的第一子集,沿着所述维度的每段的所述第一计算节点连接到所述维度的所述两个或更多个光交换机的第一光交换机;For a first subset of the set of building blocks, the first compute node along each segment of the dimension is connected to the first optical switch of the two or more optical switches of the dimension; 对于所述一组构建块的一个或多个第二子集中的每一个,沿着所述维度的每段的所述第一计算节点连接到用于所述维度的所述两个或更多个光交换机的相应附加光交换机;和For each of one or more second subsets of the set of building blocks, the first compute node along each segment of the dimension is connected to a corresponding additional optical switch for the two or more optical switches of the dimension; and 每个构建块的每段的所述第二计算节点连接到用于所述构建块和段的相应的一对多光交换机的输入,每个一对多光交换机具有所述输入和多个输出,其中所述多个输出中的第一输出连接到用于所述维度的所述第一光交换机,并且所述多个输出中的每个附加输出连接到用于所述维度的对应的相应附加光交换机;The second compute node of each segment of each building block is connected to the input of a corresponding one-to-many optical switch for the building block and segment, each one-to-many optical switch having the input and a plurality of outputs, wherein a first output of the plurality of outputs is connected to the first optical switch for the dimension, and each additional output of the plurality of outputs is connected to a corresponding additional optical switch for the dimension. 确定与所述计算节点的所述目标排列匹配的所述构建块的子集的逻辑排列;Determine the logical arrangement of a subset of the building blocks that match the target arrangement of the computing nodes; 生成包括所述构建块的所述子集的计算节点的工作量集群,所述生成包括:Generate a workload cluster of compute nodes comprising the subset of the building blocks, the generation comprising: 针对所述工作量集群的每个维度,为所述维度的所述两个或更多个光交换机中的每一个光交换机的配置相应的路由数据,所述工作量集群的每个维度的所述相应的路由数据指示所述计算工作量的数据如何沿着所述工作量集群的所述维度在计算节点之间被路由;和For each dimension of the workload cluster, corresponding routing data is configured for each of the two or more optical switches in that dimension, the corresponding routing data for each dimension of the workload cluster indicating how the computation workload data is routed along that dimension of the workload cluster between compute nodes; and 基于所述逻辑排列来配置所述一对多交换机的至少一部分,使得每段的所述第二计算节点连接到与所述逻辑排列中所述第二计算节点所连接的对应段的对应第一计算节点相同的光交换机。Based on the logical arrangement, at least a portion of the one-to-many switch is configured such that the second computing node of each segment is connected to the same optical switch as the corresponding first computing node of the segment to which the second computing node in the logical arrangement is connected. 20.根据权利要求19所述的非暂时性计算机存储介质,其特征在于,用于每个构建块的所述一组一对多交换机包括用于所述构建块的每个维度的每段的相应的一对多交换机。20. The non-transitory computer storage medium according to claim 19, wherein the set of one-to-many switches for each building block comprises a corresponding one-to-many switch for each segment of each dimension of the building block.
HK42023069405.1A 2019-07-01 2023-03-02 Reconfigurable computing pods using optical networks with one-to-many optical switches HK40081215B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/458,947 2019-07-01

Publications (2)

Publication Number Publication Date
HK40081215A HK40081215A (en) 2023-05-19
HK40081215B true HK40081215B (en) 2025-04-25

Family

ID=

Similar Documents

Publication Publication Date Title
CN115002584B (en) Reconfigurable computing platform using optical network with one-to-many optical switch
CN112889032B (en) Reconfigurable computing platform using optical network
CN110495137A (en) Expansible data center network topology on distribution switch
JP6809360B2 (en) Information processing equipment, information processing methods and programs
Adda et al. Routing and fault tolerance in Z-fat tree
HK40081215B (en) Reconfigurable computing pods using optical networks with one-to-many optical switches
US20240154906A1 (en) Creation of cyclic dragonfly and megafly cable patterns
HK40108484A (en) Reconfigurable computing pods using optical networks
HK40081215A (en) Reconfigurable computing pods using optical networks with one-to-many optical switches
HK40053729B (en) Reconfigurable computing pods using optical networks
HK40044238B (en) Reconfigurable computing pods using optical networks with one-to-many optical switches
HK40044238A (en) Reconfigurable computing pods using optical networks with one-to-many optical switches
HK40053729A (en) Reconfigurable computing pods using optical networks
Sem-Jacobsen et al. Dynamic fault tolerance with misrouting in fat trees
Rashidi et al. FRED: A Wafer-scale Fabric for 3D Parallel DNN Training
US20240154903A1 (en) Cyclic dragonfly and megafly
CN107370652A (en) A kind of computer node dynamic interconnection platform and platform network-building method
Azeez Reliable low latency I/O in torus-based interconnection networks
KR20200124837A (en) Technology of flexiblex interconnect topology and packet controlling method in host network with silicon-photonics interface for high-performance computing