WO2025029948A2 - Multi-node server unit - Google Patents
Multi-node server unit Download PDFInfo
- Publication number
- WO2025029948A2 WO2025029948A2 PCT/US2024/040429 US2024040429W WO2025029948A2 WO 2025029948 A2 WO2025029948 A2 WO 2025029948A2 US 2024040429 W US2024040429 W US 2024040429W WO 2025029948 A2 WO2025029948 A2 WO 2025029948A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- server unit
- devices
- compute nodes
- compute
- switchboard
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H05—ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
- H05K—PRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
- H05K7/00—Constructional details common to different types of electric apparatus
- H05K7/20—Modifications to facilitate cooling, ventilating, or heating
- H05K7/20709—Modifications to facilitate cooling, ventilating, or heating for server racks or cabinets; for data centers, e.g. 19-inch computer racks
- H05K7/208—Liquid cooling with phase change
- H05K7/20809—Liquid cooling with phase change within server blades for removing heat from heat source
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/16—Constructional details or arrangements
- G06F1/20—Cooling means
-
- H—ELECTRICITY
- H05—ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
- H05K—PRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
- H05K7/00—Constructional details common to different types of electric apparatus
- H05K7/14—Mounting supporting structure in casing or on frame or rack
- H05K7/1485—Servers; Data center rooms, e.g. 19-inch computer racks
- H05K7/1487—Blade assemblies, e.g. blade cases or inner arrangements within a blade
-
- H—ELECTRICITY
- H05—ELECTRIC TECHNIQUES NOT OTHERWISE PROVIDED FOR
- H05K—PRINTED CIRCUITS; CASINGS OR CONSTRUCTIONAL DETAILS OF ELECTRIC APPARATUS; MANUFACTURE OF ASSEMBLAGES OF ELECTRICAL COMPONENTS
- H05K7/00—Constructional details common to different types of electric apparatus
- H05K7/20—Modifications to facilitate cooling, ventilating, or heating
- H05K7/2029—Modifications to facilitate cooling, ventilating, or heating using a liquid coolant with phase change in electronic enclosures
- H05K7/203—Modifications to facilitate cooling, ventilating, or heating using a liquid coolant with phase change in electronic enclosures by immersion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2200/00—Indexing scheme relating to G06F1/04 - G06F1/32
- G06F2200/20—Indexing scheme relating to G06F1/20
- G06F2200/201—Cooling arrangements using cooling fluid
Definitions
- Section 1 As feature sizes and transistor sizes have decreased for integrated circuits (ICs), the amount of heat generated by a single chip, such as a microprocessor, has increased. Chips that once were air cooled have evolved to chips needing more heat dissipation than can be provided by air alone. In some cases, immersion cooling of chips in a tank containing a coolant liquid is employed to maintain IC chips at appropriate operating temperatures. The chips can be part of a larger assembled unit, such as a server used in a data center.
- ICs integrated circuits
- One type of immersion cooling is two-phase immersion cooling, in which heat from a semiconductor die is high enough to boil the coolant liquid in the tank. The boiling creates a coolant-liquid vapor in the tank, which is condensed by cooled coils or pipes back to liquid form. Heat from the semiconductor dies can then be sunk into the liquid-to-gas and gas- to-liquid phase transitions of the coolant liquid.
- Section 2 One challenge with operating a data center is keeping the data center’ s servers and other computing components cool enough to operate quickly and efficiently. Dissipating the heat generated by the servers becomes more challenging as the number of servers grows and the operating temperatures of the servers increase. Unfortunately, air cooling is usually not sufficient to cool the large number of servers in a modern data center.
- single-phase immersion cooling there are two types of immersion cooling: single-phase immersion cooling and two-phase immersion cooling.
- single-phase immersion cooling the servers are immersed or submerged in liquid coolant that circulates between a heat exchanger and an immersion tank that contains the servers and immersion fluid. As it circulates, the liquid coolant moves heat generated by the servers to a cold-water circuit or other heat sink thermally coupled to the heat exchanger. The coolant remains in the liquid phase as it circulates, hence the name “singlephase” immersion cooling.
- the servers are submerged in liquid coolant with a relatively low boiling point (e.g., at or about 50 °C).
- the liquid coolant absorbs heat generated by the servers; this causes the liquid coolant to evaporate.
- the coolant vapor rises from the surface of the liquid coolant, moving heat away from the servers immersed in the liquid coolant.
- the coolant vapor is cooled by a heat exchanger, such as a condenser coil, and returns to the liquid phase in the immersion tank holding the servers and the liquid coolant.
- the present disclosure relates to a high-performance server unit and associated methods.
- the server unit can be cooled in an immersion-cooling system (single phase or two phase).
- the server unit can include multiple reconfigurable host computers, each host computer comprising a compute node of one or more central processing units (CPUs) communicatively coupled to one or more input/output devices (I/O devices).
- the I/O devices can include, but are not limited to, peripheral cards (such as graphical processing units (GPUs), memory cards, accelerator modules, and custom modules that can include application specific integrated circuits (ASICs), digital signal processors (DSPs), and/or field-programmable gate arrays (FPGAs)).
- peripheral cards such as graphical processing units (GPUs), memory cards, accelerator modules, and custom modules that can include application specific integrated circuits (ASICs), digital signal processors (DSPs), and/or field-programmable gate arrays (FPGAs)).
- ASICs application specific integrated circuits
- DSPs digital signal
- the coupling of I/O devices to compute nodes is reconfigurable by firmware and/or hardware and can be configured at any time by a user.
- the server unit can be used for artificial intelligence (Al) workloads, such as inference computations, data mining, deep learning, natural language processing, and other types of Al workloads as well as non- Al workloads.
- Al artificial intelligence
- Some implementations relate to server units which can be installed in immersioncooling systems.
- Example server units can comprise a chassis, a plurality of compute nodes mechanically coupled to the chassis, wherein each compute node of the plurality of compute nodes is configured to process a computational workload independently of other compute nodes of the plurality of compute nodes, a plurality of slots and/or sockets mechanically coupled to the chassis to receive VO devices, and a switchboard mechanically coupled to the chassis.
- the switchboard can comprise a first plurality of cabling connectors to receive cables that communicatively couple the plurality of compute nodes to the switchboard, a second plurality of cabling connectors to receive cables that couple the plurality of slots and/or sockets to the switchboard, and a plurality of switches to configure and reconfigure communicative couplings between the first plurality of cabling connectors and the second plurality of cabling connectors to assign and reassign portions of the VO devices to each compute node of the plurality of compute nodes.
- Embodiments of the present technology include a gasket for a server chassis that holds a server and is installed in an immersion tank filled with liquid coolant.
- the gasket is formed of an elastically deformable material and has at least one fastener for securing the gasket to the server chassis.
- the gasket prevents the server chassis from rubbing against a metal surface of the immersion tank as the server chassis and the gasket are inserted into the immersion tank.
- the gasket can have a chamfered corner.
- the gasket can be elastically compressed and may be made of plastic, rubber, silicone, plastic, and/or polytetrafluoroethylene. It may include at least one compressible spring feature configured to be compressed between the server chassis and an immersion tank.
- the gasket can be made of a sponge-like material.
- the gasket can define a hollow, compressible lumen having a vent or hole to allow fluid flow into and out of the hollow, compressible lumen.
- the vent or hole can be disposed at an end of the gasket.
- FIG. 1.1 depicts an example of a two-phase immersion-cooling system that can be used to cool semiconductor dies.
- FIG. 1.2 depicts a tank for cooling semiconductor dies in assembled units and also illustrates an assembled unit installed in the tank.
- FIG. 1.3 is a block diagram of a server unit that can be installed as an assembled unit in the tank of FIG. 1.2.
- FIG. 1.4A and FIG. 1.4B depict further details of the server unit of FIG. 1.3.
- FIG. 1.5A depicts a first arrangement of printed circuit boards (PCBs).
- FIG. 1.5B depicts a second arrangement of the PCBs.
- FIG. 1.5C depicts a second arrangement of the PCBs.
- FIG. 1.6 depicts an oversized full-height full-length PCB that can be used in the server unit of FIG. 1.4B.
- FIG. 1.7 depicts a switchboard and compute nodes that can be used in the server unit of FIG. 1.3.
- FIG. 1.8A depicts further details of the switchboard of FIG. 1.7.
- FIG. 1.8B depicts I2C communication channels in management circuitry of the switchboard of FIG. 1.8A.
- FIG. 1.8C depicts UART communication channels in the management circuitry of the switchboard of FIG. 1.8A.
- FIG. 1.9A depicts an example of a compute node that can be used in the server unit of FIG. 1.4A.
- FIG. 1.9B depicts details of the compute-node management circuitry for the compute node of FIG. 1.9A.
- FIG. 1.10 depicts a process to recover firmware images for components of the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.11 depicts reset functionality for the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.12 depicts I2C and I3C communication channels provided by the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.13 depicts UART and USB communication channels provided by the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.14 depicts non-maskable interrupt functionality for the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.15 depicts error-handling functionality for the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.16 depicts thermal monitoring of voltage regulators by the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.17 depicts processor-hot (PROCHOT) monitoring functionality of the compute-node management circuitry of FIG. 1.9B.
- FIG. 1.18 depicts power distribution circuitry for the server unit of FIG. 1.4A and FIG. 1.4B.
- FIG. 2.1 shows an inventive sleeve or gasket for protecting a server when it is installed in and removed from an immersion tank.
- FIG. 2.2A shows a front view of the sleeve of FIG. 2.1 attached to one side of a server chassis.
- FIG. 2.2B shows a top view of a server chassis with sleeves attached to both sides.
- FIG. 2.2C shows a detailed cross-section of a sleeve attached to one side of a server chassis.
- FIG. 2.3A shows a cross-sectional view of a server chassis with a sleeve attached to one edge installed in an immersion tank.
- FIG. 2.3B shows a close-up view of the server chassis and the sleeve between guide rails of the immersion tank.
- FIG. 2.3C shows a server chassis with sleeves on both edges installed in an alignment plate at the bottom of the immersion tank.
- FIG. 2.3D is a close-up view of the server and sleeves installed in the alignment plate at the bottom of the immersion tank.
- FIG. 2.4 shows different views of an alternative sleeve for an immersion-cooled server chassis.
- Section 1 Multi-Node Server Unit
- PCIe Peripheral Component Interconnect Express
- a user may pay for more computing power than is needed for their particular workload, and any unused PCIe devices assigned to their compute node become stranded, i.e., unusable for other workloads by the same user or a different user.
- the inefficient use of the PCIe devices may, in turn, lead to a higher overall total cost of ownership for the user.
- One option to address stranding of resources is to virtualize the hardware. However, there is an undesirable software overhead associated with virtualization of hardware.
- a networked architecture may be implemented where the computing resources provided by one server are shared with another server, thus allowing for more efficient utilization of computing resources.
- networked architectures are often complex and expensive to implement.
- such networked architectures are often prone to issues related to, for example, the compatibility of network protocols for data transmission and security of data.
- Such networked architectures can also experience unfavorable bandwidth limitations.
- each server unit in a data center there preferably should be little server overhead in terms of hardware utilization.
- VO devices that are not used or needed by one compute node should be available to the other compute node(s) that can be dedicated to other Al or non- Al workloads. In this manner, the server units and data center can operate closer to 100% utilization of computing hardware resources, reducing server overhead and data center operating costs.
- a compute node in the server can be assigned an appropriate number of VO devices (such as GPUs, accelerator modules, etc.) such that the configured host computer (compute node and communicatively coupled VO device(s)) can handle an intended workload (such as a particular Al inference workload) without having stranded VO devices.
- VO devices such as GPUs, accelerator modules, etc.
- the number of VO devices and compute nodes communicatively coupled together can be changed by a user as needed to readapt the server unit to handle a different type of Al workload or non- Al workload to more efficiently utilize the available computing resources.
- a single server unit can handle different types of workloads ranging from complex Al workloads to significantly simpler non- Al workloads. Furthermore, multiple teams of developers, engineers, and support staff to create, produce, and support different types of server designs for different types of workloads are not needed. Instead, a single team of developers, engineers, and support personnel can be employed to develop a single server that can be adapted by customers to meet a wide range of customer needs.
- FIG. 1.1 depicts several aspects of an immersion-cooling system 1160 for dissipating heat from one or more semiconductor die packages 1105 via immersion cooling.
- Each die package 1105 can include one or more semiconductor dies that produce heat when the semiconductor dies are in operation.
- the die packages 1105 can be part of a larger assembled unit, such as a server.
- FIG. 1.1 depicts a two-phase immersion-cooling system at a time when the semiconductor dies are operating and generating enough heat to boil coolant liquid 1164 in an immersion-cooling tank 1107.
- the immersion-cooling system 1160 in the illustrated example of FIG. 1.1 is a two-phase immersion-cooling system, the invention can also be implemented for single-phase immersion-cooling systems.
- the immersion-cooling system 1160 includes a tank 1107 filled, at least in part, with coolant liquid 1164.
- the immersion-cooling system 1160 can further include at least one chiller 1180 that flows a heat-transfer fluid through at least one condenser coil 1170 or pipes located in the head space 1109 of the tank 1107.
- the condenser coil(s) 1170 or pipes can condense coolant liquid vapor 1166 into droplets 1168 that return to the coolant liquid 1164.
- the packages 1105 can be mounted on one or more printed circuit boards (PCBs) 1157 that are immersed, at least in part, in the coolant liquid 1164.
- PCBs printed circuit boards
- a foam or froth 1167 can form in the tank 1107 above the coolant liquid 1164 when the system is in operation.
- the froth 1167 can comprise mostly bubbles 1165 that collect across the surface of the coolant liquid 1164 as the coolant liquid boils.
- the immersion-cooling system 1160 may house and provide coolant liquid 1164 to tens, hundreds, or even thousands of packages 1105.
- the packages 1105 can be included in assembled units, such as server units.
- the total amount of electrical power drawn by the packages 1105 in one tank 1107 of an immersion-cooling system can be from 100,000 Watts to 600,000 Watts. In some cases, the total amount of electrical power can be over 600,000 Watts.
- the tank 1107 of an immersion-cooling system 1160 can be small (e.g., the size of a floor unit air conditioner, approximately 1 meter high, 0.5 meter width, 0.5 meter depth or length). In some implementations, the tank 1107 of an immersion-cooling system can be large (e.g., the size of an automotive van, approximately 2.5 meters high, 2.5 meters width, 4 meters depth or length). In some cases, the tank 1107 of an immersion-cooling system 1160 can be larger than an automotive van.
- the immersion-cooling system 1160 can also include a controller 1102 (e.g., a microcontroller, programmable logic controller, microprocessor, field-programmable gate array, logic circuitry, memory, or some combination thereof) to manage operation of at least the immersion-cooling system 1160.
- the controller 1102 can perform various system functions such as monitoring temperatures of system components, coolant liquid fluid level, technician access to the interior of the tank, chiller operation etc.
- the controller 1102 can further issue commands to control system operation such as executing a start-up sequence, executing a shut-down sequence, assigning workloads among the packages 1105 and/or assembled units, changing coolant liquid fluid level, changing the temperature of the heat-transfer fluid circulated by the chiller 1180, etc.
- the controller 1102 can include (or be included in) a baseboard management controller (BMC) 1104. That is, the BMC 1104 may monitor and control all aspects of system operation for the immersion-cooling system 1160 in addition to monitoring and controlling workloads of the semiconductor dies 1150 in the packages 1105 and/or workloads among the assembled units cooled by the immersion-cooling system 1160.
- the immersion-cooling system 1160 can also include a network interface controller (NIC) 1103 to allow the system to communicate over a network, such as a local area network or wide area network.
- NIC network interface controller
- FIG. 1.2 depicts an example tank 1207 for an immersion-cooling system 1260 that can cool a plurality of assembled units 1210.
- the tank 1207 can be one of many installed in a data center, for example.
- the assembled units 1210 can each be servers, dedicated processing units, data-mining units, high bandwidth memory units, or other types of processing and/or memory units.
- the tank 1207 can be formed to include compartments 1205 that are not filled with coolant liquid 1164.
- the compartments 1205 can be outside a main, enclosed volume of the tank that is at least partially filled with coolant liquid 1164, for example.
- the compartments 1205 can be used to house server components that need not be cooled by liquid (e.g., signal routing electronics and connectors).
- Condensing pipes 1270 can be located above and to the side of the assembled units 1210 so that the assembled units can be lowered into and removed from the tank 1207 through an opening 1220 in the tank 1207 without contacting the condensing pipes 1270. There can be many more condensing pipes than the number shown in FIG. 1.2.
- the opening 1220 can be covered and sealed by an access door 1212 when the immersion-cooling system is in operation.
- the access door 1212 can comprise glass so that server units 1211 can be viewed inside the tank during operation of the system (e.g., to view indicator LEDs on the server units 1211).
- a single tank 1207 can be sized to contain 10 or more assembled units 1210 (e.g., at least from 10 to 70 assembled units 1210). In some cases, a tank 1207 can be sized to contain more than 70 units. Power to the assembled units 1210 can be provided by one or more busbars 1230 that run(s) along a base of the tank 1207 in the ⁇ x direction. The assembled units 1210 can be placed in one or more rows (e.g., extending in the ⁇ x direction in the illustration) within the tank 1207. In a data center, there can be a plurality of tanks 1207 each containing a plurality of assembled units 1210. The immersion-cooling tanks 1207 can be operated continuously for days, weeks, months, or even longer in some cases, before substantial servicing of any of the tanks is needed. 3. Overview of Server Architecture
- FIG. 1.3 is a simplified diagram of the computing architecture for an example assembled server unit 1211 of the disclosed implementations.
- the server unit 1211 can comprise a plurality of computing hardware components mounted to a chassis.
- the size of the chassis can comply with a two-rack-unit (2U or 2RU) form factor in one dimension.
- the hardware components include multiple compute nodes 1310-1, 1310-2, ... 1310-N, at least one switchboard 1320, and multiple I/O devices 1330-1, 1330-2, ... 1330-M (which may also be referred to as peripheral cards or peripheral devices).
- the compute nodes 1310 can be communicatively coupled to the switchboard 1320 with first cables 1315 and the I/O devices 1330 can be communicatively coupled to the switchboard 1320 with second cables 1325.
- the switchboard 1320, I/O devices 1330, first cables 1315, and second cables 1325 are all Peripheral Component Interconnect Express® (PCIe) compliant, though compliance with other communication standards such as Compute Express Link (CXL) and Open Compute Project Accelerator Module (0AM) is possible.
- PCIe Peripheral Component Interconnect Express®
- Each compute node 1310-1 can comprise one or more CPUs communicatively coupled to each other.
- Each CPU of the compute nodes 1310 can dissipate up to 350 watts of power, so that the compute nodes 1310 of a single server unit 1211 can dissipate up to 2800 watts of power or more.
- the CPU(s) of a compute node 1310-1 work together (in collaboration with communicatively-coupled I/O devices 1330) on a particular computational task or workload whereas other CPUs and other I/O devices in the server do not work on the same task or workload.
- each compute node could have only one CPU in some implementations.
- a server unit 1211 can have fewer than four compute nodes 1310 and other implementations can have more than four compute nodes 1310. There may be from 2 to 8 compute nodes 1310 in a server unit 1211.
- the I/O devices 1330 can include GPUs, though other types of I/O devices (such as accelerators and custom cards) can be included in a server unit 1211.
- a server unit 1211 can include one or more redundant array of independent disks (RAID) cards and/or one or more solid-state drive (SSD) cards among the I/O devices 1330.
- RAID redundant array of independent disks
- SSD solid-state drive
- all the I/O devices 1330 can be the same (e.g., all GPUs).
- An example GPU card that can be used in the server unit 1211 is the Nvidia Titan V graphics card or the NVIDIA A100 Tensor Core GPU card, available from Nvidia Corporation of Santa Clara, California.
- I/O devices 1330 e.g., a mix of GPUs, accelerators, RAID cards, custom cards, and SSD cards or other types of I/O devices.
- a server unit 1211 can have fewer than 16 VO devices or more than 16 VO devices. There may be from 0 to 32 VO devices 1330 in a server unit 1211.
- the VO devices 1330 can be communicatively coupled to the compute nodes 1310 through the second cables 1325, switchboard 1320, and first cables 1315.
- Each VO device can plug into a socket or slot in the server unit 1211.
- the socket or slot that receives the VO device can connect to a cabling connector 1423 (male or female) to which one of the second cables 1325 can be plugged.
- the cables 1315, 1325 and cabling connectors 1420, 1421, 1422, 1423 can be used to manually reconfigure communicative couplings between the compute nodes 1310 and VO devices 1330.
- the first and second cables 1315, 1325 can be connected between cabling connectors 1422, 1420 and 1421, 1423, respectively, by a user to obtain a server configuration desired by the user.
- software and firmware resident, at least in part, in the switchboard 1320 or a controller in the server unit that communicates with the switchboard 1320 can be executed to communicatively couple any desired portions of the VO devices 1330-1, 1330-2, ... 1330-M to each of the compute node 1310-1, 1310-2, ... 1310-N.
- one or all of the VO devices can be communicatively coupled to any one of the compute nodes and any remaining VO devices can be divided into groups and communicatively coupled to the remaining compute nodes as desired by the user.
- At least a portion of the couplings between compute nodes 1310 and VO devices 1330 can be reconfigured at any time (e.g., when the affected compute nodes and VO devices are idle) with user instructions provided to the switchboard 1320 through a programming interface (such as orchestrator software) without physical changes to the cables 1315, 1325.
- the orchestrator software can be installed and executed on a remote controller that communicates with firmware and/or software executing in the server unit 1211 to alter communicative couplings between one or more of the compute nodes 1310 and at least a portion of the I/O devices 1330.
- the executing firmware and/or software can be used to communicate with and set signal routing switches in the switchboard 1320.
- the orchestrator software can reside on another server in the data center and be used to manage operation of a plurality of server units 1211 in the data center, for example. More extensive changes to the couplings between compute nodes 1310 and VO devices 1330 on a server unit 1211 can be made by changing physical connections of the first cables 1315 between first cabling connectors 1422, 1420 and/or changing physical connections of the second cables 1325 between cabling connectors 1421, 1423.
- the inventors have recognized that peer-to-peer communications between the VO devices 1330 can enhance processing power of the server unit 1211. Accordingly, the VO devices 1330 can communicate signals with each other through the switchboard 1320 via the second cables 1325 without communication signals traveling to and from any of the compute nodes 1310. Such communication can reduce system latency (compared to communicating through a compute node), accelerate computational speed, and reduce computational burden on the compute nodes. In some implementations, at least a portion of the VO devices 1330 can be adapted for direct card-to-card communication via third cables 1335 (such as an VO device data bus). Each such VO device can include transceivers 1332 that can transmit data to other VO devices and receive data from other VO devices.
- the transceivers 1332 may support universal asynchronous receive/transmit (UART) communications and/or inter-integrated circuit (I2C) communications, for example.
- UART universal asynchronous receive/transmit
- I2C inter-integrated circuit
- SERDES serializer/deserializer
- PCIE Gen 5 PCIE Gen 5 communications directly between VO devices 1330.
- Direct card-to-card communication can further reduce system latency (compared to communicating through the switchboard) and speed up server computations.
- FIG. 1.4A and FIG. 1.4B depict an example layout for an assembled server unit 1211 of FIG. 1.2 and FIG. 1.3.
- the front side of the server unit 1211 is depicted in FIG. 1.4A and the back side of the server unit is depicted in FIG. 1.4B.
- the body 1407 of the server unit 1211 can have a depth (in the x dimension) of approximately or exactly 89 mm (2RU) for the illustrated implementation, though the server unit is not limited to only a 2 RU depth.
- the body 1407 of the server unit 1211 is a portion of the server unit behind a face or faceplate 1406 of the server unit 1211 that installs into a rack or other assembly that supports the server unit 1211.
- the body 1407 of the server unit 1211 comprises a portion of the chassis in which the compute nodes 1310, switchboard 1320, and I/O devices 1330 are mounted.
- the server unit 1211 can be adapted for installation in a tank 1207 of an immersion-cooling system.
- the server unit 1211 comprises a chassis 1405 that can include lift rings 1408 (such as handles, hooks, eye bolts, etc.) for lowering and lifting the server unit 1211 into and out of an immersion-cooling tank 1207.
- Hardware components can be assembled on PCBs which are mounted to the chassis 1405.
- the chassis 1405 can be formed from a metal such as steel that is treated to be corrosion resistant or an aluminum alloy.
- the chassis 1405 and/or normally exposed metal conductors and solder and at least some components on the PCBs can be treated to prevent corrosion when in contact with coolant liquid 1164 in the tank 1207.
- the compute nodes 1310 (four in this example) can all be mounted on a first side of the chassis 1405, as illustrated in FIG. 1.4A.
- the I/O devices 1330 (eight visible in the drawings though 16 are present in this example) can all be mounted on a second, opposite side of the chassis 1405, as illustrated in FIG. 1.4B.
- the switchboard 1320 can be located at the top of the chassis 1405 (e.g., in a top one-third of the server unit 1211) and may be immersed, at least in part, or may not be immersed in coolant liquid 1164 when the server unit is installed for operation in the immersion-cooling tank 1207.
- the dimension D2 of the server unit 1211 may be 900 mm, for example, and the top third of the server unit would be the top 1300 mm of the server unit 1211.
- the top of the chassis 1405 and top of the server unit 1211 would be farthest from a base of the immersioncooling tank 1207 when installed in the tank 1207 compared to a bottom or base of the chassis 1405 and server unit 1211.
- the compute nodes 1310 and I/O devices 1330 can be mechanically coupled to the chassis 1405 and located in a lower two-thirds of the server unit 1211 and can all be immersed in coolant liquid 1164 when the server unit is installed and operating in the immersion-cooling tank 1207.
- electrical contacts 1409 pins or sockets
- the electrical contacts 1409 are arranged to make power connections to mating electrical contacts on a busbar 1230 at the base of the immersion-cooling tank 1207 when the server unit is lowered into position and installed in the immersion-cooling tank 1207 for operation.
- the server unit 1211 can essentially be plugged into the busbar 1230 at the base of the immersion-cooling tank 1207.
- Guide rails in the immersion-cooling tank 1207 can guide the server unit 1211 into place such that it can be easily installed and plugged into the busbar 1230 at the base of the tank.
- the guide rails can comprise right-angle bars or T bars that extend along the z direction and guide at least two edges of the server unit 1211 that also extend along the z direction, for example.
- the first power-distribution board 1410 can receive power from the busbar 1230 (which may be at 48 volts or at a voltage in a range from 40 volts to 56 volts) and provide and/or convert the power to one or more voltage values compatible with the compute nodes 1310 and/or I/O devices 1330 (e.g., 48 volts, ⁇ 12 volts, ⁇ 5 volts, and/or ⁇ 3.3 volts).
- the first powerdistribution board 1410 can include fan-out connections, fuses, buffers, power converters, power amplifiers, or some combination thereof to distribute power to a plurality of components within the server unit 1211.
- the server unit 1211 can further include one or more second powerdistribution boards 1436.
- the second power-distribution boards 1436 can also be mounted at a periphery of the chassis 1405 (e.g., along one or two sides of the chassis) and can convert power at a first voltage (received from the busbar 1230 and/or first power-distribution board 1410) to power at one or more second voltages.
- the second power-distribution boards(s) 1436 can also include fan-out connections, fuses, buffers, power amplifiers, power converters, or some combination thereof to distribute power to a plurality of components within the server unit 1211.
- FIG. 1.4B there are two second power-distribution boards 1436 mounted on vertical sidewalls of the chassis 1405.
- the two second power-distribution boards 1436 can be identical to and interchangeable with each other to simplify design, reduce component count, and reduce manufacturing complexity.
- the first power-distribution board 1410 and second power-distribution boards 1436 can be located in a lower two-thirds of the server unit such that they are immersed, fully or mostly, in coolant liquid 1164 when installed in the tank 1207.
- the chassis can be sized to comply, in part, with standard rack unit (RU) sizes.
- the chassis can be sized in the x direction to comply with the Electronic Industries Alliance (EIA)-310 standard of 1 RU or 2 RU size (which commonly references the height of the unit), depending on the number of compute nodes 1310 and VO devices 1330 included in the server unit 1211.
- a 2 RU dimension would be approximately or exactly 89 mm in the x direction in the drawing (normally referred to as height of the unit).
- the other dimensions of the server unit 1211 can be Dy that is approximately or exactly 800 mm (y direction in the drawing) and D2 that is approximately or exactly 900 mm (z direction in the drawing) and may or may not comply with an industry standard.
- Di can be any value in a range from 700 mm to 900 mm and D2 can be any value in a range from 800 mm to 1000 mm.
- the switchboard 1320 can be mechanically coupled to the chassis 1405 and located near the top of the assembled server unit 1211.
- the center of the switchboard 1320 can be located in the upper half, upper third, or upper quarter of the server unit 1211.
- There can be an opening 1402 in the chassis 1405 such that cables can pass through the chassis and connect to cabling connectors 1420, 1421 located on opposing sides of the switchboard 1320.
- There can be a second plurality of cabling connectors 1421 on the second side of the switchboard 1320 (shown in FIG.
- the cabling connectors 1420, 1421 can all be the same or there can be different cabling connectors used on the switchboard 1320.
- An example cabling connector that can be used is the Mini Cool Edge VO (MCIO) connector.
- the switchboard 1320 and the compute nodes 1310 and VO devices 1330 may be above the level of coolant liquid 1164 in the immersion-cooling tank 1207 when the server unit is in operation. Keeping cables out of the coolant liquid 1164 can reduce contamination of the coolant liquid by the cables and extend the service life of the coolant liquid 1164. Further, some or all of the cabling connectors 1420, 1421 on the switchboard can be above the level of coolant liquid 1164 in the immersion-cooling tank 1207 when the server unit 1211 is in operation, which can also help reduce contamination of the coolant liquid.
- the compute nodes 1310 can be arranged within the assembled server unit 1211 in a two-dimensional array (e.g., extending in the y and z dimensions as shown in FIG. 1.4A).
- Each compute node can include one or more compute-node connectors 1422 for making cable connections to the first plurality of cabling connectors 1420 on the switchboard 1320.
- two or more arrays of compute nodes can be stacked in the third dimension (x direction in the drawing) to provide up to eight or more compute nodes in the server unit 1211.
- the server unit 1211 may have a larger x dimension (e.g., up to 3 RU or more).
- the two-dimensional array of compute nodes 1310 shown in FIG. 1.4A can make it easy for a user to install and visually check the cabling connections between the compute nodes 1310 and switchboard 1320 as well as change the cables to reconfigure the assignment of compute nodes 1310 to I/O devices 1330.
- the cables between the 16 cabling connectors 1420 on the switchboard 1320 and the 16 compute-node’s cabling connectors 1422 can be spread over the y and z dimensions with no obstructed viewing of cables and connectors.
- the cabling connectors 1422 of different compute nodes 1310 are not stacked side-by-side in the x dimension which could make it difficult to trace cable connections.
- the two-dimensional array of compute nodes 1310 can also improve immersion cooling of the compute nodes.
- PCBs 1505 are stacked like pages in a book: in a one-dimensional linear array such that the smallest dimension of the PCB (thickness of the PCB) and the array extend in the same direction, as depicted in FIG. 1.5A.
- the planar surface 1507 of each PCB 1505 in the linear array is oriented perpendicular to a line 1510 running approximately or exactly through the center of each PCB 1505 in the array and extending in the direction of the linear array.
- the planar surface 1507 of the PCB 1505 is the surface on which electronic components are mounted. The electronic components are not shown in FIG. 1.5A through FIG. 1.5C.
- FIG. 1.5B depicts another way to array PCBs 1505 in one dimension.
- the planar surface 1507 of each PCB 1505 in the linear array is oriented parallel to the line 1510 running approximately or exactly through the center of each PCB 1505 in the array and extending in the direction of the linear array.
- the size of the array of PCBs 1505 viewed from any direction spans a significantly smaller area than the largest area spanned by the array of FIG. 1.5B.
- the compute nodes 1310 and I/O devices 1330 are arranged side-by-side and end-to-end in two-dimensional planar arrays, like that shown in FIG. 1.5C.
- the planar surface of each PCB 1505 in the array is oriented approximately or exactly parallel (e.g., within ⁇ 20 degrees of parallel) to a plane 1520 extending in the two directions (y and z in the drawing) spanned by the two-dimensional planar array.
- the planar array arrangement of FIG. 1.5C can be beneficial for immersion cooling because it can provide unimpeded fluid access to semiconductor dies and their heat dissipative elements on each of the PCBs 1505.
- the arrangement of I/O devices 1330 for the server unit 1211 of FIG. 1.4B comprises two, two-dimensional planar arrays arranged beside each other in the x direction of the drawing.
- Each two-dimensional planar array comprises eight VO devices 1330 for a total of 16 VO devices 1330 in the implementation of FIG. 1.4B.
- the compute nodes 1310 can be oriented within the server unit 1211 such that the planar surfaces of their PCBs extend approximately or exactly (e.g., within ⁇ 20 degrees) in directions across the broad area of the server unit 1211 (in the y and z dimensions in the illustrated example).
- the broad faces of heat-dissipating elements 1413 that are thermally coupled to CPUs of the compute nodes 1310 also extend approximately or exactly in the same (y and z) dimensions allowing bubbles generated by the heat-dissipating elements 1413 (when immersed in the coolant liquid) to rise freely without being trapped under the compute nodes 1310 or being impeded by other closely-spaced compute nodes when the server unit 1211 is installed in the immersion-cooling tank 1207 and operating.
- Such free -rising bubbles 1165 are depicted in FIG. 1.1, where the broad faces of the packages 1105 extend in the x and z dimensions for that illustration.
- the planes of the PCBs of the compute nodes 1310 and the broad faces of heatdissipating elements 1413 on the compute nodes 1310 extend approximately or exactly in vertical and horizontal dimensions (e.g., within ⁇ 20 degrees of) when the server unit is installed in the immersion-cooling tank 1207.
- cabling connections to the compute nodes 1310 on one side of the chassis 1405 and to the VO devices 1330 on the opposite side of the chassis can be made entirely within the chassis 1405, without cables passing outside the perimeter of the chassis 1405.
- the perimeter of the chassis 1405 is defined as six planar faces that bound the chassis’ x, y, and z dimensions.
- the VO devices 1330 can be arrayed in two dimensions and/or three dimensions on an opposite side of the server unit 1211 from the compute nodes 1310, as depicted in FIG. 1.4B.
- Each of the VO devices 1330 can include or couple to an VO device connector 1423, and the VO device connector 1423 can couple (via second cables 1325) to one of the second plurality of cabling connectors 1421 of the switchboard 1320.
- each VO device 1330-5 for example, can plug into a slot 1437 or socket that is mechanically coupled to the chassis 1405.
- the slot 1437 or socket can be located on a corresponding device connector board 1432-5 on which is mounted the cabling connector 1423 for the I/O device 1330-5.
- Each device connector board 1432-5 can be mechanically coupled to the chassis 1405 and service one, two, or more VO devices 1330 that can plug into a corresponding one, two, or more slots 1437 or sockets mounted on the device connector board 1432-5.
- the device connector board 1432-5 can include one, two, or more cabling connectors 1423 for cabling connections to the second plurality of connectors 1421 of the switchboard 1320.
- the VO devices 1330 can have a size up to full-height full-length (FHFL) VO devices or larger and can be stacked in the x dimension.
- FHFL full-height full-length
- the VO devices 1330 can have a size up to full-height full-length (FHFL) VO devices or larger and can be stacked in the x dimension.
- FHFL full-height full-length
- the two stacked VO devices at each device location can be serviced by the same device connector board 1432-5 or two different device connector boards.
- the assembled server unit 1211 comprises a highly dense computing package comprising four, dual-processor compute nodes 1310 and 16 FHFE VO devices 1330 (of increased size) along with signal routing hardware to support reconfigurable communicative couplings between compute nodes and I/O devices, all within a 2 RU package having a total volume of less than 0.07 m 3 .
- any one or all of the VO devices 1330 can be full-height halflength (FHHE) or half-height full-length (HHFE), for example. Other sizes are also possible.
- FHHE full-height halflength
- HHFE half-height full-length
- each of the FHFE VO devices 1330 are slightly oversized.
- the PCB 1610 shape and size for the oversized I/O devices 1330 is depicted in FIG. 1.6.
- the PCB height is increased by more than 7% fully along the top of the PCB compared to a standard FHFE PCB for conventional VO devices.
- the increased height provides a first added board area 1612 to accommodate more components, board traces, and/or portions of components.
- the second added board area 1614 extends along the majority of the length of the PCB 1610 at the base of the PCB to also accommodate more components, board traces, and/or portions of components.
- a lower edge 1617 of the second added board area 1614 steps up (by .15 mm in this example) compared to a lower edge 1615 of the PCB in the pin area 1620.
- the step-up of the lower edge 1617 of the second added board area 1614 can prevent the PCB 1610 from contacting other hardware or components when an VO device comprising the PCB 1610 is inserted into a bus slot, for example.
- the full height of the increased sized PCB for the I/O devices 1330 can be from 112 mm to 120 mm.
- the PCB 1610 can further include one or more “keep-out” areas 1660 where components above a specified height are not permitted on the assembled I/O device.
- the preclusion of components above a specified height from the keep-out areas 1660 can prevent contacting of components on a first I/O device with components on a second I/O device when the first and second I/O devices are mounted immediately adjacent to each other in a server, for example.
- FIG. 1.7 is a block diagram that depicts one example implementation of cabling connections between the compute nodes 1310, switchboard 1320, and I/O device connectors 1423, which couple to I/O devices 1330 (not shown in FIG. 7).
- the switchboard 1320 comprises four switches 1430-1, 1430-2, 1430-3, 1430-4 in the illustrated example of FIG. 7, though more or fewer switches can be used in other implementations.
- each compute node has five compute-node connectors 1422.
- the first cables 1315 can connect four of the compute-node connectors 1422 on each compute node 1310-1, 1310-2, 1310-3, 1310-4 to the first plurality of connectors 1420 on the switchboard 1320 (16 first cables 1315 and 16 switchboard cabling connectors 1420 in this example).
- the first cables 1315 are arranged such that each one of the compute nodes 1310 connects (through the first plurality of cabling connectors 1420 and traces in the switchboard’s PCB 1326) to one communication port on each of the four switches 1430.
- Four additional communication ports from each of the four switches 1430 connect to four of the second plurality of connectors 1421 on the switchboard 1320 (e.g., through traces in the switchboard’s PCB 1326).
- a ninth communication port 1329 on each of the switches 1430 can be used to connect one switch to another one of the switches, which allows for peer-to-peer communication and signal routing directly between the connected switches.
- switch 1430-1 directly connects to switch 1430-2 through their ninth communication ports
- switch 1430-3 directly connects to switch 1430-4 through their ninth communication ports.
- Second cables 1325 communicatively couple the second plurality of connectors 1421 on the switchboard 1320 to the I/O device connectors 1423.
- a signal from a first I/O device arriving at a first connector 1423-1 can pass through a first switch 1430-1 to a second switch 1430-2 and to a second connector 1421-8 without going to any of the compute nodes 1310.
- the fifth compute-node connector 1422 can be used to couple a CPU of a compute node 1330-1 to a network interface card (NIC), for example, so that the CPU(s) of each compute node can communicate with other CPUs in the server unit 1211 and/or remote devices via a local area network or wide area network.
- NIC network interface card
- An example of a NIC that can be used in a server unit 1211 is the Nvidia Bluefield 3 NIC available from Nvidia Corporation of Santa Clara, California.
- the communicative couplings between compute nodes 1310 and I/O devices 1330 can be reconfigured.
- each one of the compute nodes 1310 can access any one, and up to all, of the 16 I/O device connectors 1423 (and coupled I/O devices 1330) through only one switch to reduce latency in the system.
- any sized portion of the I/O devices 1330 can be assigned to any one of the compute nodes 1310 using cabling and/or switch settings of the switchboard 1320.
- the remaining I/O devices 1330 can be assigned in any sized portions to the remaining compute nodes.
- an integer number M of the I/O devices 1330 can be assigned to a first compute node 1310-1
- an integer number N of the I/O devices 1330 can be assigned to a second compute node 1310-2
- an integer number P of the I/O devices 1330 can be assigned to a third compute node 1310-3
- the assignment of I/O devices 1330 can be done entirely by the switches 1430 to each compute node using only the settings of the switches 1430.
- the assignment of I/O devices 1330 can be done via software instructions (e.g., via orchestrator software as discussed above) to configure the settings of the switches 1430.
- a user of the system can assign the I/O devices 1330 to the compute nodes 1310 as desired by the user.
- the assignment can be done from a remote location in the data center or outside of the data center (e.g., via the internet).
- FIG. 1.8A and FIG. 1.8B depict further details of the switchboard 1320 of FIG. 1.4A, FIG. 1.4B, and FIG. 7.
- the switchboard comprises the first plurality of cabling connectors 1420, the second plurality of cabling connectors 1421, and four switches 1430 to reconfigure communicative couplings between the first plurality of cabling connectors 1420 and the second plurality of cabling connectors 1421.
- the switchboard 1320 can further comprise management circuitry 1461 to facilitate operation of the switchboard.
- the first plurality of cabling connectors 1420 and the second plurality of cabling connectors 1421 are all MCIO connectors, though other types of connectors (e.g., RJ45 connectors) can be used for some implementations.
- the switches 1430 can be programmable.
- the switchboard 1320 can further include at least one configuration input 1450 (e.g., at least one terminal or connector to receive one or more digital signals) by which settings within each switch 1430-1, 1430-2, 1430-3, 1430-4 can be configured to establish desired communicative couplings between the first plurality of cabling connectors 1420 and the second plurality of cabling connectors 1421 (and thereby establish assignments of VO devices 1330 to compute nodes 1310).
- at least one configuration input 1450 e.g., at least one terminal or connector to receive one or more digital signals
- the switches 1430 are located under, and in thermal contact with, two heat sinks 1440 that dissipate heat from the switches 1430.
- An example switch that can be used for the switchboard 1320 is the Broadcom Atlas2 144-port PCIe switches (model number PEX89144), available from Broadcom Inc of San Jose, California.
- 16 ports of the chip can be grouped together as a communication port of the switch to provide 16 signal lines to one of the cabling connectors 1420, 1421, such that each switch can service eight cabling connectors (with 16 signal lines to each of the 8 cabling connectors) and connect to another switch with 16 signal lines for a switch- to-switch communication port 1329.
- the switches 1430 and traces in the switchboard’s PCB can be used to route signals between the first plurality of cabling connectors
- the management circuitry 1461 can comprise a switchboard board management controller (BMC) 1460 and service processor 1470, among other components.
- the switchboard BMC 1460 may comprise, for example, a microcontroller, a programmable logic device, a complex programmable logic device, a microprocessor, a field-programmable gate array, logic circuitry, memory, or some combination some combination of these devices in which there can be none, one, or more of each device present in the combination.
- An example device that can be used for the switchboard BMC 1460 is the AST2600 server management processor available from ASPEED of Hsinchu City, Taiwan.
- the switchboard BMC 1460 can be communicatively coupled to one or more memory devices 1480 (such as one or more double data rate (DDR) synchronous dynamic random access memory devices and/or one or more embedded multimedia card (eMMC) memory devices). These memory devices can store code and data for operation of the switchboard 1320.
- the switchboard BMC 1460 can also be communicatively coupled to a service processor 1470 and at least one complex programmable logic device (CPLD) 1475, which is shown in the further detailed drawing of FIG. 1.8B.
- CPLD complex programmable logic device
- the switchboard BMC 1460 can manage local resources (such as controlling and monitoring the state of each of the connected I/O devices 1330). In some implementations, the switchboard BMC 1460 may be responsible for assigning computing tasks from each compute node to I/O devices communicatively coupled to the compute node by the switches 1430. The switchboard BMC 1460 may coordinate assignment of computing tasks among the I/O devices 1330 with a compute-node BMC associated with the compute node that is communicatively coupled to the I/O devices 1330, and perform such management for each of the compute nodes 1330-1, 1330-2, 1330-3, 1330-4.
- the switchboard BMC 1460 may indicate to a compute-node’s BMC (described further below) when one or more I/O devices 1330 coupled to the compute node has (or have) completed tasks and/or is (or are) ready to accept a new computing task, thus allowing the compute node to route a new compute task to an available I/O device.
- the management circuitry 1461 can further monitor the states of compute nodes 1310, switches 1430, and I/O devices 1330 as well as provide interfacing circuitry for setting switch configurations of the switches 1430.
- the management circuitry 1461 can also provide start-up, boot, and recovery functionality for the switchboard 1320.
- the management circuitry 1461 can monitor the boot readiness of the switches 1430 and I/O devices 1330 and signal each of the compute nodes 1310 when their corresponding switch(es) and I/O device(s) are in a “boot ready” status.
- Each of the compute nodes 1310 can monitor the readiness statuses to ensure that the I/O devices 1330 and switches 1430 are ready before attempting to boot its host system.
- the management circuitry 1461 can support two communication protocols for communications between the BMC 1460 and the compute nodes 1310 and I/O devices 1330.
- the management circuitry 1461 can support I2C communications as well as UART communications between the BMC 1460 and the compute nodes 1310 and I/O devices 1330.
- the switchboard 1320 can further include a service processor 1470 that is communicatively coupled to the switchboard BMC 1460 and to memory 1482.
- the service processor 1470 can provide platform firmware resilience (PFR).
- Memory 1482 can include flash memory devices for storing switchboard BMC firmware and BIOS code for the switchboard 1320.
- the service processor 1470 can provide secure image verification for the switchboard BMC 1460 during runtime and retrieve firmware and BIOS code for recovery operations following certain runtime errors.
- An example device that can be used for the service processor 1470 is the AST1060 service processor available from ASPEED of Hsinchu City, Taiwan.
- Table 1 lists different firmware that is involved in operation of the management circuitry 1461 for the switchboard 1320. The table also lists where the firmware can be stored (second column), how firmware can be updated (third and fourth columns) and whether an image of the firmware is recoverable and if so, how it can be recovered (last column).
- Switchboard Firmware The implementation of the switchboard illustrated in FIG. 1.8A depicts a configuration for PCIe standards, but the invention is not limited to only PCIe standards. As mentioned above, the hardware can be selected to comply with other standards such as CXL and/or 0AM standards for I/O devices that are compatible with CXL or 0AM standards.
- management circuitry 1461 of the switchboard 1320 can support at least two communication protocols, such as I2C and UART communication protocols.
- FIG. 1.8B depicts a plurality of inter-device communications using fourteen I2C communication ports on the BMC 1460. Four of the I2C ports can be used to communicatively couple, through traces in the switchboard 1320, to four compute-node connectors 1465 (MCIO connectors in this example), which in turn can couple to each of the compute nodes 1310 in the server unit 1211.
- MCIO connectors compute-node connectors in this example
- RJ45 compute-node connectors 1462 there are four RJ45 compute-node connectors 1462 that are also used to communicatively couple the switchboard BMC 1460 to board management controllers at each of the compute nodes 1310.. Fewer or more compute-node connectors 1462, 1465 can be used, depending on the number of compute nodes 1310 in the server unit 1211. In some cases, the RJ45 connectors can allow for Ethernet cable or reduced gigabit media- independent interface (RGMII) connections between the switchboard BMC 1460 and each of the four compute nodes 1310 of the server unit 1211. Different sideband management signals can occur simultaneously between the switchboard and any one of the compute nodes 1310 using the two different signaling channels (one via the RJ45 compute-node connectors 1462 and one via the MCIO compute-node connectors 1465).
- RGMII gigabit media- independent interface
- the switchboard BMC 1460 can communicatively couple to the service processor 1470 with three I2C ports of the switchboard BMC 1460.
- the switchboard BMC 1460 can further communicatively couple to the four switches 1430 using at least one I2C port (two shown in the implementation of FIG. 1.8B).
- the I2C communicative couplings between the BMC 1460 and switches 1430 can be used to program the switches 1430 (e.g., configure switch settings in each of the switches 1430 to assign I/O devices 1330 to compute nodes 1310 as desired) and/or to monitor switch status.
- the switchboard BMC 1460 and I2C communicative couplings to the switches 1430 can provide out-of-band (OOB) configurability of the switches 1430, for example.
- OOB out-of-band
- Using two or more I2C ports of the switchboard BMC 1460 to communicatively couple to the four switches 1430 can provide enough bandwidth for switch monitoring and telemetry services for all I/O devices 1330 while also allowing for concurrent firmware updates to the switches 1430 (e.g., via the switchboard BMC 1460).
- Dedicated I2C ports of the switchboard’s BMC 1460 can each be used to communicate with a corresponding compute node of the compute nodes 1310, also providing OOB communication between the compute nodes 1310 and the switchboard BMC 1460.
- the switchboard BMC 1460 can further communicatively couple to the second plurality of connectors 1421 (and to connected I/O devices 1330) to monitor the state of each connected I/O device, for example.
- two I2C ports of the BMC 1460 and two 8-way multiplexors 1464 are used to communicatively couple the BMC 1460 to the second plurality of connectors 1421 (and to connected I/O devices 1330).
- the I2C communications can be used by the BMC 1460 for a variety of tasks. Such tasks include, but are not limited to, management of the switches 1430, telemetry monitoring of the I/O devices 1330, and monitoring of power devices in the server unit 1211 such as voltage regulators (VRs) and hot-swap controllers (HSCs).
- the I2C communications can also be used for out-of-band (OOB) communications between the compute nodes 1310 and BMC 1460 and devices communicatively coupled to I2C ports of the BMC 1460.
- OOB out-of-band
- the switchboard BMC 1460 can be used to monitor and/or control power delivery from the power-distribution boards 1410, 1436 in the server unit 1211 to the I/O devices 1330.
- the switchboard BMC 1460 can be communicatively coupled to power-distribution circuitry 1490 through one or more I2C ports.
- the powerdistribution circuitry 1490 can include hot-swap protection devices 1492 that allow I/O devices 1330 to be removed and inserted while the system is powered among other features. Protection devices can also include over-voltage protection, over-temperature protection, short-circuit protection, and over-current protection.
- An example hot-swap protection device 1492 that can be used in the power-distribution circuitry 1490 is the MP5990 protection device available from Monolithic Power Systems, Inc. of Kirkland, Washington, though other protection devices can be used to control power delivery in the server unit 1211.
- the switchboard BMC 1460 can further sense operating conditions of one or more components in the server unit 1211.
- the switchboard BMC 1460 can be communicatively coupled to one or more temperature sensors 1488 via an I2C port, for example.
- the temperature sensor(s) 1488 (which may comprise a thermistor or IC temperature sensor) can sense the temperature of such system components as a heat-dissipative element.
- the heat-dissipative element can be thermally coupled to one or more CPUs at one of the compute nodes 1310, coupled to one of more switches 1430 on the switchboard 1320, or coupled to one or more microprocessors of one of the I/O devices 1330.
- the switchboard BMC 1460 can initiate and/or execute a shutdown of the component. In some implementations, the switchboard BMC 1460 can divert computational workloads away from a monitored component and migrate data as needed in response to detecting a high temperature or overtemperature condition of the component.
- At least one of the temperature sensors 1488 can sense the temperature of power supply circuitry (such as voltage regulators, transformers, and DC-DC voltage converters), and the switchboard BMC 1460 can take corrective action in response to detecting an over-temperature condition in the supply circuitry (e.g., migrating data and initiating a shutdown of the server unit 1211).
- a temperature sensor 1488 can sense the temperature of the liquid coolant 1164 in the immersion-cooling system.
- the switchboard BMC 1460 may restrict or limit workload distribution to the I/O devices 1330 until the temperature of the liquid coolant 1164 reduces below a second threshold value, which may be the same value as or different, lower value from the first threshold value.
- FIG. 1.8C illustrates an example of how UART communications can be supported by the management circuitry 1461 of the switchboard 1320.
- Multiplexors can be implemented in circuitry of the CPLD 1470 to provide UART communication channels between the BMC 1460 and devices in the server unit 1211 (such as the compute nodes 1310, service processor 1470, and switches 1430).
- one UART port (labeled UART2) can couple to any compute node 1310-1, 1310-2, 1310-3, 1310-4 through a first multiplexor 1476 that couples to four compute-node connectors 1422.
- One UART port (labeled UART1) can be used to communicatively couple the BMC 1460 to the service processor 1470.
- Two UART ports and two multiplexors can be used to communicatively couple the BMC 1460 to two ports on each of the four switches 1430.
- a first of the two ports on the switches 1430 can be for communicating with an advanced RISC machine (ARM) on each switch and controlling the switches (e.g., setting switch configurations).
- a second of the two ports (labeled SDB) can be used to debug switch operation (e.g., receive information about switch errors).
- UART communications can be used for debugging the server unit 1211 while I2C communications can be used for sideband board management functionalities in the server unit.
- each one of the four compute nodes 1310 is coupled to one-quarter of the I/O devices 1330 (four in this example).
- all four of the compute nodes 1310 and all sixteen of the I/O devices 1330 are utilized during operation of the server unit 1211. If one of the compute nodes fails, then four of the I/O devices assigned to the failed compute node will become stranded. However, it is possible to reassign (and unstrand) the stranded I/O devices to functioning compute nodes using the switches 1430 and the management circuitry 1461 so that use of all I/O devices 1330 can be regained.
- all I/O devices 1330 can be assigned to only one of the compute nodes 1310.
- the remaining compute nodes which have no assigned I/O device may or may not be utilized.
- the remaining compute nodes can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
- two of the compute nodes 1310 can each be assigned one-half of the I/O devices 1330 (eight in this example).
- the remaining compute nodes which have no assigned I/O device may or may not be utilized. In some cases, the remaining compute nodes can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
- a fourth basic configuration BC4 two of the compute nodes 1310 can each be assigned one-quarter of the I/O devices 1330 (four in this example) and a third compute node can be assigned one-half of the I/O devices 1330 (eight in this example).
- the remaining compute node which has no assigned I/O device may or may not be utilized. In some cases, the remaining compute node can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
- each compute node can access any I/O device of the 16 I/O devices 1330 in the server unit 1211.
- the cabling can be reconfigured such that one or more of the compute nodes 1310 can access only a portion of the I/O devices 1330 and be isolated (in terms of signal communications) from remaining compute nodes and I/O devices in the server unit 1211.
- cable connections from the first compute node 1310-1 and third compute node 1310-3 can be made to only the first switch 1430-1 and the second switch 1430-2, instead of also connecting to the third switch 1430-3 and fourth switch 1430-4.
- cable connections from the second compute node 1310-2 and fourth compute node 1310-4 can be made to only the third switch 1430-3 and fourth switch 1430-4, instead of also connecting to the first switch 1430- 1 and the second switch 1430-2.
- the first compute node 1310-1, third compute node 1310-3, and eight coupled VO devices are isolated from the second compute node 1310-2, fourth compute node 1310-4, and the remaining eight coupled VO devices (coupled to VO device connectors 1423-9 through 1423-16).
- the single server unit 1211 can be used for two completely independent workloads where data for one of the workloads is secure from access by a portion of the server used to handle the second workload. In such an implementation, it can be possible for two clients to use the same server unit 1211.
- bandwidth between a compute node and VO device(s) can be increased.
- one of the first cables 1315 that would otherwise be used to connect the first compute node 1310-1 to the third switch 1430-3 can be connected to the first switch 1430-1 instead, thereby increasing the bandwidth of communications between the first compute node 1310-1 and any of the eight coupled VO devices (coupled to VO device connectors 1423-1 through 1423-8).
- bandwidth to coupled VO devices can be increased for the other compute nodes.
- bandwidth to coupled VO devices can be increased for the other compute nodes.
- Other changes to bandwidth are possible.
- one of the compute nodes 1310-1 could have all four of its first cables 1315 connected to one of the switches 1430-2 to further increase bandwidth to VO devices coupled to that switch.
- a server unit 1211 can comprise a plurality of compute nodes 1310 (e.g., from 2 to 8 compute nodes).
- each compute node can be identified (for purposes of system management and operation) by a two-bit compute-node identifier (e.g., [00], [01], [10], [11]).
- a three-bit compute-node identifier can be used when the server unit 1211 comprises more than four compute nodes 1310.
- An example of a compute node 1310-1 is depicted in FIG. 1.9A.
- Each of the compute nodes 1310 comprises two CPU packages 1105-1, 1105-2 communicatively coupled to each other and to compute- node management circuitry 1920 included in the compute node 1310-1.
- Each CPU package can comprise at least one CPU, such as the 4th generation Intel® Xeon® Scalable Processor, Max Series, or the 5th generation Xeon® processor, both available from Intel Corporation® of Santa Clara, California.
- the compute node 1310-1 can further comprise memory modules 1930, 1935 communicatively coupled to each of the two CPU packages 1105-1, 1105-2 and their processors.
- One of the CPU packages 1105-1 can be communicatively coupled to a NIC 1940 (such as the Nvidia Bluefield 3 network interface card mentioned above), so that each of the compute nodes 1310 can communicate with each other and/or other processing devices over a network.
- the NIC 1940 can provide an interface with the BMC’s local management network.
- the NIC 1940 can be installed in an expansion card slot that is powered by 12- volt standby power (such that the card is powered irrespective of the power state of the compute node 1310-1).
- the data center management network can access the Compute-node management circuitry 1920 and its local management network over a network controller sideband interface (NCSI) port regardless of the power state of the compute node 1310-1.
- NCSI network controller sideband interface
- a first memory module 1930 can comprise dual in-line memory modules (DIMMs) that support DDR data transfer between the memory module and the coupled CPU processor.
- DIMMs dual in-line memory modules
- An example memory module 1930 is the RDIMM DDR5-5600 available from Micron Technology, Inc of Boise, Idaho.
- the DIMM storage capability can be 32GB, 48GB, 64GB, 96GB, or 128GB, for example.
- Each DIMM memory module 1930 can couple to the CPU package 1105-1, 1105-2 on its own data transfer channel. There can be from one to twelve memory modules 1930 coupled to the processor of the CPU package on separate data transfer channels.
- a second memory module 1935 can comprise solid-state drives (SSDs) coupled to the CPU packages 1105 on one or more data transfer channels.
- SSDs solid-state drives
- three SSD memory modules 1935 couple to the processor of the first CPU package 1105-1 on a single data transfer channel.
- the second CPU package 1105-2 couples to one SSD memory module 1935 over a first data transfer channel and couples to two SSD memory modules 1935 over a second data transfer channel.
- An example SSD memory module 1935 can have an El.S form factor, such as the DC P4511 SSD module available from Solidigm® of Collinso Cordovo, California. Communication between the processor of the CPU package and its SSD memory modules 1935 can be via I2C communications.
- the SSD memory modules 1935 can be electrically isolated from each other using I2C switches 1937 for connecting each SSD memory module 1935 to the I2C signal path.
- I2C switches 1937 for connecting each SSD memory module 1935 to the I2C signal path.
- the module can be de-configured from the I2C path and I2C functionality recovered for the remaining memory module(s) 1935 on the I2C path.
- the compute-node management circuitry 1920 can include several components that are identical to, or different from, components in the switchboard’s management circuitry 1461.
- the BMC 1960 and service processor 1970 can be the same devices used for the switchboard’s BMC 1460 and service processor 1470, respectively.
- a different BMC 1960 can be used for the compute node and/or a different service processor 1970 can be used for the compute node than used for the switchboard 1320.
- the compute-node management circuitry 1920 can further comprise a platform controller hub 1950 that includes a chipset for controlling data paths and supporting functions (such as providing a system clock) that are used in conjunction with the CPU(s) of the compute node 1310-1.
- FIG. 1.9B illustrates further details of the compute-node management circuitry 1920 and its communicative couplings to the switchboard 1320 (at one of the switchboard’s four compute-node connectors 1462) and to the NIC 1940.
- the server unit of FIG. 7 there will be four such communicative couplings of the four compute nodes 1310 to the switchboard’s four compute-node connectors 1462.
- One of the communicative couplings between the computenode management circuitry 1920 and NIC 1940 provide access to the NCSI port 1962 of the compute-node BMC 1960.
- the compute-node management circuitry 1920 is configured to control and monitor the state of the compute node 1310-1. Also included in the circuitry is a CPLD 1975 for power monitoring and management and for reset operations. Resetting of the various components of the compute node 1310-5 can be buffered, sequenced, and fanned out by the CPLD 1975 whenever a reset of the compute node 1310-1 is needed. Security of the compute node 1310-1 is assured in part by the compute-node BMC 1960 which is a root of trust (RoT) processor, such as the AST2600 processor described above. This processor can also secure the CPLD 1975.
- RoT root of trust
- the service processor 1970 is also a RoT processor, such as the AST1060 service processor described above.
- the service processor 1970 can securely access BMC and BIOS firmware (stored in flash memory 1982) for boot-up and run-time.
- the service processor 1970 secures the BMC and BIOS flash images and provide run time protection against attack or accidental corruption using hardware filtering of the SPI buses.
- the service processor 1970 can also provide firmware recovery in the event of main image corruption.
- the service processor 1970 supports multiple flash devices 1982 for BMC and BIOS enabling full backup of images and on-line firmware update in which the new firmware image will be activated on the next boot cycle.
- the flash devices 1982 can be partitioned into three regions with access controlled by the service processor 1970.
- the first region is a staging region.
- This staging region comprises a read/write section of data storage that is used to store a staged BMC or BIOS image. Once an image is written to the staging section, it can be evaluated by the service processor 1970 before being copied to an active BMC or BIOS region of the corresponding flash devices 1982.
- the second region is a recovery region, which comprises a read-only section of data storage that is used for BMC or BIOS firmware recovery.
- the third region is the active region of the flash devices 1982 and is a read/write section of data storage. The active BMC or BIOS firmware image is stored here.
- Each of the compute nodes 1310 supports an external trigger for initiating bare- metal recovery of the BMC firmware and other critical firmware images in the event of a non- recoverable corruption of stored firmware.
- Bare-metal recovery can be triggered through use of a “magic packet” command sent on the local area network (sometimes referred to as a “Wake on LAN magic packet command.”
- This command can be received through the NIC 1940 by a physical layer (PHY) transceiver 1927 and routed to the service processor 1970.
- PHY physical layer
- This command triggers a WAKE# interrupt to the service processor 1970 on the compute node 1310-1 which initiates a recovery of the BMC firmware image and NIC firmware (if needed).
- FIG. 11 An example of such a bare-metal recovery process 11000 is depicted in FIG.
- the process 11000 is designed to recover all firmware images necessary for compute-node BMC 1960 and management network functionality, such that the server unit 1211 can return to production.
- the process 11000 contains safeguards to prevent denial-of-service attacks.
- the bare-metal method of recovery is intended for servers that are down (out of production).
- Abbreviations used in the steps of the process 11000 are as follows: FW - firmware; Mgmt NW - management network; CMD - command.
- step 11060 If the compute node’s BMC 1960 cannot be placed in an operational status after recovering firmware (step 11035) for the compute node’s BMC 1960, then service will be requested (step 11060) for the server unit 1211.
- a UID LED service indicator light (described below) can illuminate blue light identifying the server unit 1211 as needing service.
- the BMC is or can be made operational whereas the management network for the Compute-node management circuitry 1920 cannot be made operational after recovering firmware (step 11050) for the compute node’s NIC 1940, then service will be requested (step 11060) for the server unit 1211.
- the process 11000 denies and logs (step 11030) unauthentic Wake# commands.
- step 11005 If a Wake# command is asserted (step 11005) to the service processor 1970 (which could occur in a denial-of-service attack) and the BMC 1960 and management network are operational, the authenticity of the command is checked (step 11025). If the Wake# command is not authentic, then it will be denied and logged (step 11030). If the compute node’s BMC 1960 and management network are operational or can be made operational after recovering BMC and/or NIC firmware, then the server unit 1211 can be returned to production (step 11070) to service clients of the data center.
- Table 2 lists different firmware that is involved in operation of the Compute-node management circuitry 1920 for each of the compute nodes 1310. The table also lists where the firmware can be stored (second column), how firmware can be updated (third and fourth columns) and whether an image of the firmware is recoverable and if so, how it can be recovered (last column).
- the NIC 1940 can have its own on-board BMC, an advanced RISC machine, and memory devices, so that it can store and manage recovery of its own firmware. Recovery of firmware on the NIC 1940 can be triggered by the compute node that is coupled to the NIC 1940 (e.g., by the compute node’s BMC 1960 issuing a firmware recovery command over a UART and/or NCIS connection to the NIC 1940 (see FIG. 1.9B).
- There are multiple ways to perform OOB updates to firmware for the NIC 1940 which can include using I2C, UART, 1 Gb Ethernet, and/or NCIS communication channels between the compute node BMC 1960 and NIC 1940.
- FIG. 1.11 depicts how reset operations are performed for the compute-node management circuitry 1920.
- Each of the compute nodes 1310 may or may not include a manual reset button 11105.
- the compute node’s service processor 1970 is responsible for resetting the BMC 1960 and associated flash memory devices 1982.
- the service processor 1970 is also responsible for monitoring a watch-dog timer (WDTRST2#) implemented on the BMC 1960 and resetting and/or recovering firmware for the BMC 1960 in the event that the watch-dog timer is asserted.
- the service processor 1970 is further responsible for resetting the platform controller hub 1950 (at input RSMRST#) and its associated flash memory devices.
- FIG. 1.12 illustrates I2C supported communication channels between components of the compute-node management circuitry 1920 and other node components.
- the BMC 1960 can communicatively couple to the NIC 1940, platform control hub 1950, CPLD 1975, and service processor 1970 with I2C communication channels as depicted in FIG. 1.12.
- Improved I2C (I3C) communication channels are also supported between the BMC 1960 and compute node’s CPUs CPU0, CPU1.
- Two I3C switches are used to switch communications to and from the CPUs to either the BMC 1960 or DIMM memory devices 1930.
- I2C communication channels can run to each of the switches 1430 on the switchboard 1320 through compute-node connectors 1422 and to compute-node memory modules 1935.
- the BMC 1960 can also access a memory device 11210 that stores field-replaceable unit identification (FRUID) information about the compute node.
- I2C communication channels can also be used to monitor system sensors (e.g., temperature sensors 11230, voltage sensors, current sensors, etc.). I2C channels can also be used by the BMC 1960 to communicate with the server unit’s power distribution boards.
- Each of the compute nodes 1310 supports UART and USB communications, as shown in FIG. 1.13. These communication channels can be used for system development and debugging of the server unit 1211.
- Each of the compute nodes 1310 can include an internal universal serial bus (USB) 3.0 and an internal USB connector 11310.
- the USB connector 11310 can be used for standard USB connections to the compute node and can be used for Intel USB- based image transport protocol (ITP), for example.
- Each of the compute nodes 1310 can further or alternatively include an internal micro-USB connector 11320 that can be used to monitor debug consoles and operating system UARTs from the compute-node BMC 1960.
- the micro-USB connector 11320 can access a debug console of the compute-node BMC 1960 via a first UART channel (labeled UART1) and second UART channel (labeled UART5).
- a USB to UART bridge driver 11340 can be used between the micro-USB connector 11320 and UART channels on the BMC 1960.
- An example bridge driver 11340 is the CP2105 driver available from Silicon Labs of Austin, Texas.
- a UART port (labeled UART2) can be used to support NCIS communications with the NIC 1940. This communication channel can be used to command recovery of firmware images stored on the NIC 1940, for example.
- the compute node’s CPUs CPU0, CPU1 support non-maskable interrupts (NMIs) generated from either the platform controller hub 1950 or the compute node’s BMC 1960.
- FIG. 1.14 illustrates how NMIs are routed to the compute node’s CPLD 1975, which logically OR’s the signals to the CPUs CPU0, CPU1. In the event of an NMI, both CPUs are interrupted. The interrupts can also be cross-routed between the platform controller hub 1950 and the BMC 1960 so that each entity is informed should the other generate an interrupt.
- the CPUs CPU0, CPU1 can provide error status signals during runtime. There can be three types of errors: (1) a hardware correctable error (no operating system or firmware action is necessary to recover from the error), (2) a non-fatal error (OS or FW action is required to contain the error and recover), and (3) a fatal error (system reset required to recover). Statuses of the three types of errors ca be reported to the platform controller hub 1950 or the compute node’s CPLD 1975.
- FIG. 1.15 depicts how a fatal error can be handled by each of the compute nodes 1310.
- the status signal is level shifted by a level shifter 11510 and reported to the CPLD 1975, which in turn reports (after a programmed delay) the fatal error to the platform controller hub 1950 and compute node’s BMC 1960 so that a system reset can be executed.
- Active low logic is used in this circuitry, so that the AND gate provides OR functionality.
- Each of the compute nodes 1310 can monitor the thermal status of the memory voltage regulators (VRs) 11610, as depicted in FIG. 1.16.
- the thermal status can be provided on a port (labeled VRHOT# in the drawing) of the memory VR 11610. If the temperature of a memory VR 11610 exceeds a threshold temperature, the condition will be detected by the compute node’s CPLD 1975.
- the CPLD 1975 can logically OR the thermal status signal from the VR 11610 with a throttle status signal from the platform controller hub 1950 to initiate throttling of at least one of the CPUs CPUO, CPU1. Active low logic is also used in this circuitry, so that the two AND gates provide OR functionality.
- the throttling of the CPU(s) will reduce its (their) performance.
- the signal to throttle the CPU(s) can also alert the compute node’s BMC 1960.
- Throttling of both CPUs CPUO, CPU1 can also be initiated by a throttling status signal from the platform controller hub 1950 (e.g., as part of a power-capping process described below).
- Each of the compute nodes 1310 supports throttling of the CPUs CPUO, CPU1 using a so-called “fast processor hot” (fast PROCHOT) process that is based on monitoring of the CPU’s input voltage and power.
- FIG. 1.17 is a block diagram that depicts how the monitoring and throttling can be implemented. Again, active low logic is used so that the AND gates provide OR functionality and the OR gate provides AND functionality.
- a plurality of processor-hot triggers 11710 are monitored by the compute-node management circuitry 1920.
- the BMC 1960 and platform controller hub 1950 can monitor trigger status signals provided by system components that are monitored. Either the BMC 1960 or platform controller hub 1950 can output a processor-hot event in response to detecting an alert status signal from the monitored processor-hot triggers 11710.
- processor-hot triggers 11710 that can be monitored include the following:
- undervoltage alert - a comparator can monitor the 12 V voltage output of an HSC and assert an alert status signal if this voltage falls below 11.5V.
- overcurrent Alert - the HSC can monitor the input current and assert an alert status signal if the input current exceeds a threshold current level.
- 48 V HSC alert - the 48 V HSC can assert an alert status signal (e.g., if it detects an overvoltage condition).
- 12 V HSC alert - the 12 V HSC can assert an alert status signal (e.g., if it detects an overvoltage condition).
- power-capping alert - the BMC 1960 can assert an alert status signal based on an FM_THROTTLE# signal received as part of a power-capping process.
- the compute-node CPLD 1975 can implement logic to process alert status signals from the processor-hot triggers 11710.
- processor-hot trigger signals can be filtered by the CPLD 1975 by performing a logical AND with a minimum-duration signal from the CPLD 1970 that is asserted only when the received processor-hot trigger signal has a duration of at least 100 ms or other minimum duration, for example.
- the CPLD 1975 can also de-assert a processor-hot signal (PROCHOT#) at either or both CPUs CPU0, CPU1 when the trigger event is cleared.
- the circuitry of FIG. 1.17 also enables power capping of each of the compute nodes 1310. Power capping can be triggered by the compute-node BMC 1960 through a powercapping signal (labeled FM_THROTTLE#) that can be asserted to the platform controller hub 1950, as indicated in the drawing. Once asserted, the platform controller hub 1950 can initiate power capping that throttles performance of the CPUs. In some implementations, compute-node power for the power capping feedback loop is read by the platform controller hub 1950 from the 12 V HSC.
- LED indicators 1322 mounted in visible locations on the server unit 1211, as depicted in FIG. 1.4B.
- the LED indicators 1322 can provide easily viewable indications of server status, switchboard status, and compute node status, for example.
- the LED indicators 1322 could be mounted on the switchboard 1320 such that they are visible from the top of the server unit 1211 (e.g., along a top edge of the switchboard 1320 such that the LED indicators 1322 can be viewed through or from the access door opening 1220).
- the LED indicator shared by the compute nodes 1330 can be a unique identifier (UID) LED indicator that emits blue light when on.
- This UID indicator can signal (when on) a request for service and easily identify a server unit 1211 in need of service. Any one of the compute nodes 1310 can activate the UID LED indicator.
- Each one of the compute nodes 1310 can control its own bicolor LED indicator.
- the bicolor LED indicator can emit amber or green light. Signaling with a compute node’s bicolor LED indicator can be as follows, though other and/or additional signaling schemes are possible.
- the UID LED indicator and compute-node bicolor LED indicators can be controlled by the compute nodes using I2C communications.
- an I2C line from the compute node can couple to an I2C expander 1463 on the switchboard 1320 (depicted in FIG. 1.8B), which can provide a controlling signal to the UID LED indicator or compute-node bicolor LED indicator.
- the switchboard 1320 can also include a bicolor LED indicator light controlled by its BMC 1460 and/or CPLD 1475, for example.
- Signaling with a compute node’s bicolor LED indicator can be as follows, though other and/or additional signaling schemes are possible.
- data center controlling software and the server unit’s compute-node management circuitry 1920 coordinates actions for the service request. For example, if any one of the compute nodes 1310 or the data center’s control software indicates that service is needed for the server unit 1211, then the compute-node management circuitry 1920 for each of the compute nodes 1310 migrates all work tasks and related data off of the server unit 1211 to another server unit before indicating a service request with the UID LED indicator. 8. Power Distribution
- Different voltages can be distributed in the server unit 1211.
- 48-volt power and 12- volt power is distributed by three powerdistribution boards 1410, 1436 located within the server unit 1211 (also see FIG. 1.4A and FIG. 1.4B).
- a block diagram of the power distribution circuitry is illustrated in FIG. 1.18.
- the first power distribution board 1410 can receive 48-volt power from one or more busbars 1230 at the base of the immersion-cooling tank 1207 and can provide fused 48 V power to the compute nodes 1310, switchboard 1320, and a plurality of 48V: 12V, 1600-watt power-converter bricks 11810 that are distributed on the first power distribution board 1410 and two second power distribution boards 1436.
- the power-converter bricks 11810 provide 12 V power to the plurality of VO devices 1330.
- hot-swap controllers HSCs
- HSCs hot-swap controllers
- the HSC supplies can further provide electrical isolation, power control, and power monitoring functionality.
- the server unit 1211 can draw considerable power when in operation.
- Table 3 lists example amounts of power drawn by various components of the server unit 1211.
- inventive concepts may be embodied as one or more methods, of which an example has been provided.
- the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
- the terms “approximately” and “about” are used to mean within ⁇ 20% of a target (e.g., dimension or orientation) in some embodiments, within ⁇ 10% of a target in some embodiments, within ⁇ 5% of a target in some embodiments, and yet within ⁇ 2% of a target in some embodiments.
- the terms “approximately” and “about” can include the target.
- the term “essentially” is used to mean within ⁇ 3% of a target.
- a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
- the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
- This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
- “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
- Section 2 Multi-Purpose Sleeve for Immersion-Cooled Servers
- a typical immersion tank for immersion cooling of servers and other electronic components is a bulky, welded mechanical assembly with large dimensional tolerances.
- the slots in the immersion tank for servers may be from 8 mm larger to 5 mm smaller than intended due to manufacturing imperfections.
- the immersion tank’s walls may expand and/or contract, causing the slots for the servers to expand and/or contract as well, e.g., by up to about ⁇ 8 mm ( ⁇ 3 mm).
- the immersion tank should be sized larger than the servers. For instance, if a server’s maximum width is 500 mm, then the slot in the immersion tank for that server should have a nominal width of at least 513 mm to account for the potential variation of the manufacturing tolerance of -5 mm, plus an extra -8 mm to account for immersion tank wall movement due to negative pressure condition, resulting in larger immersion tanks and larger slots for servers in the immersion tanks.
- the maximum immersion tank width could be as high as 521 mm (unpressurized) or 529 mm (pressurized).
- the larger sizes and size variations among openings for servers in immersion tanks create several problems. For instance, when the slot or opening in the immersion tank is larger than the server, then aligning the server to the immersion tank can be more difficult as the server can be biased on one side of the tank and may not connect or couple properly to the bus bar inside the immersion tank.
- the server chassis can rub against the guide rails instead the immersion tank as the server is being inserted into or removed from the immersion tank.
- the immersion tank and server chassis are typically made of steel or another metal, so this rubbing can produce metal debris. As the immersion fluid boils, it moves from bottom of the tank to top, bringing the metal debris into contact with the servers and/or other electrical components, which may cause electrical shorts.
- Rubbing between metal surfaces can also produce metal debris even for larger slots and openings.
- the larger slots for servers also lead to larger immersion tank volumes, which in turn leads to larger amounts of immersion fluid needed to fill the immersion tanks.
- Immersion fluid is very expensive, so the extra immersion fluid increases cost without necessarily improving cooling performance.
- FIG. 2.1 shows a gasket or sleeve 2100 that addresses problems associated with large manufacturing tolerances for slots or openings in immersion tanks.
- the sleeve 2100 allows for “less-than-perfect” mating of the server chassis to guide rails and/or alignment features within the immersion tanks.
- the sleeve 2100 can be attached to the side of an immersion-cooled server chassis before the server chassis is inserted into a slot or opening in the immersion tank.
- the server chassis may have a height of 700 mm or 1925 mm, a length of 797 mm, and a thickness of 86.9 mm
- the sleeve 2100 may have a length of about 700 mm or 1925 mm (to match the height of the server chassis), a width of about 85 mm (to roughly match the thickness of the server chassis), and a thickness of about 8 mm (to fill the expected maximum gap between the server rails and the immersion tank).
- the sleeve 2100 prevents metal-on-metal contact between the server chassis and the guide rails within the immersion tank, reducing the probability of creating metal debris when inserting or removing the server chassis.
- the sleeve 2100 occupies space inside the immersion tank that would otherwise be occupied by the immersion fluid, reducing the quantity of immersion fluid needed to fill the immersion tank.
- the sleeve 2100 is made of a strip of electrically insulating, elastically deformable material, such as rubber, silicone, plastic, or polytetrafluoroethylene (PTFE).
- PTFE polytetrafluoroethylene
- the sleeve 2100 can be stamped out of a single sheet or material, molded, or extruded.
- This material should also be inert; that is, it should not react with the immersion fluid, anything else within the immersion tank, or the immersion tank itself.
- the material should not degrade or decompose in the immersion tank or when it is inserted into or removed from the immersion tank because degradation could affect the immersion fluid’s cooling capacity. Degradation could also deleteriously affect the operation of the server and/or other components within the immersion tank.
- the sleeve 2100 includes two or more snap-fit features 2110 or other fasteners for attaching the sleeve 2100 to a server chassis.
- the snap-fit features 2110 are at the ends of the sleeve 2100 and secure the sleeve 2100 to a server chassis in a detachable fashion.
- the snap-fit features 2110 in FIG. 2.1 have mushroom heads the fit through holes in the server chassis and grab the inside wall of the server chassis (e.g., as shown in FIG. 2.2A, described below).
- Other sleeves may have different types of fasteners, such as snaps, buttons, latches, and so on, that allow compression and decompression of the sleeve.
- the sleeve 2100 may also have more fasteners, for example, disposed along the length of the sleeve 2100.
- the sleeve 2100 can be shaped to conform to or stretch or wrap around the server chassis such that the sleeve 2100 stays secured to the server chassis.
- springs or spring features 2120 disposed along the length of the sleeve 2100. These spring features 2120 protrude from the sleeve and can expand or contract to adjust to a wide variety of immersion tank slot sizes resulting from the large manufacturing tolerances for immersion tanks.
- the spring features 2110 on the sleeves 2100 push outwards, holding the sleeves 2100 and the server chassis in place with respect to the immersion tank.
- the combined width of the server chassis and the sleeve(s) 2100 should equal the maximum width of the opening or slot in the immersion tank.
- the sleeve 2100 at its thickest point should be wide enough to fill the largest possible gap between the server chassis and the rails that define the opening or slot in the immersion tank for the server chassis. Because the rails take up some space, the maximum spring thickness may be about 7 or 8 mm, for example, and may compress by about 7 mm.
- the sleeve 2100 also has chamfered or round ends 2130 to aid in alignment and to reduce the possibility of contact between the metal server chassis and the metal guide rails of the immersion tank. Preventing or reducing metal-to-metal contact prevents or reduces the possibility of creating metal debris that could cause electrical shorting of servers and other components in the immersion tank.
- FIG. 2.1 shows that both ends of the sleeve 2100 are chamfered (cut away to form a sloping edge) or rounded.
- the sleeve may have a single chamfered end — the end that fits around the corner of the server chassis that is inserted first into the immersion tank.
- FIG. 2.2A-2.2C show different views of immersion sleeves 2100 installed on both sides of a server chassis 2200 that holds a server 2202.
- FIG. 2.2A shows a front view of the server chassis 2200, illustrating the server 2202 and one of the guide pins 2210 that extends from the server chassis 2200 for aligning the server chassis 2200 to the immersion tank.
- FIG. 2.2A also shows the mushroom head of one fastener 2110 sticking through a hole in the server chassis 2200 and one of the sleeve’s chamfered ends 2130.
- the sleeve’s spring features 2120 extend towards the server chassis 2200 as shown in FIGS. 2.2A and 2.2C.
- Alternative sleeves may have spring features that extend away from the server chassis 2200 when installed on the server chassis 2200.
- FIGS. 2.2B and 2.2C show that the sleeves 2100 are as wide as the server chassis 2200.
- the sleeves 2100 can be shorter than the server chassis 2200 is long, as long as the server chassis 2200, or even longer than the server chassis 2200, with the extra length extending towards the top of immersion tank and possibly forming a handle for removing the server chassis 2200 from the immersion tank.
- FIGS. 2.3A-2.3D illustrate the server chassis 2200 with sleeves 2100 along both sides installed in an immersion tank 2300.
- FIG. 2.3A shows a vertical cross-section of the immersion tank 2300
- FIGS. 2.3B-2.3D show top-down views of the immersion tank 2300 at different levels of detail.
- the immersion tank 2300 includes guide rails 2302 and an alignment plate 2310 that align the server chassis 2200 to the immersion tank 2300 and hold the server chassis 2200 in position.
- the alignment plate 2310 and a bus bar 2320 sit at the bottom of the immersion tank 2300 between condenser coils 2330 that cool vaporized immersion fluid.
- the server chassis 2200 can come in different sizes, e.g., with heights of 700 mm or 925 mm. When fully inserted into the immersion tank 2300, the shorter server chassis 2200 may below the condenser coils 2330 and may be fully immersion in the immersion fluid (not shown). The taller server chassis 2200 may not fit completely below the condenser coils 2330 and may be partially immersed in the immersion fluid.
- the guide pins 2210 extending from the server chassis 2200 fit into guide pin receptacles 2312 in the alignment plate 2310. Electrical pins 2204 electrically coupled to the server 2202 and extending from the server chassis 2200 contact bus bar connections 2314 that are electrically coupled to the bus bar 2320, which provides electrical power to the server 2202.
- the chamfer 2130 on the sleeve 2100 helps to align the server chassis 200 to the guide rails 2302 in the immersion tank 2300.
- the guide rails 2302 push the sleeve 2100 toward the server chassis 2200, compressing the sleeves’ spring features 2120.
- the spring features 2120 expand back to their original state. This expansion and contraction of the sleeve’s spring features 2120 account for the large tank manufacturing tolerances (e.g., +8/-5 mm (static) and ⁇ 8 mm (pressurized)).
- FIG. 2.4 shows top, side, and cross-sectional views of an alternative gasket or sleeve 2400 for an immersion-cooled server chassis.
- This sleeve 2400 is a tube-like structure that is made of compliant, elastically deformable, electrically insulating material, such as rubber, plastic, or PTFE. It has two or more fasteners 2410 for attaching the sleeve 2400 to the server chassis and rounded or chamfered ends 2430 for sliding the server chassis into the immersion tank.
- the tube-like structure is hollow and can collapse, e.g., along predefined fold lines or pleats, to accommodate narrower openings as shown in the cross-sectional views.
- the tube-like structure can have a square or rectangular cross section as shown in FIG. 2.4 or a curved or pleated cross section.
- a vent or hole 2420 in the sleeve 2400 allows fluid e.g., air or immersion fluid) to enter and exit the sleeve 2400 as the sleeve 2400 expands or is compressed. Placing the vent 2420 is at the end of the sleeve 2400 that is not immersed in immersion fluid as shown in FIG. 2.4 allows air or vapor to enter the sleeve 2400 and reduces the amount of immersion fluid needed to fill the immersion tank.
- the sleeve 2400 may be made long enough so that one end sticks out of the immersion fluid when the server chassis and sleeve 2400 are installed in the immersion tank.
- the sleeve 2400 can have one or more vents or holes along its length.
- the sleeve 2400 can be made of a sponge-like material that can be compressed and springs backs into shape when relaxed.
- inventive embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed.
- inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein.
- inventive concepts may be embodied as one or more methods, of which an example has been provided.
- the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
- a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
- the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
- This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
- “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
Landscapes
- Engineering & Computer Science (AREA)
- Microelectronics & Electronic Packaging (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Thermal Sciences (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
- Cooling Or The Like Of Electrical Apparatus (AREA)
Abstract
A server unit having multiple compute nodes with reconfigurable connections to a plurality of I/O devices is described. The number of I/O devices assigned to each compute node can be reconfigured by a user with firmware and/or hardware. The server unit is adapted to accept oversized I/O devices for additional processing capability. The compute-node PCBs and I/O device PCBs can each be arrayed in two and three dimensions for cooling in an immersion-cooling system.
Description
MULTI-NODE SERVER UNIT
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of and priority to U.S. Provisional Application No. 63/516,996, filed August 1, 2023, U.S. Provisional Application No. 63/578,255, filed August 23, 2023, and U.S. Provisional Application No. 63/598,825, filed November 14, 2023, which applications are hereby incorporated by reference in their entireties.
BACKGROUND
Description of the Related Art
[0002] Section 1: As feature sizes and transistor sizes have decreased for integrated circuits (ICs), the amount of heat generated by a single chip, such as a microprocessor, has increased. Chips that once were air cooled have evolved to chips needing more heat dissipation than can be provided by air alone. In some cases, immersion cooling of chips in a tank containing a coolant liquid is employed to maintain IC chips at appropriate operating temperatures. The chips can be part of a larger assembled unit, such as a server used in a data center.
[0003] One type of immersion cooling is two-phase immersion cooling, in which heat from a semiconductor die is high enough to boil the coolant liquid in the tank. The boiling creates a coolant-liquid vapor in the tank, which is condensed by cooled coils or pipes back to liquid form. Heat from the semiconductor dies can then be sunk into the liquid-to-gas and gas- to-liquid phase transitions of the coolant liquid.
[0004] Section 2: One challenge with operating a data center is keeping the data center’ s servers and other computing components cool enough to operate quickly and efficiently. Dissipating the heat generated by the servers becomes more challenging as the number of servers grows and the operating temperatures of the servers increase. Unfortunately, air cooling is usually not sufficient to cool the large number of servers in a modern data center.
[0005] Fortunately, it is feasible to cool servers by immersing or submerging them in a thermally conductive dielectric liquid coolant. This type of cooling, called immersion cooling, offers several advantages over air cooling. First, it works without cooling fins or fans on the servers or other computing components being cooled, reducing the servers’ size and weight.
Second, it works without a refrigerated air conditioning system, reducing data center power consumption. Third, the liquid coolant has a higher heat capacity than air, so a given volume of liquid coolant can dissipate more heat than the same volume of air.
[0006] There are two types of immersion cooling: single-phase immersion cooling and two-phase immersion cooling. In single-phase immersion cooling, the servers are immersed or submerged in liquid coolant that circulates between a heat exchanger and an immersion tank that contains the servers and immersion fluid. As it circulates, the liquid coolant moves heat generated by the servers to a cold-water circuit or other heat sink thermally coupled to the heat exchanger. The coolant remains in the liquid phase as it circulates, hence the name “singlephase” immersion cooling.
[0007] In two-phase immersion cooling, the servers are submerged in liquid coolant with a relatively low boiling point (e.g., at or about 50 °C). The liquid coolant absorbs heat generated by the servers; this causes the liquid coolant to evaporate. The coolant vapor rises from the surface of the liquid coolant, moving heat away from the servers immersed in the liquid coolant. The coolant vapor is cooled by a heat exchanger, such as a condenser coil, and returns to the liquid phase in the immersion tank holding the servers and the liquid coolant.
BRIEF SUMMARY
[0008] Section 1: The present disclosure relates to a high-performance server unit and associated methods. The server unit can be cooled in an immersion-cooling system (single phase or two phase). The server unit can include multiple reconfigurable host computers, each host computer comprising a compute node of one or more central processing units (CPUs) communicatively coupled to one or more input/output devices (I/O devices). The I/O devices can include, but are not limited to, peripheral cards (such as graphical processing units (GPUs), memory cards, accelerator modules, and custom modules that can include application specific integrated circuits (ASICs), digital signal processors (DSPs), and/or field-programmable gate arrays (FPGAs)). The coupling of I/O devices to compute nodes is reconfigurable by firmware and/or hardware and can be configured at any time by a user. The server unit can be used for artificial intelligence (Al) workloads, such as inference computations, data mining, deep learning, natural language processing, and other types of Al workloads as well as non- Al workloads. There can be a plurality of such server units installed in a tank of an immersioncooling system.
[0009] Some implementations relate to server units which can be installed in immersioncooling systems. Example server units can comprise a chassis, a plurality of compute nodes mechanically coupled to the chassis, wherein each compute node of the plurality of compute nodes is configured to process a computational workload independently of other compute nodes of the plurality of compute nodes, a plurality of slots and/or sockets mechanically coupled to the chassis to receive VO devices, and a switchboard mechanically coupled to the chassis. The switchboard can comprise a first plurality of cabling connectors to receive cables that communicatively couple the plurality of compute nodes to the switchboard, a second plurality of cabling connectors to receive cables that couple the plurality of slots and/or sockets to the switchboard, and a plurality of switches to configure and reconfigure communicative couplings between the first plurality of cabling connectors and the second plurality of cabling connectors to assign and reassign portions of the VO devices to each compute node of the plurality of compute nodes.
[0010] All combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are part of the inventive subject matter disclosed herein. In particular, all combinations of subject matter appearing in this disclosure are part of the inventive subject matter disclosed herein. The terminology used herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.
[0011] Section 2: Embodiments of the present technology include a gasket for a server chassis that holds a server and is installed in an immersion tank filled with liquid coolant. The gasket is formed of an elastically deformable material and has at least one fastener for securing the gasket to the server chassis. The gasket prevents the server chassis from rubbing against a metal surface of the immersion tank as the server chassis and the gasket are inserted into the immersion tank. The gasket can have a chamfered corner. The gasket can be elastically compressed and may be made of plastic, rubber, silicone, plastic, and/or polytetrafluoroethylene. It may include at least one compressible spring feature configured to be compressed between the server chassis and an immersion tank. The gasket can be made of a sponge-like material. The gasket can define a hollow, compressible lumen having a vent or hole to allow fluid flow into and out of the hollow, compressible lumen. The vent or hole can be disposed at an end of the gasket.
[0012] All combinations of the foregoing concepts and additional concepts discussed in greater detail below (provided such concepts are not mutually inconsistent) are part of the
inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are part of the inventive subject matter disclosed herein. The terminology used herein that also may appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0013] The skilled artisan will understand that the drawings primarily are for illustrative purposes and are not intended to limit the scope of the inventive subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the inventive subject matter disclosed herein may be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally and/or structurally similar elements).
[0014] FIG. 1.1 depicts an example of a two-phase immersion-cooling system that can be used to cool semiconductor dies.
[0015] FIG. 1.2 depicts a tank for cooling semiconductor dies in assembled units and also illustrates an assembled unit installed in the tank.
[0016] FIG. 1.3 is a block diagram of a server unit that can be installed as an assembled unit in the tank of FIG. 1.2.
[0017] FIG. 1.4A and FIG. 1.4B depict further details of the server unit of FIG. 1.3.
[0018] FIG. 1.5A depicts a first arrangement of printed circuit boards (PCBs).
[0019] FIG. 1.5B depicts a second arrangement of the PCBs.
[0020] FIG. 1.5C depicts a second arrangement of the PCBs.
[0021] FIG. 1.6 depicts an oversized full-height full-length PCB that can be used in the server unit of FIG. 1.4B.
[0022] FIG. 1.7 depicts a switchboard and compute nodes that can be used in the server unit of FIG. 1.3.
[0023] FIG. 1.8A depicts further details of the switchboard of FIG. 1.7.
[0024] FIG. 1.8B depicts I2C communication channels in management circuitry of the switchboard of FIG. 1.8A.
[0025] FIG. 1.8C depicts UART communication channels in the management circuitry of the switchboard of FIG. 1.8A.
[0026] FIG. 1.9A depicts an example of a compute node that can be used in the server unit of FIG. 1.4A.
[0027] FIG. 1.9B depicts details of the compute-node management circuitry for the compute node of FIG. 1.9A.
[0028] FIG. 1.10 depicts a process to recover firmware images for components of the compute-node management circuitry of FIG. 1.9B.
[0029] FIG. 1.11 depicts reset functionality for the compute-node management circuitry of FIG. 1.9B.
[0030] FIG. 1.12 depicts I2C and I3C communication channels provided by the compute-node management circuitry of FIG. 1.9B.
[0031] FIG. 1.13 depicts UART and USB communication channels provided by the compute-node management circuitry of FIG. 1.9B.
[0032] FIG. 1.14 depicts non-maskable interrupt functionality for the compute-node management circuitry of FIG. 1.9B.
[0033] FIG. 1.15 depicts error-handling functionality for the compute-node management circuitry of FIG. 1.9B.
[0034] FIG. 1.16 depicts thermal monitoring of voltage regulators by the compute-node management circuitry of FIG. 1.9B.
[0035] FIG. 1.17 depicts processor-hot (PROCHOT) monitoring functionality of the compute-node management circuitry of FIG. 1.9B.
[0036] FIG. 1.18 depicts power distribution circuitry for the server unit of FIG. 1.4A and FIG. 1.4B.
[0037] FIG. 2.1 shows an inventive sleeve or gasket for protecting a server when it is installed in and removed from an immersion tank.
[0038] FIG. 2.2A shows a front view of the sleeve of FIG. 2.1 attached to one side of a server chassis.
[0039] FIG. 2.2B shows a top view of a server chassis with sleeves attached to both sides.
[0040] FIG. 2.2C shows a detailed cross-section of a sleeve attached to one side of a server chassis.
[0041] FIG. 2.3A shows a cross-sectional view of a server chassis with a sleeve attached to one edge installed in an immersion tank.
[0042] FIG. 2.3B shows a close-up view of the server chassis and the sleeve between guide rails of the immersion tank.
[0043] FIG. 2.3C shows a server chassis with sleeves on both edges installed in an alignment plate at the bottom of the immersion tank.
[0044] FIG. 2.3D is a close-up view of the server and sleeves installed in the alignment plate at the bottom of the immersion tank.
[0045] FIG. 2.4 shows different views of an alternative sleeve for an immersion-cooled server chassis.
DETAILED DESCRIPTION
[0046] Section 1 : Multi-Node Server Unit The inventors have recognized and appreciated that conventional data centers typically provide users access to computing resources in relatively large increments, which can result in excess computing resources being allocated to users. For example, conventional data centers often include servers that provide a single compute node with a fixed number of Peripheral Component Interconnect Express (PCIe) devices (e.g., GPUs, accelerators). When a user pays to use the computing resources of the data center, they typically do so by purchasing access to one (or more) compute nodes where each compute node has a fixed number of PCIe devices. If a user does not require all of the PCIe devices available for a single compute node, they may pay for more computing power than is needed for their particular workload, and any unused PCIe devices assigned to their compute node become stranded, i.e., unusable for other workloads by the same user or a different user. The inefficient use of the PCIe devices may, in turn, lead to a higher overall total cost of ownership for the user. One option to address stranding of resources is to virtualize the hardware. However, there is an undesirable software overhead associated with virtualization of hardware.
[0047] In some conventional data centers, a networked architecture may be implemented where the computing resources provided by one server are shared with another server, thus allowing for more efficient utilization of computing resources. However, networked architectures are often complex and expensive to implement. For example, such networked architectures are often prone to issues related to, for example, the compatibility of network protocols for data transmission and security of data. Such networked architectures can also experience unfavorable bandwidth limitations.
[0048] For each server unit in a data center, there preferably should be little server overhead in terms of hardware utilization. In this regard, VO devices that are not used or needed
by one compute node should be available to the other compute node(s) that can be dedicated to other Al or non- Al workloads. In this manner, the server units and data center can operate closer to 100% utilization of computing hardware resources, reducing server overhead and data center operating costs.
[0049] The inventors have recognized and appreciated that including multiple compute nodes in a server unit and providing reconfigurability of compute node-to-VO device pairings can significantly improve server adaptability and performance. For example, a compute node in the server can be assigned an appropriate number of VO devices (such as GPUs, accelerator modules, etc.) such that the configured host computer (compute node and communicatively coupled VO device(s)) can handle an intended workload (such as a particular Al inference workload) without having stranded VO devices. The number of VO devices and compute nodes communicatively coupled together can be changed by a user as needed to readapt the server unit to handle a different type of Al workload or non- Al workload to more efficiently utilize the available computing resources. In this way, a single server unit can handle different types of workloads ranging from complex Al workloads to significantly simpler non- Al workloads. Furthermore, multiple teams of developers, engineers, and support staff to create, produce, and support different types of server designs for different types of workloads are not needed. Instead, a single team of developers, engineers, and support personnel can be employed to develop a single server that can be adapted by customers to meet a wide range of customer needs.
1. Overview of Immersion- Cooling System
[0050] FIG. 1.1 depicts several aspects of an immersion-cooling system 1160 for dissipating heat from one or more semiconductor die packages 1105 via immersion cooling. Each die package 1105 can include one or more semiconductor dies that produce heat when the semiconductor dies are in operation. The die packages 1105 can be part of a larger assembled unit, such as a server. FIG. 1.1 depicts a two-phase immersion-cooling system at a time when the semiconductor dies are operating and generating enough heat to boil coolant liquid 1164 in an immersion-cooling tank 1107. Although the immersion-cooling system 1160 in the illustrated example of FIG. 1.1 is a two-phase immersion-cooling system, the invention can also be implemented for single-phase immersion-cooling systems.
[0051] The immersion-cooling system 1160 includes a tank 1107 filled, at least in part, with coolant liquid 1164. The immersion-cooling system 1160 can further include at least one
chiller 1180 that flows a heat-transfer fluid through at least one condenser coil 1170 or pipes located in the head space 1109 of the tank 1107. The condenser coil(s) 1170 or pipes can condense coolant liquid vapor 1166 into droplets 1168 that return to the coolant liquid 1164. The packages 1105 can be mounted on one or more printed circuit boards (PCBs) 1157 that are immersed, at least in part, in the coolant liquid 1164. In some implementations, a foam or froth 1167 can form in the tank 1107 above the coolant liquid 1164 when the system is in operation. The froth 1167 can comprise mostly bubbles 1165 that collect across the surface of the coolant liquid 1164 as the coolant liquid boils.
[0052] The example of FIG. 1.1 is not intended to be illustrated to scale. The immersion-cooling system 1160 may house and provide coolant liquid 1164 to tens, hundreds, or even thousands of packages 1105. According to some implementations, the packages 1105 can be included in assembled units, such as server units. The total amount of electrical power drawn by the packages 1105 in one tank 1107 of an immersion-cooling system can be from 100,000 Watts to 600,000 Watts. In some cases, the total amount of electrical power can be over 600,000 Watts.
[0053] According to some implementations, the tank 1107 of an immersion-cooling system 1160 can be small (e.g., the size of a floor unit air conditioner, approximately 1 meter high, 0.5 meter width, 0.5 meter depth or length). In some implementations, the tank 1107 of an immersion-cooling system can be large (e.g., the size of an automotive van, approximately 2.5 meters high, 2.5 meters width, 4 meters depth or length). In some cases, the tank 1107 of an immersion-cooling system 1160 can be larger than an automotive van.
[0054] The immersion-cooling system 1160 can also include a controller 1102 (e.g., a microcontroller, programmable logic controller, microprocessor, field-programmable gate array, logic circuitry, memory, or some combination thereof) to manage operation of at least the immersion-cooling system 1160. The controller 1102 can perform various system functions such as monitoring temperatures of system components, coolant liquid fluid level, technician access to the interior of the tank, chiller operation etc. The controller 1102 can further issue commands to control system operation such as executing a start-up sequence, executing a shut-down sequence, assigning workloads among the packages 1105 and/or assembled units, changing coolant liquid fluid level, changing the temperature of the heat-transfer fluid circulated by the chiller 1180, etc. In some implementations, the controller 1102 can include (or be included in) a baseboard management controller (BMC) 1104. That is, the BMC 1104 may monitor and control all aspects of system operation for the immersion-cooling system 1160 in addition to monitoring
and controlling workloads of the semiconductor dies 1150 in the packages 1105 and/or workloads among the assembled units cooled by the immersion-cooling system 1160. The immersion-cooling system 1160 can also include a network interface controller (NIC) 1103 to allow the system to communicate over a network, such as a local area network or wide area network.
2. Example Tank for a Two-Phase Immersion-Cooling System
[0055] FIG. 1.2 depicts an example tank 1207 for an immersion-cooling system 1260 that can cool a plurality of assembled units 1210. The tank 1207 can be one of many installed in a data center, for example. The assembled units 1210 can each be servers, dedicated processing units, data-mining units, high bandwidth memory units, or other types of processing and/or memory units.
[0056] In some implementations, the tank 1207 can be formed to include compartments 1205 that are not filled with coolant liquid 1164. The compartments 1205 can be outside a main, enclosed volume of the tank that is at least partially filled with coolant liquid 1164, for example. The compartments 1205 can be used to house server components that need not be cooled by liquid (e.g., signal routing electronics and connectors). Condensing pipes 1270 can be located above and to the side of the assembled units 1210 so that the assembled units can be lowered into and removed from the tank 1207 through an opening 1220 in the tank 1207 without contacting the condensing pipes 1270. There can be many more condensing pipes than the number shown in FIG. 1.2. The opening 1220 can be covered and sealed by an access door 1212 when the immersion-cooling system is in operation. In some implementations, the access door 1212 can comprise glass so that server units 1211 can be viewed inside the tank during operation of the system (e.g., to view indicator LEDs on the server units 1211).
[0057] A single tank 1207 can be sized to contain 10 or more assembled units 1210 (e.g., at least from 10 to 70 assembled units 1210). In some cases, a tank 1207 can be sized to contain more than 70 units. Power to the assembled units 1210 can be provided by one or more busbars 1230 that run(s) along a base of the tank 1207 in the ± x direction. The assembled units 1210 can be placed in one or more rows (e.g., extending in the ±x direction in the illustration) within the tank 1207. In a data center, there can be a plurality of tanks 1207 each containing a plurality of assembled units 1210. The immersion-cooling tanks 1207 can be operated continuously for days, weeks, months, or even longer in some cases, before substantial servicing of any of the tanks is needed.
3. Overview of Server Architecture
[0058] FIG. 1.3 is a simplified diagram of the computing architecture for an example assembled server unit 1211 of the disclosed implementations. The server unit 1211 can comprise a plurality of computing hardware components mounted to a chassis. In some implementations, the size of the chassis can comply with a two-rack-unit (2U or 2RU) form factor in one dimension. The hardware components include multiple compute nodes 1310-1, 1310-2, ... 1310-N, at least one switchboard 1320, and multiple I/O devices 1330-1, 1330-2, ... 1330-M (which may also be referred to as peripheral cards or peripheral devices). The compute nodes 1310 can be communicatively coupled to the switchboard 1320 with first cables 1315 and the I/O devices 1330 can be communicatively coupled to the switchboard 1320 with second cables 1325. In some implementations, the switchboard 1320, I/O devices 1330, first cables 1315, and second cables 1325 are all Peripheral Component Interconnect Express® (PCIe) compliant, though compliance with other communication standards such as Compute Express Link (CXL) and Open Compute Project Accelerator Module (0AM) is possible.
[0059] Each compute node 1310-1 can comprise one or more CPUs communicatively coupled to each other. Each CPU of the compute nodes 1310 can dissipate up to 350 watts of power, so that the compute nodes 1310 of a single server unit 1211 can dissipate up to 2800 watts of power or more. In operation, the CPU(s) of a compute node 1310-1 work together (in collaboration with communicatively-coupled I/O devices 1330) on a particular computational task or workload whereas other CPUs and other I/O devices in the server do not work on the same task or workload. An example CPU that can be used in a compute node 1310-1 is the 4th generation Intel® Xeon® Scalable Processor, Max Series, available from Intel Corporation® of Santa Clara, California. Another example CPU that can be used is Intel’s 5th generation Xeon® processor (currently codenamed Emerald Rapids). Other types of CPUs can be used in the compute nodes 1310 and the invention is not limited to only Intel CPUs. According to an example implementation of the server unit 1211, described further below, each compute node 1310-1, 1310-2, ... 1310-N comprises two CPUs communicatively coupled together and there are four (N=4) compute nodes in the server unit 1211 for a total of eight CPUs. However, each compute node could have only one CPU in some implementations. Some implementations of a server unit 1211 can have fewer than four compute nodes 1310 and other implementations can have more than four compute nodes 1310. There may be from 2 to 8 compute nodes 1310 in a server unit 1211.
[0060] In some implementations, the I/O devices 1330 can include GPUs, though other types of I/O devices (such as accelerators and custom cards) can be included in a server unit 1211. In some implementations, a server unit 1211 can include one or more redundant array of independent disks (RAID) cards and/or one or more solid-state drive (SSD) cards among the I/O devices 1330. In some cases, all the I/O devices 1330 can be the same (e.g., all GPUs). An example GPU card that can be used in the server unit 1211 is the Nvidia Titan V graphics card or the NVIDIA A100 Tensor Core GPU card, available from Nvidia Corporation of Santa Clara, California. In other implementations, there can be different types of I/O devices 1330 (e.g., a mix of GPUs, accelerators, RAID cards, custom cards, and SSD cards or other types of I/O devices). For the implementation of the server unit 1211 described further below, there can be up to 16 VO devices 1330 in the server unit. In some implementations, a server unit 1211 can have fewer than 16 VO devices or more than 16 VO devices. There may be from 0 to 32 VO devices 1330 in a server unit 1211.
[0061] As mentioned above, the VO devices 1330 can be communicatively coupled to the compute nodes 1310 through the second cables 1325, switchboard 1320, and first cables 1315. Each VO device can plug into a socket or slot in the server unit 1211. The socket or slot that receives the VO device can connect to a cabling connector 1423 (male or female) to which one of the second cables 1325 can be plugged. The cables 1315, 1325 and cabling connectors 1420, 1421, 1422, 1423 can be used to manually reconfigure communicative couplings between the compute nodes 1310 and VO devices 1330.
[0062] The first and second cables 1315, 1325 can be connected between cabling connectors 1422, 1420 and 1421, 1423, respectively, by a user to obtain a server configuration desired by the user. Further, software and firmware resident, at least in part, in the switchboard 1320 or a controller in the server unit that communicates with the switchboard 1320 can be executed to communicatively couple any desired portions of the VO devices 1330-1, 1330-2, ... 1330-M to each of the compute node 1310-1, 1310-2, ... 1310-N. For example, one or all of the VO devices can be communicatively coupled to any one of the compute nodes and any remaining VO devices can be divided into groups and communicatively coupled to the remaining compute nodes as desired by the user.
[0063] At least a portion of the couplings between compute nodes 1310 and VO devices 1330 can be reconfigured at any time (e.g., when the affected compute nodes and VO devices are idle) with user instructions provided to the switchboard 1320 through a programming interface (such as orchestrator software) without physical changes to the cables 1315, 1325. The
orchestrator software can be installed and executed on a remote controller that communicates with firmware and/or software executing in the server unit 1211 to alter communicative couplings between one or more of the compute nodes 1310 and at least a portion of the I/O devices 1330. For example, the executing firmware and/or software can be used to communicate with and set signal routing switches in the switchboard 1320. The orchestrator software can reside on another server in the data center and be used to manage operation of a plurality of server units 1211 in the data center, for example. More extensive changes to the couplings between compute nodes 1310 and VO devices 1330 on a server unit 1211 can be made by changing physical connections of the first cables 1315 between first cabling connectors 1422, 1420 and/or changing physical connections of the second cables 1325 between cabling connectors 1421, 1423.
[0064] The inventors have recognized that peer-to-peer communications between the VO devices 1330 can enhance processing power of the server unit 1211. Accordingly, the VO devices 1330 can communicate signals with each other through the switchboard 1320 via the second cables 1325 without communication signals traveling to and from any of the compute nodes 1310. Such communication can reduce system latency (compared to communicating through a compute node), accelerate computational speed, and reduce computational burden on the compute nodes. In some implementations, at least a portion of the VO devices 1330 can be adapted for direct card-to-card communication via third cables 1335 (such as an VO device data bus). Each such VO device can include transceivers 1332 that can transmit data to other VO devices and receive data from other VO devices. The transceivers 1332 may support universal asynchronous receive/transmit (UART) communications and/or inter-integrated circuit (I2C) communications, for example. In some implementations, faster serializer/deserializer (SERDES) hardware can be used to support, for example, PCIE Gen 5 communications directly between VO devices 1330. Direct card-to-card communication can further reduce system latency (compared to communicating through the switchboard) and speed up server computations.
4. Server Layout
[0065] FIG. 1.4A and FIG. 1.4B depict an example layout for an assembled server unit 1211 of FIG. 1.2 and FIG. 1.3. The front side of the server unit 1211 is depicted in FIG. 1.4A and the back side of the server unit is depicted in FIG. 1.4B. The body 1407 of the server unit 1211 can have a depth (in the x dimension) of approximately or exactly 89 mm (2RU) for the illustrated implementation, though the server unit is not limited to only a 2 RU depth. The body
1407 of the server unit 1211 is a portion of the server unit behind a face or faceplate 1406 of the server unit 1211 that installs into a rack or other assembly that supports the server unit 1211. The body 1407 of the server unit 1211 comprises a portion of the chassis in which the compute nodes 1310, switchboard 1320, and I/O devices 1330 are mounted.
[0066] The server unit 1211 can be adapted for installation in a tank 1207 of an immersion-cooling system. The server unit 1211 comprises a chassis 1405 that can include lift rings 1408 (such as handles, hooks, eye bolts, etc.) for lowering and lifting the server unit 1211 into and out of an immersion-cooling tank 1207. Hardware components (compute nodes 1310, switchboard 1320, and I/O devices 1330) can be assembled on PCBs which are mounted to the chassis 1405. The chassis 1405 can be formed from a metal such as steel that is treated to be corrosion resistant or an aluminum alloy. The chassis 1405 and/or normally exposed metal conductors and solder and at least some components on the PCBs can be treated to prevent corrosion when in contact with coolant liquid 1164 in the tank 1207. According to some implementations, the compute nodes 1310 (four in this example) can all be mounted on a first side of the chassis 1405, as illustrated in FIG. 1.4A. The I/O devices 1330 (eight visible in the drawings though 16 are present in this example) can all be mounted on a second, opposite side of the chassis 1405, as illustrated in FIG. 1.4B.
[0067] The switchboard 1320 can be located at the top of the chassis 1405 (e.g., in a top one-third of the server unit 1211) and may be immersed, at least in part, or may not be immersed in coolant liquid 1164 when the server unit is installed for operation in the immersion-cooling tank 1207. For example, the dimension D2 of the server unit 1211 may be 900 mm, for example, and the top third of the server unit would be the top 1300 mm of the server unit 1211. The top of the chassis 1405 and top of the server unit 1211 would be farthest from a base of the immersioncooling tank 1207 when installed in the tank 1207 compared to a bottom or base of the chassis 1405 and server unit 1211. The compute nodes 1310 and I/O devices 1330 can be mechanically coupled to the chassis 1405 and located in a lower two-thirds of the server unit 1211 and can all be immersed in coolant liquid 1164 when the server unit is installed and operating in the immersion-cooling tank 1207.
[0068] There can be electrical contacts 1409 (pins or sockets) that extend from or are located on a first power-distribution board 1410 that is located at the base 1408 of the server unit 1211. The electrical contacts 1409 are arranged to make power connections to mating electrical contacts on a busbar 1230 at the base of the immersion-cooling tank 1207 when the server unit is lowered into position and installed in the immersion-cooling tank 1207 for operation. For
example, the server unit 1211 can essentially be plugged into the busbar 1230 at the base of the immersion-cooling tank 1207. Guide rails in the immersion-cooling tank 1207 can guide the server unit 1211 into place such that it can be easily installed and plugged into the busbar 1230 at the base of the tank. The guide rails can comprise right-angle bars or T bars that extend along the z direction and guide at least two edges of the server unit 1211 that also extend along the z direction, for example.
[0069] The first power-distribution board 1410 can receive power from the busbar 1230 (which may be at 48 volts or at a voltage in a range from 40 volts to 56 volts) and provide and/or convert the power to one or more voltage values compatible with the compute nodes 1310 and/or I/O devices 1330 (e.g., 48 volts, ± 12 volts, ± 5 volts, and/or ± 3.3 volts). The first powerdistribution board 1410 can include fan-out connections, fuses, buffers, power converters, power amplifiers, or some combination thereof to distribute power to a plurality of components within the server unit 1211. The server unit 1211 can further include one or more second powerdistribution boards 1436. The second power-distribution boards 1436 can also be mounted at a periphery of the chassis 1405 (e.g., along one or two sides of the chassis) and can convert power at a first voltage (received from the busbar 1230 and/or first power-distribution board 1410) to power at one or more second voltages. The second power-distribution boards(s) 1436 can also include fan-out connections, fuses, buffers, power amplifiers, power converters, or some combination thereof to distribute power to a plurality of components within the server unit 1211. For the implementation of FIG. 1.4B, there are two second power-distribution boards 1436 mounted on vertical sidewalls of the chassis 1405. The two second power-distribution boards 1436 can be identical to and interchangeable with each other to simplify design, reduce component count, and reduce manufacturing complexity. The first power-distribution board 1410 and second power-distribution boards 1436 can be located in a lower two-thirds of the server unit such that they are immersed, fully or mostly, in coolant liquid 1164 when installed in the tank 1207.
[0070] The chassis can be sized to comply, in part, with standard rack unit (RU) sizes. For example, the chassis can be sized in the x direction to comply with the Electronic Industries Alliance (EIA)-310 standard of 1 RU or 2 RU size (which commonly references the height of the unit), depending on the number of compute nodes 1310 and VO devices 1330 included in the server unit 1211. A 2 RU dimension would be approximately or exactly 89 mm in the x direction in the drawing (normally referred to as height of the unit). The other dimensions of the server unit 1211 can be Dy that is approximately or exactly 800 mm (y direction in the drawing)
and D2 that is approximately or exactly 900 mm (z direction in the drawing) and may or may not comply with an industry standard. In some cases, Di can be any value in a range from 700 mm to 900 mm and D2 can be any value in a range from 800 mm to 1000 mm.
[0071] The switchboard 1320 can be mechanically coupled to the chassis 1405 and located near the top of the assembled server unit 1211. For example, the center of the switchboard 1320 can be located in the upper half, upper third, or upper quarter of the server unit 1211. There can be an opening 1402 in the chassis 1405 such that cables can pass through the chassis and connect to cabling connectors 1420, 1421 located on opposing sides of the switchboard 1320. There can be a first plurality of cabling connectors 1420 on a first side of the switchboard 1320 (shown in FIG. 1.4A) for configuring cable connections between the switchboard 1320 and the compute nodes 1310. There can be a second plurality of cabling connectors 1421 on the second side of the switchboard 1320 (shown in FIG. 1.4B) for configuring cable connections between the switchboard 1320 and VO devices 1330. The cabling connectors 1420, 1421 can all be the same or there can be different cabling connectors used on the switchboard 1320. An example cabling connector that can be used is the Mini Cool Edge VO (MCIO) connector.
[0072] By locating the switchboard near the top of the assembled server unit 1211, at least some of the cables between the switchboard 1320 and the compute nodes 1310 and VO devices 1330 may be above the level of coolant liquid 1164 in the immersion-cooling tank 1207 when the server unit is in operation. Keeping cables out of the coolant liquid 1164 can reduce contamination of the coolant liquid by the cables and extend the service life of the coolant liquid 1164. Further, some or all of the cabling connectors 1420, 1421 on the switchboard can be above the level of coolant liquid 1164 in the immersion-cooling tank 1207 when the server unit 1211 is in operation, which can also help reduce contamination of the coolant liquid.
[0073] The compute nodes 1310 can be arranged within the assembled server unit 1211 in a two-dimensional array (e.g., extending in the y and z dimensions as shown in FIG. 1.4A). Each compute node can include one or more compute-node connectors 1422 for making cable connections to the first plurality of cabling connectors 1420 on the switchboard 1320. For the illustrated implementation, there are four compute-node connectors 1422 for each compute node 1310-1, 1310-2, .. 1310-4. In some implementations, two or more arrays of compute nodes can be stacked in the third dimension (x direction in the drawing) to provide up to eight or more compute nodes in the server unit 1211. For such an implementation the server unit 1211 may have a larger x dimension (e.g., up to 3 RU or more).
[0074] The two-dimensional array of compute nodes 1310 shown in FIG. 1.4A can make it easy for a user to install and visually check the cabling connections between the compute nodes 1310 and switchboard 1320 as well as change the cables to reconfigure the assignment of compute nodes 1310 to I/O devices 1330. As can be deduced from FIG. 1.4A, the cables between the 16 cabling connectors 1420 on the switchboard 1320 and the 16 compute-node’s cabling connectors 1422 can be spread over the y and z dimensions with no obstructed viewing of cables and connectors. For example, the cabling connectors 1422 of different compute nodes 1310 are not stacked side-by-side in the x dimension which could make it difficult to trace cable connections. The two-dimensional array of compute nodes 1310 can also improve immersion cooling of the compute nodes.
[0075] In typical conventional computers, PCBs 1505 are stacked like pages in a book: in a one-dimensional linear array such that the smallest dimension of the PCB (thickness of the PCB) and the array extend in the same direction, as depicted in FIG. 1.5A. Stated alternatively, the planar surface 1507 of each PCB 1505 in the linear array is oriented perpendicular to a line 1510 running approximately or exactly through the center of each PCB 1505 in the array and extending in the direction of the linear array. The planar surface 1507 of the PCB 1505 is the surface on which electronic components are mounted. The electronic components are not shown in FIG. 1.5A through FIG. 1.5C.
[0076] FIG. 1.5B depicts another way to array PCBs 1505 in one dimension. In this arrangement, the planar surface 1507 of each PCB 1505 in the linear array is oriented parallel to the line 1510 running approximately or exactly through the center of each PCB 1505 in the array and extending in the direction of the linear array. For the arrangement of FIG. 1.5A, the size of the array of PCBs 1505 viewed from any direction spans a significantly smaller area than the largest area spanned by the array of FIG. 1.5B.
[0077] For the assembled server unit 1211 described herein, the compute nodes 1310 and I/O devices 1330 are arranged side-by-side and end-to-end in two-dimensional planar arrays, like that shown in FIG. 1.5C. In this arrangement, the planar surface of each PCB 1505 in the array is oriented approximately or exactly parallel (e.g., within ± 20 degrees of parallel) to a plane 1520 extending in the two directions (y and z in the drawing) spanned by the two-dimensional planar array. The planar array arrangement of FIG. 1.5C can be beneficial for immersion cooling because it can provide unimpeded fluid access to semiconductor dies and their heat dissipative elements on each of the PCBs 1505. The arrangement of I/O devices 1330 for the server unit 1211 of FIG. 1.4B comprises two, two-dimensional planar arrays arranged beside
each other in the x direction of the drawing. Each two-dimensional planar array comprises eight VO devices 1330 for a total of 16 VO devices 1330 in the implementation of FIG. 1.4B. However, there can be fewer VO devices 1330 in the server unit 1211 arranged in a two- dimensional array. For example, there can be at least four VO devices 1330 installed in the server unit 1211 that are arranged in a two-dimensional array like that shown in FIG. 1.5C.
[0078] The compute nodes 1310 can be oriented within the server unit 1211 such that the planar surfaces of their PCBs extend approximately or exactly (e.g., within ± 20 degrees) in directions across the broad area of the server unit 1211 (in the y and z dimensions in the illustrated example). In this orientation, the broad faces of heat-dissipating elements 1413 that are thermally coupled to CPUs of the compute nodes 1310 also extend approximately or exactly in the same (y and z) dimensions allowing bubbles generated by the heat-dissipating elements 1413 (when immersed in the coolant liquid) to rise freely without being trapped under the compute nodes 1310 or being impeded by other closely-spaced compute nodes when the server unit 1211 is installed in the immersion-cooling tank 1207 and operating. Such free -rising bubbles 1165 are depicted in FIG. 1.1, where the broad faces of the packages 1105 extend in the x and z dimensions for that illustration. Free removal of generated bubbles can prevent dry-out of the heat-dissipating elements 1413 and improve cooling of the compute nodes 1310. Stated alternatively, the planes of the PCBs of the compute nodes 1310 and the broad faces of heatdissipating elements 1413 on the compute nodes 1310 extend approximately or exactly in vertical and horizontal dimensions (e.g., within ± 20 degrees of) when the server unit is installed in the immersion-cooling tank 1207.
[0079] Because of the opening 1402 in the chassis 1405 and cabling connectors 1420, 1421 mounted on opposite sides of the switchboard 1320, cabling connections to the compute nodes 1310 on one side of the chassis 1405 and to the VO devices 1330 on the opposite side of the chassis can be made entirely within the chassis 1405, without cables passing outside the perimeter of the chassis 1405. The perimeter of the chassis 1405 is defined as six planar faces that bound the chassis’ x, y, and z dimensions.
[0080] The VO devices 1330 can be arrayed in two dimensions and/or three dimensions on an opposite side of the server unit 1211 from the compute nodes 1310, as depicted in FIG. 1.4B. Each of the VO devices 1330 can include or couple to an VO device connector 1423, and the VO device connector 1423 can couple (via second cables 1325) to one of the second plurality of cabling connectors 1421 of the switchboard 1320. For the illustrated example of FIG. 1.4B, each VO device 1330-5, for example, can plug into a slot 1437 or socket that is mechanically
coupled to the chassis 1405. In some implementations, the slot 1437 or socket can be located on a corresponding device connector board 1432-5 on which is mounted the cabling connector 1423 for the I/O device 1330-5. Each device connector board 1432-5 can be mechanically coupled to the chassis 1405 and service one, two, or more VO devices 1330 that can plug into a corresponding one, two, or more slots 1437 or sockets mounted on the device connector board 1432-5. The device connector board 1432-5 can include one, two, or more cabling connectors 1423 for cabling connections to the second plurality of connectors 1421 of the switchboard 1320.
[0081] The VO devices 1330 can have a size up to full-height full-length (FHFL) VO devices or larger and can be stacked in the x dimension. For the illustrated server unit 1211, there are a total of 16 FHFE VO devices (of increased size) mechanically coupled to the chassis, two stacked in the x dimension at each device location. The two stacked VO devices at each device location can be serviced by the same device connector board 1432-5 or two different device connector boards. As such, the assembled server unit 1211 comprises a highly dense computing package comprising four, dual-processor compute nodes 1310 and 16 FHFE VO devices 1330 (of increased size) along with signal routing hardware to support reconfigurable communicative couplings between compute nodes and I/O devices, all within a 2 RU package having a total volume of less than 0.07 m3.
[0082] Of course, smaller VO device sizes and a mix of device sizes can be used in the server unit 1211. For example, any one or all of the VO devices 1330 can be full-height halflength (FHHE) or half-height full-length (HHFE), for example. Other sizes are also possible. For the example implementation, each of the FHFE VO devices 1330 are slightly oversized.
[0083] The PCB 1610 shape and size for the oversized I/O devices 1330 is depicted in FIG. 1.6. The PCB height is increased by more than 7% fully along the top of the PCB compared to a standard FHFE PCB for conventional VO devices. The increased height provides a first added board area 1612 to accommodate more components, board traces, and/or portions of components. Further, there is a second added board area 1614 at the base of the PCB 1610 that extends beside the board’s pin area 1620. The second added board area 1614 extends along the majority of the length of the PCB 1610 at the base of the PCB to also accommodate more components, board traces, and/or portions of components. A lower edge 1617 of the second added board area 1614 steps up (by .15 mm in this example) compared to a lower edge 1615 of the PCB in the pin area 1620. The step-up of the lower edge 1617 of the second added board area 1614 can prevent the PCB 1610 from contacting other hardware or components when an VO
device comprising the PCB 1610 is inserted into a bus slot, for example. The full height of the increased sized PCB for the I/O devices 1330 can be from 112 mm to 120 mm.
[0084] The PCB 1610 can further include one or more “keep-out” areas 1660 where components above a specified height are not permitted on the assembled I/O device. The preclusion of components above a specified height from the keep-out areas 1660 can prevent contacting of components on a first I/O device with components on a second I/O device when the first and second I/O devices are mounted immediately adjacent to each other in a server, for example.
[0085] FIG. 1.7 is a block diagram that depicts one example implementation of cabling connections between the compute nodes 1310, switchboard 1320, and I/O device connectors 1423, which couple to I/O devices 1330 (not shown in FIG. 7). The switchboard 1320 comprises four switches 1430-1, 1430-2, 1430-3, 1430-4 in the illustrated example of FIG. 7, though more or fewer switches can be used in other implementations. In this illustrated implementation, each compute node has five compute-node connectors 1422. The first cables 1315 can connect four of the compute-node connectors 1422 on each compute node 1310-1, 1310-2, 1310-3, 1310-4 to the first plurality of connectors 1420 on the switchboard 1320 (16 first cables 1315 and 16 switchboard cabling connectors 1420 in this example). The first cables 1315 are arranged such that each one of the compute nodes 1310 connects (through the first plurality of cabling connectors 1420 and traces in the switchboard’s PCB 1326) to one communication port on each of the four switches 1430. Four additional communication ports from each of the four switches 1430 connect to four of the second plurality of connectors 1421 on the switchboard 1320 (e.g., through traces in the switchboard’s PCB 1326). A ninth communication port 1329 on each of the switches 1430 can be used to connect one switch to another one of the switches, which allows for peer-to-peer communication and signal routing directly between the connected switches. In the example implementation of FIG. 1.7, switch 1430-1 directly connects to switch 1430-2 through their ninth communication ports and switch 1430-3 directly connects to switch 1430-4 through their ninth communication ports. Second cables 1325 communicatively couple the second plurality of connectors 1421 on the switchboard 1320 to the I/O device connectors 1423. In an example peer-to-peer communication, a signal from a first I/O device arriving at a first connector 1423-1 can pass through a first switch 1430-1 to a second switch 1430-2 and to a second connector 1421-8 without going to any of the compute nodes 1310.
[0086] The fifth compute-node connector 1422 can be used to couple a CPU of a compute node 1330-1 to a network interface card (NIC), for example, so that the CPU(s) of each
compute node can communicate with other CPUs in the server unit 1211 and/or remote devices via a local area network or wide area network. An example of a NIC that can be used in a server unit 1211 is the Nvidia Bluefield 3 NIC available from Nvidia Corporation of Santa Clara, California.
[0087] By reconfiguring the first cables 1315 and/or the second cables 1325, as well as additionally or alternatively changing switch settings in the switches 1430, the communicative couplings between compute nodes 1310 and I/O devices 1330 can be reconfigured. For the implementation of FIG. 7, each one of the compute nodes 1310 can access any one, and up to all, of the 16 I/O device connectors 1423 (and coupled I/O devices 1330) through only one switch to reduce latency in the system. Accordingly, any sized portion of the I/O devices 1330 can be assigned to any one of the compute nodes 1310 using cabling and/or switch settings of the switchboard 1320. The remaining I/O devices 1330 can be assigned in any sized portions to the remaining compute nodes.
[0088] As an example and for the implementation of FIG. 1.7 which has L=16 I/O devices 1330, an integer number M of the I/O devices 1330 can be assigned to a first compute node 1310-1, an integer number N of the I/O devices 1330 can be assigned to a second compute node 1310-2, an integer number P of the I/O devices 1330 can be assigned to a third compute node 1310-3, and an integer number Q of the I/O devices 1330 can be assigned to a fourth compute node 1310-4, where M, N, P, and Q can each have an integer value in a range from 0 to L (L = 16 for the example implementation), subject to the constraint that M + N + P + Q < L.
[0089] In some implementations, the assignment of I/O devices 1330 can be done entirely by the switches 1430 to each compute node using only the settings of the switches 1430. For example, the assignment of I/O devices 1330 can be done via software instructions (e.g., via orchestrator software as discussed above) to configure the settings of the switches 1430. In this manner a user of the system can assign the I/O devices 1330 to the compute nodes 1310 as desired by the user. The assignment can be done from a remote location in the data center or outside of the data center (e.g., via the internet).
5. Switchboard
[0090] FIG. 1.8A and FIG. 1.8B depict further details of the switchboard 1320 of FIG. 1.4A, FIG. 1.4B, and FIG. 7. The switchboard comprises the first plurality of cabling connectors 1420, the second plurality of cabling connectors 1421, and four switches 1430 to reconfigure communicative couplings between the first plurality of cabling connectors 1420 and
the second plurality of cabling connectors 1421. The switchboard 1320 can further comprise management circuitry 1461 to facilitate operation of the switchboard.
[0091] In the illustrated example of FIG. 1.8A, the first plurality of cabling connectors 1420 and the second plurality of cabling connectors 1421 are all MCIO connectors, though other types of connectors (e.g., RJ45 connectors) can be used for some implementations. The switches 1430 can be programmable. For example, the switchboard 1320 can further include at least one configuration input 1450 (e.g., at least one terminal or connector to receive one or more digital signals) by which settings within each switch 1430-1, 1430-2, 1430-3, 1430-4 can be configured to establish desired communicative couplings between the first plurality of cabling connectors 1420 and the second plurality of cabling connectors 1421 (and thereby establish assignments of VO devices 1330 to compute nodes 1310).
[0092] For the example implementation shown in FIG. 1.4B, the switches 1430 are located under, and in thermal contact with, two heat sinks 1440 that dissipate heat from the switches 1430. An example switch that can be used for the switchboard 1320 is the Broadcom Atlas2 144-port PCIe switches (model number PEX89144), available from Broadcom Inc of San Jose, California. For this switch, 16 ports of the chip can be grouped together as a communication port of the switch to provide 16 signal lines to one of the cabling connectors 1420, 1421, such that each switch can service eight cabling connectors (with 16 signal lines to each of the 8 cabling connectors) and connect to another switch with 16 signal lines for a switch- to-switch communication port 1329. When configured, the switches 1430 and traces in the switchboard’s PCB can be used to route signals between the first plurality of cabling connectors
1420 on the first side of the switchboard 1320 and the second plurality of cabling connectors
1421 on the second side of the switchboard and thereby configure I/O device to compute node assignments.
[0093] The management circuitry 1461 can comprise a switchboard board management controller (BMC) 1460 and service processor 1470, among other components. The switchboard BMC 1460 may comprise, for example, a microcontroller, a programmable logic device, a complex programmable logic device, a microprocessor, a field-programmable gate array, logic circuitry, memory, or some combination some combination of these devices in which there can be none, one, or more of each device present in the combination. An example device that can be used for the switchboard BMC 1460 is the AST2600 server management processor available from ASPEED of Hsinchu City, Taiwan. The switchboard BMC 1460 can be communicatively coupled to one or more memory devices 1480 (such as one or more double data rate (DDR)
synchronous dynamic random access memory devices and/or one or more embedded multimedia card (eMMC) memory devices). These memory devices can store code and data for operation of the switchboard 1320. The switchboard BMC 1460 can also be communicatively coupled to a service processor 1470 and at least one complex programmable logic device (CPLD) 1475, which is shown in the further detailed drawing of FIG. 1.8B.
[0094] The switchboard BMC 1460 can manage local resources (such as controlling and monitoring the state of each of the connected I/O devices 1330). In some implementations, the switchboard BMC 1460 may be responsible for assigning computing tasks from each compute node to I/O devices communicatively coupled to the compute node by the switches 1430. The switchboard BMC 1460 may coordinate assignment of computing tasks among the I/O devices 1330 with a compute-node BMC associated with the compute node that is communicatively coupled to the I/O devices 1330, and perform such management for each of the compute nodes 1330-1, 1330-2, 1330-3, 1330-4. For example, the switchboard BMC 1460 may indicate to a compute-node’s BMC (described further below) when one or more I/O devices 1330 coupled to the compute node has (or have) completed tasks and/or is (or are) ready to accept a new computing task, thus allowing the compute node to route a new compute task to an available I/O device.
[0095] The management circuitry 1461 can further monitor the states of compute nodes 1310, switches 1430, and I/O devices 1330 as well as provide interfacing circuitry for setting switch configurations of the switches 1430. The management circuitry 1461 can also provide start-up, boot, and recovery functionality for the switchboard 1320. For example, the management circuitry 1461 can monitor the boot readiness of the switches 1430 and I/O devices 1330 and signal each of the compute nodes 1310 when their corresponding switch(es) and I/O device(s) are in a “boot ready” status. Each of the compute nodes 1310 can monitor the readiness statuses to ensure that the I/O devices 1330 and switches 1430 are ready before attempting to boot its host system. Further, the management circuitry 1461 can support two communication protocols for communications between the BMC 1460 and the compute nodes 1310 and I/O devices 1330. For example, the management circuitry 1461 can support I2C communications as well as UART communications between the BMC 1460 and the compute nodes 1310 and I/O devices 1330.
[0096] The switchboard 1320 can further include a service processor 1470 that is communicatively coupled to the switchboard BMC 1460 and to memory 1482. The service processor 1470 can provide platform firmware resilience (PFR). Memory 1482 can include flash
memory devices for storing switchboard BMC firmware and BIOS code for the switchboard 1320. The service processor 1470 can provide secure image verification for the switchboard BMC 1460 during runtime and retrieve firmware and BIOS code for recovery operations following certain runtime errors. An example device that can be used for the service processor 1470 is the AST1060 service processor available from ASPEED of Hsinchu City, Taiwan.
[0097] Table 1 lists different firmware that is involved in operation of the management circuitry 1461 for the switchboard 1320. The table also lists where the firmware can be stored (second column), how firmware can be updated (third and fourth columns) and whether an image of the firmware is recoverable and if so, how it can be recovered (last column).
TABLE I: Switchboard Firmware
[0098] The implementation of the switchboard illustrated in FIG. 1.8A depicts a configuration for PCIe standards, but the invention is not limited to only PCIe standards. As mentioned above, the hardware can be selected to comply with other standards such as CXL and/or 0AM standards for I/O devices that are compatible with CXL or 0AM standards.
5.1 Switchboard I2C Communications
[0099] As mentioned above, management circuitry 1461 of the switchboard 1320 can support at least two communication protocols, such as I2C and UART communication protocols. FIG. 1.8B depicts a plurality of inter-device communications using fourteen I2C communication ports on the BMC 1460. Four of the I2C ports can be used to communicatively couple, through traces in the switchboard 1320, to four compute-node connectors 1465 (MCIO connectors in this example), which in turn can couple to each of the compute nodes 1310 in the server unit 1211. In the example implementation of FIG. 1.8A, there are four RJ45 compute-node connectors 1462 that are also used to communicatively couple the switchboard BMC 1460 to board management controllers at each of the compute nodes 1310.. Fewer or more compute-node connectors 1462, 1465 can be used, depending on the number of compute nodes 1310 in the server unit 1211. In some cases, the RJ45 connectors can allow for Ethernet cable or reduced gigabit media- independent interface (RGMII) connections between the switchboard BMC 1460 and each of the four compute nodes 1310 of the server unit 1211. Different sideband management signals can occur simultaneously between the switchboard and any one of the compute nodes 1310 using the two different signaling channels (one via the RJ45 compute-node connectors 1462 and one via the MCIO compute-node connectors 1465).
[0100] The switchboard BMC 1460 can communicatively couple to the service processor 1470 with three I2C ports of the switchboard BMC 1460. The switchboard BMC 1460 can further communicatively couple to the four switches 1430 using at least one I2C port (two shown in the implementation of FIG. 1.8B). The I2C communicative couplings between the BMC 1460 and switches 1430 can be used to program the switches 1430 (e.g., configure switch settings in each of the switches 1430 to assign I/O devices 1330 to compute nodes 1310 as desired) and/or to monitor switch status. The switchboard BMC 1460 and I2C communicative couplings to the switches 1430 can provide out-of-band (OOB) configurability of the switches 1430, for example. Using two or more I2C ports of the switchboard BMC 1460 to communicatively couple to the four switches 1430 can provide enough bandwidth for switch
monitoring and telemetry services for all I/O devices 1330 while also allowing for concurrent firmware updates to the switches 1430 (e.g., via the switchboard BMC 1460).
[0101] Dedicated I2C ports of the switchboard’s BMC 1460 (four in the example of FIG. 1.8B) can each be used to communicate with a corresponding compute node of the compute nodes 1310, also providing OOB communication between the compute nodes 1310 and the switchboard BMC 1460. The switchboard BMC 1460 can further communicatively couple to the second plurality of connectors 1421 (and to connected I/O devices 1330) to monitor the state of each connected I/O device, for example. For the implementation of FIG. 1.8B, two I2C ports of the BMC 1460 and two 8-way multiplexors 1464 (along with traces in the switchboard’s PCB) are used to communicatively couple the BMC 1460 to the second plurality of connectors 1421 (and to connected I/O devices 1330).
[0102] The I2C communications can be used by the BMC 1460 for a variety of tasks. Such tasks include, but are not limited to, management of the switches 1430, telemetry monitoring of the I/O devices 1330, and monitoring of power devices in the server unit 1211 such as voltage regulators (VRs) and hot-swap controllers (HSCs). The I2C communications can also be used for out-of-band (OOB) communications between the compute nodes 1310 and BMC 1460 and devices communicatively coupled to I2C ports of the BMC 1460.
[0103] In some implementations, the switchboard BMC 1460 can be used to monitor and/or control power delivery from the power-distribution boards 1410, 1436 in the server unit 1211 to the I/O devices 1330. For example, the switchboard BMC 1460 can be communicatively coupled to power-distribution circuitry 1490 through one or more I2C ports. The powerdistribution circuitry 1490 can include hot-swap protection devices 1492 that allow I/O devices 1330 to be removed and inserted while the system is powered among other features. Protection devices can also include over-voltage protection, over-temperature protection, short-circuit protection, and over-current protection. An example hot-swap protection device 1492 that can be used in the power-distribution circuitry 1490 is the MP5990 protection device available from Monolithic Power Systems, Inc. of Kirkland, Washington, though other protection devices can be used to control power delivery in the server unit 1211.
[0104] According to some implementations, the switchboard BMC 1460 can further sense operating conditions of one or more components in the server unit 1211. For example, the switchboard BMC 1460 can be communicatively coupled to one or more temperature sensors 1488 via an I2C port, for example. The temperature sensor(s) 1488 (which may comprise a thermistor or IC temperature sensor) can sense the temperature of such system components as a
heat-dissipative element. The heat-dissipative element can be thermally coupled to one or more CPUs at one of the compute nodes 1310, coupled to one of more switches 1430 on the switchboard 1320, or coupled to one or more microprocessors of one of the I/O devices 1330. If an overtemperature is detected at a monitored component, the switchboard BMC 1460 can initiate and/or execute a shutdown of the component. In some implementations, the switchboard BMC 1460 can divert computational workloads away from a monitored component and migrate data as needed in response to detecting a high temperature or overtemperature condition of the component.
[0105] In some implementations, at least one of the temperature sensors 1488 can sense the temperature of power supply circuitry (such as voltage regulators, transformers, and DC-DC voltage converters), and the switchboard BMC 1460 can take corrective action in response to detecting an over-temperature condition in the supply circuitry (e.g., migrating data and initiating a shutdown of the server unit 1211). In some cases, a temperature sensor 1488 can sense the temperature of the liquid coolant 1164 in the immersion-cooling system. If the temperature of the liquid coolant 1164 is detected by the switchboard BMC 1460 to rise above a threshold level, the switchboard BMC 1460 may restrict or limit workload distribution to the I/O devices 1330 until the temperature of the liquid coolant 1164 reduces below a second threshold value, which may be the same value as or different, lower value from the first threshold value.
5.2 Switchboard UART Communications
[0106] FIG. 1.8C illustrates an example of how UART communications can be supported by the management circuitry 1461 of the switchboard 1320. Multiplexors can be implemented in circuitry of the CPLD 1470 to provide UART communication channels between the BMC 1460 and devices in the server unit 1211 (such as the compute nodes 1310, service processor 1470, and switches 1430). In the example implementation of FIG. 1.8C, one UART port (labeled UART2) can couple to any compute node 1310-1, 1310-2, 1310-3, 1310-4 through a first multiplexor 1476 that couples to four compute-node connectors 1422. One UART port (labeled UART1) can be used to communicatively couple the BMC 1460 to the service processor 1470.
[0107] Two UART ports and two multiplexors can be used to communicatively couple the BMC 1460 to two ports on each of the four switches 1430. A first of the two ports on the switches 1430 can be for communicating with an advanced RISC machine (ARM) on each switch and controlling the switches (e.g., setting switch configurations). A second of the two
ports (labeled SDB) can be used to debug switch operation (e.g., receive information about switch errors). According to some implementations, UART communications can be used for debugging the server unit 1211 while I2C communications can be used for sideband board management functionalities in the server unit.
5.3 Server Configurations
[0108] With four compute nodes 1310, sixteen I/O devices 1330, four configurable switches 1430, and 32 configurable cable connections to communicatively couple the compute nodes 1310, switches 1430, and I/O devices 1330, there are many configurations in which the server unit 1211 can be operated. There are at least four basic configurations in which none of the I/O devices will be stranded (not accessed by a compute node).
[0109] In a first basic configuration BC1, each one of the four compute nodes 1310 is coupled to one-quarter of the I/O devices 1330 (four in this example). In this configuration, all four of the compute nodes 1310 and all sixteen of the I/O devices 1330 are utilized during operation of the server unit 1211. If one of the compute nodes fails, then four of the I/O devices assigned to the failed compute node will become stranded. However, it is possible to reassign (and unstrand) the stranded I/O devices to functioning compute nodes using the switches 1430 and the management circuitry 1461 so that use of all I/O devices 1330 can be regained.
[0110] In a second basic configuration BC2, all I/O devices 1330 can be assigned to only one of the compute nodes 1310. In such a configuration, the remaining compute nodes which have no assigned I/O device may or may not be utilized. In some cases, the remaining compute nodes can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
[0111] In a third basic configuration BC3, two of the compute nodes 1310 can each be assigned one-half of the I/O devices 1330 (eight in this example). The remaining compute nodes which have no assigned I/O device may or may not be utilized. In some cases, the remaining compute nodes can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
[0112] In a fourth basic configuration BC4, two of the compute nodes 1310 can each be assigned one-quarter of the I/O devices 1330 (four in this example) and a third compute node can be assigned one-half of the I/O devices 1330 (eight in this example). The remaining compute node which has no assigned I/O device may or may not be utilized. In some cases, the remaining
compute node can be used for less complex tasks (e.g., information searches, data storage and retrieval, low-complexity calculations, etc.).
[0113] The implementation of FIG. 1.8A can allow for isolation of compute nodes 1310 and isolation of I/O devices 1330. According to the cable connections of FIG. 1.7, each compute node can access any I/O device of the 16 I/O devices 1330 in the server unit 1211. However, the cabling can be reconfigured such that one or more of the compute nodes 1310 can access only a portion of the I/O devices 1330 and be isolated (in terms of signal communications) from remaining compute nodes and I/O devices in the server unit 1211. For example, cable connections from the first compute node 1310-1 and third compute node 1310-3 can be made to only the first switch 1430-1 and the second switch 1430-2, instead of also connecting to the third switch 1430-3 and fourth switch 1430-4. Similarly, cable connections from the second compute node 1310-2 and fourth compute node 1310-4 can be made to only the third switch 1430-3 and fourth switch 1430-4, instead of also connecting to the first switch 1430- 1 and the second switch 1430-2. In this way, the first compute node 1310-1, third compute node 1310-3, and eight coupled VO devices (coupled to VO device connectors 1423-1 through 1423-8) are isolated from the second compute node 1310-2, fourth compute node 1310-4, and the remaining eight coupled VO devices (coupled to VO device connectors 1423-9 through 1423-16). As such, the single server unit 1211 can be used for two completely independent workloads where data for one of the workloads is secure from access by a portion of the server used to handle the second workload. In such an implementation, it can be possible for two clients to use the same server unit 1211.
[0114] Further, in such an implementation, bandwidth between a compute node and VO device(s) can be increased. For example, one of the first cables 1315 that would otherwise be used to connect the first compute node 1310-1 to the third switch 1430-3 can be connected to the first switch 1430-1 instead, thereby increasing the bandwidth of communications between the first compute node 1310-1 and any of the eight coupled VO devices (coupled to VO device connectors 1423-1 through 1423-8). Similarly, bandwidth to coupled VO devices can be increased for the other compute nodes. Other changes to bandwidth are possible. For example, one of the compute nodes 1310-1 could have all four of its first cables 1315 connected to one of the switches 1430-2 to further increase bandwidth to VO devices coupled to that switch.
[0115] It is also possible to establish peer-to-peer communicative coupling between all of the switches 1430. This could be done by modifying the cabling connections in FIG. 1.7. For example, one of the first cables 1315 could be removed and a second of the first cables 1315
used to couple a communication port of the second switch 1430-2 to a communication port of the third switch 1430-3. Such a modification would allow all connected I/O devices to communicate with each other through one or more of the switches 1430 without going through a compute node, reducing latency of peer-to-peer communications. However, the modification would sacrifice bandwidth of communications between at least one of the compute nodes and the connected I/O devices 1330.
6. Compute Nodes
[0116] As described above, a server unit 1211 can comprise a plurality of compute nodes 1310 (e.g., from 2 to 8 compute nodes). When the server unit 1211 comprises four compute nodes, as in the illustration of FIG. 1.7, each compute node can be identified (for purposes of system management and operation) by a two-bit compute-node identifier (e.g., [00], [01], [10], [11]). A three-bit compute-node identifier can be used when the server unit 1211 comprises more than four compute nodes 1310. An example of a compute node 1310-1 is depicted in FIG. 1.9A. Each of the compute nodes 1310 comprises two CPU packages 1105-1, 1105-2 communicatively coupled to each other and to compute- node management circuitry 1920 included in the compute node 1310-1. Each CPU package can comprise at least one CPU, such as the 4th generation Intel® Xeon® Scalable Processor, Max Series, or the 5th generation Xeon® processor, both available from Intel Corporation® of Santa Clara, California. The compute node 1310-1 can further comprise memory modules 1930, 1935 communicatively coupled to each of the two CPU packages 1105-1, 1105-2 and their processors. One of the CPU packages 1105-1 can be communicatively coupled to a NIC 1940 (such as the Nvidia Bluefield 3 network interface card mentioned above), so that each of the compute nodes 1310 can communicate with each other and/or other processing devices over a network. The NIC 1940 can provide an interface with the BMC’s local management network. The NIC 1940 can be installed in an expansion card slot that is powered by 12- volt standby power (such that the card is powered irrespective of the power state of the compute node 1310-1). As such, the data center management network can access the Compute-node management circuitry 1920 and its local management network over a network controller sideband interface (NCSI) port regardless of the power state of the compute node 1310-1.
[0117] A first memory module 1930 can comprise dual in-line memory modules (DIMMs) that support DDR data transfer between the memory module and the coupled CPU processor. An example memory module 1930 is the RDIMM DDR5-5600 available from
Micron Technology, Inc of Boise, Idaho. The DIMM storage capability can be 32GB, 48GB, 64GB, 96GB, or 128GB, for example. Each DIMM memory module 1930 can couple to the CPU package 1105-1, 1105-2 on its own data transfer channel. There can be from one to twelve memory modules 1930 coupled to the processor of the CPU package on separate data transfer channels.
[0118] A second memory module 1935 can comprise solid-state drives (SSDs) coupled to the CPU packages 1105 on one or more data transfer channels. For the implementation of FIG. 1.9A, three SSD memory modules 1935 couple to the processor of the first CPU package 1105-1 on a single data transfer channel. The second CPU package 1105-2 couples to one SSD memory module 1935 over a first data transfer channel and couples to two SSD memory modules 1935 over a second data transfer channel. An example SSD memory module 1935 can have an El.S form factor, such as the DC P4511 SSD module available from Solidigm® of Rancho Cordovo, California. Communication between the processor of the CPU package and its SSD memory modules 1935 can be via I2C communications. The SSD memory modules 1935 can be electrically isolated from each other using I2C switches 1937 for connecting each SSD memory module 1935 to the I2C signal path. In the event of an I2C failure of an individual memory module 1935, the module can be de-configured from the I2C path and I2C functionality recovered for the remaining memory module(s) 1935 on the I2C path.
[0119] The compute-node management circuitry 1920 can include several components that are identical to, or different from, components in the switchboard’s management circuitry 1461. For example, the BMC 1960 and service processor 1970 can be the same devices used for the switchboard’s BMC 1460 and service processor 1470, respectively. In other implementations, a different BMC 1960 can be used for the compute node and/or a different service processor 1970 can be used for the compute node than used for the switchboard 1320. The compute-node management circuitry 1920 can further comprise a platform controller hub 1950 that includes a chipset for controlling data paths and supporting functions (such as providing a system clock) that are used in conjunction with the CPU(s) of the compute node 1310-1.
6.1.1 Compute-Node Management Circuitry and Firmware
[0120] FIG. 1.9B illustrates further details of the compute-node management circuitry 1920 and its communicative couplings to the switchboard 1320 (at one of the switchboard’s four compute-node connectors 1462) and to the NIC 1940. Four the server unit of FIG. 7, there will
be four such communicative couplings of the four compute nodes 1310 to the switchboard’s four compute-node connectors 1462. One of the communicative couplings between the computenode management circuitry 1920 and NIC 1940 provide access to the NCSI port 1962 of the compute-node BMC 1960.
[0121] The compute-node management circuitry 1920 is configured to control and monitor the state of the compute node 1310-1. Also included in the circuitry is a CPLD 1975 for power monitoring and management and for reset operations. Resetting of the various components of the compute node 1310-5 can be buffered, sequenced, and fanned out by the CPLD 1975 whenever a reset of the compute node 1310-1 is needed. Security of the compute node 1310-1 is assured in part by the compute-node BMC 1960 which is a root of trust (RoT) processor, such as the AST2600 processor described above. This processor can also secure the CPLD 1975.
[0122] Security of the compute node 1310-1 is further assured in part by the service processor 1970, which is also a RoT processor, such as the AST1060 service processor described above. The service processor 1970 can securely access BMC and BIOS firmware (stored in flash memory 1982) for boot-up and run-time. The service processor 1970 secures the BMC and BIOS flash images and provide run time protection against attack or accidental corruption using hardware filtering of the SPI buses. The service processor 1970 can also provide firmware recovery in the event of main image corruption. The service processor 1970 supports multiple flash devices 1982 for BMC and BIOS enabling full backup of images and on-line firmware update in which the new firmware image will be activated on the next boot cycle.
[0123] The flash devices 1982 can be partitioned into three regions with access controlled by the service processor 1970. The first region is a staging region. This staging region comprises a read/write section of data storage that is used to store a staged BMC or BIOS image. Once an image is written to the staging section, it can be evaluated by the service processor 1970 before being copied to an active BMC or BIOS region of the corresponding flash devices 1982. The second region is a recovery region, which comprises a read-only section of data storage that is used for BMC or BIOS firmware recovery. The third region is the active region of the flash devices 1982 and is a read/write section of data storage. The active BMC or BIOS firmware image is stored here.
[0124] Each of the compute nodes 1310 supports an external trigger for initiating bare- metal recovery of the BMC firmware and other critical firmware images in the event of a non- recoverable corruption of stored firmware. Bare-metal recovery can be triggered through use of
a “magic packet” command sent on the local area network (sometimes referred to as a “Wake on LAN magic packet command.” This command can be received through the NIC 1940 by a physical layer (PHY) transceiver 1927 and routed to the service processor 1970. This command triggers a WAKE# interrupt to the service processor 1970 on the compute node 1310-1 which initiates a recovery of the BMC firmware image and NIC firmware (if needed).
[0125] An example of such a bare-metal recovery process 11000 is depicted in FIG.
1.10. The process 11000 is designed to recover all firmware images necessary for compute-node BMC 1960 and management network functionality, such that the server unit 1211 can return to production. The process 11000 contains safeguards to prevent denial-of-service attacks. The bare-metal method of recovery is intended for servers that are down (out of production). Abbreviations used in the steps of the process 11000 are as follows: FW - firmware; Mgmt NW - management network; CMD - command.
[0126] If the compute node’s BMC 1960 cannot be placed in an operational status after recovering firmware (step 11035) for the compute node’s BMC 1960, then service will be requested (step 11060) for the server unit 1211. A UID LED service indicator light (described below) can illuminate blue light identifying the server unit 1211 as needing service. Similarly, if the BMC is or can be made operational whereas the management network for the Compute-node management circuitry 1920 cannot be made operational after recovering firmware (step 11050) for the compute node’s NIC 1940, then service will be requested (step 11060) for the server unit 1211. The process 11000 denies and logs (step 11030) unauthentic Wake# commands. If a Wake# command is asserted (step 11005) to the service processor 1970 (which could occur in a denial-of-service attack) and the BMC 1960 and management network are operational, the authenticity of the command is checked (step 11025). If the Wake# command is not authentic, then it will be denied and logged (step 11030). If the compute node’s BMC 1960 and management network are operational or can be made operational after recovering BMC and/or NIC firmware, then the server unit 1211 can be returned to production (step 11070) to service clients of the data center.
[0127] Table 2 lists different firmware that is involved in operation of the Compute-node management circuitry 1920 for each of the compute nodes 1310. The table also lists where the firmware can be stored (second column), how firmware can be updated (third and fourth columns) and whether an image of the firmware is recoverable and if so, how it can be recovered (last column). The NIC 1940 can have its own on-board BMC, an advanced RISC machine, and memory devices, so that it can store and manage recovery of its own firmware. Recovery of
firmware on the NIC 1940 can be triggered by the compute node that is coupled to the NIC 1940 (e.g., by the compute node’s BMC 1960 issuing a firmware recovery command over a UART and/or NCIS connection to the NIC 1940 (see FIG. 1.9B). There are multiple ways to perform OOB updates to firmware for the NIC 1940, which can include using I2C, UART, 1 Gb Ethernet, and/or NCIS communication channels between the compute node BMC 1960 and NIC 1940.
TABLE 2: Compute-Node Firmware
6.1.2 Compute-Node Management Circuitry Reset
[0128] FIG. 1.11 depicts how reset operations are performed for the compute-node management circuitry 1920. Each of the compute nodes 1310 may or may not include a manual reset button 11105. The compute node’s service processor 1970 is responsible for resetting the
BMC 1960 and associated flash memory devices 1982. The service processor 1970 is also responsible for monitoring a watch-dog timer (WDTRST2#) implemented on the BMC 1960 and resetting and/or recovering firmware for the BMC 1960 in the event that the watch-dog timer is asserted. The service processor 1970 is further responsible for resetting the platform controller hub 1950 (at input RSMRST#) and its associated flash memory devices.
6.1.3 Compute-Node Management Circuitry I2C Communications
[0129] FIG. 1.12 illustrates I2C supported communication channels between components of the compute-node management circuitry 1920 and other node components. The BMC 1960 can communicatively couple to the NIC 1940, platform control hub 1950, CPLD 1975, and service processor 1970 with I2C communication channels as depicted in FIG. 1.12. Improved I2C (I3C) communication channels are also supported between the BMC 1960 and compute node’s CPUs CPU0, CPU1. Two I3C switches are used to switch communications to and from the CPUs to either the BMC 1960 or DIMM memory devices 1930. I2C communication channels can run to each of the switches 1430 on the switchboard 1320 through compute-node connectors 1422 and to compute-node memory modules 1935. The BMC 1960 can also access a memory device 11210 that stores field-replaceable unit identification (FRUID) information about the compute node. I2C communication channels can also be used to monitor system sensors (e.g., temperature sensors 11230, voltage sensors, current sensors, etc.). I2C channels can also be used by the BMC 1960 to communicate with the server unit’s power distribution boards.
6.2 Compute Node UART and USB Communications
[0130] Each of the compute nodes 1310 supports UART and USB communications, as shown in FIG. 1.13. These communication channels can be used for system development and debugging of the server unit 1211. Each of the compute nodes 1310 can include an internal universal serial bus (USB) 3.0 and an internal USB connector 11310. The USB connector 11310 can be used for standard USB connections to the compute node and can be used for Intel USB- based image transport protocol (ITP), for example. Each of the compute nodes 1310 can further or alternatively include an internal micro-USB connector 11320 that can be used to monitor debug consoles and operating system UARTs from the compute-node BMC 1960. For example, the micro-USB connector 11320 can access a debug console of the compute-node BMC 1960 via a first UART channel (labeled UART1) and second UART channel (labeled UART5). A USB to UART bridge driver 11340 can be used between the micro-USB connector 11320 and UART
channels on the BMC 1960. An example bridge driver 11340 is the CP2105 driver available from Silicon Labs of Austin, Texas. A UART port (labeled UART2) can be used to support NCIS communications with the NIC 1940. This communication channel can be used to command recovery of firmware images stored on the NIC 1940, for example.
6.3 Non-Maskable Interrupts
[0131] The compute node’s CPUs CPU0, CPU1 support non-maskable interrupts (NMIs) generated from either the platform controller hub 1950 or the compute node’s BMC 1960. FIG. 1.14 illustrates how NMIs are routed to the compute node’s CPLD 1975, which logically OR’s the signals to the CPUs CPU0, CPU1. In the event of an NMI, both CPUs are interrupted. The interrupts can also be cross-routed between the platform controller hub 1950 and the BMC 1960 so that each entity is informed should the other generate an interrupt.
6.4 CPU Errors
[0132] The CPUs CPU0, CPU1 can provide error status signals during runtime. There can be three types of errors: (1) a hardware correctable error (no operating system or firmware action is necessary to recover from the error), (2) a non-fatal error (OS or FW action is required to contain the error and recover), and (3) a fatal error (system reset required to recover). Statuses of the three types of errors ca be reported to the platform controller hub 1950 or the compute node’s CPLD 1975. FIG. 1.15 depicts how a fatal error can be handled by each of the compute nodes 1310. When a CPU asserts a fatal or catastrophic error at a fatal error port (labeled CATERR_N), the status signal is level shifted by a level shifter 11510 and reported to the CPLD 1975, which in turn reports (after a programmed delay) the fatal error to the platform controller hub 1950 and compute node’s BMC 1960 so that a system reset can be executed. Active low logic is used in this circuitry, so that the AND gate provides OR functionality.
6.5 Memory Temperature Monitoring and CPU Throttling
[0133] Each of the compute nodes 1310 can monitor the thermal status of the memory voltage regulators (VRs) 11610, as depicted in FIG. 1.16. The thermal status can be provided on a port (labeled VRHOT# in the drawing) of the memory VR 11610. If the temperature of a memory VR 11610 exceeds a threshold temperature, the condition will be detected by the compute node’s CPLD 1975. The CPLD 1975 can logically OR the thermal status signal from the VR 11610 with a throttle status signal from the platform controller hub 1950 to initiate
throttling of at least one of the CPUs CPUO, CPU1. Active low logic is also used in this circuitry, so that the two AND gates provide OR functionality. The throttling of the CPU(s) will reduce its (their) performance. The signal to throttle the CPU(s) can also alert the compute node’s BMC 1960. Throttling of both CPUs CPUO, CPU1 can also be initiated by a throttling status signal from the platform controller hub 1950 (e.g., as part of a power-capping process described below).
6.6 CPU Temperature Monitoring and Throttling
[0134] Each of the compute nodes 1310 supports throttling of the CPUs CPUO, CPU1 using a so-called “fast processor hot” (fast PROCHOT) process that is based on monitoring of the CPU’s input voltage and power. FIG. 1.17 is a block diagram that depicts how the monitoring and throttling can be implemented. Again, active low logic is used so that the AND gates provide OR functionality and the OR gate provides AND functionality. A plurality of processor-hot triggers 11710 are monitored by the compute-node management circuitry 1920. For example, the BMC 1960 and platform controller hub 1950 and can monitor trigger status signals provided by system components that are monitored. Either the BMC 1960 or platform controller hub 1950 can output a processor-hot event in response to detecting an alert status signal from the monitored processor-hot triggers 11710.
[0135] The types of processor-hot triggers 11710 that can be monitored include the following:
• undervoltage alert - a comparator can monitor the 12 V voltage output of an HSC and assert an alert status signal if this voltage falls below 11.5V.
• overcurrent Alert - the HSC can monitor the input current and assert an alert status signal if the input current exceeds a threshold current level.
• 48 V HSC alert - the 48 V HSC can assert an alert status signal (e.g., if it detects an overvoltage condition).
• 12 V HSC alert - the 12 V HSC can assert an alert status signal (e.g., if it detects an overvoltage condition).
• BMC, power-capping alert - the BMC 1960 can assert an alert status signal based on an FM_THROTTLE# signal received as part of a power-capping process.
• CPU temperature alert - either of the CPUs CPUO, CPU1 can assert an alert status signal when the CPU’s internal temperature sensor detects that the CPU exceeds a threshold temperature (e.g., its maximum safe operating temperature).
[0136] The compute-node CPLD 1975 can implement logic to process alert status signals from the processor-hot triggers 11710. For example, processor-hot trigger signals can be filtered by the CPLD 1975 by performing a logical AND with a minimum-duration signal from the CPLD 1970 that is asserted only when the received processor-hot trigger signal has a duration of at least 100 ms or other minimum duration, for example. The CPLD 1975 can also de-assert a processor-hot signal (PROCHOT#) at either or both CPUs CPU0, CPU1 when the trigger event is cleared.
[0137] The circuitry of FIG. 1.17 also enables power capping of each of the compute nodes 1310. Power capping can be triggered by the compute-node BMC 1960 through a powercapping signal (labeled FM_THROTTLE#) that can be asserted to the platform controller hub 1950, as indicated in the drawing. Once asserted, the platform controller hub 1950 can initiate power capping that throttles performance of the CPUs. In some implementations, compute-node power for the power capping feedback loop is read by the platform controller hub 1950 from the 12 V HSC.
7. Server LED Indicators
[0138] There can be a plurality of LED indicators 1322 mounted in visible locations on the server unit 1211, as depicted in FIG. 1.4B. The LED indicators 1322 can provide easily viewable indications of server status, switchboard status, and compute node status, for example. In some implementations, there can be one shared LED indicator for the compute nodes, a dedicated LED indicator for each compute node, and an LED indicator for the switchboard 1320 (yielding six LED indicators for the example server unit of FIG. 7). According to some implementations, the LED indicators 1322 could be mounted on the switchboard 1320 such that they are visible from the top of the server unit 1211 (e.g., along a top edge of the switchboard 1320 such that the LED indicators 1322 can be viewed through or from the access door opening 1220).
[0139] The LED indicator shared by the compute nodes 1330 can be a unique identifier (UID) LED indicator that emits blue light when on. This UID indicator can signal (when on) a request for service and easily identify a server unit 1211 in need of service. Any one of the compute nodes 1310 can activate the UID LED indicator.
[0140] Each one of the compute nodes 1310 can control its own bicolor LED indicator. For example, the bicolor LED indicator can emit amber or green light. Signaling with a compute
node’s bicolor LED indicator can be as follows, though other and/or additional signaling schemes are possible.
• off = standby power off
• steady amber = standby power on
• blinking amber (5 Hz) = compute node BMC boot complete; power on self test (POST) not completed
• blinking amber (1 Hz) = compute node BMC connected to management network; POST not completed
• blinking green (1 Hz) = compute node system power operating properly
• steady green = compute nodePOST complete
• blinking green (5 Hz) = compute node BMC firmware update in progress
[0141] The UID LED indicator and compute-node bicolor LED indicators can be controlled by the compute nodes using I2C communications. For example, an I2C line from the compute node can couple to an I2C expander 1463 on the switchboard 1320 (depicted in FIG. 1.8B), which can provide a controlling signal to the UID LED indicator or compute-node bicolor LED indicator.
[0142] The switchboard 1320 can also include a bicolor LED indicator light controlled by its BMC 1460 and/or CPLD 1475, for example. Signaling with a compute node’s bicolor LED indicator can be as follows, though other and/or additional signaling schemes are possible.
• off = standby power off
• steady amber = standby power on
• steady green = switchboard BMC booted
• blinking green (5 Hz) = switchboard BMC firmware update in progress
[0143] When service is needed for the server unit 1211, data center controlling software and the server unit’s compute-node management circuitry 1920 coordinates actions for the service request. For example, if any one of the compute nodes 1310 or the data center’s control software indicates that service is needed for the server unit 1211, then the compute-node management circuitry 1920 for each of the compute nodes 1310 migrates all work tasks and related data off of the server unit 1211 to another server unit before indicating a service request with the UID LED indicator.
8. Power Distribution
[0144] Different voltages can be distributed in the server unit 1211. For the example implementation described herein, 48-volt power and 12- volt power is distributed by three powerdistribution boards 1410, 1436 located within the server unit 1211 (also see FIG. 1.4A and FIG. 1.4B). A block diagram of the power distribution circuitry is illustrated in FIG. 1.18. The first power distribution board 1410 can receive 48-volt power from one or more busbars 1230 at the base of the immersion-cooling tank 1207 and can provide fused 48 V power to the compute nodes 1310, switchboard 1320, and a plurality of 48V: 12V, 1600-watt power-converter bricks 11810 that are distributed on the first power distribution board 1410 and two second power distribution boards 1436. The power-converter bricks 11810 provide 12 V power to the plurality of VO devices 1330. According to some implementations, hot-swap controllers (HSCs) are used between the 48 V busbar and power-converter bricks 11810 and between the 48 V busbar and voltage regulators used on the compute nodes 1310 and switchboard 1320 so that the compute nodes and switchboard can be hot swapped. HSCs can also be used between the powerconverter bricks 11810 and VO devices 1330 so that the VO devices can be hot swapped. The HSC supplies can further provide electrical isolation, power control, and power monitoring functionality.
[0145] The server unit 1211 can draw considerable power when in operation. Table 3 lists example amounts of power drawn by various components of the server unit 1211.
TABLE 3
9. Conclusion
[0146] While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize or be able to ascertain, using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that inventive embodiments may be practiced otherwise than as specifically described. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
[0147] Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
[0148] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
[0149] Unless stated otherwise, the terms “approximately” and “about” are used to mean within ± 20% of a target (e.g., dimension or orientation) in some embodiments, within ± 10% of
a target in some embodiments, within ± 5% of a target in some embodiments, and yet within ± 2% of a target in some embodiments. The terms “approximately” and “about” can include the target. The term “essentially” is used to mean within ± 3% of a target.
[0150] The indefinite articles “a” and “an,” as used herein, unless clearly indicated to the contrary, should be understood to mean “at least one.”
[0151] The phrase “and/or,” as used herein, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0152] As used herein, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of’ or “exactly one of’ or “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” shall have its ordinary meaning as used in the field of patent law.
[0153] As used herein, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or
B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0154] In the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of’ and “consisting essentially of’ shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
[0155] Section 2: Multi-Purpose Sleeve for Immersion-Cooled Servers A typical immersion tank for immersion cooling of servers and other electronic components is a bulky, welded mechanical assembly with large dimensional tolerances. For instance, the slots in the immersion tank for servers may be from 8 mm larger to 5 mm smaller than intended due to manufacturing imperfections. When the immersion tank is filled with immersion fluid and pressurized, the immersion tank’s walls may expand and/or contract, causing the slots for the servers to expand and/or contract as well, e.g., by up to about ± 8 mm (± 3 mm).
[0156] To account for the large manufacturing tolerances and to avoid interference with the servers, the immersion tank should be sized larger than the servers. For instance, if a server’s maximum width is 500 mm, then the slot in the immersion tank for that server should have a nominal width of at least 513 mm to account for the potential variation of the manufacturing tolerance of -5 mm, plus an extra -8 mm to account for immersion tank wall movement due to negative pressure condition, resulting in larger immersion tanks and larger slots for servers in the immersion tanks. When the immersion tank is on the larger side of the tolerance, the maximum immersion tank width could be as high as 521 mm (unpressurized) or 529 mm (pressurized). These manufacturing tolerances also result in big variations in server openings from tank to tank.
[0157] The larger sizes and size variations among openings for servers in immersion tanks create several problems. For instance, when the slot or opening in the immersion tank is larger than the server, then aligning the server to the immersion tank can be more difficult as the server can be biased on one side of the tank and may not connect or couple properly to the bus bar inside the immersion tank. When the slot or opening is smaller, on the other hand, the server
chassis can rub against the guide rails instead the immersion tank as the server is being inserted into or removed from the immersion tank. The immersion tank and server chassis are typically made of steel or another metal, so this rubbing can produce metal debris. As the immersion fluid boils, it moves from bottom of the tank to top, bringing the metal debris into contact with the servers and/or other electrical components, which may cause electrical shorts. Rubbing between metal surfaces can also produce metal debris even for larger slots and openings. The larger slots for servers also lead to larger immersion tank volumes, which in turn leads to larger amounts of immersion fluid needed to fill the immersion tanks. Immersion fluid is very expensive, so the extra immersion fluid increases cost without necessarily improving cooling performance.
[0158] FIG. 2.1 shows a gasket or sleeve 2100 that addresses problems associated with large manufacturing tolerances for slots or openings in immersion tanks. The sleeve 2100 allows for “less-than-perfect” mating of the server chassis to guide rails and/or alignment features within the immersion tanks. The sleeve 2100 can be attached to the side of an immersion-cooled server chassis before the server chassis is inserted into a slot or opening in the immersion tank. The server chassis may have a height of 700 mm or 1925 mm, a length of 797 mm, and a thickness of 86.9 mm, and the sleeve 2100 may have a length of about 700 mm or 1925 mm (to match the height of the server chassis), a width of about 85 mm (to roughly match the thickness of the server chassis), and a thickness of about 8 mm (to fill the expected maximum gap between the server rails and the immersion tank). The sleeve 2100 prevents metal-on-metal contact between the server chassis and the guide rails within the immersion tank, reducing the probability of creating metal debris when inserting or removing the server chassis. And the sleeve 2100 occupies space inside the immersion tank that would otherwise be occupied by the immersion fluid, reducing the quantity of immersion fluid needed to fill the immersion tank.
[0159] The sleeve 2100 is made of a strip of electrically insulating, elastically deformable material, such as rubber, silicone, plastic, or polytetrafluoroethylene (PTFE). For instance, the sleeve 2100 can be stamped out of a single sheet or material, molded, or extruded. This material should also be inert; that is, it should not react with the immersion fluid, anything else within the immersion tank, or the immersion tank itself. In addition, the material should not degrade or decompose in the immersion tank or when it is inserted into or removed from the immersion tank because degradation could affect the immersion fluid’s cooling capacity. Degradation could also deleteriously affect the operation of the server and/or other components within the immersion tank.
[0160] The sleeve 2100 includes two or more snap-fit features 2110 or other fasteners for attaching the sleeve 2100 to a server chassis. In FIG. 2.1, for example, the snap-fit features 2110 are at the ends of the sleeve 2100 and secure the sleeve 2100 to a server chassis in a detachable fashion. The snap-fit features 2110 in FIG. 2.1 have mushroom heads the fit through holes in the server chassis and grab the inside wall of the server chassis (e.g., as shown in FIG. 2.2A, described below). Other sleeves may have different types of fasteners, such as snaps, buttons, latches, and so on, that allow compression and decompression of the sleeve. They may also have more fasteners, for example, disposed along the length of the sleeve 2100. Alternatively, or in addition, the sleeve 2100 can be shaped to conform to or stretch or wrap around the server chassis such that the sleeve 2100 stays secured to the server chassis.
[0161] There are also one or more springs or spring features 2120 disposed along the length of the sleeve 2100. These spring features 2120 protrude from the sleeve and can expand or contract to adjust to a wide variety of immersion tank slot sizes resulting from the large manufacturing tolerances for immersion tanks. When the sleeve 2100 is attached to the server chassis and the server chassis is in the immersion tank, the spring features 2110 on the sleeves 2100 push outwards, holding the sleeves 2100 and the server chassis in place with respect to the immersion tank. When the spring features 2120 are fully expanded against the immersion tank, the combined width of the server chassis and the sleeve(s) 2100 should equal the maximum width of the opening or slot in the immersion tank. This means that when the spring features 2120 are in a free or fully expanded state, the sleeve 2100 at its thickest point should be wide enough to fill the largest possible gap between the server chassis and the rails that define the opening or slot in the immersion tank for the server chassis. Because the rails take up some space, the maximum spring thickness may be about 7 or 8 mm, for example, and may compress by about 7 mm.
[0162] The sleeve 2100 also has chamfered or round ends 2130 to aid in alignment and to reduce the possibility of contact between the metal server chassis and the metal guide rails of the immersion tank. Preventing or reducing metal-to-metal contact prevents or reduces the possibility of creating metal debris that could cause electrical shorting of servers and other components in the immersion tank. FIG. 2.1 shows that both ends of the sleeve 2100 are chamfered (cut away to form a sloping edge) or rounded. Alternatively, the sleeve may have a single chamfered end — the end that fits around the corner of the server chassis that is inserted first into the immersion tank.
[0163] FIGS. 2.2A-2.2C show different views of immersion sleeves 2100 installed on both sides of a server chassis 2200 that holds a server 2202. FIG. 2.2A shows a front view of the server chassis 2200, illustrating the server 2202 and one of the guide pins 2210 that extends from the server chassis 2200 for aligning the server chassis 2200 to the immersion tank. FIG. 2.2A also shows the mushroom head of one fastener 2110 sticking through a hole in the server chassis 2200 and one of the sleeve’s chamfered ends 2130. The sleeve’s spring features 2120 extend towards the server chassis 2200 as shown in FIGS. 2.2A and 2.2C. Alternative sleeves may have spring features that extend away from the server chassis 2200 when installed on the server chassis 2200. FIGS. 2.2B and 2.2C show that the sleeves 2100 are as wide as the server chassis 2200. The sleeves 2100 can be shorter than the server chassis 2200 is long, as long as the server chassis 2200, or even longer than the server chassis 2200, with the extra length extending towards the top of immersion tank and possibly forming a handle for removing the server chassis 2200 from the immersion tank.
[0164] FIGS. 2.3A-2.3D illustrate the server chassis 2200 with sleeves 2100 along both sides installed in an immersion tank 2300. FIG. 2.3A shows a vertical cross-section of the immersion tank 2300, and FIGS. 2.3B-2.3D show top-down views of the immersion tank 2300 at different levels of detail. The immersion tank 2300 includes guide rails 2302 and an alignment plate 2310 that align the server chassis 2200 to the immersion tank 2300 and hold the server chassis 2200 in position. The alignment plate 2310 and a bus bar 2320 sit at the bottom of the immersion tank 2300 between condenser coils 2330 that cool vaporized immersion fluid.
[0165] The server chassis 2200 can come in different sizes, e.g., with heights of 700 mm or 925 mm. When fully inserted into the immersion tank 2300, the shorter server chassis 2200 may below the condenser coils 2330 and may be fully immersion in the immersion fluid (not shown). The taller server chassis 2200 may not fit completely below the condenser coils 2330 and may be partially immersed in the immersion fluid. The guide pins 2210 extending from the server chassis 2200 fit into guide pin receptacles 2312 in the alignment plate 2310. Electrical pins 2204 electrically coupled to the server 2202 and extending from the server chassis 2200 contact bus bar connections 2314 that are electrically coupled to the bus bar 2320, which provides electrical power to the server 2202.
[0166] When the server chassis 2200 is installed in the immersion tank 2300, the chamfer 2130 on the sleeve 2100 helps to align the server chassis 200 to the guide rails 2302 in the immersion tank 2300. In the cases where tank opening is smaller than the total width of the server chassis 2200 and sleeves 2100, as the server chassis 2200 slides down into the immersion
tank 2300, the guide rails 2302 push the sleeve 2100 toward the server chassis 2200, compressing the sleeves’ spring features 2120. When the server chassis 2200 is pulled out of the immersion tank 2300, the spring features 2120 expand back to their original state. This expansion and contraction of the sleeve’s spring features 2120 account for the large tank manufacturing tolerances (e.g., +8/-5 mm (static) and ±8 mm (pressurized)).
[0167] FIG. 2.4 shows top, side, and cross-sectional views of an alternative gasket or sleeve 2400 for an immersion-cooled server chassis. This sleeve 2400 is a tube-like structure that is made of compliant, elastically deformable, electrically insulating material, such as rubber, plastic, or PTFE. It has two or more fasteners 2410 for attaching the sleeve 2400 to the server chassis and rounded or chamfered ends 2430 for sliding the server chassis into the immersion tank. The tube-like structure is hollow and can collapse, e.g., along predefined fold lines or pleats, to accommodate narrower openings as shown in the cross-sectional views. The tube-like structure can have a square or rectangular cross section as shown in FIG. 2.4 or a curved or pleated cross section. A vent or hole 2420 in the sleeve 2400 allows fluid e.g., air or immersion fluid) to enter and exit the sleeve 2400 as the sleeve 2400 expands or is compressed. Placing the vent 2420 is at the end of the sleeve 2400 that is not immersed in immersion fluid as shown in FIG. 2.4 allows air or vapor to enter the sleeve 2400 and reduces the amount of immersion fluid needed to fill the immersion tank. (If the server chassis is short enough to be completely immersed in the immersion fluid, the sleeve 2400 may be made long enough so that one end sticks out of the immersion fluid when the server chassis and sleeve 2400 are installed in the immersion tank.) Alternatively, the sleeve 2400 can have one or more vents or holes along its length. Or instead of being hollow and having vents or holes, the sleeve 2400 can be made of a sponge-like material that can be compressed and springs backs into shape when relaxed.
Conclusion
[0168] While various inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive
teachings is/are used. Those skilled in the art will recognize or be able to ascertain, using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
[0169] Also, various inventive concepts may be embodied as one or more methods, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
[0170] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
[0171] The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
[0172] The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0173] As used herein in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a
list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of’ or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
[0174] As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0175] In the claims, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of’ and “consisting essentially of’ shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03.
Claims
1. A server unit comprising: a chassis; a plurality of compute nodes mechanically coupled to the chassis, wherein each compute node of the plurality of compute nodes is configured to process a computational workload independently of other compute nodes of the plurality of compute nodes; a plurality of slots and/or sockets mechanically coupled to the chassis to receive I/O devices; and a switchboard mechanically coupled to the chassis comprising: a first plurality of cabling connectors to receive cables that communicatively couple the plurality of compute nodes to the switchboard; and a second plurality of cabling connectors to receive cables that couple the plurality of slots and/or sockets to the switchboard; and a plurality of switches to configure and reconfigure communicative couplings between the first plurality of cabling connectors and the second plurality of cabling connectors to assign and reassign portions of the I/O devices to each compute node of the plurality of compute nodes.
2. The server unit of claim 1, adapted for installation in a tank of an immersioncooling system.
3. The server unit of claim 2, wherein the switchboard is located in a top third of the server unit, wherein the top third of the server unit is farthest from a base of the tank when the server unit is installed in the tank for operation.
4. The server unit of claim 2, wherein at least a portion of the first plurality of cabling connectors are arranged on the switchboard to not be immersed in coolant liquid in the tank when the server unit is installed in the tank for operation.
5. The server unit of claim 2, wherein the first plurality of cabling connectors are located on a first side of the switchboard and the second plurality of cabling connectors are located on an opposing second side of the switchboard and wherein the chassis comprising an
opening for cabling connections to be made between the first plurality of cabling connectors and the plurality of compute nodes without cables passing outside a perimeter of the chassis.
6. The server unit of claim 2, wherein the plurality of switches are arranged on the switchboard to be immersed in coolant liquid in the tank when the server unit is installed in the tank for operation.
7. The server unit of claim 1, wherein two switches of the plurality of switches are communicatively coupled to each other such that a signal from a first I/O device arriving at a first connector of the second plurality of connectors can pass through a first switch of the two switches to a second switch of the two switches and to a second connector of the second plurality of connectors without going to a compute node of the plurality of compute nodes.
8. The server unit of claim 1, wherein the switchboard further comprises management circuitry comprising: a switchboard board management controller (BMC) to configure settings of the plurality of switches, wherein the BMC is communicatively coupled to the plurality of compute nodes and to the plurality of switches; and a service processor communicatively coupled to the switchboard BMC.
9. The server unit of claim 8, wherein the management circuitry further comprises: a complex programmable logic device (CPLD) communicatively coupled between the BMC and the plurality of compute nodes, wherein the CPLD is configured to support universal asynchronous receive/transmit (UART) communications between the BMC and the plurality of compute nodes.
10. The server unit of claim 9, wherein the CPLD if further configured to support UART communications between the BMC and the plurality of switches.
11. The server unit of claim 8, wherein the BMC is configured to support interintegrated circuit (I2C) communications with the plurality of switches.
12. The server unit of claim 8, wherein the BMC is configured to monitor at least one temperature sensor and at least one voltage sensor in the server unit.
13. The server unit of claim 1, wherein: the server unit is adapted for installation in a tank of an immersion-cooling system; and the plurality of compute nodes are arranged in a lower two-thirds of the chassis, such that the plurality of compute nodes are immersed in coolant liquid when the server unit is installed in the tank and the tank is operating.
14. The server unit of claim 1, wherein each compute node of the plurality of compute nodes comprises: a printed circuit board (PCB); and at least one computer processing unit (CPU) mounted to the PCB, wherein the plurality of compute nodes are arranged in a two-dimensional planar array within the chassis such that a planar surface of each PCB of each compute node of the plurality of compute nodes is oriented approximately or exactly parallel to a plane extending in two directions spanned by the two- dimensional planar array.
15. The server unit of claim 1, wherein each compute node of the plurality of compute nodes comprises: a PCB; and at least two CPUs mounted to the PCB and communicatively coupled to each other.
16. The server unit of claim 1, wherein each compute node of the plurality of compute nodes comprises: a PCB; at least one CPU mounted to the PCB; and management circuitry comprising: a platform controller hub communicatively coupled to the at least one CPU; a board management controller (BMC) communicatively coupled to the platform controller hub; and a service processor communicatively coupled to the platform controller hub and to the BMC.
17. The server unit of claim 16, wherein the management circuitry further comprises a complex programmable logic device communicatively coupled to the platform controller hub and to the BMC.
18. The server unit of claim 17, wherein the BMC supports inter-integrated circuit (I2C) and universal asynchronous receive/transmit (UART) communications with the platform controller hub and the complex programmable logic device.
19. The server unit of claim 16, wherein the BMC monitors at least one temperature sensor and at least one voltage sensor in the server unit.
20. The server unit of claim 16, wherein each compute node of the plurality of compute nodes further comprises a network interface card (NIC) communicatively coupled to one or more CPUs of the at least one CPU.
21. The server unit of claim 16, wherein each compute node of the plurality of compute nodes further comprises a plurality of dual in-line memory modules communicatively coupled to each CPU of the at least one CPU.
22. The server unit of claim 1, further comprising a plurality of the I/O devices mounted in the plurality of slots and/or sockets.
23. The server unit of claim 22, wherein the plurality of compute nodes are arrayed across a first side of the server unit and the plurality of I/O devices are arrayed across an opposing second side of the server unit.
24. The server unit of claim 22, wherein a majority of the I/O devices of the plurality of I/O devices are graphical processor units (GPUs).
25. The server unit of claim 22, wherein at least some of the I/O devices of the plurality of I/O devices are accelerator modules.
26. The server unit of claim 22, wherein: the server unit is adapted for installation in a tank of an immersion-cooling system; and the plurality of I/O devices are arranged in a lower two-thirds of the chassis, such that the plurality of I/O devices are immersed in coolant liquid when the server unit is installed in the tank and the tank is operating.
27. The server unit of claim 26, wherein the immersion-cooling system is a two-phase immersion cooling system.
28. The server unit of claim 22, wherein: each I/O device of the plurality of I/O devices comprises a PCB; and at least four I/O devices of the plurality of I/O devices are arranged in a two-dimensional planar array within the chassis such that a planar surface of each PCB of each I/O device of the at least four I/O devices is oriented approximately or exactly parallel to a plane extending in two directions spanned by the two-dimensional planar array.
29. The server unit of claim 22, wherein each I/O device of the plurality of I/O devices comprises a PCB having a first planar surface area that exceeds a second planar surface area of a conventional full-height full-length (FHFL) PCB, the PCB of each I/O device comprising: a pin area comprising pins to plug into a slot or socket of the plurality of slots and/or sockets; and an added board area that extends beside the pin area.
30. The server unit of claim 29, wherein a height of the PCB of each I/O device is between 112 mm and 120 mm.
31. The server unit of claim 1, further comprising power distribution circuitry mechanically coupled to the chassis, the power distribution circuitry configured to: receive power at a first voltage from a busbar; provide a first portion of the power at the first voltage to the switchboard; provide a second portion of the power at the first voltage to the plurality of compute nodes; and
provide a third portion of the power at a second voltage that is lower than the first voltage to power the I/O devices.
32. The server unit of claim 31, wherein the power distribution circuitry comprises: a power distribution PCB located at a base of the server unit and having at least one electrical contact to receive the power at the first voltage from the busbar, wherein the at least one electrical contact is arranged to connect with at least one mating contact on the busbar when the server unit is lowered into and installed in an immersion-cooling tank.
33. The server unit of claim 31, wherein the power distribution circuitry comprises: at least one fuse connected between the busbar and the plurality of compute nodes; and at least one hot-swap controller connected between the busbar and the plurality of compute nodes.
34. The server unit of claim 31, wherein the power distribution circuitry comprises at least one fuse connected between the busbar and the switchboard.
35. The server unit of claim 31, wherein the power distribution circuitry comprises: at least one power converter to convert the power at the first voltage to the third portion of the power at the second voltage; at least one fuse connected between the busbar and the at least one power converter; and at least one hot-swap controller connected between the busbar and the at least one power converter.
36. The server unit of claim 31, wherein the first voltage is 40 volts to 56 volts and the second voltage is approximately or exactly 12 volts.
37. The server unit of claim 1, further comprising: sixteen of the I/O devices installed in the server unit and connected to sixteen of the slots and/or sockets, wherein the plurality of compute nodes comprises eight CPUs and wherein a body of the server unit has a dimension no larger than two rack units (2RU) equal to 89 mm.
38. The server unit of claim 1, further comprising: sixteen of the I/O devices installed in the server unit and connected to sixteen of the slots and/or sockets, wherein the plurality of compute nodes comprises eight CPUs and wherein a volume of the server unit is no larger than 0.07 cubic meter.
39. A gasket for a server chassis, the gasket being formed of an elastically deformable material and having at least one fastener for securing the gasket to a server chassis.
40. The gasket of claim 39, wherein the gasket is configured to prevent the server chassis from rubbing against a metal surface of an immersion tank as the server chassis and the gasket are inserted into the immersion tank.
41. The gasket of claim 39, wherein the gasket is configured to be elastically compressed.
42. The gasket of claim 39, wherein the gasket is made of at least one of plastic, rubber, silicone, plastic, or polytetrafluoroethylene.
43. The gasket of claim 39, further comprising at least one compressible spring feature configured to be compressed between the server chassis and an immersion tank.
44. The gasket of claim 39, wherein the gasket comprises a sponge-like material.
45. The gasket of claim 39, wherein the gasket defines a hollow, compressible lumen having a vent or hole to allow fluid flow into and out of the hollow, compressible lumen.
46. The gasket of claim 45, wherein the vent or hole is disposed at an end of the gasket.
47. The gasket of claim 39, wherein the gasket has a chamfered corner.
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202363516996P | 2023-08-01 | 2023-08-01 | |
US63/516,996 | 2023-08-01 | ||
US202363578255P | 2023-08-23 | 2023-08-23 | |
US63/578,255 | 2023-08-23 | ||
US202363598825P | 2023-11-14 | 2023-11-14 | |
US63/598,825 | 2023-11-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2025029948A2 true WO2025029948A2 (en) | 2025-02-06 |
WO2025029948A3 WO2025029948A3 (en) | 2025-03-20 |
Family
ID=94396016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2024/040429 Pending WO2025029948A2 (en) | 2023-08-01 | 2024-07-31 | Multi-node server unit |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2025029948A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119917442A (en) * | 2025-03-31 | 2025-05-02 | 苏州元脑智能科技有限公司 | Server cabinet and server system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4653965B2 (en) * | 2004-04-08 | 2011-03-16 | 株式会社日立製作所 | How to manage I/O interface modules |
JP5272442B2 (en) * | 2008-02-20 | 2013-08-28 | 日本電気株式会社 | Blade server and switch blade |
WO2015183314A1 (en) * | 2014-05-30 | 2015-12-03 | Hewlett-Packard Development Company, L.P. | Supporting input/output (i/o) connectivity for a printed circuit assembly (pca) in a hot aisle cabling or a cold aisle cabling arrangement |
US10010008B2 (en) * | 2016-06-28 | 2018-06-26 | Dell Products, L.P. | Sled mounted processing nodes for an information handling system |
US20230025369A1 (en) * | 2022-09-29 | 2023-01-26 | Prabhakar Subrahmanyam | Methods and apparatus for an autonomous stage-switching multi-stage cooling device |
-
2024
- 2024-07-31 WO PCT/US2024/040429 patent/WO2025029948A2/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119917442A (en) * | 2025-03-31 | 2025-05-02 | 苏州元脑智能科技有限公司 | Server cabinet and server system |
Also Published As
Publication number | Publication date |
---|---|
WO2025029948A3 (en) | 2025-03-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9811127B2 (en) | Twin server blades for high-density clustered computer system | |
US9585281B2 (en) | System and method for flexible storage and networking provisioning in large scalable processor installations | |
US20110292595A1 (en) | Liquid Cooling System for Stackable Modules in Energy-Efficient Computing Systems | |
US8358503B2 (en) | Stackable module for energy-efficient computing systems | |
US8441788B2 (en) | Server | |
KR20140064788A (en) | Method and system for building a low power computer system | |
US10956353B1 (en) | Connector configurations for printed circuit boards | |
WO2025029948A2 (en) | Multi-node server unit | |
CN113220085A (en) | Server | |
CN101126949A (en) | Chassis partition architecture for multi-processor system | |
US20240013815A1 (en) | High-density data storage systems and methods | |
TW202115528A (en) | Computer power supply assembly and manufacturing method thereof | |
CN206209481U (en) | A kind of server | |
CN102541806A (en) | Computer | |
CN100541387C (en) | A Server System Based on Opteron Processor | |
CN103135701B (en) | Blade module | |
FIU20244070U1 (en) | Chassis structure | |
CN217034655U (en) | Double-circuit server | |
US20250021139A1 (en) | Multi-level baseboard management control structure | |
TWI873967B (en) | A power distribution board kit | |
CN213750943U (en) | Dual-system I7 calculation module 2U ruggedized computer | |
US20230074432A1 (en) | Card level granularity operation based module design | |
WO2025081071A1 (en) | System architecture for ai training server for immersed environment | |
US20230054055A1 (en) | Mounting adaptor assemblies to support memory devices in server systems | |
US20230380099A1 (en) | Systems with at least one multi-finger planar circuit board for interconnecting multiple chassis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24850065 Country of ref document: EP Kind code of ref document: A2 |