CN120567807A - Communication control method, computer program product, and basic input/output system - Google Patents
Communication control method, computer program product, and basic input/output systemInfo
- Publication number
- CN120567807A CN120567807A CN202511048145.3A CN202511048145A CN120567807A CN 120567807 A CN120567807 A CN 120567807A CN 202511048145 A CN202511048145 A CN 202511048145A CN 120567807 A CN120567807 A CN 120567807A
- Authority
- CN
- China
- Prior art keywords
- target
- maximum payload
- link
- maximum
- target device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The application discloses a communication control method, a computer program product and a basic input and output system, which relate to the technical field of servers and can solve the problems that the resources are wasted and redundant because the target devices connected with different virtual switches in the related technology are completely deployed under the condition that the mainboard of the server is reset, virtual switches of the target links in the same link group are connected with different ports of the same physical switch, the first maximum effective load with the minimum value is determined from the first maximum effective loads of the link group aiming at each link group, the second maximum effective load of each target device is updated according to the first maximum effective load with the minimum value corresponding to each link group, and the correct communication among the target devices crossing different target links is realized.
Description
Technical Field
The present application relates to the field of server technologies, and in particular, to a communication control method, a computer program product, and a basic input/output system.
Background
In the server, the Maximum Payload (MPS) on each target link based on the PCIe (PERIPHERAL COMPONENT INTERCONNECT EXPRESS ) protocol is generally determined by each target device on the target link, where the maximum payload specifies the maximum size of the payload in the data packet, the maximum payloads of each target link are independent, and when the maximum payloads of the two target links are different during east-west data transmission, the packet packaging format is different, and finally communication abnormality is caused.
In the related art, in order to ensure that the maximum payloads of the label links of each item in the same physical switch are the same, the target devices which require different virtual switch connection are required to be deployed completely the same (for example, network cards and hard disks on the virtual machine A and the virtual machine B are required to be completely the same), and this way can cause waste of resources and excessive redundancy, thereby bringing about cost rise.
Disclosure of Invention
The application provides a communication control method, a computer program product and a basic input/output system, which at least solve the problems of resource waste and excessive redundancy caused by the fact that all target devices connected with different virtual switches are required to be identical in related technology.
The application provides a communication control method, which is applied to a basic input/output system, and comprises the following steps:
Under the condition that the main board of the server is reset, acquiring a first maximum payload of each target link, and communicating a plurality of target devices in each target link through a target protocol;
Grouping each item of target links to obtain at least one link group, wherein virtual switches of each item of target links in the same link group are connected to different ports of the same physical switch;
determining a first maximum payload with the smallest value from first maximum payloads of marked links contained in each link group aiming at each link group;
and updating the second maximum payload of each target device in the link group according to the first maximum payload with the minimum value corresponding to each link group, so that each item in the link group communicates with the first maximum payload with the minimum value based on the first maximum payload with the minimum value.
The present application also provides a computer program product for use in a basic input output system, comprising:
The system comprises an acquisition module, a target protocol communication module and a virtual switch, wherein the acquisition module is used for acquiring a first maximum payload of each target link under the condition that a main board of a server is reset;
The first processing module is used for grouping each item of target link to obtain at least one link group, wherein the virtual switch of each item of target link in the same link group is connected with different ports of the same physical switch;
A second processing module, configured to determine, for each link group, a first maximum payload with a minimum value from first maximum payloads of links of each item included in the link group;
And the third processing module is used for updating the second maximum payload of each target device in the link group according to the first maximum payload with the minimum value corresponding to each link group so as to enable each item of target link in the link group to communicate based on the first maximum payload with the minimum value.
The application also provides a basic input output system comprising the computer program product provided above.
The application also provides electronic equipment which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing any one of the steps of the communication control method when executing the computer program.
The present application also provides a non-volatile computer-readable storage medium in which a computer program is stored, wherein the computer program when executed by a processor implements the steps of any one of the communication control methods described above.
The application also provides a further computer program product comprising a computer program which when executed by a processor implements the steps of any of the communication control methods described above.
According to the method and the system, under the condition that a main board of a server is reset, a first maximum effective load of each target link is obtained, a plurality of target devices in each target link are communicated through a target protocol, each target device comprises a virtual switch, each target link is grouped to obtain at least one link group, the virtual switches of each target link in the same link group are connected to different ports of the same physical switch, for each link group, a first maximum effective load with the minimum value is determined from the first maximum effective load of each target link contained in the link group, according to the corresponding first maximum effective load with the minimum value of each link group, the second maximum effective load of each target device in the link group is updated, so that communication based on the first maximum effective load with the minimum value among each target link in the link group is achieved, the problems that resources are wasted and excessive redundancy are caused when each target device which is connected with different virtual switches in the same technical requirements is required to be deployed completely the same can be solved, and the maximum resources are wasted, namely the communication cost of each target device with the maximum redundancy is reduced, and the maximum redundancy is reduced (the maximum and the communication cost is reduced).
In addition, the whole communication control process is automatically carried out, manual modification of technicians is not needed, and the operation is flexible and convenient.
In addition, the first maximum payload with the minimum value is the minimum value in the maximum payloads of all target links, but is also the maximum payload meeting the communication requirements of all target devices in the link group, so that the number of transaction layer data packets in subsequent communication can be reduced, and the utilization rate of the target link bandwidth can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings that are required for the embodiments will be briefly described below, and it will be apparent that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a hardware connection relationship in a server system according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of a communication control method according to an embodiment of the present application;
Fig. 3 is a schematic link relationship diagram of each target device of a server according to an embodiment of the present application;
FIG. 4 is a second flow chart of a communication control method according to the embodiment of the application;
fig. 5 is a schematic structural diagram of a computer program product according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The present application will be described in further detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
The specific application environment architecture or specific hardware architecture upon which the execution of the communication control method depends is described herein.
Referring to fig. 1, an embodiment of the present application provides a schematic diagram of a hardware connection relationship in a server system, where the server may specifically be an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) server, and includes two central processing units (Central Processing Unit, CPU) including a central processing unit 0 and a central processing unit 1, where the central processing unit 0 and the central processing unit 1 each have a SLOT (Sequential Logic Output Terminal, SLOT) for connecting to a network card, the SLOT of the central processing unit 0 is SLOT 0, and the SLOT of the central processing unit 1 is SLOT 9.
The central processing unit 0 is connected to the virtual switch A1, the virtual switch A2, the virtual switch B1 and the virtual switch B2 through interfaces P1-1, P1-2, P1-3 and P1-4, respectively.
The central processing unit 1 is connected to the virtual switch C1, the virtual switch C2, the virtual switch D1 and the virtual switch D2 through interfaces P2-1, P2-2, P2-3 and P2-4, respectively.
Wherein virtual switch A1 and virtual switch A2 are connected to different interfaces of the same physical switch a (not shown). The virtual switch B1 and the virtual switch B2 are connected to different interfaces of the same physical switch B (not shown in the figure). The virtual switch C1 and the virtual switch C2 are connected to different interfaces of the same physical switch C (not shown in the figure). The virtual switch D1 and the virtual switch D2 are connected to the same physical switch D (different interfaces not shown in the figure).
Each virtual switch has a corresponding slot, a Non-volatile memory high-speed interface (Non-Volatile Memory Express, high-speed), and a graphics processing unit (Graphics Processing Unit ). The high-speed interface of the nonvolatile memory is hereinafter referred to as a high-speed interface for short, the high-speed interface is used for connecting with a solid state disk SSD, the slot is used for deploying a standard Ethernet network card or an IB card, and the slot is used for expanding a graph processing unit cluster among servers.
The virtual switch A1 is connected with the graphics processing unit 1 through an interface slot 1 and a high-speed interface 1 to be used;
the interfaces to be used by the virtual switch A2 comprise a slot 2, a high-speed interface 2 and a graphic processing unit 2 through the interface 2;
the interfaces to be used of the virtual switch B1 comprise a slot 3, a high-speed interface 3, a graphics processing unit 3 and a control unit 3;
The interfaces to be used by the virtual switch B2 comprise a slot 4, a high-speed interface 4 and a graphics processing unit 4 through the interface 4;
the interfaces to be used of the virtual switch C1 comprise a slot 5 and a high-speed interface, and the interfaces 5 and the graphic processing unit 5 are used;
the interfaces to be used by the virtual switch C2 include a slot 6, a high-speed interface 6, and a graphics processing unit 6 through the interface 6;
The interfaces to be used by the virtual switch D1 include a slot 7, a high-speed interface 7, and a graphics processing unit 7 through the interface 7;
the interfaces to be used by the virtual switch D2 include a slot 8, a high-speed interface 8, a pass-through interface 8 and a graphics processing unit 8.
In the server infrastructure, north-south data and east-west data in the data communication field are data for describing different data flows and communication relationships, wherein,
North-south data generally refers to the interactive traffic between a user and the system of an AI server (hereinafter AI system), i.e., the transfer of data between the AI system and an external user, external system, or other non-AI internal component. For example, the user sends a request to an application of the AI system to infer, or the AI system returns the processing results to the user, such a data flow direction being referred to as north-south traffic. In some architectures, communication between the AI system and an external network or an external platform may also be referred to, and it is responsible for providing services of the AI system to external access, and letting the AI system access services of the external environment.
East-west data mainly refers to communication traffic between components inside the AI system, such as data transmission and interaction inside a data center, between graphics processing units and ethernet or InfiniBand networks, between servers, or between different micro-services. When processing data such as text, audio, image, etc., like a multimodal big language model (Large Language Model, LLM), data synchronization and transmission between different processing modules belongs to east-west communication. During AI training and reasoning, a large number of data parallel communications and collective communications are involved, such as operations using broadcast, all-to-all, and all-reduce, which are also part of east-west communications.
In fig. 1, the traffic between the central processing unit 0-virtual switch A1-graphics processing unit 1 is the same as the north-south data, the central processing unit 0-virtual switch A1-graphics processing unit 1 in the opposite direction is also the north-south data (not shown), the traffic between the graphics processing unit 3-virtual switch B1-virtual switch B2-graphics processing unit 4 is the east-west data, and the graphics processing unit 3-virtual switch B1-virtual switch B2-graphics processing unit 4 in the opposite direction is also the east-west data (not shown).
In the server infrastructure, data in either the north-south or east-west direction is basically transmitted through the PCIe bus, and Transaction layer packets (transactions LAYER PACKET, TLP) are the basic unit in the PCIe protocol stack for data transmission.
In PCIe protocol, the maximum payload of the target link is determined by the device co-negotiation of the target link, and the maximum payload in the target register of the target device specifies the maximum value of the payload of the TLP packet sent by the target device.
The specific process of the target link for maximum payload negotiation is as follows:
(1) Initializing and capability notification, namely initializing target equipment after the power-on reset of the server main board. Each target device will set a target register in its configuration space advertising the maximum payload it supports. For example, MPS sizes that can be supported by themselves, such as 128 bytes, 256 bytes, 512 bytes, etc., are indicated by Device Capabilities registers.
(2) Link training phase-a target device in the target link performs link initialization and configuration by transmitting a specific training sequence. In this process, the target devices exchange respective capability information, including maximum payloads, and they select a value supported by both devices according to the maximum payloads advertised by each other, the value being the maximum payload of the target link to which they belong.
(3) The maximum payload of the target link is determined-the target devices in the target link determine the commonly supported maximum payload, which is used for subsequent data transmission. During data transmission, the payload size of the transaction layer data packet will not exceed the maximum payload of the target link determined by negotiation.
The maximum payload negotiation mechanism has the feature of being downward compatible. If a new target device supports a larger maximum payload and an old device connected thereto supports a smaller maximum payload, the new device is downward compatible and adjusts to communicate with the smaller maximum payload supported by the old device. For example, the new graphics card supports a maximum payload of 4096 bytes, while the PCIe slot on the motherboard supports only a maximum payload of 2048 bytes, then the graphics card adjusts its own maximum payload to 2048 bytes to achieve compatible communication with the motherboard.
The application of the maximum payload mainly has the following aspects:
The transmission efficiency is improved, namely, the larger maximum payload can divide data into fewer transaction layer data packets during transmission, so that the utilization rate of the target link bandwidth is improved. For example, for large data transmission, if the maximum payload is set small, the data is divided into a plurality of transaction layer data packets to be transmitted, so that the transmission time and the overhead are increased, and the larger maximum payload is set, so that the data can be transmitted in fewer transaction layer data packets, and the efficiency is improved.
Optimizing system performance-in a system such as a server, multiple target devices may share physical links. The maximum effective load is reasonably set, so that the condition that a certain device occupies excessive link resources due to the fact that an oversized transaction layer data packet is used, the performance of other devices is affected, the communication bandwidth of each device in a system is balanced, and the overall performance is improved.
Negotiation of the maximum payload is a sub-process in PCIE TRAINING (i.e., training). Continuing with fig. 1, the 2 central processing units are connected to a total of 8 virtual switches. PCIE TRAINING thus separate into 8 target links (also called PCIe branches). Since the types and the numbers of devices mounted downstream of the virtual switch are different. The maximum payload that is eventually negotiated is therefore independent in the 8 target links, and is likely to be different.
The maximum payload of each target link is independent, each target link is in the north-south direction, so that the condition that the maximum payloads are inconsistent does not occur in the north-south direction data transmission, however, in the east-west direction data transmission, namely, in the data transmission between target links, if the maximum payloads are different, the data encapsulation formats of transaction layer data packets of the two links are different, communication abnormality is caused, PCIe malformed TLP errors are generated, malformed TLP errors are irreparable error types in PCIE errors, and serious faults such as downtime or restarting of a server are caused.
In the related technology, in order to ensure that the maximum payloads of all the target links are the same, target devices connected with different virtual switches are required to be uniformly distributed and deployed identically, which can cause waste of resources and excessive redundancy, and higher limitation is brought while the cost is increased.
However, in practical applications, the number and types of target devices required by different virtual switches may be different due to service requirements, and some other related technologies are set by manually modifying the number and types of target devices, so that according to the principle of downward compatibility, the maximum payload negotiated by each target link is manually checked, then the maximum payload of the central processing unit Rootport corresponding to the target link with the larger maximum payload is manually set to be smaller, the manually modifying manner is less flexible, and each time a professional technician is required to participate, so that the requirement of mass production cannot be met.
In order to solve the above-mentioned problems, embodiments of the present application provide a communication control method, and the method is described in detail with reference to an execution flow of the communication control method.
In the communication control method provided by the embodiment of the present application, the execution body may be a target firmware in the server, and the target firmware may specifically be a Basic Input/Output System (BIOS).
BIOS is a firmware in a computer system, which is located in a chip on the motherboard of the computer and is responsible for the process of hardware initialization, hardware self-test (POST), and booting the operating system at the time of computer startup. It is the first step in computer startup and is the bridge between the operating system and the hardware.
In addition, the target firmware may be other programs that may execute the communication control method, such as a unified extensible firmware interface (Unified Extensible FIRMWARE INTERFACE, UEFI), which is not limited.
Referring to fig. 2, an embodiment of the present application provides one of flow diagrams of a communication control method, including steps 210 to 240:
step 210, under the condition that the main board of the server is reset, acquiring a first maximum payload of each target link, and communicating a plurality of target devices in each target link through a target protocol, wherein the target devices comprise virtual switches.
The server of the embodiment of the application is a computer system for supporting and running AI model training and reasoning tasks.
The AI model can be a large language model, such as ChatGPT, deepseek, etc.
Unlike conventional servers, servers are often equipped with high-performance computing resources, such as powerful central processing units, graphics processors, and other acceleration hardware, for handling large amounts of data and complex computing tasks.
Motherboard reset (also referred to as hardware reset or hardware restart) of a server refers to reinitializing and restoring the server motherboard to its original state by hardware means, and is generally used to solve problems occurring in a system or to allow the motherboard to reload a configuration.
After the server is powered on and reset, the BIOS performs firmware loading work, and after the target firmware is loaded into the memory, BIOS codes are operated in the memory.
The target links refer to PCIe links, each target link includes a central processing unit, a virtual switch, and target devices connected by the virtual switch, where the central processing unit and the virtual switch also belong to devices supporting a target protocol.
The maximum payload of the target link of the embodiments of the present application is referred to as the first maximum payload.
The target device refers to various hardware components connected to the server motherboard through a PCIe bus, such as a central processor, a graphics processing unit, a network interface card, a solid state disk, an accelerator card, and a memory controller card.
After the main board of the server is reset, the PCIe initialization is started, the maximum payload notification is performed by the target devices, the maximum payloads of the target devices are called second payloads, each target device sets a relevant target register in a configuration space of the target device, and during the initialization, specific training sequences (orderset sequences) occur between the target devices to notify the second maximum payloads supported by the target registers of the target devices.
In practical applications, the second largest payload that can be supported by itself is indicated, for example, by the destination register Device Capabilities register, such as 128 bytes, 256 bytes, 512 bytes, etc.
The target link refers to a connection path between the target device and the motherboard (e.g., central processing unit or other device) that is established by the PCIe bus through one or more PCIe SLOTs (SLOTs). The target link is the transfer of data from one device to another via physical and electrical interfaces, typically used for high bandwidth, high speed data transfer.
Virtual switches (also known as PCIe virtual switches) are one of the core components in a virtualized environment, providing network connectivity and data forwarding between virtual machines and a physical network, following the PCIe protocol.
In virtualized platforms (e.g., VMware, hyper-V, KVM), virtual switches are typically connected to physical switches through physical ports. These physical ports become the "bridge" between the virtual switch and the physical network.
In the embodiment of the application, each target device connected to the same virtual switch belongs to the same target link.
The central processing unit comprises a plurality of target ports, each target port is connected with a virtual switch, each port corresponds to a target link, target devices on each target link send second maximum payloads in target registers of the target devices to other target devices on the target link when initializing, and the second maximum payloads in the target registers of the target devices are updated to be the second maximum payloads of the other target devices when the obtained second maximum payloads of the other target devices are smaller than the second maximum payloads in the target registers of the target devices.
After each target device receives the second maximum payloads sent by all other target devices on the target link to which the target device belongs, determining that negotiation is completed, wherein the second payloads stored in the target register are the first maximum payloads of the target link to which the target device belongs at the moment.
The central processing unit creates a corresponding relation between the port identification of each port and the first maximum payload of the target link corresponding to the port, and the BIOS can acquire the first maximum payload of each target link by searching the mapping relation.
And 220, grouping each item of target links to obtain at least one link group, wherein the virtual switches of each item of target links in the same link group are connected to different ports of the same physical switch.
It will be appreciated that communication between target devices typically follows the "proximity principle" and communication across physical switches results in higher communication costs, and in order to reduce data transfer costs and data transfer efficiency, for a target device, when there is no other target device to which it is connected on the virtual switch to which it is connected, the target device may acquire the required resources from other target devices to which "proximity" virtual machines of the virtual switch to which it is connected are connected, the "proximity" virtual machines and the virtual switch to which the target device is connected typically being connected to different ports of the same physical virtual machine, thereby reducing communication costs.
If the first maximum payloads of the two label links where the two virtual switches connected to the same physical virtual machine are located are different, when the east-west data interaction is performed, the data encapsulation formats of the data packets of the two label links are different, and finally PCIe communication abnormality is caused, and PCIe malformed TLP errors are generated. malformed TLP the error is an unrepairable error type in PCIE errors, which can cause serious faults such as downtime or restarting of the server.
In order to avoid the limitation of the related art and save resources, the embodiment of the application groups the target links, the target links where the virtual switches connected to the same physical switch are located belong to a link group, and a common first maximum payload needs to be determined for each link group, so that each target device in the link group communicates based on the common first maximum payload.
Referring to fig. 3, an embodiment of the present application provides a link relationship diagram of each target device of a server, which includes a central processing unit 0, the central processing unit 0 is connected to a network card 0 through a slot 0 (not shown in the figure), and the central processing unit 0 is connected to a virtual switch A1, a virtual switch A2, a virtual switch B1, and a virtual switch B2 through interfaces P1-1, P1-2, P1-3, and P1-4, respectively.
The virtual switch A1 and the virtual switch A2 are connected to the same physical switch a (not shown in the figure). The number and types of target devices to which each virtual switch is connected may be the same or different.
The virtual switch A1 is connected to the network card 1 through a slot 1 (not shown), and is connected to the graphics processing unit 1 through an interface 1 (not shown), and has a high-speed interface 1.
The virtual switch A2 is connected to the graphics processing unit 2 via an interface 2 (not shown).
The virtual switch B1 and the virtual switch B2 belong to the same physical switch B (not shown in the figure).
The virtual switch B1 is connected to the network card 3 via a slot 3 (not shown), and is connected to the graphics processing unit 3 via an interface 3 (not shown), and has a high-speed interface 3.
The virtual switch B2 is connected to the network card 4 via a slot 4 (not shown) and to the graphics processing unit 4 via an interface.
Each virtual switch has a unique target link to which it belongs, such as:
the target link where the virtual switch A1 is located is a central processing unit 0- - - - > a virtual switch A1- - > a network card 1, a graphic processing unit 1 and a high-speed interface 1;
The target link where the virtual switch A2 is located is: central processing unit 0— virtual switch A1-graphics processing unit 2;
the target link where the virtual switch B1 is positioned is a central processing unit 0- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -, a network card 3, a graphics processing unit 3, and a high-speed interface 3;
the target link where the virtual switch B2 is located is the central processing unit 0— the virtual switch B2-the network card 4, the graphics processing unit 4.
The first maximum payload of each target link is a payload that is not greater than the second maximum payload of any target device as determined by the respective target device co-negotiations on that target link.
Referring to table 1 below, embodiments of the present application provide MPS capabilities of each target device in virtual switch A1 and virtual switch A2, and a negotiated first maximum payload, to distinguish each target link with an identity of the virtual switch.
TABLE 1
Downward compatibility is required according to the negotiation mechanism of the maximum payload. The first maximum payload after negotiation for the target link for virtual switch A1 is 256B and the first maximum payload after negotiation for the target link for virtual switch A2 is 512B.
Step 230, for each link group, determining a first maximum payload with the smallest value from the first maximum payloads of the links of each item included in the link group.
Continuing with the embodiment of fig. 3, end-to-end (P2P) communication, so-called east-west communication, may be performed between the virtual switch A1 and the target device to which the virtual switch A2 is connected.
Such an application scenario is very common, as the number of graphics processing units in a server is typically fixed as the core component of the computation, and the number and model of network cards and hard disks are configured according to the user specific trained model size and parameter size, as the communication between graphics processing units 2 and network cards 1 is noted in fig. 3, which need to span virtual switch A1 and virtual switch A2. Typically, the ratio of network cards to graphics processing units is not a 1:1 relationship. Therefore, there may be a case where the graphics processing units, i.e., the network card, are the hard disk=2:1:1, i.e., two graphics processing units share one network card for communication, in which case data communication is required across the virtual switch.
In the embodiment of fig. 3, the first maximum payload after negotiation of the target link where the virtual switch A1 is located is 256B, and the first maximum payload after negotiation of the target link where the virtual switch A2 is located is 512B, and obviously, when the graphics processing unit 2 and the network card 1 communicate directly across the virtual switch, the first maximum payloads are different, an error PCIe malformed TLP is caused, which further causes a serious fault problem such as downtime or restarting of the server.
In order to make the first maximum payloads of the target links in the same link group identical, for each link group, the maximum payload with the smallest value in the first maximum payloads of the target links in the link group may be used as the maximum payload corresponding to the link group, so that the target links of different virtual switches connected to the same physical switch may communicate based on the same first maximum payload, that is, based on the first maximum payload with the smallest value, regardless of whether the target devices connected to the virtual switches in the target links are identical.
Step 240, updating the second maximum payload of each target device in the link group according to the first maximum payload with the minimum value corresponding to each link group, so that each item in the link group communicates with the link based on the first maximum payload with the minimum value.
After determining the first maximum payload with the minimum value of each link group, the embodiment of the application updates the second maximum payload of each target device in the link group according to the first maximum payload with the minimum value, namely, when the second maximum payload in the target register of the target device is different from the first maximum payload with the minimum value, the second maximum payload in the target register is updated to the first maximum payload with the minimum value, so that each item in the link group communicates with each other based on the first maximum payload with the minimum value.
Continuing with the embodiment of fig. 3, the first maximum payload after negotiation of the target link where the virtual switch A1 is located is 256B, the first maximum payload after negotiation of the target link where the virtual switch A2 is located is 512B, and it can be found that the first maximum payloads of the two target links are different, and then the first maximum payload 256B of the target link where the virtual switch A1 is located can be assigned to the target link where the virtual switch A2 is located, thereby implementing downward alignment of the maximum payloads.
The communication control method provided by the embodiment of the application comprises the steps of acquiring a first maximum effective load of each target link under the condition of resetting a main board of a server, carrying out communication among a plurality of target devices in each target link through a target protocol, wherein each target device comprises a virtual switch, grouping each target link to obtain at least one link group, connecting the virtual switches of each target link in the same link group to different ports of the same physical switch, determining the first maximum effective load with the minimum numerical value from the first maximum effective load of each target link contained in the link group aiming at each link group, updating the second maximum effective load of each target device in the link group according to the corresponding first maximum effective load with the minimum numerical value of each link group, so that the communication based on the first maximum effective load with the minimum numerical value among each target link in the link group can solve the problems that the virtual switch of different target devices which are required to be connected by related technologies are completely the same, the waste and the excessive redundancy can be caused, and the problem that whether the virtual switch of each target device connected in each link has the links has the same virtual switch can be deployed, namely the maximum redundancy can be reduced, and the communication cost is reduced, namely the maximum and the communication cost is reduced.
In addition, the whole communication control process is automatically carried out, manual modification of technicians is not needed, and the operation is flexible and convenient.
In addition, the first maximum payload with the minimum value is the minimum value in the maximum payloads of all target links, but is also the maximum payload meeting the communication requirements of all target devices in the link group, so that the number of transaction layer data packets in subsequent communication can be reduced, and the utilization rate of the target link bandwidth can be improved.
In some embodiments, updating the second maximum payload of each target device in the link group according to the first maximum payload of each link group having the smallest corresponding value comprises:
For any one target device, acquiring a second maximum payload of the target device from a target register of the target device;
Updating the second maximum payload in the target register to the first maximum payload with the minimum value under the condition that the second maximum payload in the target register and the first maximum payload with the minimum value are not the same;
And skipping updating the target register of the target device under the condition that the second maximum payload and the first maximum payload with the smallest number in the target register are determined to be the same.
The foregoing embodiments have described that the target devices record their own second maximum payloads via the target registers, and that the second maximum payloads of the target devices in each target link are smaller than the second maximum payloads advertised by other target devices in their target registers in order to enable normal communication within the target link.
In order to enable normal communication between the target links, the second maximum payloads of the target devices in the link group are updated according to the first maximum payloads with the minimum link group values of the target links, namely, the second maximum payloads in the target registers are updated to be the first maximum payloads with the minimum values under the condition that the first maximum payloads with the minimum values in the target registers are different, and the target registers of the target devices are skipped to be updated under the condition that the first maximum payloads with the minimum values in the second maximum payloads in the target registers are identical.
Therefore, whether all target devices connected with the virtual switch in all target links are the same or not, the target links where different virtual switches connected with the same physical switch are located can be communicated based on the same first maximum payload with the minimum value, the flexibility of deployment is improved, and the redundancy of devices in a server is reduced.
In some embodiments, after updating the second largest payload in the destination register to the first largest payload having the smallest value, the method further comprises:
and triggering a soft restart instruction through a command line tool of the server so as to reset the mainboard of the server.
In the embodiment of the present application, after the second maximum payload in the target register is updated, the updated second maximum payload cannot be immediately validated, and the BIOS needs to initiate a soft restart instruction WARM RESET to command to reset the motherboard of the server, so that the flow steps of the communication control method are re-executed, that is, the steps 110 to 140 are executed.
After the second maximum effective load in the target register of the target device is updated to be the first maximum effective load with the minimum value, the mainboard of the server is reset, so that all configuration and updating of the server can be ensured to be effective, potential errors are cleared, resources are released, and the stability and performance of the system are restored.
In some embodiments, after determining that the second maximum payload and the first maximum payload with the smallest number in the destination register are the same, the method further comprises:
And loading a preset guiding program to enable the server to enter an operating system under the condition that the second maximum payload of each target device is determined to be the same as the first maximum payload with the smallest value of the link group.
In the embodiment of the application, under the condition that the second maximum effective load of each target device is determined to be the same as the first maximum effective load with the minimum value of the link group, the second maximum effective load of each target device is unified, the self-checking can be determined to be finished, and the preset guidance program is loaded so that the server enters the operating system.
In summary, the communication control method of the embodiment of the application occurs before the operating system of the server starts, and when the second maximum effective load of each target device is the same as the first maximum effective load with the smallest value of the link group, the server is controlled to enter the operating system, so that the system start failure or performance degradation caused by hardware incompatibility or configuration error can be avoided, and the stability, reliability and performance of the server system are improved.
In some embodiments, after updating the second maximum payload of each target device in the link group according to the corresponding minimum value first maximum payload of each link group, the method further comprises:
under the condition that the access of the new first target equipment is detected, determining a first target link where the first target equipment is located;
acquiring a new first maximum payload which is redetermined by each target device in a first target link after the first target device is accessed;
and updating the second maximum payload of each target device in the link group to which the first target link belongs according to the new first maximum payload when the new first maximum payload is smaller than the first maximum payload before the first target device is accessed.
In practical applications, there are some hot plug events that are unavoidable in the server, such as accessing a new target device to the server, or disconnecting (or unplugging) an existing target device from the server.
It will be appreciated that when an existing target device is disconnected from the server, the disconnection target device will not affect the maximum payload of the link group to which it belongs, nor will it affect the stability of communication, and will therefore not cause a change in the first maximum payload of the link group with the smallest value.
However, when a new first target device is inserted into the server, the second maximum payload of the new first target device may be smaller than the first maximum payload of the first target link to which it belongs, in which case a communication failure may occur when the new target device subsequently communicates with the target devices of other virtual switches in the link group.
Since the original first maximum payload of the first target link is the minimum value in the second maximum payloads, after the first target device is accessed, the newly negotiated first maximum payload of each target device in the first target link is either equal to the original first maximum payload or smaller than the original first maximum payload.
In the case where the second maximum payload of the first target device is equal to or greater than the first maximum payload of the first target link, the new first maximum payload of the first target link group is the same as the original first maximum payload, but the second maximum payload of the first target device is updated to the original first maximum payload, and the second maximum payloads of the other target devices remain unchanged.
When the second maximum payload of the first target device is smaller than the first maximum payload of the first target link, the first maximum payload of the first target link group to which the first target device belongs and the second maximum payloads of other target devices in the first target link group are affected, and the second maximum payloads of the other target devices are required to be updated to be the second maximum payloads of the first target devices, so that PCIE communication abnormality caused by inconsistent first maximum payloads of the target links is avoided even if the target devices of the server are expanded, malformed TLP errors caused by the target device expansion are avoided, and the reliability of communication is further improved.
In some embodiments, each target device is configured to perform segmentation processing and encapsulation processing on east-west data to be sent based on a second maximum payload in its own target register to obtain at least one transaction layer data packet;
the payload of each transaction layer packet is no greater than the second largest payload in the destination register.
The foregoing embodiments have explained that the payload size of the Transaction Layer Packet (TLP) will not exceed the first maximum payload with the minimum value determined by negotiation, so each destination device may perform segmentation processing on the east-west data to be sent based on the second maximum payload in its destination register, and perform encapsulation processing on each segmented data through the PCIe protocol, so as to obtain each transaction layer packet, and send at least one transaction layer packet to other destination devices through the PCIe connection, so the number of TLP packets in subsequent communications may be reduced, and the utilization rate of the destination link bandwidth may be improved.
In some embodiments, each target device comprises an accelerator and a network card, each virtual switch is connected with at least one accelerator and one network card, and the accelerators connected with different virtual machines perform remote direct memory access through the network cards connected with the corresponding virtual machines.
The target device in the embodiment of the present application includes an accelerator, which includes but is not limited to at least one of a graphics processing unit, a TPU, an FPGA, and an NPU, and the description will be given by taking the graphics processing unit as an example.
The server of the embodiment of the application supports remote direct memory access (Remote Direct Memory Access, RDMA) technology, and can enable the opposite terminal network card to directly access the opposite terminal system memory.
Based on the remote direct memory access technology, the network card can be directly communicated with a remote graphic processing unit, the central processing unit participates in a control path, a transmission queue is prepared, a control mechanism before and after transmission is adopted, east-west data can be directly sent from a graphic processing unit video memory to an RDMA network card, and the network card supporting the RDMA technology at the opposite end directly transmits the east-west data to the opposite end graphic processing unit video memory after receiving the east-west data.
According to the embodiment of the application, based on the RDMA technology, one accelerator can directly read and write data from the video memory of the other accelerator, so that the participation of a central processing unit is reduced, the actions of carrying/copying the data are reduced, the inter-machine communication delay is greatly reduced, and the data throughput is improved.
In some embodiments, each target device sends the second maximum payload in its own target register to other target devices on the target link to which it belongs at the time of initialization, and updates the second maximum payload in its own target register to the second maximum payload of other target devices if it is determined that the acquired second maximum payload of other target devices is smaller than its own second maximum payload.
The foregoing embodiments have been described, and will not be described in detail herein.
It is noted that, the second maximum payload in the target register of the target device is updated to be the second maximum payload of other target devices in the initialization process, and the process of gradually determining the first maximum payload of the target link is active, so that communication between each target device in the target link based on the same second maximum payload can be ensured, the load requirements of all target devices can be met, the number of transaction layer data packets in subsequent communication can be reduced, and the utilization rate of the target link bandwidth can be improved.
In some embodiments, the target link comprises a central processing unit, a virtual switch, and a plurality of target devices connected by the virtual switch, wherein the plurality of target devices comprise at least part of a graphics processing unit, a network card and a hard disk.
The foregoing embodiments have been described, and will not be described in detail herein.
Referring to fig. 4, an embodiment of the present application provides a second flow chart of a communication control method, which includes the following steps:
Step 401, powering on and resetting a server;
step 402, target device initialization and second maximum payload announcement;
Step 403, the target device in the target link determines the first maximum payload of the target link according to the second maximum payloads of the target device and other target devices;
step 404, the basic input output system sequentially acquires the first maximum effective load of each target link;
Step 405, the basic input/output system groups each target link to obtain at least one link group, wherein virtual switches of each target link in the same link group are connected to different ports of the same physical switch;
Step 406, the basic input output system determines a first maximum payload with the smallest value from the first maximum payloads of the links of the entries contained in the link group aiming at each link group;
Step 407, the basic input output system updates the second maximum payload in the target register to the first maximum payload with the minimum value under the condition that the second maximum payload in the target register of any target device is different from the first maximum payload with the minimum value in the link group to which the second maximum payload belongs;
Step 408, the basic input output system performs self-checking when determining that the second maximum payload in the target register of any target device is the same as the first maximum payload with the smallest value in the link group to which the second maximum payload belongs;
Step 409, loading a preset guidance program by the basic input/output system under the condition of no error of self-checking, so that the server enters an operating system;
Step 410, the bios triggers a soft restart instruction through a command line tool of the server to reset the motherboard of the server if it is determined that the second maximum payload of each target device is updated, and step 402 is executed.
The detailed execution of steps 401 to 410 is described in the foregoing embodiments, and will not be described herein.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
Referring to fig. 5, an embodiment of the present application further provides a schematic structural diagram of a computer program product, where the computer program product is applied to a basic input/output system, and includes an acquisition module 510, a first processing module 520, a second processing module 530, and a third processing module 540;
The obtaining module 510 is configured to obtain, when the motherboard of the server is reset, a first maximum payload of each target link;
A first processing module 520, configured to group each target link to obtain at least one link group; virtual switches of each item of target link in the same link group are connected to different ports of the same physical switch;
A second processing module 530, configured to determine, for each link group, a first maximum payload with a minimum value from first maximum payloads of links of each item included in the link group;
and a third processing module 540, configured to update the second maximum payloads of the target devices in the link groups according to the first maximum payloads with the smallest values corresponding to the link groups, so that the first maximum payloads with the smallest values are communicated between the target links in the link groups.
The computer program product provided by the embodiment of the application acquires the first maximum effective load of each target link through a target protocol under the condition of resetting a main board of a server, a plurality of target devices in each target link are communicated through the target protocol, each target device comprises a virtual switch, each target link is grouped to obtain at least one link group, the virtual switches of each target link in the same link group are connected to different ports of the same physical switch, the first maximum effective load with the minimum value is determined from the first maximum effective load of each target link contained in the link group for each link group, the second maximum effective load of each target device in the link group is updated according to the corresponding first maximum effective load with the minimum value, so that the problem that resources are wasted and excessive caused when each target device connected with different virtual switches in the link group needs to be completely the same can be solved, and the problem that the resources are wasted no matter whether the virtual switches of each target device connected with each target link in the link group are connected with the same virtual switch is completely the same can be solved, namely, the communication cost of the virtual switches of each target device in the link group is reduced by the same virtual switch is greatly, and the redundancy device is not the same, and the communication cost is reduced, and the communication cost is greatly is reduced.
In addition, the whole communication control process is automatically carried out, manual modification of technicians is not needed, and the operation is flexible and convenient.
In addition, the first maximum payload with the minimum value is the minimum value in the maximum payloads of all target links, but is also the maximum payload meeting the communication requirements of all target devices in the link group, so that the number of transaction layer data packets in subsequent communication can be reduced, and the utilization rate of the target link bandwidth can be improved.
In some embodiments, the computer program product further comprises:
A fourth processing module for:
under the condition that the access of the new first target equipment is detected, determining a first target link where the first target equipment is located;
acquiring a new first maximum payload which is redetermined by each target device in a first target link after the first target device is accessed;
and updating the second maximum payload of each target device in the link group to which the first target link belongs according to the new first maximum payload when the new first maximum payload is smaller than the first maximum payload before the first target device is accessed.
In some embodiments, the third processing module 540 is specifically configured to:
For any one target device, acquiring a second maximum payload of the target device from a target register of the target device;
Updating the second maximum payload in the target register to the first maximum payload with the minimum value under the condition that the second maximum payload in the target register and the first maximum payload with the minimum value are not the same;
And skipping updating the target register of the target device under the condition that the second maximum payload and the first maximum payload with the smallest number in the target register are determined to be the same.
In some embodiments, the computer program product further comprises:
And the restarting module is used for triggering a soft restarting instruction through a command line tool of the server so as to reset the main board of the server.
In some embodiments, the computer program product further comprises:
And the loading module is used for loading a preset guiding program to enable the server to enter the operating system under the condition that the second maximum effective load of each target device is the same as the first maximum effective load with the minimum value of the link group.
Each target device sends a second maximum payload in a target register of the target device to other target devices on a target link to which the target device belongs during initialization, and updates the second maximum payload in the target register of the target device to be the second maximum payload of the other target devices under the condition that the acquired second maximum payload of the other target devices is smaller than the second maximum payload of the target device.
In some embodiments, each target device is configured to perform segmentation processing and encapsulation processing on east-west data to be sent based on a second maximum payload in its own target register to obtain at least one transaction layer data packet;
the payload of each transaction layer packet is no greater than the second largest payload in the destination register.
In some embodiments, the target link comprises a central processing unit, a virtual switch, and a plurality of target devices connected by the virtual switch, wherein the plurality of target devices comprise at least part of a graphics processing unit, a network card and a hard disk.
The description of the features of the embodiments corresponding to the computer program product may be referred to the related description of the embodiments corresponding to the communication control method, which is not described in detail herein.
An embodiment of the present application also provides an electronic device including a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the communication control method embodiments described above.
Embodiments of the present application also provide a basic input output system comprising the computer program product provided above.
Embodiments of the present application also provide a non-volatile computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the communication control method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
Embodiments of the present application also provide a further computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the communication control method embodiments described above.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which when executed by a processor implements the steps of any of the communication control method embodiments described above.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above description is provided for a communication control method, a computer program product, and a basic input output system. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202511048145.3A CN120567807A (en) | 2025-07-29 | 2025-07-29 | Communication control method, computer program product, and basic input/output system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202511048145.3A CN120567807A (en) | 2025-07-29 | 2025-07-29 | Communication control method, computer program product, and basic input/output system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN120567807A true CN120567807A (en) | 2025-08-29 |
Family
ID=96833284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202511048145.3A Pending CN120567807A (en) | 2025-07-29 | 2025-07-29 | Communication control method, computer program product, and basic input/output system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN120567807A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051737A (en) * | 2011-11-22 | 2013-04-17 | 微软公司 | Providing network capability over a converged interconnect fabric |
CN119271594A (en) * | 2024-09-27 | 2025-01-07 | 苏州元脑智能科技有限公司 | A data packet size matching method, device, apparatus and storage medium |
-
2025
- 2025-07-29 CN CN202511048145.3A patent/CN120567807A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051737A (en) * | 2011-11-22 | 2013-04-17 | 微软公司 | Providing network capability over a converged interconnect fabric |
CN119271594A (en) * | 2024-09-27 | 2025-01-07 | 苏州元脑智能科技有限公司 | A data packet size matching method, device, apparatus and storage medium |
Non-Patent Citations (2)
Title |
---|
AI17316391579: "高性能GPU服务器AI网络架构(上篇)", pages 1 - 6, Retrieved from the Internet <URL:https://blog.csdn.net/Ai17316391579/article/details/137458314?ops_request_misc=%257B%2522request%255Fid%2522%253A%252277919cdc81bfcc85a8425d907f936f84%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=77919cdc81bfcc85a8425d907f936f84&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-2-137458314-null-null.142^v102^pc_search_result_base8&utm_term=%E9%AB%98%E6%80%A7%E8%83%BDGPU%E6%9C%8D%E5%8A%A1%E5%99%A8AI%E7%BD%91%E7%BB%9C%E6%9E%B6%E6%9E%84&spm=1018.2226.3001.4187> * |
古猫先生: "浅析PCIe MPS对系统性能和稳定性的影响", pages 1 - 12, Retrieved from the Internet <URL:https://blog.csdn.net/zhuzongpeng/article/details/127061736?ops_request_misc=%257B%2522request%255Fid%2522%253A%25224a5f5d0c085fd06b122fdc57039fb455%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=4a5f5d0c085fd06b122fdc57039fb455&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-1-127061736-null-null.142^v102^pc_search_result_base8&utm_term=%E6%B5%85%E6%9E%90PCIe%20MPS%E5%AF%B9%E7%B3%BB%E7%BB%9F%E6%80%A7%E8%83%BD%E5%92%8C%E7%A8%B3%E5%AE%9A%E6%80%A7%E7%9A%84%E5%BD%B1%E5%93%8D&spm=1018.2226.3001.4187> * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2153333B1 (en) | Method and system for managing a plurality of i/o interfaces with an array of multicore processor resources in a semiconductor chip | |
WO2017152633A1 (en) | Port binding implementation method and device | |
WO2016115831A1 (en) | Fault tolerant method, apparatus and system for virtual machine | |
US8787152B2 (en) | Virtual switch interconnect for hybrid enterprise servers | |
JP2014501424A (en) | Integrated software and hardware system that enables automated provisioning and configuration based on the physical location of the blade | |
US12143316B2 (en) | Software-controlled active-backup mode of link aggregation for RDMA and virtual functions | |
US12294924B2 (en) | Distributed ledger control over wireless network slices | |
CN117978758B (en) | Adaptation method for data processing unit, computer device and medium | |
US20240104029A1 (en) | Network instantiated peripheral devices | |
US12175292B2 (en) | Job target aliasing in disaggregated computing systems | |
CN117421268A (en) | Interconnection system, equipment and network | |
CN115080479B (en) | Transmission method, server, device, bare metal instance and baseboard management controller | |
CN114328434B (en) | Data processing system, method, device and storage medium | |
CN117041147B (en) | Intelligent network card equipment, host equipment, method and system | |
CN119226193A (en) | Communication method, device, storage medium and program product | |
CN120567807A (en) | Communication control method, computer program product, and basic input/output system | |
CN107547277A (en) | One kind virtualization control panel implementation method and network communication equipment | |
CN108737465A (en) | A kind of User Agreement stack operation method and device | |
CN112073499A (en) | Dynamic service method of multi-machine type cloud physical server | |
CN111221620B (en) | Storage method, device and storage medium | |
CN116069359A (en) | Online program updating method and equipment based on Cortex kernel chip | |
WO2023177982A1 (en) | Dynamic server rebalancing | |
CN116644022A (en) | Data processing node and method | |
EP3229145A1 (en) | Parallel processing apparatus and communication control method | |
CN109656674A (en) | A kind of computer equipment, virtualization chip and data transmission method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination |