CN120406839A

CN120406839A - A dedicated high-performance storage device designed for intelligent computing training and reasoning

Info

Publication number: CN120406839A
Application number: CN202510484624.3A
Authority: CN
Inventors: 周健; 徐栋梁; 张丹立
Original assignee: Fuhe Smart Technology Shanghai Co ltd
Current assignee: Fuhe Smart Technology Shanghai Co ltd
Priority date: 2025-04-17
Filing date: 2025-04-17
Publication date: 2025-08-01

Abstract

The invention provides special high-performance storage equipment designed for intelligent computation training and reasoning, which comprises a machine case and at least two storage nodes arranged in the machine case, wherein each storage node comprises a storage unit on a main board, the storage unit comprises a plurality of high-speed interface hard disk drive slots, the hard disk drive slots are positioned at the front part of the machine case and are configured to accommodate a plurality of drivers, each driver is exclusive of one high-speed data bus channel and is directly connected with the processor to support non-blocking concurrent data access, the processor is connected with the storage unit and the storage unit to support at least one expandable processor, the storage unit is connected with the processor to support Error Correction Code (ECC) memory, and the high-speed expansion slots are connected with the processor and the storage unit. The device improves the data processing speed through the high-speed data bus and concurrent access design, has good expansibility and reliability, is suitable for intelligent computing training and reasoning application, and can meet the requirements of large-scale data storage and efficient processing.

Description

Special high-performance storage device designed for intelligent computation training and reasoning

Technical Field

The invention relates to a computer storage technology, in particular to special high-performance storage equipment designed for intelligent computation training and reasoning.

Background

In the field of modern computing, with rapid development of artificial intelligence, cloud computing and big data, especially with increasing demand for intelligent computing training and reasoning, conventional storage technologies face a great challenge. In order to achieve efficient and fast data processing, the performance of a storage system becomes one of the key bottlenecks, especially when training and reasoning about massive data. The efficient storage system is required to have fast data reading and writing capabilities, and is also required to be capable of processing a large number of concurrent requests, so that the requirement of large-scale intelligent computing is met.

Currently, conventional storage devices mostly adopt architectures based on Hard Disk Drives (HDD) or Solid State Drives (SSD) combination, and these devices have certain performance limitations when processing a large amount of data, especially in application scenarios requiring high throughput and low latency. Although SSDs are superior in speed to HDDs, existing storage systems still have difficulty providing adequate performance guarantees in the face of very large-scale data concurrent access and high computational demands. In addition, the traditional storage system has certain defects in expandability, concurrent data access and fault tolerance.

Most of storage devices in the prior art are optimized for traditional application scenes, and high-performance data transmission requirements in intelligent computing cannot be fully considered. Although some high performance storage solutions have been proposed, they often fail to address the need for high concurrent data access, storage system scalability, and fault tolerance at the same time. The general lack of an optimal design for intelligent computing training and reasoning specific requirements in the prior art results in the difficulty of existing storage systems in meeting the requirements of high-speed, reliable and efficient data storage in practical applications. Accordingly, there is a need for a new type of storage device to cope with the increasing demand for storage performance in the intelligent computing field.

Disclosure of Invention

It is therefore an aim of embodiments of the present invention to provide a dedicated high performance memory device designed for mental arithmetic training and reasoning, which solves at least one of the above mentioned technical problems.

To achieve the above object, there is provided a dedicated high-performance storage device designed for mental arithmetic training and reasoning, comprising:

a chassis, and at least two storage nodes disposed within the chassis;

each storage node comprises the following components arranged on a main board:

The storage unit comprises a plurality of high-speed interface hard disk drive slots, the hard disk drive slots are positioned at the front part of the chassis and are configured to accommodate a plurality of drivers, each driver monopolizes one high-speed data bus channel and is directly connected with the processor to support non-blocking concurrent data access;

the processor is connected with the storage unit and the memory unit and supports at least one expandable processor;

The memory unit is connected with the processor and supports an Error Correction Code (ECC) memory;

And the high-speed expansion slot is connected with the processor and the storage unit and is used for installing an external reasoning acceleration card.

In some possible embodiments, each storage node is provided with a built-in remote diagnosis function, and fault code real-time feedback is supported;

The remote diagnosis function is realized through a baseboard management controller BMC, wherein the BMC is an integrated component of a main board of a storage node, operates independently of an operating system and is accessed remotely through an IPMI LAN port;

The BMC is used for collecting hardware state data of the processor, the memory unit, the storage unit and the power module in real time, obtaining a hardware abnormal event according to the hardware state data, converting the hardware abnormal event into a standardized fault code, and recording the standardized fault code to a hardware event log SEL;

the storage device supports real-time viewing of fault codes in the hardware event log SEL through an IPMI tool or a network interface.

In some possible implementations, the hardware status data includes a temperature of the processor, a fan speed, and a status of the power module, and the storage device supports fault alerting via an LED indicator, a simple network management protocol SNMP trap, or an email notification.

In some possible embodiments, the remote diagnosis function supports power module control and hardware state diagnosis through an out-of-band management function in case of downtime or non-startup of an operating system;

The BMC supports security of remote management session through encryption protocol, wherein the encryption protocol comprises HTTPS or IPMI over LAN encryption mode, and provides positioning of hardware fault through BIOS POST code.

In some possible embodiments, each storage node is provided with 24 PCIe 5.0NVMe drivers, each of which monopolizes one PCIe channel, supporting multitasking concurrent data access;

And a RAID controller of the redundant array of independent disks is arranged on a main board of the storage node, the RAID controller executes a dynamic load balancing algorithm, and the dynamic load balancing algorithm dynamically allocates storage bandwidth resources according to the priority of the computing task.

In some possible embodiments, the motherboard of the storage device provides 4 PCIe 5.0 expansion slots, wherein the first slot and the second slot are 2 PCIe 5.0x16 slots, the third slot and the fourth slot are 2 PCIe 5.0x8 slots, and the expansion slots are located in a rear area of the motherboard for installing external reasoning acceleration cards;

The dynamic load balancing algorithm dynamically configures PCIe bandwidth resources by:

The BMC monitors the task load of the GPU acceleration card or the FPGA acceleration card connected to the expansion slot in real time, wherein the task load is quantified by the following indexes of reasoning the I/O request times of the task in unit time;

when any index in the task load exceeds a preset threshold, the BMC sends a PCIe resource configuration instruction to the BIOS;

The BIOS performs the following operations according to the source configuration instruction, namely switching a PCIe 5.0x16 link of the first slot or the second slot from a default x16 single channel mode to two independent x8 channels, dedicating one x8 channel to a high-priority reasoning acceleration card, reserving the other x8 channel for other expansion cards except the reasoning acceleration card, and realizing the bandwidth resource configuration process through a PCIe link splitting function of the BIOS;

And when the task load falls below the preset threshold, the BMC informs the BIOS to restore the default x16 single channel mode.

In some possible embodiments, the BMC is configured to continuously collect utilization data of the processor, the memory unit, the storage unit and the network, and provide an API to the outside through the IPMI interface, so that the external management system can query the real-time utilization data through the API, trigger the dynamic load balancing algorithm to adjust the configuration of PCIe bandwidth resources, and receive a real-time alarm of a hardware exception event.

In some possible implementations, the storage device is equipped with a dual 2000W titanium gold power supply and a plurality of fans;

The BMC is also used for automatically triggering load migration when the power module abnormality or the heat dissipation abnormality is detected.

In some possible embodiments, the storage unit, the processor, and the memory unit implement co-scheduling by an optimized BIOS.

The technical scheme has the following beneficial effects:

the technical scheme provides special high-performance storage equipment designed for intelligent computing training and reasoning, a plurality of high-speed interface hard disk drive slots are configured at the front part of a chassis, and an interface supporting hot plug is adopted, so that the drivers are more convenient to install and replace, the connection design of a processor, a storage unit and a memory unit can support an expandable processor and an ECC memory, the computing performance and the data integrity of the equipment are enhanced, the configuration of the high-speed expansion slots further improves the expansibility of a storage system, the plurality of drivers each monopolize one high-speed data bus channel and support concurrent data access, the data processing speed and the overall performance of the system are obviously improved, the high-performance storage requirement can be met, and the special high-performance storage equipment is particularly suitable for large-scale intelligent computing training and reasoning tasks.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, in which the drawings are only some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a front view of a front panel of a storage device of an embodiment of the present invention;

FIG. 2 is a rear view of a memory device of an embodiment of the present invention;

FIG. 3 is a schematic view of an inner rail mounted inside a chassis according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process of mounting a storage device to a rack in accordance with an embodiment of the present invention;

FIG. 5 is a schematic illustration of removal of a top cover plate of a storage device for maintenance in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Intelligent computing training and reasoning refers to the process of processing, analyzing and learning large amounts of data in the field of intelligent computing, especially in applications of artificial intelligence and machine learning. Intelligent computing training refers to learning and training through an algorithm model by using a large amount of marked or unmarked data, so that the model can extract rules, features or knowledge from the data, and further is used for predicting or classifying future data. The reasoning is to process and predict the newly input data in real time by using the model after model training is completed, thereby providing support for decision or operation. The intelligent computing training and reasoning requires high-efficiency data storage and processing capacity, so that special high-performance storage equipment is required to meet the high-speed reading and writing requirements of mass data, and the efficiency and accuracy of the training and reasoning process are ensured.

Example 1

Fig. 1 shows a front panel of a storage device, with reference numerals 0 through 23 corresponding to 24 2.5 inch PCIe 5.0NVMe hard disk drive slots. The control panel is provided with a power button, and the main power switch is used for starting or removing the power of the server. Pressing this button turns off the system power, but remains on standby. And the UID button is used for switching on or switching off the UID LED indicator lamp. And the power supply LED is used for indicating that the system power supply unit is supplying power, and the LED is normally lighted when the system is running. A driver LED for flashing when the storage driver is active. NIC2 LED, NIC2 LED may blink when LAN port 2 has network activity. NIC1 LED for flashing NIC1 LED when LAN port 1 has network activity. And the overheat LED is used for indicating that the system has overheat condition. And the information LED is used for warning operators of different states.

FIG. 2 illustrates a rear view of a storage device configured with PCIe5.0 expansion slots, 10GbE LAN ports, etc. In fig. 2, the power supply corresponds to two 2000 watt high efficiency power supplies (PWS 0 on the left and PWS1 on the right), the IPMI LAN port corresponds to one RJ451GbE dedicated IPMI LAN port, the COM port corresponds to one COM (serial port) port, the LAN port corresponds to two RJ4510GbE LAN ports, the USB port corresponds to two USB 3.0 ports, and the VGA port corresponds to one video graphic port. Reference numeral 1 in fig. 2 is a PCIe5.0x16 low configuration slot, reference numeral 3 is a PCIe 5.0x8 low configuration slot, and reference numeral 4 is a PCIe 5.0x8 low configuration slot.

FIG. 3 shows the mounting of the internal rails inside the chassis, supporting the smooth mounting of hardware components. The process of mounting the inner rail to the chassis includes confirming that the left and right inner rails have been properly identified. The inner rail is closely placed on the side of the case, and the pins on the inner rail are aligned with the slotted holes on the side of the case. The inner rail is slid to the rear of the chassis until the pins reach the ends of the slots, which secures the inner rail to the chassis. An optional screw may be additionally added to ensure higher security.

Fig. 4 illustrates a process of mounting a storage device to a rack that includes pulling two middle rails out of the front of the outer rail until each catch comes to a stop. The inner rail on the chassis is aligned with the front end of the middle rail. And sliding the inner rail on the chassis into the middle rail to ensure the pressure uniformity at two sides. When the part slides in, the locking lever stops further movement. Pressing the locking lever inside the inner rail and pushing the chassis into the rear of the chassis.

Fig. 5 illustrates a process of removing a top cover plate of a storage device for maintenance. The top cover plate consists of a front cover and a rear cover which are connected through a hinge. The maintenance process includes turning off the system power, unplugging the power cord from the rear of the power, removing three screws on the front of the hinge portion of the top cover, and lifting the hinge cover to access the fan area.

The embodiment of the invention provides a special high-performance storage device designed for intelligent computation training and reasoning, which comprises:

a chassis, and at least two storage nodes disposed within the chassis;

each storage node comprises the following components arranged on a main board:

Further, a remote diagnosis function is built in each storage node, and real-time feedback of fault codes is supported;

The remote diagnosis function is realized through a baseboard management controller BMC, the BMC is an integrated component of a main board of a storage node, operates independently of an operating system and is accessed remotely through an IPMI LAN port, and the IPMI refers to an intelligent platform management interface (INTELLIGENT PLATFORM MANAGEMENT INTERFACE).

The BMC is used for collecting hardware state data of the processor, the memory unit, the storage unit and the power module in real time, obtaining an abnormal event according to the hardware state data, converting the abnormal event into a standardized fault code, and recording the standardized fault code to a hardware event log SEL;

The storage device supports real-time viewing of fault codes in the hardware event log SEL through an IPMI tool or Web interface.

Further, the fault codes include critical error codes and non-critical error codes. Critical error codes are fatal in that the storage node cannot pass hardware self-test, including processor failure, uncorrectable ECC error in memory, power module power failure. The non-critical error code is triggered by an abnormality allowing the storage node to operate in a degrading mode, and comprises a fan rotating speed exceeding threshold, an NVMe driver temperature alarm and PCIe link speed reduction, wherein the storage device supports to check fault codes in the SEL log in real time through an IPMI tool or a Web interface, and triggers an LED alarm, an SNMP notification or a mail alarm based on the code severity level.

Further, the BMC is specifically configured to collect, in real time, hardware state data including a processor temperature and an operation state, an memory unit ECC state, a temperature and a health index of an NVMe driver in a storage unit, a power module and a load state, a system fan rotation speed, a fault signal, and the like. When the data exceeds a preset threshold or an abnormality of the device is detected, the BMC judges the data as an abnormal event and generates a key error code or a non-key error code according to the severity of the event. Critical error codes indicate that there is a fatal failure of the storage node, such as processor temperature exceeding a safety threshold (95 ℃ or more) and continuing to overrun, uncorrectable ECC errors in the memory cells, or power module output power down or voltage out of tolerance (10%), non-critical error codes indicate fault tolerant exceptions, such as NVMe driver temperature exceeding an early warning threshold (70 ℃ or more) but not reaching a critical value, system fan speed deviating from a set value (15%) but not stalling, or PCIe link rate degradation to low bandwidth mode. The fault codes are in one-to-one correspondence with the abnormal events, recorded to a hardware event log (SEL), and support real-time inquiry through an IPMI tool or Web interface, and trigger an LED alarm, an SNMP notification or a mail alarm.

Further, the drivers are physically deployed in hot plug brackets at the front of the chassis, directly connecting to the processor/controller through exclusive PCIe 5.0 lanes. Each driver is arranged in an independent hot plug tray and is fixed on a front panel of the case through a sliding rail to support online replacement. Each drive is directly connected to the processor or memory controller through a separate PCIe 5.0x4 lane without intermediate switching chip interference. Each driver has exclusive one PCIe 5.0 lane (non-shared bus), ensuring full bandwidth access and avoiding performance degradation due to multi-device contention. The driver is connected with a PCIe hot plug connector on the main board through an SFF-8639 (U.2) interface, and supports hot plug and automatic identification.

Further, the storage unit includes a plurality of high-speed interface hard disk drive slots, wherein at least one slot supports a non-volatile storage device NVMe or serial advanced technology attachment SATA interface.

Further, a high-speed expansion slot, coupled to the processor and the memory unit, includes a plurality of slots supporting a high-speed data transmission standard and at least one low-profile slot.

Further, the storage device further comprises a heat dissipation module connected with the processor and the memory unit and comprising a plurality of heat dissipation fans.

Further, the memory device also includes a power module coupled to each of the components, including at least two power modules that form a redundancy.

Further, the storage device also includes a network interface unit coupled to the processor and the power module and including a plurality of high-speed LAN ports and at least one basic management control BMC LAN port.

Further, the storage device further comprises a remote management unit connected with the network interface unit and used for supporting BMC remote monitoring through an intelligent platform management interface IPMI protocol.

Further, the storage device further comprises a redundant array independent disk RAID module, which is connected with the storage unit, the power supply module and the processor and is used for performing disk array configuration through a RAID configuration function.

Further, the hardware state data comprises the temperature of the processor, the rotating speed of the fan and the state of the power supply module, and the storage device supports fault alarming through LED indicator lights, simple network management protocol SNMP traps or email notification.

Further, the remote diagnosis function of the storage device supports power control and hardware diagnosis through the out-of-band management function under the condition that the host operating system is down or not started;

The BMC of the storage device supports security of a remote management session through an encryption protocol, wherein the encryption protocol comprises an HTTPS or IPMI over LAN encryption mode, security threat in the remote management process is prevented, and positioning of hardware faults is provided through BIOS POST codes.

Further, each storage node is provided with 24 PCIe 5.0NVMe drivers, and each driver monopolizes one PCIe channel to support multi-task concurrent data access;

And a RAID controller of the redundant array of independent disks is arranged on a main board of the storage node, the RAID controller executes a dynamic load balancing algorithm, the dynamic load balancing algorithm dynamically allocates storage bandwidth according to the priority of the computing task, and preferentially allocates more channel bandwidth resources for the high-priority computing task.

Further, the RAID controller is specifically configured to allocate a storage bandwidth by:

Monitoring PCIe channel utilization rate of each NVMe driver and task priority labels of calculation tasks in real time;

according to the task priority label, routing the data request of the high-priority task to an NVMe driver with higher idle channel occupation ratio;

The method for dynamically allocating the extra PCIe channel bandwidth resources for the high-priority tasks comprises the steps of splitting an x4 channel of a low-priority task into an x2 mode in a RAID group, vacating bandwidth and allocating the bandwidth to the high-priority task, aggregating a plurality of x4 channels into an x8/x16 link by a PCIe switch and being special for the high-priority task, wherein the channel bandwidth resources refer to bandwidth allocation authorities of PCIe 5.0x4 channels corresponding to a single NVMe driver, and the allocation process does not interrupt concurrent access of other computing tasks.

Further, the main board of the storage device provides 4 PCIe 5.0 expansion slots, wherein the first slot and the second slot are 2 PCIe 5.0x16 slots, the third slot and the fourth slot are 2 PCIe 5.0x8 slots, and the expansion slots are located in the rear area of the main board and are used for installing external reasoning acceleration cards;

Switching PCIe 5.0x16 links of the first slot or the second slot from a default x16 single channel mode to two independent x8 channels by the BIOS according to the source configuration instruction, wherein one x8 channel is dedicated to a high-priority reasoning acceleration card to ensure that the exclusive bandwidth is not lower than 32GB/s, and the other x8 channel is reserved for other expansion cards except the reasoning acceleration card;

Further, the storage device adopts intelligent RAID and data partition management, the intelligent RAID supports hardware level RAID 0/1/5/10, and the data partition strategy is dynamically adjusted by combining a firmware algorithm to balance I/O load or automatically optimize data distribution according to the type of computing task.

Further, the BMC is used for continuously collecting utilization rate data of the processor, the memory unit, the storage unit and the network, and providing an API (application program interface) through the IPMI (intelligent platform management interface) to the outside, so that an external management system can inquire the real-time utilization rate data through the API, trigger the dynamic load balancing algorithm to adjust PCIe (peripheral component interconnect express) bandwidth resource allocation and receive real-time alarm of hardware abnormal events.

Further, the dynamic load balancing algorithm dynamically adjusts task allocation based on the data through a third party scheduling system, for example, when a storage bandwidth bottleneck is detected, part of tasks are migrated to low-load nodes.

Further, the storage device supports network architecture-based NVMe and storage pooling, combines super fusion architecture software through an M.2 expansion slot, dynamically allocates storage resources through a virtualization layer, and allocates cache space for GPU computing nodes.

Further, the storage device supports network architecture-based NVMe (NVMe-orf) protocol and storage pool function, and the dynamic resource allocation oF the virtualization layer is realized through the following steps:

the method comprises a storage pool construction step, namely, connecting a plurality of NVMe drivers through an M.2 expansion slot to form a unified storage pool, dividing the storage pool into logic volumes (LUNs) by utilizing a super fusion architecture (HCI) software, and associating each logic volume to a specific GPU computing node;

In the step of monitoring the virtualization layer, the I/O load, the cache hit rate and the task priority data of the GPU computing node are collected in real time, and the storage resource requirement is analyzed through a virtualization manager (such as VMWARE VSAN or Ceph);

In the dynamic allocation mechanism step, an exclusive NVMe-oF channel and a high-speed buffer space (more than or equal to 64GB DRAM buffer) are allocated for the high-priority GPU task, hot spot data are migrated to a low-delay NVMe driver in a storage pool according to the need, and a data access path is optimized;

And in the step of resource recycling and rebalancing, after the task is completed, the exclusive channel and the cache space are released to a public resource pool, and the QoS strategy (such as IOPS upper limit and delay threshold) of the logic volume is automatically adjusted based on the load change.

Further, the storage device is provided with a double 2000W titanium power supply and 6 high-speed fans to ensure the stability of resource allocation, and when the BMC detects abnormal power supply or heat dissipation, the BMC automatically triggers load migration and preferentially ensures the resource supply of a core computing task.

Further, the BMC is also configured to perform the following operations:

The BMC monitors the input voltage (110-240V AC) and the output load rate of the power supply module, collects temperature data of the processor, the NVMe driver and the memory unit, and detects the rotating speed and the fault state of the cooling fan;

When detecting an abnormal state, the BMC performs response operation, and specifically comprises triggering the dynamic transfer of the load of a failure power supply to a normal power supply module when detecting that any power supply module fails or the output load rate exceeds 95%, limiting the power consumption quota of a non-critical task, and preferentially guaranteeing the supply of resources such as Turbo Boost power supply of a processor, PCIe channel bandwidth of a high-priority computing task (AI training/reasoning), over-frequency mode power supply of a memory unit, triggering the following operation such as closing a hyper-thread or GPU acceleration function of the non-critical task when detecting that the temperature of the processor is more than or equal to 90 degrees, the temperature of an NVMe driver is more than or equal to 75 degrees or the fan rotating speed is less than 3000RPM, and migrating the high-priority task to an NVMe driver or a processor core with lower temperature;

The BMC predefines task priority labels (for example, AI is trained to be P0 and data backup is P3) through BIOS, and when resources contend, power and heat dissipation resources are distributed from high priority to low priority, so that high-priority tasks are ensured not to be affected by frequency reduction or interruption.

Further, the storage unit, the processor and the memory realize cooperative scheduling through the optimized BIOS.

Further, each 2.5 inch PCIe 5.0NVMe hard disk drive slot of the storage unit supports the highest 32TB storage capacity, the processor supports multi-node configuration and is suitable for distributed computing tasks, the memory unit realizes data error correction through an error correction code ECC technology, the redundant array independent disk RAID module supports RAID 0, RAID1, RAID5 and RAID10 configuration modes, the network interface unit supports data parallel transmission and load balancing, and the front end of the case is provided with an LED indicator light for displaying hard disk drive states and I/O activities.

The following describes the above technical scheme in detail:

The storage device in this embodiment supports 24 2.5 inch PCIe 5.0NVMe hard disk drive slots, each hard disk drive slot supporting storage capacities of up to 32 TB. These hard disk drive slots use PCIe 5.0 interfaces that provide high bandwidth and fast storage access, particularly for large-scale data set storage and high performance computing tasks. By the design, the system can process mass data and support rapid data reading and writing, and the requirements of high-performance computing and AI training tasks on storage equipment are met. In addition, the system also supports M.2PCIe 3.0x4 NVMe/SATA slots, further adapts to different types of storage requirements, and ensures flexible storage configuration and strong expansibility of the equipment.

The system is equipped with a plurality of PCIe 5.0 expansion slots, including 2 x16 slots and 2 x8 low-profile slots, supporting expansion cards and efficient data transfer. These expansion slots are very important for high-speed computation and data access in AI reasoning and training tasks, and can provide a stable and fast data transmission channel. In computationally intensive applications, the expansion card may implement additional acceleration computing capabilities, further enhancing the performance of the device, supporting different types of AI hardware acceleration cards, such as GPUs and FPGAs, to meet the ever-increasing AI computing demands.

The memory device of this embodiment supports fourth and fifth generation intel to strongly scalable processors, and is equipped with ECC DDR5 memory of up to 2TB, with memory speeds up to 4800MT/s and 5600MT/s. These configurations can provide powerful computational support for AI training and reasoning workloads. The memory adopts ECC technology, so that the data in the memory can be corrected in time when errors occur, the calculation accuracy is improved, and the data damage is avoided. The high-performance processor and the memory are configured, so that the system can operate efficiently when processing massive parallel computing tasks, and the requirements of AI application on high-bandwidth and high-reliability memory are met.

To ensure stable operation of the system under high load conditions, the storage device of this embodiment is equipped with six high efficiency 6 cm fans. AI training and reasoning tasks require a lot of computational resources and the heat generated inside the system is high, so a powerful heat dissipation system is critical. Through fan design, the system can discharge the heat effectively, keeps the temperature of subassembly in reasonable within range, avoids leading to performance decline or hardware trouble because of overheated, ensures long-term stability and the reliability of system.

The device is equipped with two 2000 watt redundant power modules to ensure that the system is able to supply power stably under high load conditions. The high-performance computing and AI training system consumes a large amount of power, and the redundant power supply design can ensure that even if one power supply module fails, the other module can still continue to supply power, so that system shutdown caused by power supply problems is avoided, and the stability and reliability of the system are enhanced.

The memory device in this embodiment employs an efficient power management system equipped with 2000 watt redundant power modules. Under high performance computing demands, the system is able to maintain a stable power supply while effectively controlling energy consumption. Through redundant power supply design, when a power supply fails, another power supply module can automatically take over power supply, and system shutdown caused by power supply failure is avoided. In addition, the design of the redundant fan further ensures the heat radiation capability of the equipment under the condition of high load, prolongs the service life of the equipment and improves the stability of the system.

The system adopts a modularized design, supports a plurality of modularized components, comprises a storage driver, an expansion card, a memory and the like, and can be flexibly configured and expanded according to the requirements of AI training and reasoning tasks. The nodes have hot plug functionality allowing replacement or installation without interrupting system operation. The system can be operated by a user under the condition of not affecting the stability of the system no matter the hardware maintenance, the upgrading or the expansion of the storage and the computing capacity, so that the maintainability and the flexibility of the system are improved, and the system is ensured to be always in an optimal working state in a long-time and high-load task.

The storage device is equipped with a plurality of high-speed network interfaces including two RJ4510GbE LAN ports and one dedicated BMC LAN port. Through the high-speed network interfaces, the system can realize rapid data transmission in AI training and reasoning tasks, and meet the requirement of concurrent access of large data volume. Efficient data transfer capability is critical to increasing the read speed of data, reducing latency, and speeding up the computation process.

The system is provided with BMC support, remote management and monitoring are carried out through IPMI, and an administrator can check the health state of the system in real time through an IPMI tool and a Web interface, wherein the health state comprises key parameters such as temperature, fan rotating speed, power supply state and the like. The function has important significance in the data center environment, can effectively prevent hardware faults and conduct fault diagnosis, and ensures stable performance of AI training and reasoning tasks.

The system supports a hot-swapped NVMe hard disk drive, allows hard disk replacement or upgrading to be performed under the condition of no interruption of service, and is suitable for dynamic requirements of AI tasks on storage capacity and speed. The intelligent RAID and data slicing management supports hardware level RAID 0/1/5/10, and the data slicing strategy is dynamically adjusted through a firmware algorithm, so that the I/O performance is optimized. According to the design, in the AI training and reasoning process, the storage load can be balanced according to the requirements of calculation tasks, and the utilization efficiency of storage resources is improved.

The system is provided with a plurality of PCIe 5.0 slots and M.2 slots and supports the design of a plurality of expansion cards and storage drivers, has high expandability and flexible configuration capability, can expand resources according to AI task requirements, and meets the increasing demands on storage capacity, computing capability and high-speed data access. The supported expansion card comprises a GPU, an FPGA and other computing acceleration cards, so that the computing capacity and the data processing efficiency of the system can be greatly improved.

The system designs a perfect fault removal mechanism, provides a detailed fault removal guide, and helps users solve common problems such as power supply problems, memory errors and the like. The system can monitor the health state of hardware in real time and generate corresponding fault logs through a remote diagnosis function and an IPMI interface, ensures the stability in intelligent calculation training and reasoning tasks, provides quick maintenance support, and reduces downtime.

Example two

And the storage equipment is internally provided with a remote diagnosis function and supports real-time feedback of fault codes.

IPMI remote management function

The storage device supports remote management based on IPMI 2.0 standard, implemented by a separate BMC (baseboard management controller). An administrator may remotely access system health status through an IPMI LAN port (RJ 451 GbE), including monitoring hardware parameters (e.g., temperature, voltage, fan speed) and fault logs in real time.

In terms of fault code feedback, the BMC may record a hardware event log (SEL) and generate a specific fault code when an anomaly (e.g., overheat, power failure) is detected, supporting real-time viewing through an IPMI tool or Web interface.

Specifically, the system provides remote management functions by supporting the IPMI 2.0 standard through a dedicated Baseboard Management Controller (BMC). An administrator may access the BMC interface through a dedicated IPMI LAN port (RJ 451 GbE) without relying on the host operating system. The design ensures that the remote management function is not influenced by the state of the host operating system, thereby improving the management efficiency and the flexibility of the system.

Specifically, the IPMI 2.0 provides a hardware state remote monitoring function, and obtains hardware health data such as CPU temperature, voltage, fan rotation speed, power state and the like in real time, so as to ensure the running stability of the system. In addition, an administrator can access a System Event Log (SEL) to record important events such as hardware faults, temperature alarms and the like, so that the problems can be conveniently tracked and checked. The alarm notification function supports triggering key error notifications, such as overheat, power failure and the like, through SNMP traps, emails or LED indicator lamps, and timely reminding an administrator of processing.

Specifically, out-of-Band (Out-of-Band) functionality allows an administrator to remotely control server power (e.g., power on, power off, reboot) through IPMI even if the host operating system is down or not started. In addition, the IPMI 2.0 also supports remote access to a server console through KVM over IP technology for BIOS configuration or fault diagnosis, thereby greatly improving maintainability and fault recovery capability of the system.

Specifically, IPMI 2.0 provides a powerful security feature that supports hierarchical management of user rights, including administrators, operators, and general users, ensuring that only authorized users can perform management operations. In order to ensure the security of the remote management session, the IPMI 2.0 may encrypt session data through an encryption protocol (e.g. HTTPS/IPMI over LAN), so as to effectively prevent possible security threats in the remote management process.

BMC interface

The BMC provides out-of-band management capability and can access diagnostic information through the network even if the host operating system is down. Real-time alarms (e.g., LED indicator light changes, SNMP traps, mail notifications) are supported, ensuring that fault codes can be fed back to the management platform in real time.

Specifically, the BMC (baseboard management controller) provides remote management functions independent of the host operating system, so that an administrator can access the BMC through a dedicated IPMI LAN port (RJ 451 GbE) for operations such as power control (power on, power off, reboot) and hardware diagnostics even in the event of a system shutdown or operating system crash. The out-of-band management ensures that an administrator can still effectively control and manage the system under the condition that the operating system cannot be started, and improves the reliability and maintainability of the server.

Specifically, the BMC continuously monitors key hardware parameters of the system, including temperatures of the CPU, the memory and the system environment, input/output voltages of the power module, rotational speeds and fault alarms of the fan, and health conditions of the redundant power supply. By monitoring the hardware data in real time, the BMC can timely find out potential hardware problems, and abnormal data can trigger alarms and be recorded in a System Event Log (SEL), so that subsequent fault investigation and system maintenance are facilitated.

Specifically, all hardware events, such as overheating, fan failure, or power anomalies, etc., are logged by the BMC into the system event Log (SYSTEM EVENT Log, SEL). An administrator can view the logs through an IPMI tool (such as an ipmitool) or a BMC Web interface and support screening according to time, event types and other modes, so that the administrator is helped to quickly locate fault reasons, trace historical events and ensure continuous and stable operation of the system.

Specifically, the BMC supports various alarm notification modes including LED indicator lights, SNMP traps, email notifications, log triggers, and the like. The information LED of the control panel displays red or blue to indicate serious errors or UID activation and the like, so that an administrator can quickly identify the problem. In addition, the BMC can also send alarm information to the network management platform through SNMP traps or receive alarm details through preconfigured emails, so that timely response to hardware faults is ensured.

Specifically, the BMC provides KVM over IP function, allows an administrator to access the server console remotely through the network, views the BIOS interface or the operating system in real time, supports keyboard, video and mouse operations, and realizes comprehensive remote management. In addition, the BMC also supports virtual media mounting, allows for remote loading of ISO images or virtual drives, facilitates system installation or repair, and does not require physical contact with a server.

Specifically, the BMC has strict security and authority management functions, supports multi-role user classification (such as an administrator, an operator and a common user), can limit sensitive operations such as firmware update, and ensures system security. In addition, the BMC also ensures the safety of data transmission through encryption protocols (such as HTTPS and IPMI over LAN encryption modes) and prevents sensitive data from being leaked in the transmission process. Through the IP filtering function, an administrator may configure the IP address range that allows access to the BMC, further preventing unauthorized access.

Specifically, the BMC supports remote firmware update through a Web interface or a command line tool (such as REDFISH API), ensures that the system always operates on the latest firmware version, and improves the performance and the safety of the system. The BMC also provides a hardware reset function, supports remote forced restarting or recovery of BIOS default settings, and simplifies fault troubleshooting and system recovery processes. In addition, in combination with the BIOS POST code, the BMC can provide accurate positioning of hardware faults, and help an administrator to diagnose and treat hardware problems more quickly.

BIOS POST code

It can realize hardware self-checking feedback. I.e., at system start-up, the BIOS indicates a detected hardware problem via the POST code. These codes may be displayed by LEDs on the motherboard or by a connection diagnostic card to assist in the rapid localization of the fault source as part of the local/remote diagnosis.

Specifically, BIOS POST (Power-On Self-Test) code is a core feedback mechanism for hardware Self-Test during system startup. During the boot phase, the BIOS indicates the initialization state of the hardware through a specific hexadecimal or decimal code. Each code represents a different detection step, such as CPU initialization, memory detection, peripheral identification, etc. If a certain detection step is not passed, the code will stay in the wrong place, helping the administrator to locate the fault source quickly. This function is a vital feedback mechanism in the system self-test process, and can provide real-time information of hardware health status.

Specifically, the POST code is presented in two formats, namely, a 2-bit hexadecimal number (for example, 0x55 represents a memory detection error) or a 4-bit decimal number (for example, 0800 represents completion of PCI device initialization). There are two options for the manner in which the code is displayed, depending on the device configuration. Some motherboard integrated LEDs can display the POST code directly, but require reference to a motherboard manual for decoding. For devices without integrated LEDs, the POST code can be captured and displayed in real time through an external PCIe or USB diagnostic card, thereby helping an administrator monitor the system state and diagnose problems.

Specifically, POST codes are classified into critical errors and non-critical errors. Critical errors refer to the problem that must be resolved during system startup, where code may continue to stay on non-zero values, e.g., 0x55 for memory failures and 0xCC for unrecognized CPU. The system may stop starting and trigger an alarm (e.g., beep or LED red light is always on). Non-critical errors are problems that occur at system start-up, and code may continue to execute after a short occurrence, e.g., 0xA0 indicates SATA device detection delay, and the system is started up normally but relevant logs are recorded. For non-critical errors, the system will not typically stop starting, but the administrator should view the log and check it in time.

Specifically, the POST code is tightly integrated with the BMC (baseboard management controller) to provide remote diagnostic support. The POST code may be recorded by the BMC into a System Event Log (SEL) and support remote viewing by IPMI tools (e.g., ipmitool SEL list). An administrator can observe POST code changes during the startup phase through KVM over IP functionality in an out-of-band management mode without physically touching the device.

Specifically, when a start-up failure is encountered, the administrator should record the last POST code displayed and analyze the specific error in conjunction with the motherboard manual or vendor provided code table. The basic method of troubleshooting includes that if the code indicates a memory error (e.g., 0x 55), an attempt can be made to reinsert the memory or replace the socket, and if the code indicates a CPU failure (e.g., 0 xCC), the CPU should be checked for installation or heat sink pressure, etc.

Example III

The storage device supports a dynamic load balancing algorithm, and allocates resources according to the computing tasks.

Specifically, the system is equipped with 24 PCIe 5.0NVMe drivers connected to the CPU through a multi-path I/O (MPIO) architecture. Each NVMe driver monopolizes PCIe lanes, supporting concurrent data access. The design allows the algorithm to dynamically allocate storage bandwidth according to task priorities, for example, allocate more channel resources for high-priority AI training tasks, and achieve load balancing.

Specifically, the motherboard provides 4 PCIe 5.0 expansion slots (2 x16, 2x 8), supporting accelerator cards such as GPUs/FPGAs. The bandwidth allocation of different slots can be dynamically adjusted through PCIe resource division functions of the BMC and the BIOS. For example, in the case of a surge in reasoning tasks, x16 slot bandwidth is preferentially allocated to the reasoning acceleration card, while training tasks use the remaining resources.

Specifically, intelligent RAID and data partition management is employed, which supports hardware level RAID 0/1/5/10 through INTEL RAID KEY (JR). In conjunction with the firmware algorithm, the RAID group may dynamically adjust the data slicing strategy. For example, hot spot data is distributed across multiple NVMe drives, balancing I/O loads, or data distribution is automatically optimized according to the computational task type (random read/sequential write).

Specifically, the BMC continuously collects CPU/memory/storage/network utilization data and provides APIs to the outside through the IPMI interface. The third party scheduling system may implement dynamic load balancing based on such data, such as migrating portions of the tasks to low load nodes when a storage bandwidth bottleneck is detected.

Specifically, hardware supports network architecture based NVMe (NVMe-orf) protocols and memory pooling (via m.2 expansion slots). In connection with super fusion architecture (HCI) software, storage resources may be dynamically allocated by a virtualization layer, such as allocating cache space as needed for GPU compute nodes.

Specifically, a dual 2000W titanium power supply and 6 high speed fans ensure stability of resource allocation. When detecting a power or heat dissipation anomaly, the BMC may automatically trigger load migration (e.g., reduce non-critical task resource quota), preferentially guaranteeing resource provisioning for core computing tasks.

Example IV

In the special high-performance storage device designed for intelligent computation training and reasoning, the storage unit, the processor and the memory realize cooperative scheduling through the optimized BIOS.

Specifically, the BIOS of the motherboard (X13 SEB-TF) supports PCIe resource partitioning and priority adjustment. The PCIe 5.0 channel bandwidth can be manually or automatically allocated through the UEFI interface, for example, a fixed channel is reserved for 24 NVMe drivers, and meanwhile, the residual bandwidth is dynamically adjusted to the GPU/FPGA accelerator card directly connected with the CPU, so that the resource balance of storage and calculation tasks is ensured.

Specifically, the BIOS provides NUMA (non-uniform memory access) configuration options that allow binding of memory to CPU core associations according to processor topology. For example, in an AI training task, a high-frequency memory is allocated to a specific CPU core, so as to reduce data access delay, and simultaneously improve the memory bandwidth utilization rate through a memory interleaving mode.

Specifically, the queue depth and prefetch policy of the NVMe controller may be configured in the BIOS. For example, for a random read intensive reasoning task, a high queue depth (e.g., 1024) and aggressive prefetching are enabled, and for a sequential write training task, the write cache refresh frequency is optimized, reducing latency.

Specifically, dynamic power consumption adjustment is achieved through CPU C/P state settings of BIOS and storage device power saving modes (e.g., NVMe APST). For example, at low load, the CPU frequency is reduced and active links of part of NVMe drivers are closed, at high load, turbo mode is enabled and full bandwidth is allocated, balancing performance and energy efficiency.

Specifically, the BIOS is linked with the BMC to collect data such as CPU temperature, memory error rate, NVMe delay and the like in real time. Policy adjustments are triggered based on thresholds, such as limiting storage bandwidth priority when the CPU overheats, or migrating data to healthy channels when memory ECC error rates increase.

Specifically, the BIOS enables Volume management device (Volume MANAGEMENT DEVICE, VMD) technology, allowing the CPU to directly manage the NVMe driver, bypassing the traditional storage controller bottleneck. Meanwhile, DDIO is supported, stored data is directly written into the CPU cache, memory transfer delay is reduced, and cooperative efficiency is improved.

The technical scheme provides special high-performance storage equipment designed for intelligent computing training and reasoning, and a plurality of high-speed interface hard disk drive slots and independent data bus channels are configured in each storage node, so that each driver can be directly connected with a processor, non-blocking concurrent data access is realized, and data processing efficiency is improved. The processor supports expandability, can be flexibly configured according to different requirements, is provided with an ECC memory, and enhances the reliability and data integrity of the system. The high-speed expansion slot provides an installation interface for the external reasoning acceleration card, and further optimizes the calculation performance of the intelligent reasoning task. In summary, the storage device has the advantages of high concurrency, high performance, scalability and high reliability, and can meet the data storage and fast processing requirements in large-scale intelligent computing tasks.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A special purpose high performance memory device designed for mental arithmetic training and reasoning, comprising:

a chassis, and at least two storage nodes disposed within the chassis;

each storage node comprises the following components arranged on a main board:

2. The special purpose high performance storage device designed for intelligent computing training and reasoning according to claim 1, wherein each storage node is built-in with a remote diagnosis function, supporting real-time feedback of fault codes;

The remote diagnosis function is realized through a baseboard management controller BMC, wherein the BMC is an integrated component of a main board of a storage node, operates independently of an operating system and is accessed remotely through an IPMI local area network port;

3. The storage device of claim 2, wherein the hardware status data includes a temperature of a processor, a fan speed, and a status of a power module, the storage device supporting fault alerting via an indicator light, a simple network management protocol SNMP trap, or an email notification.

4. The storage device of claim 3, wherein the remote diagnostic function supports power module control and hardware status diagnostics through out-of-band management functions in the event of an operating system downtime or non-startup;

The BMC supports security of remote management session through encryption protocol, wherein the encryption protocol comprises HTTPS or baseboard management controller BMC provides hardware fault location through IPMI protocol encryption mode of local area network and BIOS POST code.

5. The storage device of claim 1, wherein each storage node is equipped with 24 PCIe5.0nvme drivers, each of the drivers monopolizing one PCIe lane supporting multitasking concurrent data access;

6. The memory device of claim 5, wherein the motherboard of the memory device provides 4 PCIe5.0 expansion slots, wherein the first slot and the second slot are 2 PCIe5.0 x16 slots, wherein the third slot and the fourth slot are 2 PCIe5.0 x8 slots, and wherein the expansion slots are located in a rear area of the motherboard for mounting external reasoning accelerator cards.

7. The storage device of claim 6, wherein the dynamic load balancing algorithm dynamically configures PCIe bandwidth resources by:

8. The memory device of claim 7, wherein the BMC is configured to continuously collect utilization data of the processor, the memory unit, the storage unit, and the network, and provide an API to the outside via the IPMI interface, such that the external management system can query the real-time utilization data via the API, trigger the dynamic load balancing algorithm to adjust PCIe bandwidth resource allocation, and receive a real-time alert of a hardware exception.

9. The storage device of claim 1, wherein the storage device is equipped with a dual 2000W titanium power supply and a plurality of fans;

10. The special purpose high performance memory device designed for intelligent computing training and reasoning as recited in claim 1, wherein the memory unit, processor and memory unit are co-scheduled by an optimized BIOS.