CN118316778B

CN118316778B - UAR page allocation method and device for IB network card

Info

Publication number: CN118316778B
Application number: CN202410417373.2A
Authority: CN
Inventors: 赵曦; 孙寅龙; 吴寅
Original assignee: Wuxi Zhongxing Microsystem Technology Co ltd
Current assignee: Wuxi Zhongxing Microsystem Technology Co ltd
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-12-13
Anticipated expiration: 2044-04-08
Also published as: CN118316778A

Abstract

The invention provides a UAR page allocation method and device for an IB network card, wherein the method comprises the steps of dividing a PCIE base address register space into UAR pages, mapping addresses of the UAR pages into corresponding host side starting addresses by utilizing offsets of PCIE base addresses, initializing a global linked list, adding the starting addresses corresponding to each UAR page into the global linked list, taking UAR pages out of the global linked list when a queue pair QP is created, and allocating idle WQE spaces of the UAR pages to the queue pairs so that each queue pair can obtain independent PCIE base address register space. The technical scheme of the invention avoids UAR page resource contention caused by concurrency of multiple QPs, and improves the performance of the network card.

Description

UAR page allocation method and device for IB network card

Technical Field

The invention belongs to the field of network communication, and particularly relates to a UAR page allocation method and device for an IB network card.

Background

RDMA (Remote Direct Memory Access), remote direct memory access), is an efficient network communication technology that allows direct exchange of memory data between computers without going through an operating system. A key advantage of RDMA is that it reduces latency and CPU load in the data transfer process. BlueFlame is an important supplement to conventional RDMA technology, and improves the efficiency and performance of overall network communication by optimizing memory write operations, allows network card hardware to directly process operations of constructing and processing network packets, provides a more efficient way to process RDMA memory write operations, reduces dependence on CPU, and thereby reduces delay and improves performance. UAR (User ACCESS REGIN, user direct access region) pages (UAR pages) are regions of memory that are used to store control information necessary for RDMA operations. UAR pages allow HCAs (high performance network cards) to efficiently access these control information to perform network operations. Each UAR page contains a plurality of UARs, each UAR corresponding to a particular network operation or queue. BlueFlame and UAR pages work together, so that the efficiency of data packet processing is improved, the quick access of control information required by operation is ensured, the RDMA process is optimized together, the delay is reduced, and the throughput is improved.

However Blueframe WQE (Work Queue Element, work queue entry for specifying and controlling RDMA operations) are aligned in 64 bytes, so if submitted directly through UAR, the host side needs to make one copy, in a multi-threaded or multi-process environment, if multiple threads or processes share the same QP and UAR Page, they may attempt to write to UAR Page at the same time, resulting in race conditions and performance problems.

The current mellanox IB network card considers the requirements of resource management and performance optimization in design, and allocates 16 UAR pages in each context (context), as shown in fig. 1. Where the context contains data structures and information associated with a particular IB device or resource. This can prevent race condition problems from occurring with the increase of context. However, when a large number of QPs (Queue Pair) are created in the same context, the concurrency of these QPs causes race conditions to occur, thereby affecting performance.

Disclosure of Invention

The invention aims to provide a UAR page allocation method and device for an IB network card, aiming at improving the network card performance under the concurrent condition of multiple QPs.

According to a first aspect of the present invention, there is provided a UAR page allocation method for an IB network card, including:

dividing a PCIE base address register space into UAR pages, mapping addresses of the UAR pages into corresponding host side starting addresses by utilizing offsets of PCIE base addresses, initializing a global linked list, and adding the starting addresses corresponding to each UAR page into the global linked list;

When creating a queue pair QP, taking UAR pages out of the global linked list, and distributing the free WQE space of the UAR pages to the queue pairs, so that each queue pair obtains independent PCIE base address register space.

Preferably, the adding each UAR page to the global linked list further includes:

and sequentially inserting the starting address corresponding to each UAR page into the tail part of the global linked list.

The number of UAR pages is independent of process or thread context.

The retrieving the UAR page from the global linked list further includes:

And taking out the current UAR page from the head of the global linked list, deleting the current UAR page from the head of the linked list, and then adding the current UAR page to the tail of the global linked list.

The allocating the free WQE space of the UAR page to the queue pair further includes:

If the current UAR page has an idle WQE space, the idle WQE space is distributed to the current queue pair;

If the current UAR page does not have the free WQE space, continuing to take out the subsequent UAR page from the global linked list until the free WQE space exists in the subsequent UAR page, and distributing the free WQE space to the current queue pair.

According to a second aspect of the present invention, there is provided a UAR page allocating apparatus for an IB network card, comprising:

The PCIE base address register space is divided into UAR pages, addresses of the UAR pages are mapped into corresponding host side starting addresses by utilizing the offset of the PCIE base address, a global linked list is initialized, and the starting address corresponding to each UAR page is added into the global linked list;

And the page allocation unit is used for taking UAR pages out of the global linked list when the queue pair QP is created, and allocating the free WQE space of the UAR pages to the queue pairs so that each queue pair obtains independent PCIE base address register space.

Compared with the related art, the technical scheme of the invention has the following advantages:

UAR pages are managed by using a global linked list mode, the number limit of UAR pages owned by one process is relieved, the whole BAR0 space is utilized, and the situation that race conditions are not generated when all QPs in one context are concurrent can be ensured, so that the software performance is ensured, the access conflict of a plurality of QPs to the BAR is avoided, each QP can access an independent BAR, all QPs in one process or thread can monopolize the resources of the BAR space, and the BAR space resource waste is reduced. When all QPs in one process or thread perform concurrent operation, lock-free operation can be realized at the software level, and CPU resource waste caused by lock competition of the CPU is reduced. Since all QPs monopolize doorbell regions in UAR pages, performance bottlenecks caused by UAR page resource contention due to concurrency of multiple QPs are avoided.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and process particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without any inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of UAR page allocation according to the related art.

Fig. 2 is a flow chart of a UAR page allocation method according to the invention.

Fig. 3 is a schematic diagram of splitting a PCIE BAR into UAR pages according to the present invention.

FIG. 4 is a diagram of a UARpage global linked list in accordance with the present invention.

FIG. 5 is a schematic diagram of a UAR page global linked list fetch and join process in accordance with the present invention.

FIG. 6 is a diagram of allocation states of WQE space and QP according to the present invention.

Fig. 7 is a schematic diagram of UAR page allocation situation according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which are derived by a person skilled in the art from the embodiments according to the invention without creative efforts, fall within the protection scope of the invention.

Based on the above analysis, the present invention proposes a UAR page allocation method and apparatus for IB network card, which removes the restriction of PCIE BAR (Base ADDRESS REGISTER ) space on QP in the drive, and does not classify UAR pages in BAR space, so that all QPs in each context can use only one doorbell area in UAR page, i.e. space of WQE.

Referring to the flowchart of fig. 2, the UAR page allocation method for the IB network card provided by the invention includes:

Step 101, dividing a PCIE base address register space into UAR pages, mapping addresses of the UAR pages into corresponding host side starting addresses by utilizing offsets of PCIE base addresses, initializing a global linked list, and adding the starting addresses corresponding to each UAR page into the global linked list.

As shown in FIG. 3, the 32M space of BAR0 is first split into UAR pages of 4K size in their entirety. Then creating a linked list UAR_Page_1ist of the UAR page of the global type at the host side, wherein the UAR page adopts a strategy of dynamic double-end queue management. The strategy employs a dynamically resized double-ended queue (uar_page_list) that supports efficient insert and delete operations at both ends. This feature of a double-ended queue allows it to be used as a stack (stack) as well as a queue (queue), providing great flexibility. In the invention, UAR pages are added by adopting a tail insertion strategy, and new UAR pages are always added to the tail of the linked list. Tail insertion ensures UARpage sequential so that the newly added UARpage is always in the last position of the linked list. Maintenance and expansion can be achieved with the flexibility of double-ended queues. Assuming that the UARs of the current BAR space are all used, a tail delete and head reuse strategy may be used, reusing one UAR page.

The UAR pages are converted into addresses which can be directly accessed by a host computer in an address mapping mode, the addresses of all UAR pages allocated in the BAR0 space are stored in a UAR_page_list of a global linked list of UAR pages, the UAR pages are sequentially added to UAR pages 8192 from UAR page0, and the addresses of each UAR page are sequentially inserted into the tail of the linked list, as shown in fig. 4.

Since one global linked list is used to maintain all UAR pages, it is independent of context for each thread or process.

And 102, when a queue pair QP is created, taking UAR pages out of the global linked list, and distributing the free WQE space of the UAR pages to the queue pairs so that each queue pair obtains independent PCIE base address register space.

When a QP is first created, one UAR page is fetched from the head of the linked list, the fetched UAR page is assigned to the QP, and the fetched UAR page is deleted from the head of the linked list and then added to the tail of the linked list, as shown in fig. 5. A head extraction policy is employed when fetching UAR pages, which corresponds to tail insertion, which allows elements to be quickly removed from the head of the linked list. The time complexity of this operation is O (1), which can be done in a constant time, guaranteeing an efficient access speed.

The QP created by each context obtains UAR pages from the linked list, and there is no limit to the number of UAR pages in one context. One UAR page contains 4 WQE spaces, each available for one QP, as shown in fig. 6.

In an alternative embodiment, if the current UAR page does not have free WQE space, indicating that all WQEs in current UARpage are allocated to previous QP, then the subsequent UAR page is continued to be fetched from the global linked list until free WQE space is found and allocated to the current QP.

Therefore, referring to fig. 7, up to 32K QPs applied in one process can independently use doorbell regions in their UAR pages, so that performance influence caused by sharing one UARpage among QPs in one process can be effectively avoided.

As described above, in the related art, a process includes at most 16 UAR pages, so that more than 64 QPs are concurrent in one process, a race condition occurs, which affects performance. Compared with the related art, the UAR page allocation method for the IB network card uses the global linked list mode to manage UAR pages, so that the quantity limit of UAR pages owned by one process is relieved, the whole BAR0 space is fully utilized, race conditions can not be generated when all QPs in one context are concurrent, the software performance is ensured, access conflict of a plurality of QPs to the BAR is avoided, each QP can access an independent BAR, all QPs in one process or thread can monopolize the resources of the BAR space, and the BAR space resource waste is reduced. When all QPs in one process or thread perform concurrent operation, lock-free operation can be realized at the software level, and CpU resource waste caused by lock competition of the CPU is reduced. Since all QPs monopolize doorbell regions in UAR pages, performance bottlenecks caused by UAR page resource contention due to concurrency of multiple QPs are avoided.

Accordingly, the present invention provides in a second aspect a UAR page allocation apparatus for an IB network card, comprising:

The above device may be implemented by the UAR page allocation method for an IB network card provided by the embodiment of the first aspect, and specific implementation manner may be referred to the description in the embodiment of the first aspect, which is not repeated herein.

It is understood that the storage structures, names and parameters described in the above embodiments are only examples. Those skilled in the art may also make and adjust the structural features of the above embodiments as desired without limiting the inventive concept to the specific details of the examples described above.

Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that modifications may be made to the technical solutions described in the foregoing embodiments or equivalents may be substituted for some of the technical features thereof, and these modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention in essence.

Claims

1. A UAR page allocation method for an IB network card, comprising:

when a queue pair QP is created, a UAR page is taken out from the global linked list, and an idle WQE space of the UAR page is distributed to the queue pairs, so that each queue pair obtains an independent PCIE base address register space;

2. The UAR page allocation method for an IB network card according to claim 1, wherein adding a start address corresponding to each UAR page to the global linked list further comprises:

3. The UAR page allocation method for an IB network card of claim 1, wherein the number of UAR pages is independent of process or thread context.

4. The UAR page allocation method for an IB network card of claim 1, wherein said retrieving a UAR page from said global linked list further comprises:

5. A UAR page allocation apparatus for an IB network card, comprising:

The page allocation unit is used for taking UAR pages out of the global linked list when a queue pair QP is created, and allocating the idle WQE space of the UAR pages to the queue pairs so that each queue pair obtains independent PCIE base address register space;

The page allocation unit is further configured to:

6. The UAR page allocation apparatus for an IB network card as defined in claim 5, wherein said linked list generating unit is further configured to:

7. The UAR page allocation apparatus for an IB network card of claim 5, wherein the number of UAR pages is independent of process or thread context.

8. The UAR page allocation apparatus for an IB network card according to claim 7, wherein said page allocation unit is further configured to: