CN119561911A

CN119561911A - Active congestion control method and device for protecting elephant flow

Info

Publication number: CN119561911A
Application number: CN202411468040.9A
Authority: CN
Inventors: 徐韶华; 黎恒; 陈贵豪; 李力源; 黄白云; 张羽辛
Original assignee: Guangxi Beitou Xinchuang Technology Investment Group Co ltd
Current assignee: Guangxi Beitou Xinchuang Technology Investment Group Co ltd
Priority date: 2024-10-21
Filing date: 2024-10-21
Publication date: 2025-03-04

Abstract

The present disclosure relates to active congestion control methods and apparatus to protect elephant flows. The method comprises the steps of obtaining current time delay and a preset time delay interval of a current transmission link when determining that a target data stream in a plurality of data streams sent to a receiving side end host computer through a data center network is an elephant stream, determining a target congestion window adjustment strategy of the current transmission link based on the size relation between the current time delay and the preset time delay interval, and adjusting the current congestion window value of the current transmission link at the current moment by utilizing the target congestion window adjustment strategy so as to control the sending rate of the target data stream. The method and the device can actively adjust the current congestion window value of the transmission link corresponding to the elephant flow at the current moment when the elephant flow is fanned into the data center network, carefully control the sending rate of the elephant flow, and ensure the transmission delivery of the elephant flow, thereby avoiding the congestion condition of the data center network, and ensuring the availability of the global network.

Description

Active congestion control method and device for protecting elephant flow

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to an active congestion control method and apparatus for protecting an elephant flow.

Background

Cloud computing frameworks are widely deployed in large data centers and create very high traffic loads that cause traffic within the data centers to exhibit an exponentially increasing trend, while the buffers of the core switches used are relatively small, resulting in many bursty congestion in the data center network. When congestion occurs on the core switch of the network, if the sender host fails to sense the congestion of the core link in a short time, the flow is still continuously fanned into the core network, which will seriously aggravate the congestion of the global network.

In the related art, the whole network traffic can be monitored in real time through the central control server, the congestion degree and the service quality of the link are detected, and a transmission path for back-off congestion is selected for the elephant flow. The congestion window initial value can also be set to be the size of a Bandwidth-Delay Product (BDP) by using a more aggressive sending rate adjustment mode for the elephant flow by the host at the sender end, and the data is fanned into the network in a mode of quickly filling an idle link, so that the global unavailability of the network is aggravated, the delivery of the elephant flow is seriously affected, and finally the overall disabling of a flow scheduling mechanism is caused.

Disclosure of Invention

In view of the above, the embodiments of the present disclosure provide an active congestion control method and apparatus for protecting an elephant flow, so as to solve the problems in the related art.

In a first aspect of an embodiment of the present disclosure, an active congestion control method for protecting an elephant flow is provided, which is applied to a sender host, and includes:

When the target data stream in a plurality of data streams sent to a host computer at a receiving side through a data center network is determined to be an elephant stream, acquiring the current time delay and a preset time delay interval of a current transmission link;

determining a target congestion window adjustment strategy of the current transmission link based on the magnitude relation between the current time delay and the preset time delay interval;

and adjusting the current congestion window value of the current transmission link at the current moment by using the target congestion window adjustment strategy so as to control the sending rate of the target data flow.

In a second aspect of the embodiments of the present disclosure, an active congestion control apparatus for protecting an elephant flow is provided, which is applied to a sender-side host, and includes:

The acquisition module is used for acquiring the current time delay and the preset time delay interval of the current transmission link when the target data stream in the plurality of data streams sent to the host computer at the receiving side through the data center network is determined to be an elephant stream;

The processing module is used for determining a target congestion window adjustment strategy of the current transmission link based on the magnitude relation between the current time delay and the preset time delay interval;

The processing module is further configured to adjust a current congestion window value of the current transmission link at a current time by using the target congestion window adjustment policy, so as to control a sending rate of the target data flow.

In a third aspect of the disclosed embodiments, there is provided an electronic device, including:

At least one processor;

A memory for storing at least one processor-executable instruction;

wherein the at least one processor is configured to execute instructions to implement the steps of the above-described method.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which when executed by a processor of an electronic device, enables the electronic device to perform the steps of the above-described method.

The at least one technical scheme adopted by the embodiment of the disclosure can achieve the following beneficial effects that when the target data stream in a plurality of data streams sent to a receiving side end host computer through a data center network is determined to be an elephant stream, the current time delay and the preset time delay interval of a current transmission link are obtained, the target congestion window adjustment strategy of the current transmission link is determined based on the size relation between the current time delay and the preset time delay interval, the current congestion window value of the current transmission link at the current moment is adjusted by utilizing the target congestion window adjustment strategy, so that the sending rate of the target data stream is controlled, the sending rate of the elephant stream corresponding to the transmission link at the current moment can be actively adjusted when the elephant stream is fanned into the data center network, the transmission delivery of the elephant stream is ensured, the congestion condition of the data center network is avoided, and the availability of the global network is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 illustrates a flow diagram of an active congestion control method for protecting an elephant flow provided by an exemplary embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating a transmission procedure of a data stream according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a leaf-ridge switching network experimental bed topology based on the Close architecture provided by exemplary embodiments of the present disclosure;

Fig. 4 is a schematic diagram illustrating the structure of an active congestion control apparatus for protecting an elephant flow according to an exemplary embodiment of the present disclosure;

Fig. 5 shows a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure;

fig. 6 shows a schematic diagram of a computer system according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

In recent years, with the rapid development of network applications such as searching, online retail and cloud computing, stringent requirements are placed on the underlying infrastructure in terms of computing, storage and networking. Under this drive, high performance data centers (DATA CENTER) such as microsoft, google, amazon, facebook, and aleba have been rapidly established worldwide. The data center Network (DATA CENTER Network, DCN) has different characteristics than the internet, which presents opportunities and challenges to the design of transport protocols over the data center Network.

Data centers, which are the core infrastructure for cloud computing, are becoming increasingly larger in size, and currently a large data center can typically accommodate at least tens of thousands of servers and connect these servers together using a multi-tier Close network. With the gradual rise of the internal traffic demands of data center networks, most of the existing data center networks set the access bandwidth of a server to 10Gbps, 40Gbps and even 100Gbps, and the delay level of an access link is often lower than microseconds.

Cloud computing frameworks are widely deployed in large data centers and generate very high traffic loads, which cause traffic inside the data centers to show an exponentially growing trend, while the buffers of the switches used are relatively small, causing many bursty congestions in the data center network, i.e., the switch queues grow rapidly and overflow in a very short time, which will result in a drop in network throughput and transmission efficiency, and a drastic rise in flow completion time (Flow Completion Time, FCT). For example, typical tasks such as Map Reduce computing modes (e.g., map Reduce) use partition/aggregate design modes, typically involving many-to-one traffic modes, with multiple computing servers simultaneously sending data to a single aggregator node, resulting in a highly synchronized concurrent traffic communication mode. Especially, when congestion occurs on the core switch of the network, if the sender host fails to sense the congestion of the core link in a short time, the flow is still continuously fanned into the core network, which will seriously aggravate the congestion of the global network.

The data center data stream size typically exhibits a long tail distribution, taking the Facebook data center network base load (workload) as an example, in a data center supporting web search, 50% of the data stream is less than 100KB called mouse stream (mice flow), while 80% of the data stream belongs to the first 10% of the elephant stream (elesphant flow) of greater than 10 MB. In a network environment with high bandwidth and low Time delay, a mouse stream completes transmission in a small amount of Round Trip Time (RTT), which is like a stream is sent to a switch buffer queue which is occupied for a long Time, and various common stream scheduling algorithms are often used for deferring scheduling transmission. In a specific traffic scenario, when a virtual machine (Virtual Private Cloud, VPC) in a data center network crashes and needs to be restored and dumped immediately, a backup snapshot required by virtual machine VPC restoration is preferentially and rapidly transmitted, a large amount of link space needs to be vacated for transmission, and when the normal working state of a sender host is restored as early as possible, the dispatch transmission of an elephant flow is more urgent. In order to ensure the normal operation of the core switching network, the flow fan-in rate of the sender end host needs to be precisely and carefully controlled, which puts higher requirements on an active congestion control algorithm.

In a related technology, in a technical scheme, a software defined network (Software Defined Network, SDN) based architecture may be used in a data center network, and an SDN central control server is used to monitor the traffic of the whole network in real time, detect the topology change of the perceived network, construct a real-time network topology, detect the congestion degree and the quality of service of the link, and select a transmission path for the traffic of the elephant to back off. The method is mainly characterized by comprising the steps of detecting the congestion degree of a link based on a network topological graph constructed in real time, dynamically adjusting a sampling period according to the congestion degree, identifying an elephant flow according to the sampling period by a two-stage double-threshold method, designing a reward function according to the minimum packet loss rate, the maximum throughput and the highest path selection probability, calculating an optimal path for the identified elephant flow in real time, and rerouting to the optimal path.

The main technical application point of the network load balancing method for the SDN data center of the elephant flow is located in an SDN center control server, and reconfiguration is required for a control plane (control plane) and a data plane (DATA PLANE). The method takes network global flow transmission information as transmission control basis, needs to globally collect network flow and environment data, identifies the elephant flow, then carries out route calculation on a control plane for each data flow arrived in each end host of the network, and finally issues the calculated route to a switch and the end hosts on the data plane to start stream transmission, thus needing additional calculation and signaling communication expenditure, and the transmission architecture is fatal in a data center network with a time delay microsecond level. The transmission control point of the method is located at the central control server, is a global method and is more similar to a routing algorithm, and is not a congestion control algorithm.

In another technical scheme, a transport layer protocol Homa applied to a data center network is provided, which is based on a transport layer protocol driven by a receiver and capable of dynamically adjusting the priority of a network queue, and an active flow control mechanism driven by the receiver is integrated. The method can ensure ultra-low time delay of short message under the condition of high workload. Homa also provide for controlled overuse (controlled overcommit) of the downlink of the recipient to ensure adequate bandwidth utilization under high workload conditions. In the scheme, the receiving party performs relatively aggressive scheduling on the queue provided by the switch, and dynamically adjusts the priority queue on the receiving party. Homa is a short message based architecture that can address the delay caused by congestion (Head of Line Blocking) of the head of line on the transmission control (Transmission Control Protocol, TCP) protocol. It is a stateless protocol, requiring no character acknowledgements (Acknowledge character, ACKs), and thus reducing the number of short messages.

The technical scheme aims at the characteristic of long tail distribution of data load in the existing data center network, the main technical application point is at the host end of a receiver, protocol stack optimization is carried out on remote procedure call (Remote Procedure Call, RPC), minimum remaining processing time (Shortest Remaining Processing Time, SRPT) is used as the basis of priority order, the flow scheduling method essentially replaces congestion control in a flow scheduling mode, and after priority allocation is completed, a host at the sender end transmits flow data in sequence. For the elephant flow, a sender side host uses a more aggressive sending rate adjusting mode to set a congestion window initial value to be the size of a Bandwidth-Delay Product (BDP), and fans data into the network in a mode of quickly filling an idle link. The stream scheduling method is friendly to the transmission of mouse streams, and is poor in the transmission priority of elephant streams.

Meanwhile, when congestion is about to occur in the core switching network, the sender end host enters the critical moment of elephant flow scheduling, and under the condition that priority allocation is determined, the number of active flows is static, the protocol is not in a state, ACK confirmation is not needed, the sender end host cannot timely sense the generation of the congestion and change the allocated priority, the elephant flow is still fanned in by using the aggressive congestion window, the global unavailability of the network is aggravated, the delivery of the elephant flow is seriously affected, and the overall disabling of a flow scheduling mechanism is finally caused.

Therefore, in order to solve the above-mentioned problem, the embodiments of the present disclosure provide an active congestion control method for protecting an elephant flow, which aims at the technical problems that, in the existing data center network, a distributed synchronous traffic arrival mode and a long tail distributed traffic load scenario are commonly existed, for a critical moment when a sender end host is about to switch from a scheduled mouse flow to a scheduled elephant flow, if a core switching network is congested, the sender end host uses an aggressive sending rate to fan the elephant flow into the network, which may result in aggravating global congestion of the network and cause unavailability, and transmission of the elephant flow is not effectively protected, and a fast-perceived and fast-converged data center network active congestion control algorithm is designed by using a relatively careful sending rate speed regulation mechanism. The method is essentially a distributed method, which can take network local flow transmission information as transmission control basis, take a sender host as a control point, and fan an elephant flow into a core switching network by using a cautious sending rate adjustment mode when the elephant flow is transmitted, so as to prevent the congestion condition of the core switching network from being aggravated.

The active congestion control method for protecting the elephant flow provided by the embodiment of the disclosure can be executed by a sender end host or a chip applied to the sender end host.

The sender host may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform, which are not limited by the exemplary embodiments of the present disclosure.

Fig. 1 shows a flow diagram of an active congestion control method for protecting an elephant flow provided by an exemplary embodiment of the present disclosure. As shown in fig. 1, the active congestion control method for protecting an elephant flow includes:

S101, when a target data stream in a plurality of data streams sent to a host machine at a receiving side through a data center network is determined to be an elephant stream, acquiring the current time delay of a current transmission link and a preset time delay interval;

S102, determining a target congestion window adjustment strategy of a current transmission link based on the magnitude relation between the current time delay and a preset time delay interval;

and S103, adjusting the current congestion window value of the current transmission link at the current moment by utilizing a target congestion window adjustment strategy so as to control the sending rate of the target data flow.

Specifically, fig. 2 shows a schematic diagram of a transmission procedure of a data stream provided by an exemplary embodiment of the present disclosure. As shown in fig. 2, when a sender end-host sends a data stream to a receiver end-host, the data stream is transmitted through an edge switch between the sender end-host and a data center network (core switch network in fig. 2, which may also be referred to as a core switch in the exemplary embodiment of the present disclosure), the data center network, and the edge switch between the data center network and the receiver end-host. When the number of data flows fanned into the data center network is too large, congestion occurs in a transmission link of the data center network, and at this time, if the sender end host still uses the aggressive sending rate to fan the elephant flow into the data center network, global congestion of the data center network may be aggravated, and finally the data center network is not available.

In order to ensure the transmission delivery of the elephant stream, the current time delay and the preset time delay interval of the current transmission link can be acquired when the target data stream in a plurality of data streams sent to the receiving end host computer through the data center network is determined to be the elephant stream. And then, determining a target congestion window adjustment strategy of the current transmission link based on the size relation between the current time delay and the preset time delay interval.

Here, the preset time delay interval may be determined according to an actual application scenario, which is not particularly limited by the exemplary embodiments of the present disclosure.

Based on the method, the current congestion window value of the current transmission link at the current moment is adjusted by utilizing the target congestion window adjustment strategy, so that the sending rate of the elephant flow is carefully controlled, and the situation that when the data center network starts to be congested, the sender side host continues to fan the elephant flow into the data center network by continuously using the aggressive sending rate, so that the buffer queue length of the data center network can continue to be increased, the transmission delivery of the elephant flow is affected, and finally, the congestion of the global network is caused until the data center network is unavailable can be avoided.

According to the technical scheme of the exemplary embodiment of the disclosure, when the target data stream in a plurality of data streams sent to a host machine at a receiving side through a data center network is determined to be an elephant stream, the current time delay and the preset time delay interval of a current transmission link are obtained, the target congestion window adjustment strategy of the current transmission link is determined based on the size relation between the current time delay and the preset time delay interval, the current congestion window value of the current transmission link at the current moment is adjusted by utilizing the target congestion window adjustment strategy to control the sending rate of the target data stream, the sending rate of the elephant stream can be carefully controlled by actively adjusting the current congestion window value of the transmission link corresponding to the elephant stream at the current moment when the elephant stream is fanned into the data center network, and the transmission delivery of the elephant stream is guaranteed, so that the congestion condition of the data center network is avoided, and the availability of the global network is ensured.

In some embodiments, the method may further comprise:

When a plurality of data streams arrive at a sender host from a buffer zone, stream size information of each data stream is respectively sent to a receiver host, and residual transmission bytes of each data stream are obtained when response information of the receiver host for the stream size information is received, wherein the response information is used for representing that the receiver host confirms to receive the corresponding data stream;

According to the sequence from small to large of the remaining transmission bytes, priority ranking is carried out on the plurality of data streams, and priority ranking corresponding to each data stream is obtained;

The plurality of data streams are fanned into the data center network in sequence according to the priority ranking.

Specifically, as shown in fig. 2, when the data stream arrives from the buffer area to the sender-side host, the sender-side host may notify the receiver-side host of the stream size information in the form of a control message (①～③), and the receiver-side host returns response information for the stream size information to the sender-side host according to the stream size information (④～⑥). Here, the response information may be used to characterize that the receiving side end host confirms that the corresponding data stream is received, and the specific content of the response information is determined according to the actual application scenario, which is not particularly limited by the exemplary embodiments of the present disclosure. In the method of the exemplary embodiment of the present disclosure, the response information may be a standard Xu Lingpai packet.

After receiving the permission card-leading packet, the sender host allocates priority to the data streams in the buffer area according to the sequence from the small to the large of the residual transmission bytes to obtain the priority ranks corresponding to the data streams respectively, and then, the data streams are sequentially fanned into the data center network according to the priority ranks, so that the active congestion control driven by the receiver host is realized in a mode of 'firstly transmitting mouse streams and then transmitting elephant streams'.

On this basis, when it is determined that the target data stream among the plurality of data streams transmitted to the recipient end hosts through the data center network is an elephant stream, that is, at a critical time from the transmission of the mouse stream to the start of the transmission of the elephant stream, and after the start of the transmission of the elephant stream, the above-described steps S101 to S103 are performed.

In some embodiments, obtaining the current delay and the preset delay interval of the current transmission link may include:

Sending a marked data packet of the target data stream to a host at a receiving side, and receiving a confirmation reply data packet of the host at the receiving side aiming at the marked data packet;

Respectively acquiring the sending time of the marked data packet and the receiving time of the confirmation reply data packet, and calculating the current time delay of the current transmission link based on the sending time of the marked data packet and the receiving time of the confirmation reply data packet;

And acquiring a preset time delay interval of the current transmission link, wherein the preset time delay interval is determined based on a preset time delay lower limit and a preset time delay upper limit.

The method comprises the steps of firstly, obtaining a plurality of data packets included in a target data stream, then, randomly sampling and marking the data packets to obtain marked data packets, sending the marked data packets to a host of a receiving party, and receiving a confirmation reply data packet of the host of the receiving party for the marked data packets. The tag packet and the acknowledgement packet may be referred to herein collectively as a joint packet.

Illustratively, the sender-side host may randomly sample a plurality of data packets included in the object stream, and mark the randomly sampled data packets in the plurality of data packets.

For example, 3 flag bits of a Type of Service (TOS) in a Differential Service Code Point (DSCP) in an IP packet header of a randomly sampled packet may be used to flag the randomly sampled packet as 011, which indicates that the randomly sampled packet is a high priority packet, thereby obtaining a flag packet.

The random sampling marking method can be a marking method for sampling telemetry data packets, and can be specifically described as that a random number is given for each elephant flow, the value range of the random number is assumed to be 0-N, N is an integer, generally 0.1% of the total number of the data packets included in the elephant flow is taken, a random number threshold value n (n E [0, N ]) is set, and TOS bit marking is carried out on the dequeued data packets when the random number is smaller than the random number threshold value, so as to obtain marked data packets. Here, the sampling ratio of the random packet may be set to N/(n+1), and the transmission time of the tag packet is recorded in the buffer of the sender-side host.

After receiving the marked data packet, the receiving end-host needs to reply and confirm the marked data packet and return a confirmation reply data packet to the sending end-host besides returning a conventional permission token packet to the sending end-host.

Here, the reply acknowledgement may be to tag the tagged data packet, the purpose of which is to confirm that the data center network packet does not contain data payload. The marking method of the confirmation reply data packet is the same as the marking method of the marking data packet, and the TOS bit is marked only on the IP message head by the same marking method.

At this time, the exemplary embodiments of the present disclosure may acquire the transmission time of the tag packet and the reception time of the acknowledgement packet, respectively, and calculate the current delay of the current transmission link based on the transmission time of the tag packet and the reception time of the acknowledgement packet.

The exemplary embodiments of the present disclosure may also obtain a preset delay interval of the current transmission link. Here, the preset time delay interval is determined based on a preset time delay lower limit and a preset time delay upper limit.

Illustratively, the preset lower delay limit may be calculated by the following formula:

Wherein, Representing a preset lower latency limit, ssthresh represents a slow start threshold, bw _exp represents the desired bandwidth.

The preset upper delay limit may be calculated by the following formula:

Wherein, Indicating an upper limit of the preset time delay,Representing a dynamic threshold interval boundary calculated from the cache remaining capacity and the actual delay,Representing the Delay dynamic threshold boundary updated immediately after each packet loss, bw representing the current bandwidth and Delay representing the current Delay.

Based on this, the preset delay interval of the current transmission link can be expressed as:

in some embodiments, determining the target congestion window adjustment policy for the current transmission link based on the magnitude relationship between the current delay and the preset delay interval may include:

determining the current state of the current transmission link based on the magnitude relation between the current time delay and a preset time delay interval;

Acquiring congestion window adjustment strategies corresponding to a plurality of preset transmission link states, and determining the congestion window adjustment strategy corresponding to the current state from the congestion window adjustment strategies corresponding to the plurality of preset transmission link states;

and determining a congestion window adjustment strategy corresponding to the current state as a target congestion window adjustment strategy of the current transmission link.

Specifically, the preset transmission link states may include a start transmission state, an upcoming full state, and a packet loss state. The current state of the current transmission link is one of the plurality of preset transmission link states.

If the current delay is smaller than the preset delay lower limit of the preset delay interval, namelyWhen the current state of the current transmission link is the start transmission state. At this time, it is indicated that the queuing situation of the current transmission link is light, the available bandwidth is not fully utilized, and in order to rapidly increase the congestion window value, a multiplicative increase method may be adopted to increase the transmission rate.

The congestion window adjustment strategy corresponding to the starting transmission state comprises that the congestion window adjustment value at the next moment is twice the current congestion window value. It may be expressed as w=w _prev.2, where W represents the congestion window adjustment value at the next time instant and W _prev represents the current congestion window value.

If the current delay is within the preset delay interval, that isWhen the current state of the current transmission link is the about to be fully loaded state. At this time, it is indicated that the bandwidth of the current transmission link gradually starts to be fully loaded, queuing starts to occur, and packet loss is likely to occur at any time, and the window should be carefully increased.

The congestion window adjustment strategy corresponding to the full load state comprises that the congestion window adjustment value at the next moment is determined based on the sum of the current congestion window value and the congestion window predicted value of the current transmission link. It may be expressed as w=w _prev+W^e (d, l), where W ^e (d, l) represents the congestion window predictor of the current transmission link.

If the current delay is greater than the preset delay upper limit of the preset delay interval, namelyWhen the current state of the current transmission link is the packet loss state. At this time, it is indicated that the current transmission link has occurred packet loss congestion, and at this time, the window should be removed rapidly to ensure that the sender-side host reduces the traffic fanned into the data center network.

The congestion window adjustment strategy corresponding to the packet loss state comprises that the congestion window adjustment value at the next moment is one half of the current congestion window value. It can be expressed as:

Based on the above, the exemplary embodiments of the present disclosure may determine a current state of a current transmission link based on a magnitude relation between a current time delay and a preset time delay interval, determine a congestion window adjustment policy corresponding to the current state from congestion window adjustment policies corresponding to preset transmission link states, and then determine the congestion window adjustment policy corresponding to the current state as a target congestion window adjustment policy of the current transmission link.

In some embodiments, adjusting the current congestion window value of the current transmission link at the current time using the target congestion window adjustment policy may include:

Acquiring a current congestion window value of a current transmission link at a current moment;

calculating a congestion window adjustment value at the next moment based on the current congestion window value and the target congestion window adjustment strategy;

and adjusting the current congestion window value by using the congestion window adjustment value at the next moment.

Specifically, after determining a target congestion window adjustment policy of a current transmission link, a current congestion window value of the current transmission link at a current time may be obtained, and based on the current congestion window value and the target congestion window adjustment policy, a congestion window adjustment value at a next time is calculated. Here, the congestion window adjustment value at the next time may be referred to above, and will not be described herein.

In the method of the exemplary embodiment of the disclosure, a congestion window experiment initial value is set to 10 at a sender host in a transmission starting stage, a current time delay of a current transmission link is calculated once every time a confirmation reply data packet of a receiver host is received, then, based on a size relation between the current time delay and a preset time delay interval, a current state of the current transmission link is determined, a congestion window adjustment strategy corresponding to the current state is determined again from congestion window adjustment strategies corresponding to various preset transmission link states, and a congestion window adjustment value of the current transmission link at the next moment is calculated until the transmission of an elephant stream is completed, so that window increasing or window decreasing operation on the transmission rate of the elephant stream is realized, and the elephant stream can fan into a data center network according to a relatively cautious transmission rate.

In some embodiments, the method may further comprise:

acquiring the current packet loss rate of a target data stream and a pre-constructed joint congestion window prediction function;

and calculating a congestion window predicted value of the current transmission link based on the current time delay and the current packet loss rate by using the joint congestion window predicted function.

Specifically, the current packet loss rate may be calculated based on the transmission and reception information of the tag packet and the acknowledgement packet in the foregoing. The calculation method is a conventional technical means in the art, and is not described herein.

When constructing the joint congestion window prediction function, firstly, a probability prediction function of time delay and a probability prediction function of packet loss rate are required to be constructed.

The sender host may calculate the delay and the packet loss rate of the marked packet of the current round trip time interval of the current transmission link from the acknowledgement reply packet acquired by the receiver host, and use the calculated delay and the packet loss rate as the basis for predicting the packet delay and the packet loss rate of the next round trip time interval, and construct a probability prediction function of the delay and a probability prediction function of the packet loss rate based on the delay and the packet loss rate of the current transmission link by using a bayesian network model.

The time delay comprises time delay caused by network congestion of the data center, and the packet loss rate comprises packet loss caused by the time delay, so that the time delay and the packet loss rate are related in condition, and a conditional probability model of a Bayesian network can be used for joint conditional probability calculation.

The probability prediction function of the delay can be described as a joint conditional probability, which can be expressed by the following formula:

P(d^e)＝P(d^old)·P(d|d^old)+(1-P(d^old))·P(d|～d^old)

Wherein P (d ^e) represents the probability of delay, P (d ^old) is the probability of delay average value in the previous round trip time interval, P (d|d ^old) is the conditional probability of delay in the next round trip time interval under the condition of the previous round trip time interval, d represents delay, d ^old is true in the joint conditional probability, and d ^old is false in the opposite condition.

The probability prediction function of the packet loss rate can be described as a joint conditional probability, which can be expressed by the following formula:

P(l^e)＝P(l^old)·P(l|l^old)+(1-P(l^old))·P(l|～l^old)+P(d^old)·P(l|d^old)+(1-P(d^old))·P(l|～d^old)

Wherein P (l ^e) represents the probability of packet loss rate, P (l ^old) is the probability of packet loss rate in the previous round trip time interval, P (l|l ^old) is the conditional probability of packet loss rate in the next round trip time interval under the condition of the previous round trip time interval, l represents the packet loss rate, and in the joint conditional probability, let l ^old be true, and conversely let l ^old be false.

Since the delay is caused by unduly increasing the size of the current congestion window value and affecting the traffic of other data flows. Packet loss is due to latency and other factors. Based on the above-mentioned joint conditional probability of the bayesian network, the congestion window prediction function of delay and the congestion window prediction function of packet loss rate can be constructed based on the probability prediction function of delay and the probability prediction function of packet loss rate respectively.

The congestion window prediction function based on latency can be expressed by the following formula:

W^e(d)＝(1-P(d^e))·W(d^t)

Where W ^e (d) represents a congestion window predictor based on latency and W (d ^t) is the congestion window predictor size at the expected latency probability.

Here, the sender end host can fully utilize the available bandwidth on the forward path without being affected by queuing delay on the reverse path, so the congestion window predicted value under the expected delay probability can be defined as:

Where d _queue represents the queuing delay of the sender end-host.

The congestion window prediction function based on the packet loss rate can be expressed by the following formula:

W^e(l)＝(1-P(l^e))·W(l^t)

Wherein W ^e (l) represents a congestion window predicted value based on the packet loss rate, and W (l ^t) represents a congestion window predicted value under the expected packet loss rate probability.

Based on the method, delay and packet loss of two indexes can be overlapped, and a joint congestion window prediction function is constructed based on a congestion window prediction function of time delay and a congestion window prediction function of packet loss rate. Wherein, the joint congestion window prediction function can be expressed by the following formula:

where W ^e (d, l) represents the congestion window predictor for the current transmission link.

Based on the above, after the current time delay and the current packet loss rate of the current transmission link are obtained, the congestion window predicted value of the current transmission link can be calculated based on the current time delay and the current packet loss rate by utilizing a joint congestion window predicted function, so that when the current state of the current transmission link is the state to be fully loaded, the congestion window adjusting value at the next moment is determined based on the sum of the current congestion window value and the congestion window predicted value of the current transmission link, and the current congestion window adjusting value at the next moment is adjusted to the current congestion window value of the current transmission link at the current moment, thereby achieving the purpose of actively controlling the sending rate of the elephant flow, ensuring the transmission delivery of the elephant flow, avoiding the congestion situation of the aggravated data center network, and ensuring the availability of the global network.

Based on this, the exemplary embodiments of the present disclosure perform data transmission experiments under an experimental bed topology using a Clos-based switching network architecture. Fig. 3 shows a leaf-ridge switching network experimental bed topology based on a Close architecture provided by an exemplary embodiment of the present disclosure. As shown in fig. 3, in a Leaf-Spine (Leaf-Spine) switching network based on the Close topology, the number of Leaf (core) switches is 4, the number of Leaf (edge) switches is 16, the number of servers is 64, the servers and the edge switches are connected through 10Gbps links, the core switches are connected through 50Gbps links, the unloaded link delay is 10 μs, and the unloaded RTT within the longest transmission path 4 hops is 80 μs. The workload employs the arrival of data streams subject to poisson distribution, and the source and destination servers for each data stream are uniformly randomly selected. The mouse stream size was randomly selected from 8KB to 32KB, and the elephant stream size was set to 10 to 100 MB. The test time was 60 minutes.

Compared with the prior art, the active congestion control method for protecting the elephant flow provided by the exemplary embodiment of the disclosure compares the average length of the cache queue of the core switch and the throughput of the network in the time of completing elephant flow under the conditions of using the experimental bed topology, the experimental workload and the experimental conditions. In the active congestion control method for protecting the elephant flow provided by the exemplary embodiment of the disclosure, the elephant flow average completion time is 3.82 seconds, which is reduced by 75.10% compared with 15.34 seconds in the prior art. In the active congestion control method for protecting the elephant flow provided by the exemplary embodiment of the disclosure, the average length of the buffer queue of the core switch is 55.32KB, which is reduced by 81.45% compared with 298.21KB in the prior art, in the active congestion control method for protecting the elephant flow provided by the exemplary embodiment of the disclosure, the average throughput is 3.72Gbps, which is improved by 45.43% compared with 2.03Gbps in the prior art, and no abrupt or continuous congestion condition occurs in the core switch network.

To sum up, the active congestion control method for protecting the elephant flow provided by the exemplary embodiment of the disclosure is based on the existing method for improving the uniform completion time of the elephant flow, the average length of the buffer queue of the core switch and the throughput of the network, so that the overall availability of the core switch network is ensured, and the transmission delivery protection effect of the elephant flow is obvious.

The foregoing has been mainly presented in terms of the teachings of the presently disclosed embodiments. It will be appreciated that, in order to achieve the above-described functions, the electronic device includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiment of the disclosure may divide the functional units of the electronic device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present disclosure, the division of the modules is merely a logic function division, and other division manners may be implemented in actual practice.

In the case of dividing each functional module by corresponding each function, exemplary embodiments of the present disclosure provide an active congestion control apparatus for protecting an elephant flow, which may be a sender end host or a chip applied to the sender end host. Fig. 4 is a schematic diagram illustrating the structure of an active congestion control apparatus for protecting an elephant flow according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 includes:

An obtaining module 401, configured to obtain a current delay and a preset delay interval of a current transmission link when it is determined that a target data stream in a plurality of data streams sent to a receiver-side host through a data center network is an elephant stream;

a processing module 402, configured to determine a target congestion window adjustment policy of the current transmission link based on a magnitude relation between the current delay and the preset delay interval;

the processing module 402 is further configured to adjust a current congestion window value of the current transmission link at a current time by using the target congestion window adjustment policy, so as to control a sending rate of the target data flow.

In some embodiments, the obtaining module 401 is further configured to send a tag packet of the target data stream to the receiver-side host, receive an acknowledgement reply packet of the receiver-side host for the tag packet, obtain a sending time of the tag packet and a receiving time of the acknowledgement reply packet, respectively, calculate a current delay of a current transmission link based on the sending time of the tag packet and the receiving time of the acknowledgement reply packet, and obtain a preset delay interval of the current transmission link, where the preset delay interval is determined based on a preset lower delay limit and a preset upper delay limit.

In some embodiments, the processing module 402 is further configured to determine a current state of a current transmission link based on a magnitude relation between the current time delay and the preset time delay interval, obtain congestion window adjustment policies corresponding to preset transmission link states, determine a congestion window adjustment policy corresponding to the current state from the congestion window adjustment policies corresponding to the preset transmission link states, and determine the congestion window adjustment policy corresponding to the current state as a target congestion window adjustment policy of the current transmission link.

In some embodiments, the processing module 402 is further configured to obtain a current congestion window value of the current transmission link at a current time, calculate a congestion window adjustment value at a next time based on the current congestion window value and the target congestion window adjustment policy, and adjust the current congestion window value using the congestion window adjustment value at the next time.

In some embodiments, the plurality of preset transmission link states includes a start transmission state, an upcoming full state, and a packet loss state;

The congestion window adjustment strategy corresponding to the starting transmission state comprises that the congestion window adjustment value at the next moment is twice the current congestion window value;

The congestion window adjustment strategy corresponding to the to-be-fully loaded state comprises that the congestion window adjustment value of the next moment is determined based on the sum of the current congestion window value and the congestion window predicted value of the current transmission link;

the congestion window adjusting strategy corresponding to the packet loss state comprises that the congestion window adjusting value at the next moment is one half of the current congestion window value.

In some embodiments, the processing module 402 is further configured to obtain a current packet loss rate of the target data flow and a pre-constructed joint congestion window prediction function, and calculate a congestion window prediction value of the current transmission link based on the current delay and the current packet loss rate by using the joint congestion window prediction function.

In some embodiments, the obtaining module 401 is further configured to send flow size information of each data flow to a receiver end host when the plurality of data flows reach the sender end host from a buffer, and obtain remaining transmission bytes of each data flow when response information of the receiver end host for the flow size information is received, where the response information is used to characterize that the receiver end host confirms that the receiver end host receives the corresponding data flow;

The processing module 402 is further configured to prioritize the plurality of data flows in order of from smaller to larger of the remaining transmission bytes, obtain priority ranks corresponding to the data flows, and fan the plurality of data flows into the data center network in turn according to the priority ranks.

The embodiment of the disclosure also provides electronic equipment, which comprises at least one processor and a memory for storing at least one processor executable instruction, wherein the at least one processor is used for executing the instruction to realize the steps of the method disclosed by the embodiment of the disclosure.

Fig. 5 shows a schematic structural diagram of an electronic device provided in an exemplary embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes at least one processor 501 and a memory 502 coupled to the processor 501, the processor 501 may perform the respective steps of the above-described methods disclosed in the embodiments of the present disclosure.

The processor 501 may also be referred to as a central processing unit (Central Processing Unit, CPU), which may be an integrated circuit chip with signal processing capabilities. The steps of the above-described methods disclosed in the embodiments of the present disclosure may be accomplished by instructions in the form of integrated logic circuits or software of hardware in the processor 501. The processor 501 may be a general purpose processor, a digital signal processor (DIGITAL SIGNAL Processing, DSP), an ASIC, an off-the-shelf programmable gate array (Field-programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in memory 502, such as random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, and other well-known storage media. The processor 501 reads the information in the memory 502 and in combination with its hardware performs the steps of the method described above.

In addition, various operations/processes according to the present disclosure, in the case of being implemented by software and/or firmware, may be installed from a storage medium or network to a computer system having a dedicated hardware structure, for example, the computer system 600 shown in fig. 6, which is capable of performing various functions including functions such as those described above, and the like, when various programs are installed. Fig. 6 shows a schematic diagram of a computer system according to an exemplary embodiment of the present disclosure.

Computer system 600 is intended to represent various forms of digital electronic computing devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the computer system 600 includes a computing unit 601, and the computing unit 601 can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the computer system 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in computer system 600 are connected to I/O interface 605, including input unit 606, output unit 607, storage unit 608, and communication unit 609. The input unit 606 may be any type of device capable of inputting information to the computer system 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the computer system 600 to exchange information/data with other devices over a network, such as the internet, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, e.g., bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above. For example, in some embodiments, the above-described methods disclosed by embodiments of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, e.g., storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device via the ROM 602 and/or the communication unit 609. In some embodiments, the computing unit 601 may be configured to perform the above-described methods disclosed by embodiments of the present disclosure in any other suitable manner (e.g., by means of firmware).

The disclosed embodiments also provide a computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the above-described method disclosed by the disclosed embodiments.

A computer readable storage medium in embodiments of the present disclosure may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium described above can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specifically, the computer-readable storage medium described above may include one or more wire-based electrical connections, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

The disclosed embodiments also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described methods of the disclosed embodiments.

In an embodiment of the present disclosure, computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C ++, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computers may be connected to the user computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computers.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules, components or units referred to in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module, component or unit does not in some cases constitute a limitation of the module, component or unit itself.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

The above description is merely illustrative of some embodiments of the present disclosure and of the principles of the technology applied. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. An active congestion control method for protecting an elephant flow is characterized by being applied to a sender end host, and comprising the following steps:

2. The method of claim 1, wherein the obtaining the current delay and the preset delay interval of the current transmission link comprises:

Sending a marked data packet of the target data stream to the receiving side end host, and receiving a confirmation reply data packet of the receiving side end host aiming at the marked data packet;

Respectively acquiring the sending time of the marking data packet and the receiving time of the confirmation reply data packet, and calculating the current time delay of the current transmission link based on the sending time of the marking data packet and the receiving time of the confirmation reply data packet;

3. The method of claim 1, wherein the determining the target congestion window adjustment policy for the current transmission link based on the magnitude relationship between the current delay and the preset delay interval comprises:

Determining the current state of the current transmission link based on the magnitude relation between the current time delay and the preset time delay interval;

Acquiring congestion window adjustment strategies corresponding to preset transmission link states, and determining the congestion window adjustment strategy corresponding to the current state from the congestion window adjustment strategies corresponding to the preset transmission link states;

And determining the congestion window adjustment strategy corresponding to the current state as the target congestion window adjustment strategy of the current transmission link.

4. The method of claim 3, wherein adjusting the current congestion window value of the current transmission link at the current time using the target congestion window adjustment policy comprises:

acquiring a current congestion window value of the current transmission link at the current moment;

5. The method of claim 3, wherein the plurality of preset transmission link states includes an active transmission state, an upcoming full state, and a lost packet state;

6. The method of claim 5, wherein the method further comprises:

acquiring the current packet loss rate of the target data stream and a pre-constructed joint congestion window prediction function;

7. The method according to any one of claims 1-6, further comprising:

When the data streams arrive at the sender end host from the buffer, stream size information of each data stream is sent to a receiver end host respectively, and residual transmission bytes of each data stream are obtained when response information of the receiver end host for the stream size information is received, wherein the response information is used for representing that the receiver end host confirms to receive the corresponding data stream;

the priority ranking of the data streams is carried out according to the sequence from small to large of the remaining transmission bytes, and the priority ranking corresponding to each data stream is obtained;

And according to the priority ranking, the data streams are sequentially fanned into the data center network.

8. An active congestion control apparatus for protecting an elephant flow, applied to a sender-side host, comprising:

9. An electronic device, comprising:

At least one processor;

a memory for storing the at least one processor-executable instruction;

Wherein the at least one processor is configured to execute the instructions to implement the steps of the method according to any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the method according to any one of claims 1-7.