[go: up one dir, main page]

CN121239603A - Link exception handling method of server and electronic equipment - Google Patents

Link exception handling method of server and electronic equipment

Info

Publication number
CN121239603A
CN121239603A CN202511785016.2A CN202511785016A CN121239603A CN 121239603 A CN121239603 A CN 121239603A CN 202511785016 A CN202511785016 A CN 202511785016A CN 121239603 A CN121239603 A CN 121239603A
Authority
CN
China
Prior art keywords
link
target
links
processing operation
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511785016.2A
Other languages
Chinese (zh)
Inventor
谢路生
张秀波
曲勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202511785016.2A priority Critical patent/CN121239603A/en
Publication of CN121239603A publication Critical patent/CN121239603A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The application discloses a link anomaly processing method and electronic equipment of a server, wherein the server comprises a plurality of link sets, the link sets are respectively composed of links of different link types, each link set is provided with a link monitoring module, when an interrupt signal sent by a target link monitoring module is received, the interrupt signal is triggered when the real-time measurement data of the links in the target link set is monitored by the target link monitoring module to not meet the preset measurement requirement, the anomaly measurement data is detected according to the preset anomaly detection rule to obtain a target detection result, a target processing strategy corresponding to the target detection result is determined according to the target detection result, and the target processing strategy is executed on at least one link in the target link set to enable the real-time measurement data of the links in the target link set to meet the preset measurement requirement.

Description

Link exception handling method of server and electronic equipment
Technical Field
The present application relates to the field of link detection technologies, and in particular, to a method for processing link abnormality of a server and an electronic device.
Background
In the server management platform, the links such as I2C, PCIe, USB bear the key task from low-speed management to high-speed data transmission, at present, the baseboard management controller (Board Management Controller, BMC) can only monitor part of link signals in the PCIe links, in addition, the monitoring process can only know the abnormal times of the PCIe links, the specific abnormal occurrence time and abnormal signals cannot be confirmed, accurate fault positioning cannot be carried out, and corresponding processing strategies cannot be adopted for link abnormality timely and accurately. How to accurately locate link failures in servers and to take effective failure handling strategies is thus a current challenge.
Disclosure of Invention
The application provides a link abnormality processing method of a server and electronic equipment, which at least solve the problem of how to accurately locate link faults in the server and adopt an effective fault processing strategy.
The application provides a link exception handling method of a server, wherein the server comprises a plurality of link sets, the link sets respectively consist of links with different link types, each link set is provided with a link monitoring module, and the link exception handling method comprises the following steps:
When an interrupt signal sent by a target link monitoring module is received, abnormal measurement data of links in a target link set acquired by the target link monitoring module is acquired, wherein the interrupt signal is triggered when the real-time measurement data of the links in the target link set monitored by the target link monitoring module does not meet the preset measurement requirement;
detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result, wherein the target detection result at least comprises abnormal levels of links in the target link set;
Determining a target processing strategy corresponding to the target detection result according to the target detection result;
And executing the target processing strategy on at least one link in the target link set so that real-time measurement data of the links in the target link set meet the preset measurement requirement.
The application also provides a link abnormity processing device of a server, wherein the server comprises a plurality of link sets, the link sets respectively consist of links with different link types, each link set is provided with a link monitoring module, and the device comprises:
The acquisition module is used for acquiring abnormal measurement data of links in the target link set acquired by the target link monitoring module when receiving an interrupt signal sent by the target link monitoring module, wherein the interrupt signal is triggered when the target link monitoring module monitors that the real-time measurement data of the links in the target link set does not meet the preset measurement requirement;
the processing module is used for detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result, wherein the target detection result at least comprises abnormal levels of links in the target link set;
The processing module is further used for determining a target processing strategy corresponding to the target detection result according to the target detection result;
The processing module is further configured to execute the target processing policy on at least one link in the target link set, so that real-time metric data of links in the target link set meets the preset metric requirement.
The application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the steps of the link exception processing method of any server when executing the computer program.
The present application also provides a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of any one of the above-described link exception handling methods of a server.
The application also provides a computer program product, which comprises a computer program, and the computer program realizes the steps of any one of the link exception handling methods of the server when being executed by a processor.
The server comprises a plurality of link sets, wherein the link sets are respectively composed of links of different link types, each link set is provided with a link monitoring module, when an interrupt signal sent by the target link monitoring module is received, abnormal measurement data of the links in the target link set acquired by the target link monitoring module is acquired, the interrupt signal is triggered when the real-time measurement data of the links in the target link set monitored by the target link monitoring module does not meet the preset measurement requirement, the abnormal measurement data is detected according to the preset abnormal detection rule to obtain a target detection result, the target detection result at least comprises abnormal levels of the links in the target link set, a target processing strategy corresponding to the target detection result is determined according to the target detection result, and the target processing strategy is executed on at least one link in the target link set so that the real-time measurement data of the links in the target link set meets the preset measurement requirement. In the scheme, a special link monitoring module is deployed for each link type of link to acquire standardized measurement parameters of the link, and the standardized measurement parameters are reported to the BMC under the condition of initial detection abnormality, so that the BMC carries out finer detection on the measurement parameters, thereby determining accurate fault reasons and fault positioning of each link, and by executing a processing strategy corresponding to the fault conditions, the abnormal conditions of the link can be effectively eliminated, and the self-healing capacity and service continuity of the system are improved.
Drawings
For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a flowchart of a method for processing link abnormality of a server according to an embodiment of the present application;
fig. 2 is a flowchart second of a method for processing link abnormality of a server according to an embodiment of the present application;
fig. 3 is a block diagram of a link exception handling apparatus of a server according to an embodiment of the present application;
Fig. 4 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the existing server management platform, links such as an integrated circuit bus (Inter-INTEGRATED CIRCUIT, I2C), a high-speed serial computer expansion bus standard (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe), a universal serial bus (Universal Serial Bus, USB) and the like bear key tasks from low-speed management to high-speed data transmission. However, the existing scheme is mostly dependent on the simple error count or the operating system log of each component to judge whether the link is abnormal, only part of link signals in the PCIe link can be monitored and counted, but the monitoring count can only be kept under the starting state, and can not be recorded for a long time, and when the link is abnormal, specific abnormal time and signal parameter abnormality can not be confirmed, that is, the operating log collected for a period of time can only count the abnormal times in the period of time, but the time of each abnormal occurrence can not be accurately described, or the reason of the occurrence of the fault and the fault of the specific link can not be accurately described, and when the intermittent abnormality or degradation of the link occurs, the integrated monitoring architecture which is cross-link, cross-equipment, measurable and controllable is lack, the fault source is difficult to accurately position, the fault and abnormality of the link can not be effectively checked, and the shutdown manual intervention is often required.
In the prior art, the monitoring capability of the BMC to various hardware links (such as I2C, PCIe, USB, etc.) mainly stays in functional detection (whether the link exists or not, whether the device responds or not), and lacks real-time acquisition and trend analysis to the link signal quality. Error information of various hardware links (such as I2C, PCIe, USB) is scattered on different controller journals or operating system layers, and a unified acquisition standard and reporting mechanism are lacked, so that monitoring data formats are different and are difficult to transversely compare.
In summary, in order to solve all or part of the technical problems, the application provides a link anomaly processing method and electronic equipment of a server, wherein the server comprises a plurality of link sets, the link sets are respectively composed of links of different link types, each link set is provided with a link monitoring module, when an interrupt signal sent by the target link monitoring module is received, abnormal measurement data of links in the target link set collected by the target link monitoring module is obtained, the interrupt signal is triggered when the real-time measurement data of the links in the target link set monitored by the target link monitoring module does not meet a preset measurement requirement, the abnormal measurement data is detected according to a preset anomaly detection rule to obtain a target detection result, the target detection result at least comprises an anomaly level of the links in the target link set, a target processing strategy corresponding to the target detection result is determined according to the target detection result, and the target processing strategy is executed on at least one link in the target link set, so that the real-time measurement data of the links in the target link set meets the preset measurement requirement. In the scheme, a special link monitoring module is deployed for each link type of link to acquire standardized measurement parameters of the link, and the standardized measurement parameters are reported to the BMC under the condition of initial detection abnormality, so that the BMC carries out finer detection on the measurement parameters, thereby determining accurate fault reasons and fault positioning of each link, and by executing a processing strategy corresponding to the fault conditions, the abnormal conditions of the link can be effectively eliminated, and the self-healing capacity and service continuity of the system are improved.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
As shown in fig. 1, fig. 1 is a flowchart of a method for processing link abnormality of a server according to an embodiment of the present application, where the method may include the following steps:
101. When an interrupt signal sent by the target link monitoring module is received, abnormal measurement data of links in the target link set acquired by the target link monitoring module is acquired.
In the embodiment of the present application, the server includes a plurality of link sets, where the plurality of link sets are respectively composed of links with different link types, that is, all links in the server are divided according to specific link types, and links with the same link type are divided into one link set, so that the link types of all links in one link set are the same, and the link types of links in different link sets are different. The link types may include I2C links, PCIe links, and USB links. Of course, other links may be included, without specific limitation.
It should be noted that, in order to detect the data of each link in each link set, a link monitoring module may be configured for each link set, where the link monitoring module may be used to collect real-time metric data of each link in the corresponding link set, and the link monitoring module may be a micro control unit (Microcontroller Unit, MCU), and be a distributed edge collection execution unit, which corresponds to different types of hardware communication links, that is, different link monitoring modules may be configured for link sets of different link types.
In some embodiments, for an I2C link, the link monitoring module may be an MCU-I2C monitoring module, which may be deployed in the physical path of the I2C link or accessed through a parallel listening interface, for non-invasive acquisition of serial clock line (Serial Clock Line, SCL) and serial data line (SERIAL DATA LINE, SDA) signals.
In some embodiments, for PCIe links, the link monitor module may be an MCU-PCIe monitor module, which may be deployed near a channel of the PCIe link, connected to a Root Port (a core Interface connecting a CPU or chipset with a downstream PCIe device, which is a starting point of a PCIe bus hierarchy) or an end-band Interface (Side-band Interface) on the end Side, and used to read information such as link training and state machine (LINK TRAINING AND Status STATE MACHINE, LTSSM), error count register, and the like.
In some embodiments, for a USB link, the link monitoring module may be an MCU-USB monitoring module, which may be deployed in Hub or PORT paths of the USB link, for monitoring VBUS voltage, PORT RESET (port_reset), and connection/disconnection (Connect/Disconnect).
In some embodiments, for different link sets, the correspondingly configured link monitoring module may collect different metric data, for I2C links, may collect bus level stability, ACK response rate, bus occupancy rate, etc., for PCIe links, may collect error rate, link training status, retransmission times, etc., for USB links, may collect CRC error packets, packet loss rate, reset event times, error rate, protocol error count, signal jitter, bandwidth utilization, etc.
In some embodiments, the link monitor module may also initialize internal registers and interrupt controllers prior to collecting metric data, thereby initiating the collection of metric data for the I2C link, PCIe link, and USB link.
In some embodiments, after the link monitoring module corresponding to each link set collects the metric data of the links in the link set, the metric data may be compared with the corresponding preset metric requirement, if the metric data meets the corresponding preset metric requirement, it may indicate that the links in the link set are normal, and if the metric data does not meet the corresponding preset metric requirement, it may indicate that the links in the link set are abnormal, so an interrupt signal may be triggered automatically, that is, the interrupt signal is triggered when the real-time metric data of the links in the target link set monitored by the target link monitoring module does not meet the preset metric requirement. The target link monitoring module may be any one of the link monitoring modules respectively corresponding to the respective link sets.
It should be noted that, the preset measurement requirement may be determined by the link detection module, it may be understood that, here, the detection of the real-time measurement data by the link detection module may be understood as a primary detection process, which does not indicate that the real-time measurement data has to be abnormal, but does not meet the preset measurement requirement, where the abnormality may exist, and further specific detection is required, where the preset measurement requirement may be a threshold value, a range, a state, or the like for the measurement data, and respective measurement requirements may be set for each measurement data correspondingly.
In the embodiment of the application, when the interrupt signal sent by the target link monitoring module is received, the condition that the link in the link set corresponding to the target link monitoring module is abnormal can be considered, so that the link in the link set needs to be subjected to fault investigation, and therefore, the abnormal measurement data of the link in the target link set acquired by the target link monitoring module can be acquired.
It should be noted that, after the target link monitoring module collects each real-time measurement data of the target link set, the real-time measurement data may be stored in a local buffer area, so that after receiving the interrupt signal sent by the target link monitoring module, it is explained that the locally stored real-time measurement data may be abnormal, so that the real-time measurement data may be obtained from the local buffer area and determined to be abnormal measurement data.
102. And detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result.
In the embodiment of the application, since the abnormal measurement data may be measurement data which may be abnormal and is determined by the target link monitoring module after initial detection, in order to obtain a more accurate detection result, the abnormal measurement data may be detected again after the abnormal measurement data is obtained, where the abnormal measurement data may be detected by using a preset abnormal detection rule, so as to obtain a target detection result, where the target detection result at least includes an abnormal level of a link in the target link set, the abnormal level is equivalent to a specific grade for dividing the abnormal condition of the link in the target link set, and a higher grade indicates that the fault is more serious, and a lower grade indicates that the fault is more slight.
It should be noted that the preset anomaly detection rule may be a detection rule set by itself, or may be a detection rule set by itself in a rule engine, such as a static threshold alarm, a dynamic adaptive baseline modeling, a sliding window statistics, and a multi-index weighted Health Score (Health Score) algorithm.
103. And determining a target processing strategy corresponding to the target detection result according to the target detection result.
In the embodiment of the application, after the target detection result is obtained, the links in the target link set are required to be processed to eliminate the abnormal situation, and then different processing strategies are corresponding to different detection results, and as the target detection result at least comprises abnormal levels of the links in the target link set, different abnormal levels can be understood to correspond to different processing strategies, if the fault is serious, more extreme strategies such as operations of power on/off, thermal reset, rate degradation and the like of the links are required to be immediately adopted, and if the common fault is only or only a parameter value is slightly higher, an alarm can be firstly given or some more common processing strategies can be adopted.
It should be noted that, for the processing strategies adopted by different detection results, a certain correspondence exists between the detection results and the processing strategies, the correspondence may be pre-stored, the correspondence may be determined according to historical fault detection and historical fault processing, the correspondence may also be determined by a user, or may be determined in other manners, and the embodiment of the present application is not specifically limited.
104. And executing a target processing strategy on at least one link in the target link set so that the real-time measurement data of the links in the target link set meet the preset measurement requirement.
In the embodiment of the present application, after a specific target processing policy is determined, the target processing policy may be executed, and it is conceivable that multiple links may not be included in a target link set, so that the target processing policy may be executed for at least one link in the target link set, where the at least one link may be specifically determined according to links with anomalies indicated by anomaly metric data, that is, which links are indicated by anomaly metric data, then the target processing policy is executed for the links, so that real-time metric data of links in the processed target link set meet a preset metric requirement, thereby eliminating an anomaly condition of the target link set.
In the embodiment of the application, the special link monitoring module is deployed aiming at each link type of link to acquire the standardized measurement parameters of the link, and the standardized measurement parameters are reported to the BMC under the condition of initial detection abnormality, so that the BMC carries out finer detection on the measurement parameters, thereby determining the accurate fault cause and fault location of each link, and the abnormal condition of the link can be effectively eliminated by executing the processing strategy corresponding to the fault condition, and the self-healing capacity and service continuity of the system are improved.
As shown in fig. 2, fig. 2 is another flowchart of a method for processing link exception of a server according to an embodiment of the present application, where the method may include the following steps:
201. And when receiving the interrupt signal sent by the target link monitoring module, sending a data reporting instruction to the target link monitoring module.
In the embodiment of the application, the interrupt signal is only used for indicating that the link in the target link set may be abnormal, but the link, the fault reason and the like of specific faults are not known, so that in order to determine more specific and accurate fault conditions, the real-time measurement data of the link can be combined for determination, and the measurement data is the data collected by the target link monitoring module, so that the target link monitoring module can be required to report more specific real-time measurement data, and therefore, a data report instruction can be sent to the target link monitoring module, and the data report instruction can be used for indicating the target link monitoring module to send the real-time measurement data of the abnormal existence corresponding to the link in the target link set.
202. And responding to the data reporting instruction, receiving real-time measurement data of links in the target link set sent by the target link monitoring module, and obtaining abnormal measurement data.
In the embodiment of the application, after the target link monitoring module receives the data reporting instruction, the real-time measurement data, corresponding to the links in the target link set, which are acquired by the target link monitoring module and have abnormality, can be summarized and then sent to the BMC, so that the real-time measurement data, which are sent by the target link monitoring module and are sent by the target link set, can be received in response to the data reporting instruction, and the abnormal measurement data are obtained.
In some embodiments, the data interaction between the BMC and the target link monitoring module may be implemented through a management bus, which may include an I2C bus or an SPI bus.
In some embodiments, when the target link monitoring module collects real-time measurement data of links in the target link set, not all real-time measurement data are abnormal, that is, measurement data of some links may be normal in all links in the target link set, and measurement data of some links may be abnormal, and links with abnormal measurement data may also include some normal measurement data, so when the target link monitoring module reports real-time measurement data to the BMC, only abnormal measurement data with abnormal measurement data may be reported, all measurement data of all links with abnormal measurement data may also be reported, and the embodiment of the present application is not limited specifically.
203. And detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result.
In the embodiment of the present application, for the description of step 203, please refer to the detailed description of step 102 in the above embodiment, and the description of the embodiment of the present application is omitted.
In some embodiments, detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result may specifically include detecting the abnormal measurement data according to a preset multi-level threshold to obtain a target detection result, where the target detection result is used to indicate an abnormal level where the abnormal measurement data is located.
It should be noted that, in order to improve the accuracy of the target detection result, a fixed threshold may be used to detect the abnormal measurement data, that is, multiple thresholds may be set, where the values of the measurement parameters are divided by the multiple thresholds to obtain multiple ranges, and different ranges correspond to different abnormal levels, and the abnormal level corresponding to the abnormal measurement parameter may be determined according to the range where the abnormal measurement parameter is located, that is, the magnitude relation between the range and each threshold.
In some embodiments, detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result, which specifically includes calculating a weighted score corresponding to the abnormal measurement data according to a preset weight corresponding to each measurement data to obtain an abnormal score of a link in a target link set, and determining the target detection result according to a score range in which the abnormal score is located.
It should be noted that, in the detection of the abnormal metric data, a multi-index weighted health scoring mode may be adopted, because there is not only one metric data, possibly multiple metric data, collected for one link, so that corresponding weights may be set for each metric data, then the values and weights of each metric data are weighted and summed to calculate the abnormal score of the link in each target link set, and for the judgment of the abnormal score, multiple scoring ranges may also be preset, each scoring range corresponds to different abnormal levels, and according to the scoring range where the calculated abnormal score is located, the abnormal level corresponding to the link in the target link set may be determined.
In some embodiments, the detection of abnormal metric data may also be performed by means of dynamic adaptive baseline modeling, it may be understood that metric data may include multiple data, and there may be a correlation between the data, and in addition, a link failure may be reflected on the values of multiple metric parameters at the same time, so that a dynamic baseline may be set based on multiple metric parameters, where the dynamic baseline may comprehensively consider the correlation between the metric parameters, and comprehensively consider the abnormal influence of all abnormal metric parameters on the link.
In some embodiments, sliding window statistics is further required for detecting abnormal metric data, where the target link monitoring module may collect and record data according to time when counting real-time metric data of links in the target link set, and may be understood as a metric data sequence, so when detecting abnormal metric data, sliding statistics may be performed on the metric data sequence according to a preset window, and an average value of the abnormal metric data located in the window may be detected, so as to obtain a target detection result.
In the embodiment of the application, when the BMC detects the abnormal measurement data, the detection can be performed through a plurality of preset detection rules, the hierarchical detection can be performed by presetting a multi-level threshold value, and the weighting calculation can be performed according to the parameter weight, or in other modes, so that the accuracy of the link fault detection is improved.
204. And when the target detection result indicates that the abnormal level of the link in the target link set is a general alarm, determining the target processing strategy as output alarm information.
In the embodiment of the application, after the target detection result is obtained, the corresponding target processing strategy can be determined according to different abnormal levels indicated by the target detection result.
It should be noted that, if the target detection result indicates that the abnormal level of the link in the target link set is a general alarm, it indicates that the abnormal condition of the current link is not serious, so that a more extreme or larger-scale complex operation is not required, and the target processing policy can be determined to be output alarm information, where the alarm information is used to instruct a worker to detect the link in the target link set.
It should be noted that general alarms can be generally divided into two types, namely general early warning (Trend abnormality) and secondary alarms (threshold value out-of-range), wherein the general early warning can be understood that the current abnormal measurement data does not strictly belong to an abnormal range, only some fluctuation exists between the current abnormal measurement data and the previous measurement data, or the current abnormal measurement data is in an imminent abnormal range, so that general early warning information can be output to prompt operation and maintenance personnel to pay attention, the secondary alarms can be understood that the current abnormal measurement data slightly exceeds the threshold value range but is not serious, the link operation is still in a controllable range, so that secondary alarm information can be output to prompt the operation and maintenance personnel to perform simple processing in time.
205. And when the target detection result indicates that the abnormal level of the link in the target link set is a general alarm, outputting alarm information.
206. When the target detection result indicates that the abnormal level of the link in the target link set is a serious alarm, determining a processing operation sequence corresponding to the abnormal metric data of the link in the target link set.
In the embodiment of the application, after the target detection result is obtained, the corresponding target processing strategy can be determined according to different abnormal levels indicated by the target detection result.
It should be noted that, if the target detection result indicates that the abnormal level of the link in the target link set is a serious alarm, which indicates that the abnormal condition of the current link is serious, a corresponding processing policy needs to be immediately adopted to eliminate the abnormality, so that a corresponding processing operation sequence can be determined according to the abnormality measurement data with serious abnormality.
It should be noted that the sequence of processing operations includes at least one processing operation, where the processing operations may be arranged in a certain order, and sequentially process the abnormal links.
The processing operations may include, among other things, link power up/down, hot reset, rate degradation, isolation, etc.
207. When the target detection result indicates that the abnormal level of the link in the target link set is a serious alarm, each processing operation included in the processing operation sequence is sequentially executed on the link in the target link set according to the processing operation sequence, so that real-time measurement data of the link in the target link set meets the preset measurement requirement.
In the embodiment of the application, after the processing operation sequence is determined, each processing operation included in the processing operation sequence can be sequentially executed on links in the target link set according to the processing operation sequence, for example, the processing operation sequence is speed limiting, soft reset, cold reset/power-down, bypass/isolation, then the speed limiting operation can be firstly executed on the abnormal links, then the soft reset operation is executed, then the cold reset/power-down operation is executed, and then the bypass/isolation operation is executed.
In some embodiments, each processing operation included in the processing operation sequence is sequentially executed on the links in the target link set according to the processing operation sequence, which specifically includes executing a first processing operation included in the processing operation sequence on the links in the target link set according to the processing operation sequence, detecting real-time metric data of the links in the target link set, and executing a second processing operation included in the processing operation sequence on the links in the target link set if the real-time metric data is detected to be still abnormal metric data.
It should be noted that, in the process of sequentially executing each processing operation according to the processing operation sequence, each time a processing operation is executed, the abnormal link may be subjected to abnormality detection, that is, there is a relatively small link failure, and after one processing operation, it is not necessary to execute each processing operation in the processing operation sequence, and there is a relatively large link failure, and it is necessary to execute a plurality of operations to recover the normal, so that a first processing operation included in the processing operation sequence may be executed on a link in the target link set first, the first processing operation may be a processing operation in the processing operation sequence that is arranged in the first order in the processing operation sequence, that is, a real-time measurement data of a link in the target link set may be detected again after the execution of the first processing operation is completed, if the real-time measurement data is normal, it is indicated that the abnormal link has recovered to be normal, and no subsequent processing operation needs to be executed, if the real-time measurement data is abnormal, it is not indicated that the abnormal, and therefore the first processing operation included in the processing operation sequence may be executed on the link in the target link set first order, the first processing operation is executed in the processing operation sequence is executed in the processing operation sequence earlier than the other processing operation in the second order in the processing operation sequence, the processing operation in the second processing operation sequence, the first processing operation is arranged in the processing operation sequence, and the processing operation in the second processing operation is processed operation in the second order in the processing operation sequence in the second order, until the real-time metric data of the links in the target link set is detected as normal data, or until each processing operation in the sequence of processing operations is performed.
In some embodiments, the sequentially executing the processing operations included in the processing operation sequence on the links in the target link set according to the processing operation sequence may specifically include determining historical processing operations for the links in the target link set, and sequentially executing the processing operations included in the processing operation sequence on the links in the target link set according to the target processing operations in the processing operation sequence, wherein the order of the target processing operations in the processing operation sequence is located at the upper position of the order of the historical processing operations in the processing operation sequence.
It should be noted that, when an abnormality occurs in a link in the target link set more than once, an abnormality may also occur in the history running process, and the abnormal link is restored to be normal by each processing operation in the processing operation sequence, the operation of the processing operation when the abnormality occurs again may be referred to, that is, the history processing operation for the link in the target link set may be obtained, which is the processing operation corresponding to the case where the link in the target link set where the metric data is abnormal is restored to be normal, and the history processing operation may be understood as that when the history processing operation is performed in the process of sequentially processing the abnormal link according to each processing operation in the processing operation sequence, the abnormality recovery effect of the history processing operation on the abnormal link is obvious, and in the case where the abnormality occurs in the current link, the history processing operation may be referred to and executed from the last processing operation of the history processing operation in the processing operation sequence.
For example, assuming that the link a is recovered to normal after the speed limit, the soft reset and the cold reset are performed, respectively, when the abnormality occurs in the last week, the history handling operation may be considered as the cold reset, and if the abnormality occurs in the current link a, the soft reset may be directly performed on the link a, and the speed limit may not be performed any more.
In the embodiment of the application, specific link faults can be divided into different abnormal levels, so that corresponding processing strategies are determined to execute, corresponding processing operation sequences are automatically matched, automatic degradation, recovery and isolation of the link faults are realized, fault response time is obviously shortened, dependence on manual intervention is reduced, transition from 'passive response' to 'active prevention' is realized, and self-healing capacity and service continuity are improved.
208. And sending a data reporting instruction to the link monitoring module corresponding to each link set according to a preset period.
209. And receiving real-time measurement data corresponding to the links in each link set, which are respectively sent by each link monitoring module.
210. And detecting the real-time measurement data to obtain the periodic detection result of the links in each link set.
In the embodiment of the application, when the BMC receives the interrupt signal, the existence of the abnormal link can be determined, and the BMC can also mainly request the link monitoring module corresponding to each link set to report real-time measurement data according to a certain preset period to realize periodic polling, so that a data reporting instruction can be sent to the link monitoring module corresponding to each link set according to the preset period, and after each link monitoring module receives the data reporting instruction, the real-time measurement data corresponding to the links in each link set can be respectively reported, so that the BMC can detect the real-time measurement data according to a preset abnormal detection rule to obtain a periodic detection result of the links in each link set.
In some embodiments, the preset period may be a preset period, or may require the link monitoring module to report data according to a certain time, and may report real-time measurement data at a time when the link is idle every day, for example, report during 2 to 4 am.
In the embodiment of the application, the BMC can not only receive the interrupt signal of the link detection module to know the link abnormality, but also mainly poll the link detection module corresponding to each link set, so that the link abnormality caused by the failure of the link detection module and incapability of collecting or reporting data can be avoided, and the predictability and long-term stability of the system are effectively improved.
In some embodiments, the BMC may also automatically trigger a link health self-test, specifically by means of Loopback test or throughput test.
The Loopback test (loopback test) can make the signal original path return to the transmitting end to form a closed loop by shorting the transmitting and receiving ends of the communication equipment or the line. The core purpose is to verify the integrity of data transmission and the correctness of device functions without depending on external network environment. For example, after sending a data packet to the Loopback interface, if the data consistency can be immediately received and verified, it indicates that the internal communication link of the device is normal.
The Throughput Test (Throughput Test) can measure the amount of data successfully processed by the system in unit time and reflect the limit processing capacity of the system. Specific metrics may include that network throughput may reflect the maximum transmission rate of a test device (e.g., router, switch) or link (e.g., wi-Fi, fiber), that storage throughput may evaluate read and write performance of a storage device (e.g., hard disk, SSD), and that application layer throughput may test the traffic handling capacity of a software system (e.g., database, web service).
In some embodiments, the BMC and the link monitoring module corresponding to each link set store the metric data after the metric data is collected.
In some embodiments, the BMC may be equipped with a local time series database for persisting historical monitoring data and supporting exporting into JSON, CSV formats through a Web interface or standard interface (e.g., RESTful API, redfish extensions) for visual presentation and remote access.
In some embodiments, the BMC integrates a security module, supports digital signature verification of control instructions, certificate authentication of MCU nodes, firmware integrity verification (e.g. based on SHA-256 hash), and anti-replay protection of communication processes, and ensures the integrity and credibility of system operation.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
As shown in fig. 3, an embodiment of the present application further provides a link anomaly handling device of a server, where the server includes a plurality of link sets, each of the link sets is composed of links of different link types, and each of the link sets is configured with a link monitoring module, and the link anomaly handling device of the server may include:
the acquiring module 301 is configured to acquire abnormal metric data of links in the target link set acquired by the target link monitoring module when an interrupt signal sent by the target link monitoring module is received, where the interrupt signal is triggered when the real-time metric data of the links in the target link set monitored by the target link monitoring module does not meet a preset metric requirement;
the processing module 302 is configured to detect the abnormal metric data according to a preset abnormal detection rule, so as to obtain a target detection result, where the target detection result at least includes an abnormal level of a link in the target link set;
the processing module 302 is further configured to determine a target processing policy corresponding to the target detection result according to the target detection result;
The processing module 302 is further configured to execute a target processing policy on at least one link in the target link set, so that real-time metric data of the links in the target link set meets a preset metric requirement.
In some embodiments, the processing module 302 is specifically configured to, when receiving an interrupt signal sent by the target link monitoring module, send a data reporting instruction to the target link monitoring module, where the data reporting instruction is configured to instruct the target link monitoring module to send real-time metric data corresponding to a link in the target link set, where the real-time metric data is abnormal;
The processing module 302 is specifically configured to receive, in response to the data reporting instruction, real-time metric data of links in the target link set sent by the target link monitoring module, and obtain abnormal metric data.
In some embodiments, the processing module 302 is specifically configured to detect the abnormal metric data according to a preset multi-level threshold, so as to obtain a target detection result, where the target detection result is used to indicate an abnormal level of the abnormal metric data.
In some embodiments, the processing module 302 is specifically configured to calculate, according to weights corresponding to preset respective metric data, a weighted score corresponding to the abnormal metric data, so as to obtain an abnormal score of a link in the target link set;
The processing module 302 is specifically configured to determine a target detection result according to a scoring range in which the abnormal score is located.
In some embodiments, the processing module 302 is specifically configured to determine that the target processing policy is to output alarm information when the target detection result indicates that the abnormal level of the link in the target link set is a general alarm, where the alarm information is used to instruct a worker to detect the link in the target link set;
the processing module 302 is specifically configured to determine a sequence of processing operations corresponding to the abnormal metric data of the links in the target link set when the target detection result indicates that the abnormal level of the links in the target link set is a severe alarm.
In some embodiments, the processing module 302 is specifically configured to, when the target detection result indicates that the abnormal level of the link in the target link set is a severe alarm, sequentially execute, according to the processing operation sequence, each processing operation included in the processing operation sequence on the link in the target link set, so that real-time metric data of the link in the target link set meets a preset metric requirement.
In some embodiments, the processing module 302 is specifically configured to perform, according to a sequence of processing operations, a first processing operation included in the sequence of processing operations on links in the target link set, and detect real-time metric data of links in the target link set, where the first processing operation is ranked earlier in the sequence of processing operations than other processing operations are ranked in the sequence of processing operations;
The processing module 302 is specifically configured to execute, on the links in the target link set, a second processing operation included in the processing operation sequence if the real-time metric data is detected to be still abnormal metric data, where the second processing operation is ranked later in the processing operation sequence than the first processing operation, and the second processing operation is ranked earlier in the processing operation sequence than other processing operations except the first processing operation and the second processing operation.
In some embodiments, the processing module 302 is specifically configured to determine a history processing operation for a link in the target link set, where the history processing operation is a processing operation corresponding to when the link in the target link set with the abnormal metric data returns to normal;
The processing module 302 is specifically configured to sequentially execute, according to the target processing operations in the processing operation sequence, each processing operation included in the processing operation sequence on the links in the target link set, where the order of the target processing operation in the processing operation sequence is located above the order of the historical processing operation in the processing operation sequence.
In some embodiments, the processing module 302 is further configured to send a data reporting instruction to the link monitoring module corresponding to each link set according to a preset period;
the processing module 302 is further configured to receive real-time metric data corresponding to links in each link set sent by each link monitoring module respectively;
The processing module 302 is further configured to detect the real-time metric data, and obtain a cycle detection result of the links in each link set.
In the embodiment of the present application, the description of the features in the embodiment corresponding to the link exception handling apparatus of the server may refer to the related description of the embodiment corresponding to the link exception handling method of the server, which is not described herein in detail.
As shown in fig. 4, an embodiment of the present application further provides an electronic device, including a memory 401 and a processor 402, where the memory 401 stores a computer program, and the processor 402 is configured to execute the computer program to perform steps in any of the embodiments of the method for processing link abnormality of a server described above.
The embodiment of the present application also provides a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to execute the steps in the link abnormality processing method embodiment of any one of the servers described above when running.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program realizes the steps in the embodiment of the link exception handling method of any one of the servers when being executed by a processor.
Embodiments of the present application also provide another computer program product, including a non-volatile computer readable storage medium, where the non-volatile computer readable storage medium stores a computer program, where the computer program when executed by a processor implements the steps in any of the embodiments of the method for processing link anomalies for a server described above.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above describes in detail the process monitoring of a storage system provided by the present application. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims (10)

1. A method for processing link abnormality of a server, wherein the server includes a plurality of link sets, the plurality of link sets are respectively composed of links of different link types, each link set is configured with a link monitoring module, the method comprises:
When an interrupt signal sent by a target link monitoring module is received, abnormal measurement data of links in a target link set acquired by the target link monitoring module is acquired, wherein the interrupt signal is triggered when the real-time measurement data of the links in the target link set monitored by the target link monitoring module does not meet the preset measurement requirement;
detecting the abnormal measurement data according to a preset abnormal detection rule to obtain a target detection result, wherein the target detection result at least comprises abnormal levels of links in the target link set;
Determining a target processing strategy corresponding to the target detection result according to the target detection result;
And executing the target processing strategy on at least one link in the target link set so that real-time measurement data of the links in the target link set meet the preset measurement requirement.
2. The method according to claim 1, wherein the obtaining abnormal metric data of links in the target link set collected by the target link monitoring module when receiving the interrupt signal sent by the target link monitoring module includes:
When receiving an interrupt signal sent by the target link monitoring module, sending a data reporting instruction to the target link monitoring module, wherein the data reporting instruction is used for indicating the target link monitoring module to send real-time measurement data with abnormality corresponding to a link in the target link set;
And responding to the data reporting instruction, and receiving real-time measurement data of links in the target link set sent by the target link monitoring module to obtain the abnormal measurement data.
3. The method of claim 1, wherein detecting the anomaly metric data according to a preset anomaly detection rule to obtain a target detection result comprises:
and detecting the abnormal measurement data according to a preset multi-level threshold value to obtain the target detection result, wherein the target detection result is used for indicating the abnormal level of the abnormal measurement data.
4. The method of claim 1, wherein detecting the anomaly metric data according to a preset anomaly detection rule to obtain a target detection result comprises:
Calculating a weighted score corresponding to the abnormal measurement data according to the preset weight corresponding to each measurement data, and obtaining an abnormal score of a link in the target link set;
and determining the target detection result according to the scoring range of the abnormal score.
5. The method of claim 1, wherein the determining a target processing policy corresponding to the target detection result according to the target detection result comprises:
When the target detection result indicates that the abnormal level of the link in the target link set is a general alarm, determining the target processing strategy to output alarm information, wherein the alarm information is used for indicating a worker to detect the link in the target link set;
And when the target detection result indicates that the abnormal level of the link in the target link set is a serious alarm, determining a processing operation sequence corresponding to the abnormal measurement data of the link in the target link set.
6. The method of claim 5, wherein said executing the target processing policy on at least one link in the set of target links such that real-time metric data for links in the set of target links meets the preset metric requirement comprises:
When the target detection result indicates that the abnormal level of the link in the target link set is a serious alarm, according to the processing operation sequence, each processing operation included in the processing operation sequence is sequentially executed on the link in the target link set, so that real-time measurement data of the link in the target link set meets the preset measurement requirement.
7. The method of claim 6, wherein sequentially performing the processing operations included in the sequence of processing operations on links in the target link set according to the sequence of processing operations comprises:
Executing a first processing operation included in the processing operation sequence on links in the target link set according to the processing operation sequence, and detecting real-time measurement data of the links in the target link set, wherein the first processing operation is ordered earlier in the processing operation sequence than other processing operations in the processing operation sequence;
and if the real-time metric data is detected to be still abnormal metric data, executing a second processing operation included in the processing operation sequence on the links in the target link set, wherein the ordering of the second processing operation in the processing operation sequence is later than the ordering of the first processing operation in the processing operation sequence, and the ordering of the second processing operation in the processing operation sequence is earlier than the ordering of other processing operations except the first processing operation and the second processing operation in the processing operation sequence.
8. The method of claim 6, wherein sequentially performing the processing operations included in the sequence of processing operations on links in the target link set according to the sequence of processing operations comprises:
Determining historical processing operation aiming at links in the target link set, wherein the historical processing operation is corresponding processing operation when links in the target link set with abnormal measurement data are recovered to be normal;
and according to the target processing operations in the processing operation sequence, sequentially executing all the processing operations included in the processing operation sequence on links in the target link set, wherein the order of the target processing operations in the processing operation sequence is positioned at the last position of the order of the historical processing operations in the processing operation sequence.
9. The method according to claim 1, wherein the method further comprises:
According to a preset period, a data reporting instruction is sent to a link monitoring module corresponding to each link set;
receiving real-time measurement data corresponding to links in each link set sent by each link monitoring module respectively;
And detecting the real-time measurement data to obtain the periodic detection results of the links in each link set.
10. An electronic device, comprising:
A memory for storing a computer program;
A processor for implementing the steps of the link abnormality processing method of the server according to any one of claims 1 to 9 when executing the computer program.
CN202511785016.2A 2025-11-28 2025-11-28 Link exception handling method of server and electronic equipment Pending CN121239603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511785016.2A CN121239603A (en) 2025-11-28 2025-11-28 Link exception handling method of server and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511785016.2A CN121239603A (en) 2025-11-28 2025-11-28 Link exception handling method of server and electronic equipment

Publications (1)

Publication Number Publication Date
CN121239603A true CN121239603A (en) 2025-12-30

Family

ID=98153052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511785016.2A Pending CN121239603A (en) 2025-11-28 2025-11-28 Link exception handling method of server and electronic equipment

Country Status (1)

Country Link
CN (1) CN121239603A (en)

Similar Documents

Publication Publication Date Title
WO2019169743A1 (en) Server failure detection method and system
US8774023B2 (en) Method and system for detecting changes in network performance
JP2010511359A (en) Method and apparatus for network anomaly detection
CN117118807A (en) A data analysis method and system based on artificial intelligence
US10447561B2 (en) BFD method and apparatus
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
CN118331779A (en) Distributed system fault judgment and recovery method, cloud operating system and computing platform using the method
CN110489260B (en) Fault identification method and device and BMC
US20180131560A1 (en) Content-aware anomaly detection and diagnosis
CN114363151A (en) Fault detection method and device, electronic equipment and storage medium
US7594014B2 (en) Abnormality detection system, abnormality management apparatus, abnormality management method, probe and program
CN120354246B (en) Troubleshooting methods, systems, and media
CN116541728A (en) Fault diagnosis method and device based on density clustering
CN121239603A (en) Link exception handling method of server and electronic equipment
CN120086049A (en) Fault detection method, device, electronic device and storage medium
CN118921582B (en) A PON gateway fault self-detection method and system
CN101442766B (en) A device and method for detecting access channel failure
CN119961099A (en) Operation and maintenance system, method, equipment and medium based on monitoring information collector device
TW201409968A (en) Information and communication service quality estimation and real-time alarming system and method
CN120045368A (en) Fault processing method, device, BMC, storage medium and computer program product
CN118733316A (en) SOA adaptive fusing method, system and terminal
CN112199247B (en) A method and device for checking the activity of a Docker container process in a non-business state
CN115686890A (en) Processor fault early warning method, system, electronic equipment and medium
CN120560893B (en) Method and device for repairing PCIe link fault of centralized storage equipment
CN115875296B (en) Energy storage system fan inspection method, device and energy storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination