CN109165138A

CN109165138A - A kind of method and apparatus of monitoring equipment fault

Info

Publication number: CN109165138A
Application number: CN201810866734.6A
Authority: CN
Inventors: 陈涛
Original assignee: Wangsu Science and Technology Co Ltd
Current assignee: Wangsu Science and Technology Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2019-01-08
Anticipated expiration: 2038-08-01
Also published as: CN109165138B

Abstract

The invention discloses a method and a device for monitoring equipment failure, belonging to the technical field of computers. The method includes: after running the tool collection script, monitoring the target key indicators through the basic tools included in the tool collection script at every monitoring sleep duration of the target key indicators; if the target key indicators are abnormal, detecting whether the current If there is a fault, otherwise adjust the monitoring sleep duration of the target key indicator based on the first preset duration; if there is currently a fault, adjust the monitoring sleep duration of the target key indicator based on the second preset duration, and determine and report the current fault through the tool set script If there is no fault currently, the monitoring sleep duration of the target key indicator is adjusted based on the third preset duration. By adopting the present invention, frequent monitoring of key indicators and frequent repeated reporting of the same fault can be avoided, and equipment faults can be found in a relatively timely manner.

Description

A kind of method and apparatus of monitoring equipment fault

Technical field

The present invention relates to field of computer technology, in particular to a kind of method and apparatus of monitoring equipment fault.

Background technique

In the process of running, often there is operation troubles because of the problem on hardware or software in equipment, so as to Phenomena such as leading to equipment disposal ability decline, execute logic error, or even will appear equipment delay machine, component damage.In order to the greatest extent The early operation troubles found and solve equipment in time, user can often be looked into passage capacity monitoring programme (can be described as monitoring tools) The performance indicator for seeing equipment understands the operating status of equipment.

Presently, there are a kind of integrated tool sets there are many monitoring tools to fit this, and being fitted by tool set originally can unify Automatically the operating status of equipment is monitored.Specifically, user can install in equipment and run above-mentioned tool set Script, so that equipment can fit periodically through tool set, multiple master tools that this is included refer to monitor multiple keys Mark.When some key index occurs abnormal, equipment can further be fitted partial data sampling instrument in this using tool set Equipment operating parameter is acquired, and judges whether equipment breaks down based on collected equipment operating parameter, and corresponding event Hinder type.In turn, the failure that equipment can report this to occur repairs equipment for failure with reminding technology personnel.

In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems:

After equipment is if a failure occurs, failure generally understands last longer, if the period of monitoring key index is shorter, Then during failure continues, equipment can be detected constantly and report same failure, it will consumption is largely used to performance monitoring Equipment process resource；And if the period of monitoring key index is longer, may cause can not find failure in time.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of method of monitoring equipment fault and dresses It sets.The technical solution is as follows:

In a first aspect, providing a kind of method of monitoring equipment fault, which comprises

Every the monitoring sleep time of target critical index, fitted described in the master tool monitoring for originally including by tool set Target critical index；

If exception occurs in the target critical index, this detection is fitted currently with the presence or absence of event by the tool set Barrier, otherwise adjusts the monitoring sleep time of the target critical index based on the first preset duration；

If there is currently failure, when adjusting the monitoring suspend mode of the target critical index based on the second preset duration It is long, and fitted by the tool set and this determination and report the fault message of current failure；

If there is currently no failure, when adjusting the monitoring suspend mode of the target critical index based on third preset duration It is long, wherein second preset duration is greater than first preset duration, and first preset duration is default greater than the third Duration.

Optionally, the monitoring sleep time that the target critical index is otherwise adjusted based on the first preset duration, packet It includes:

Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.

Optionally, if it is described there is currently no failure, the target critical index is adjusted based on third preset duration Monitoring sleep time, comprising:

If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.

Optionally, it is described fitted by the tool set this determination and report the fault message of current failure, comprising:

Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one；

When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.

Optionally, described to be fitted the fault message of this determination current failure by the tool set, by the current failure Number of repetition in short-term add one, comprising:

Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause；

If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.

Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.

Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.

Second aspect, provides a kind of device of monitoring equipment fault, and described device includes:

Monitoring module, for the monitoring sleep time every target critical index, fitted the base for originally including by tool set Plinth tool monitors the target critical index；

Module is adjusted, if occurring for the target critical index abnormal, is fitted this detection by the tool set It currently whether there is failure, the monitoring sleep time of the target critical index otherwise adjusted based on the first preset duration, if There is currently failures, then the monitoring sleep time of the target critical index are adjusted based on the second preset duration, and by described Tool set, which fits, this determination and reports the fault message of current failure, if there is currently no failure, when being preset based on third The long monitoring sleep time for adjusting the target critical index；

Wherein, second preset duration is greater than first preset duration, and first preset duration is greater than described the Three preset durations.

Optionally, the adjustment module, is specifically used for:

The third aspect provides a kind of equipment, and the equipment includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute State the side for the monitoring equipment fault that code set or instruction set are loaded by the processor and executed with realization as described in relation to the first aspect Method.

Fourth aspect provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or The method that instruction set is loaded by processor and executed to realize monitoring equipment fault as described in relation to the first aspect.

Technical solution provided in an embodiment of the present invention has the benefit that

In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index；If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index；If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure；If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of method flow diagram of monitoring equipment fault provided in an embodiment of the present invention；

Fig. 2 is a kind of logical schematic for monitoring key index provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of chained list provided in an embodiment of the present invention；

Fig. 4 is a kind of apparatus structure schematic diagram of monitoring equipment fault provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of equipment provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

The embodiment of the invention provides a kind of method of monitoring equipment fault, the executing subject of this method, which can be, has journey The arbitrary equipment of sort run function can be server either terminal.Equipment can be loaded and be run in the technology that has powerful connections and mention And tool set fit this, fitted by the tool set and originally can use different monitoring tools monitoring device from different perspectives Operating status, so as to the hardware or software failure of the generation in timely discovering device operational process.Equipment may include place Device, memory, transceiver are managed, processor can be used for carrying out the processing in following processes for monitoring equipment fault, memory Can be used for storing the data of the data and generation that need in treatment process, such as store tool set fit this, recording equipment fortune Row parameter etc., transceiver can be used for sending and receiving the related data in treatment process, such as receiving the finger of user's input It enables, the fault message etc. of reporting equipment failure.Equipment can support multiple processes while run, different degrees of when process is run It occupies the process resource of equipment CPU, using certain memory headroom, and generates magnetic disc i/o.

Below in conjunction with specific embodiment, process flow shown in FIG. 1 is described in detail, content can be as Under:

Step 101, every the monitoring sleep time of target critical index, fitted the master tool for originally including by tool set Monitoring objective key index.

In an implementation, after technical staff is mounted with that tool set fits originally in equipment, equipment can load and run this Tool set fits this, and later, the equipment master tool for originally including that can be fitted by tool set monitors multiple key indexes.This Place, key index can be it is preset, by multiple key indexes can it is relatively simple, in time on discovering device whether It breaks down, and is directed to each key index, too small amount of key index that is able to reflect can be led to the presence or absence of exception information Master tool monitored in real time, in this way, executing a small amount of master tool monitors key index, the equipment process resource of consumption It is less, equipment performance is had an impact smaller.And for each key index, the monitoring that can individually set the key index is stopped Dormancy duration, i.e., every monitoring sleep time, equipment can fit the master tool for originally including to corresponding key by tool set Index is once monitored.Further, the monitoring sleep time of different key indexes can be different, correspondingly, different crucial The monitoring moment of index can also be different.In this way, by taking target critical index as an example, after running tool set script, Equipment can be every the monitoring sleep time of target critical index, and fitted the master tool monitoring objective for originally including by tool set Key index.

Optionally, above-mentioned key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.It is understood that in other embodiments, before key index is not limited to State the these types enumerated.

In an implementation, CPU usage, memory usage, load value, the CPU of I/O waiting time and each process can be chosen This five indices of utilization rate are as key index.Pointedly, for CPU usage, the progress of " mpstat " tool can be used Detection；For memory usage, can be detected by checking " used " and " free " field of " free-m "；For load Value, can the Load field in 1 minute by checking "/proc/load avg " file detected；When being waited for I/O It is long, " mpstat " tool can be used and detected；For the CPU usage of each process, " top " tool can be used and examined It surveys.

Step 102, if exception occurs in target critical index, this detection is fitted currently with the presence or absence of event by tool set Barrier, the monitoring sleep time otherwise based on the first preset duration adjustment target critical index.

In an implementation, equipment is when fitting the master tool monitoring objective key index in originally by tool set, Ke Yitong The mode for crossing threshold determination is tested according to some empirical datas used in routine analysis, judges the target monitored Whether key index there is exception, and so as to judge whether it is necessary to triggering following processing, specific processing refers to Fig. 2 institute Show.And if it find that target critical index occurs abnormal, equipment can be currently further then by tool set this detection of fitting It is no there are failure, be preset with for different key indexes number when its exception in this respectively specifically, tool set fits According to sampling instrument, equipment can first sampling instrument can collect set relevant to target critical Indexes Abnormality based on these data Then standby operating parameter is further confirmed that currently by these equipment operating parameters with the presence or absence of failure.And if target critical Index does not occur exception, then the monitoring sleep time of target critical index can be adjusted based on the first preset duration.

Optionally, if certain key index continuously detects normally, the monitoring suspend mode of the key index can be appropriately extended Duration, correspondingly, the part processing of step 102 can be such that the continuous normal number for otherwise counting target critical index, and will The monitoring sleep time of target critical index is adjusted to the product of continuous normal number and the first preset duration.

In an implementation, after being monitored to target critical index, if it find that target critical index does not occur exception, Equipment can then count the continuous normal number of target critical index, then adjust the monitoring sleep time of target critical index For the product of above-mentioned continuous normal number and the first preset duration.As an example it is assumed that the first preset duration is 1min, if on Target critical index is abnormal in primary monitoring, and target critical index is normal when this monitoring, then continuous normal number is 1, the monitoring sleep time of target critical index is then adjusted to 1*1min；If target critical index is in the monitoring of preceding n times Normally, target critical index is also normal when and this is monitored, then continuous normal number is N+1, and the monitoring of target critical index is stopped Dormancy duration is then adjusted to (N+1) * 1min.Furthermore, it is possible to set target critical index monitoring sleep time maximum value, i.e., without Why it is worth by continuous normal number, the monitoring sleep time of target critical index does not exceed the maximum value, this way it is possible to avoid When continuous normal number value is larger, i.e., when target critical index is chronically at normal condition, the monitoring of target critical index is stopped Dormancy duration is excessive, and the case where can not be monitored in time after target critical Indexes Abnormality.

Step 103, if there is currently failure, the monitoring suspend mode based on the second preset duration adjustment target critical index Duration, and fitted by tool set and this determination and report the fault message of current failure.

In an implementation, if in a step 102 by the confirmation of equipment operating parameter there is currently failure, equipment if, can be with base In the monitoring sleep time of the second preset duration adjustment target critical index.Meanwhile equipment can also fit this by tool set Determine the fault message of simultaneously reporting equipment current failure.Herein, technical staff can the various failures that are likely to occur of pre- measurement equipment, And the parameter attribute of equipment operating parameter when each failure occurs for recording equipment, it later can be by parameter attribute and fault message pair Tool set should be written to fit in this source code, in this way, equipment can be according to above-mentioned interior after collecting equipment operating parameter Hold, determines the corresponding fault message of equipment operating parameter acquired.

Optionally, if repeated detection has arrived same failure in a short time, corresponding event can intermittently be reported Hinder information, therefore, the processing of the part of step 103 can be such that is fitted the fault message of this determination current failure by tool set, The number of repetition in short-term of current failure is added one；When number of repetition reports threshold value equal to the corresponding failure of current failure in short-term, The fault message of current failure is reported, and reports threshold value by the failure that preset rules increase current failure.

In an implementation, equipment can be fitted the fault message of this determination current failure by tool set, and by current failure Number of repetition in short-term add one, it is readily appreciated that, number of repetition reflects equipment in a short time and repeats to detect the failure in short-term Number.Later, the failure corresponding with current failure of number of repetition in short-term that equipment can compare after adding one reports the big of threshold value Small, if number of repetition is equal to the corresponding failure of current failure and reports threshold value in short-term, equipment if, can report the event of current failure Hinder information, while reporting threshold value according to the failure that preset rules increase current failure, otherwise reports place without fault message Reason.As an example it is assumed that it is the increase with 3 for index that failure, which reports the regular i.e. preset rules of the increase of threshold value, then in failure Report threshold value is then followed successively by 1,3,9,27 ..., represents in conjunction with number of repetition in short-term: when determining the fault message of the failure for the first time into Row reports, and second does not report when determining the fault message of the failure, and third time reports when determining, does not report for the 4th time ... until It reports again for 9th time, it is subsequent and so on.

Optionally, it above-mentioned determining fault message and updates the processing of number of repetition in short-term and specifically can be such that and pass through tool Set script selects the failure cause of current failure in the corresponding preset failure reason of target critical index, and determines that failure is former The fault signature of cause；If the fault signature for the failure cause for locally recording faulty reason, and locally recording and this determination Fault signature similarity be greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, otherwise The failure cause and fault signature of this determination are recorded, and sets one for the number of repetition in short-term of failure cause.

In an implementation, during determining the fault message of current failure, equipment can be fitted by tool set and originally be existed The failure cause of current failure is selected in the corresponding preset failure reason of target critical index, and determines that the failure of failure cause is special Sign.By key index be CPU usage, load value, the CPU usage of each process and for I/O waiting time, it is specific default Failure cause and the method for determination of fault signature can refer to following table 1.Later, equipment may determine that locally whether recorded phase Same failure cause can determine fault signature and this event determined of the failure cause of local record if record has Hinder the similarity of feature, for example, fault signature there are 4, wherein only a fault signature is consistent, then similarity is 1/4.It Afterwards, if similarity is greater than preset threshold, the number of repetition in short-term of the failure cause of local record can be added one by equipment if.And Failure cause is not recorded or above-mentioned similarity is less than preset threshold if local, and equipment if can recorde the event of this determination Hinder reason and fault signature, and sets one for the number of repetition in short-term of failure cause.It is noted that the event of local record Barrier reason has certain storage duration, and after storing duration, equipment will be automatically deleted corresponding failure cause and event Hinder feature.

Table 1

Optionally, above-mentioned failure cause is recorded in the form of chained list, wherein chained list includes multiple nodes, Mei Gejie The corresponding key index of point, each key index respectively correspond one or more child list, and every child list includes multiple use In the linked list head of record failure cause, each linked list head corresponds to multiple child nodes, and it is former that multiple child nodes are respectively used to storage failure The fault signature of cause, in short-term number of repetition and failure report threshold value.

In an implementation, it is contemplated that in the data structure of programming, chained list is convenient for data traversal, while chained list shape Formula is easy to extend (i.e. in chained list can unlimited nested child list), and chained list has stronger data type compatibility, It can store the data under arbitrary data types, so above-mentioned failure cause can be recorded in the form of chained list.Equally to close Key index is CPU usage, load value, the CPU usage of each process and for I/O waiting time, and chained list is as shown in figure 3, chain Table trunk portion is made of, respectively CPU, LOAD, PROCESS, IO four nodes, and each key index is corresponding with one extremely A plurality of child list, the child list of CPU branch include the preset failure reason linked list head equal in number with CPU usage exception； The child list of LOAD branch can be divided into using disk (SDA, SDB ...), process (PROCESS_A, PROCESS_B ...), CPU (CPU0, CPU1 ...) three child lists, wherein the child list of LOAD- disk includes that number of disks corresponding with equipment is equal Linked list head, the child list of LOAD- process include N number of linked list head, and the child list of LOAD-CPU includes logic CPU corresponding with equipment The equal linked list head of quantity；The child list of PROCESS branch includes N number of linked list head；The child list of IO branch include and equipment pair The equal linked list head of the number of disks answered.Above-mentioned each linked list head can correspond to multiple events for being respectively used to storage failure cause Barrier feature, in short-term number of repetition and failure report the child node of threshold value.

Step 104, if there is currently no failure, the monitoring based on third preset duration adjustment target critical index is stopped Dormancy duration.

Wherein, the second preset duration is greater than the first preset duration, and the first preset duration is greater than third preset duration.

It in an implementation, can be with if in a step 102 by the confirmation of equipment operating parameter there is currently no failure, if equipment Monitoring sleep time based on third preset duration adjustment target critical index.It should be noted that the second preset duration is greater than First preset duration, the first preset duration are greater than third preset duration.It is appreciated that first, due to before failover, equipment Failure can generally have certain time, and corresponding key index will also be continuously in exception, so, detecting that target critical refers to Mark is abnormal, and successfully, in order to avoid frequently repeatedly detecting same failure, can control after the fault message of determining current failure Interval longer period of time is again monitored target critical index, therefore selection is adjusted based on longer second preset duration The monitoring sleep time of target critical index；Second, the probability is relatively small for the device fails under in operating status, equipment Most of the time is at normal condition, so without frequently being monitored to key index, while in order to which equipment is going out Can be detected in time after existing failure, the supervision interval of key index is not answered yet it is too long, so if monitoring target critical Index is normal, then selects the first preset duration of moderate length to adjust the monitoring sleep time of target critical index；Third, right The monitoring of key index primarily serves fault pre-alarming function, and when finding target critical Indexes Abnormality, equipment has greatly may be There is failure, and if further detection fails to find that failure, very possible failure are in the initial stage, Yi Xieshe Standby operating parameter is also not affected by influence, it is also possible to be therefore the other reasons such as target critical index Temporal fluctuations are set this In the case that standby state can not determine, need in a short time to monitor target critical index again, that is, need selection compared with Short third preset duration adjusts the monitoring sleep time of target critical index.

Optionally, if certain key index continuously detects exception, and be not further discovered that failure every time, then it can be appropriate Extend the monitoring sleep time of the key index, correspondingly, the processing of step 104 can be such that if there is currently no failure, Then statistics continuously monitors the continuous fault-free number after target critical Indexes Abnormality, and by the monitoring suspend mode of target critical index Duration is adjusted to the product of continuous fault-free number and third preset duration.

In an implementation, if it find that exception occurs in target critical index, but failure is not found in further detection process, Equipment, which can then count, continuously monitors the continuous fault-free number after target critical Indexes Abnormality, then by target critical index Monitoring sleep time be adjusted to the product of above-mentioned continuous fault-free number and third preset duration.As an example it is assumed that third Preset duration is 10s, if target critical index is normal in last monitoring, or target critical in last monitoring Indexes Abnormality, and confirmed equipment fault in further detection process, and target critical Indexes Abnormality when this monitoring, but not It was found that failure, then continuous fault-free number is 1, and the monitoring sleep time of target critical index is then adjusted to 1*10s；If preceding N Target critical index is exception in secondary monitoring, and does not find failure in further detection, while mesh when this monitoring It is also abnormal to mark key index, does not further also find failure in detection, then continuous fault-free number is N+1, target critical index Monitoring sleep time be then adjusted to (N+1) * 10s.Furthermore, it is possible to set the maximum of the monitoring sleep time of target critical index Value, i.e., no matter why continuous fault-free number is worth, and the monitoring sleep time of target critical index does not exceed the maximum value.

Based on the same technical idea, the embodiment of the invention also provides a kind of devices of monitoring equipment fault, such as Fig. 4 institute Show, described device includes:

Monitoring module 401, for the monitoring sleep time every target critical index, being fitted by tool set originally includes Master tool monitors the target critical index；

Module 402 is adjusted, if occurring for the target critical index abnormal, is fitted this inspection by the tool set It surveys and currently whether there is failure, the monitoring sleep time of the target critical index is otherwise adjusted based on the first preset duration, such as There is currently failures for fruit, then the monitoring sleep time of the target critical index is adjusted based on the second preset duration, and pass through institute It states tool set and fits and this determination and report the fault message of current failure, if there is currently no failure, it is default based on third Duration adjusts the monitoring sleep time of the target critical index；

Optionally, the adjustment module 402, is specifically used for:

It should be understood that the device of monitoring equipment fault provided by the above embodiment is in monitoring equipment fault, only with The division progress of above-mentioned each functional module can according to need and for example, in practical application by above-mentioned function distribution by not Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above Or partial function.In addition, the device of monitoring equipment fault provided by the above embodiment and the method for monitoring equipment fault are implemented Example belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.

Fig. 5 is the structural schematic diagram of equipment provided in an embodiment of the present invention.The equipment 500 can be due to configuration or performance be different Bigger difference is generated, may include one or more central processing units 522 (for example, one or more are handled Device) and memory 532, one or more storage application programs 552 or data 554 storage medium 530 (such as one or More than one mass memory unit).Wherein, memory 532 and storage medium 530 can be of short duration storage or persistent storage.It deposits Storage may include one or more modules (diagram does not mark) in the program of storage medium 530, and each module may include To the series of instructions operation in equipment.Further, central processing unit 522 can be set to communicate with storage medium 530, The series of instructions operation in storage medium 530 is executed in equipment 500.

Equipment 500 can also include one or more power supplys 525, one or more wired or wireless networks connect Mouthfuls 550, one or more input/output interfaces 558, one or more keyboards 555, and/or, one or one with Upper operating system 551, such as Windows Server, Mac OS X, UnixTM, Linux, FreeBSD etc..

Equipment 500 may include have memory and one perhaps one of them or one of more than one program with Upper program is stored in memory, and be configured to be executed by one or more than one processor it is one or one with Upper program includes the instruction for carrying out above-mentioned monitoring equipment fault.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. A method for monitoring equipment failure, wherein the method comprises:

Every monitoring sleep duration of the target key indicators, monitor the target key indicators through the basic tools included in the tool collection script;

If the target key indicator is abnormal, use the tool set script to detect whether there is a fault at present, otherwise, adjust the monitoring sleep duration of the target key indicator based on the first preset duration;

If there is a fault currently, adjust the monitoring sleep duration of the target key indicator based on the second preset duration, and determine and report the fault information of the current fault through the tool set script;

If there is currently no fault, the monitoring sleep duration of the target key indicator is adjusted based on a third preset duration, wherein the second preset duration is greater than the first preset duration, and the first preset duration is greater than the third preset duration.

2. The method according to claim 1, wherein the adjusting the monitoring sleep duration of the target key indicator based on the first preset duration otherwise, comprises:

Otherwise, count the consecutive normal times of the target key indicator, and adjust the monitoring sleep duration of the target key indicator to be the product of the consecutive normal times and the first preset duration.

3. The method according to claim 1, wherein if there is no fault currently, adjusting the monitoring sleep duration of the target key indicator based on a third preset duration, comprising:

If there is currently no fault, count the number of consecutive failure-free times after the abnormality of the target key indicator is continuously monitored, and adjust the monitoring sleep duration of the target key indicator to the difference between the number of consecutive failure-free times and the third preset duration. product.

4. The method according to claim 1, wherein the determining and reporting the fault information of the current fault through the tool set script comprises:

Determine the fault information of the current fault through the tool set script, and add one to the short-term repetition times of the current fault;

When the number of short-term repetitions is equal to the fault reporting threshold corresponding to the current fault, the fault information of the current fault is reported, and the fault reporting threshold of the current fault is increased according to a preset rule.

5. The method according to claim 4, wherein the determining the fault information of the current fault through the tool set script, and adding one to the short-term repetition times of the current fault, comprising:

Select the fault cause of the current fault from the preset fault causes corresponding to the target key indicators through the tool set script, and determine the fault characteristics of the fault cause;

If the failure cause is recorded locally, and the similarity between the failure feature of the locally recorded failure cause and the failure feature determined this time is greater than the preset threshold, the number of short-term repetitions of the locally recorded failure cause is increased by one, Otherwise, record the fault cause and fault feature determined this time, and set the short-term repetition number of the fault cause to one.

6. The method according to claim 5, wherein the failure cause is recorded in the form of a linked list, wherein the linked list includes a plurality of nodes, each of the nodes corresponds to a key indicator, and each of the The key indicators respectively correspond to one or more sub-linked lists, each sub-linked list contains multiple linked list headers used to record the cause of the failure, each of the linked list headers corresponds to multiple sub-nodes, and the multiple sub-nodes are used to store the failure cause. Fault characteristics, number of short-term repetitions, and fault reporting thresholds.

7. The method according to any one of claims 1-6, wherein the key indicators include at least CPU usage, memory usage, load value, I/O waiting time, and CPU usage of each process. one or more of the.

8. A device for monitoring equipment failure, wherein the device comprises:

The monitoring module is used to monitor the target key indicators through the basic tools included in the tool collection script for every monitoring sleep duration of the target key indicators;

an adjustment module, configured to detect whether there is a fault currently through the tool set script if the target key indicator is abnormal, otherwise adjust the monitoring sleep duration of the target key indicator based on the first preset duration, if there is a fault currently, Then adjust the monitoring sleep duration of the target key indicators based on the second preset duration, and determine and report the fault information of the current fault through the tool set script. If there is no fault currently, adjust the Monitoring sleep duration of target key indicators;

Wherein, the second preset duration is greater than the first preset duration, and the first preset duration is greater than the third preset duration.

9. The device according to claim 8, wherein the adjustment module is specifically used for:

10. The device according to claim 8, wherein the adjustment module is specifically used for:

11. The device according to claim 8, wherein the adjustment module is specifically used for:

12. The device according to claim 11, wherein the adjustment module is specifically used for:

13. The apparatus according to claim 12, wherein the failure cause is recorded in the form of a linked list, wherein the linked list includes a plurality of nodes, each of the nodes corresponds to a key indicator, and each of the The key indicators respectively correspond to one or more sub-linked lists, each sub-linked list contains multiple linked list headers used to record the cause of the failure, each of the linked list headers corresponds to multiple sub-nodes, and the multiple sub-nodes are respectively used to store the cause of the failure. Fault characteristics, number of short-term repetitions, and fault reporting thresholds.

14. The apparatus according to any one of claims 8-13, wherein the key indicators include at least CPU usage, memory usage, load value, I/O waiting time, and CPU usage of each process. one or more of the.

15. A device, characterized in that the device comprises a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one instruction, the at least one program . The code set or the instruction set is loaded and executed by the processor to implement the method for monitoring equipment failure as claimed in any one of claims 1 to 7 .

16. A computer-readable storage medium, wherein the storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program, the code The set or instruction set is loaded and executed by the processor to implement the method of monitoring equipment failure as claimed in any one of claims 1 to 7 .