[go: up one dir, main page]

CN109165138A - A kind of method and apparatus of monitoring equipment fault - Google Patents

A kind of method and apparatus of monitoring equipment fault Download PDF

Info

Publication number
CN109165138A
CN109165138A CN201810866734.6A CN201810866734A CN109165138A CN 109165138 A CN109165138 A CN 109165138A CN 201810866734 A CN201810866734 A CN 201810866734A CN 109165138 A CN109165138 A CN 109165138A
Authority
CN
China
Prior art keywords
fault
failure
monitoring
target key
preset duration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810866734.6A
Other languages
Chinese (zh)
Other versions
CN109165138B (en
Inventor
陈涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wangsu Science and Technology Co Ltd
Original Assignee
Wangsu Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wangsu Science and Technology Co Ltd filed Critical Wangsu Science and Technology Co Ltd
Priority to CN201810866734.6A priority Critical patent/CN109165138B/en
Publication of CN109165138A publication Critical patent/CN109165138A/en
Application granted granted Critical
Publication of CN109165138B publication Critical patent/CN109165138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • G06F11/3423Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time where the assessed time is active or idle time

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明公开了一种监控设备故障的方法和装置,属于计算机技术领域。所述方法包括:在运行工具集合脚本之后,每隔目标关键指标的监控休眠时长,通过工具集合脚本包含的基础工具监控目标关键指标;如果目标关键指标出现异常,则通过工具集合脚本检测当前是否存在故障,否则基于第一预设时长调整目标关键指标的监控休眠时长;如果当前存在故障,则基于第二预设时长调整目标关键指标的监控休眠时长,并通过工具集合脚本确定并上报当前故障的故障信息;如果当前不存在故障,则基于第三预设时长调整目标关键指标的监控休眠时长。采用本发明,可以避免对关键指标的频繁监控以及对同一故障的频繁重复上报,又可以较为及时地发现设备故障。

The invention discloses a method and a device for monitoring equipment failure, belonging to the technical field of computers. The method includes: after running the tool collection script, monitoring the target key indicators through the basic tools included in the tool collection script at every monitoring sleep duration of the target key indicators; if the target key indicators are abnormal, detecting whether the current If there is a fault, otherwise adjust the monitoring sleep duration of the target key indicator based on the first preset duration; if there is currently a fault, adjust the monitoring sleep duration of the target key indicator based on the second preset duration, and determine and report the current fault through the tool set script If there is no fault currently, the monitoring sleep duration of the target key indicator is adjusted based on the third preset duration. By adopting the present invention, frequent monitoring of key indicators and frequent repeated reporting of the same fault can be avoided, and equipment faults can be found in a relatively timely manner.

Description

A kind of method and apparatus of monitoring equipment fault
Technical field
The present invention relates to field of computer technology, in particular to a kind of method and apparatus of monitoring equipment fault.
Background technique
In the process of running, often there is operation troubles because of the problem on hardware or software in equipment, so as to Phenomena such as leading to equipment disposal ability decline, execute logic error, or even will appear equipment delay machine, component damage.In order to the greatest extent The early operation troubles found and solve equipment in time, user can often be looked into passage capacity monitoring programme (can be described as monitoring tools) The performance indicator for seeing equipment understands the operating status of equipment.
Presently, there are a kind of integrated tool sets there are many monitoring tools to fit this, and being fitted by tool set originally can unify Automatically the operating status of equipment is monitored.Specifically, user can install in equipment and run above-mentioned tool set Script, so that equipment can fit periodically through tool set, multiple master tools that this is included refer to monitor multiple keys Mark.When some key index occurs abnormal, equipment can further be fitted partial data sampling instrument in this using tool set Equipment operating parameter is acquired, and judges whether equipment breaks down based on collected equipment operating parameter, and corresponding event Hinder type.In turn, the failure that equipment can report this to occur repairs equipment for failure with reminding technology personnel.
In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems:
After equipment is if a failure occurs, failure generally understands last longer, if the period of monitoring key index is shorter, Then during failure continues, equipment can be detected constantly and report same failure, it will consumption is largely used to performance monitoring Equipment process resource;And if the period of monitoring key index is longer, may cause can not find failure in time.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of method of monitoring equipment fault and dresses It sets.The technical solution is as follows:
In a first aspect, providing a kind of method of monitoring equipment fault, which comprises
Every the monitoring sleep time of target critical index, fitted described in the master tool monitoring for originally including by tool set Target critical index;
If exception occurs in the target critical index, this detection is fitted currently with the presence or absence of event by the tool set Barrier, otherwise adjusts the monitoring sleep time of the target critical index based on the first preset duration;
If there is currently failure, when adjusting the monitoring suspend mode of the target critical index based on the second preset duration It is long, and fitted by the tool set and this determination and report the fault message of current failure;
If there is currently no failure, when adjusting the monitoring suspend mode of the target critical index based on third preset duration It is long, wherein second preset duration is greater than first preset duration, and first preset duration is default greater than the third Duration.
Optionally, the monitoring sleep time that the target critical index is otherwise adjusted based on the first preset duration, packet It includes:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, if it is described there is currently no failure, the target critical index is adjusted based on third preset duration Monitoring sleep time, comprising:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, it is described fitted by the tool set this determination and report the fault message of current failure, comprising:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, described to be fitted the fault message of this determination current failure by the tool set, by the current failure Number of repetition in short-term add one, comprising:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
Second aspect, provides a kind of device of monitoring equipment fault, and described device includes:
Monitoring module, for the monitoring sleep time every target critical index, fitted the base for originally including by tool set Plinth tool monitors the target critical index;
Module is adjusted, if occurring for the target critical index abnormal, is fitted this detection by the tool set It currently whether there is failure, the monitoring sleep time of the target critical index otherwise adjusted based on the first preset duration, if There is currently failures, then the monitoring sleep time of the target critical index are adjusted based on the second preset duration, and by described Tool set, which fits, this determination and reports the fault message of current failure, if there is currently no failure, when being preset based on third The long monitoring sleep time for adjusting the target critical index;
Wherein, second preset duration is greater than first preset duration, and first preset duration is greater than described the Three preset durations.
Optionally, the adjustment module, is specifically used for:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, the adjustment module, is specifically used for:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, the adjustment module, is specifically used for:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, the adjustment module, is specifically used for:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
The third aspect provides a kind of equipment, and the equipment includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute State the side for the monitoring equipment fault that code set or instruction set are loaded by the processor and executed with realization as described in relation to the first aspect Method.
Fourth aspect provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or The method that instruction set is loaded by processor and executed to realize monitoring equipment fault as described in relation to the first aspect.
Technical solution provided in an embodiment of the present invention has the benefit that
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is a kind of method flow diagram of monitoring equipment fault provided in an embodiment of the present invention;
Fig. 2 is a kind of logical schematic for monitoring key index provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of chained list provided in an embodiment of the present invention;
Fig. 4 is a kind of apparatus structure schematic diagram of monitoring equipment fault provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of equipment provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.
The embodiment of the invention provides a kind of method of monitoring equipment fault, the executing subject of this method, which can be, has journey The arbitrary equipment of sort run function can be server either terminal.Equipment can be loaded and be run in the technology that has powerful connections and mention And tool set fit this, fitted by the tool set and originally can use different monitoring tools monitoring device from different perspectives Operating status, so as to the hardware or software failure of the generation in timely discovering device operational process.Equipment may include place Device, memory, transceiver are managed, processor can be used for carrying out the processing in following processes for monitoring equipment fault, memory Can be used for storing the data of the data and generation that need in treatment process, such as store tool set fit this, recording equipment fortune Row parameter etc., transceiver can be used for sending and receiving the related data in treatment process, such as receiving the finger of user's input It enables, the fault message etc. of reporting equipment failure.Equipment can support multiple processes while run, different degrees of when process is run It occupies the process resource of equipment CPU, using certain memory headroom, and generates magnetic disc i/o.
Below in conjunction with specific embodiment, process flow shown in FIG. 1 is described in detail, content can be as Under:
Step 101, every the monitoring sleep time of target critical index, fitted the master tool for originally including by tool set Monitoring objective key index.
In an implementation, after technical staff is mounted with that tool set fits originally in equipment, equipment can load and run this Tool set fits this, and later, the equipment master tool for originally including that can be fitted by tool set monitors multiple key indexes.This Place, key index can be it is preset, by multiple key indexes can it is relatively simple, in time on discovering device whether It breaks down, and is directed to each key index, too small amount of key index that is able to reflect can be led to the presence or absence of exception information Master tool monitored in real time, in this way, executing a small amount of master tool monitors key index, the equipment process resource of consumption It is less, equipment performance is had an impact smaller.And for each key index, the monitoring that can individually set the key index is stopped Dormancy duration, i.e., every monitoring sleep time, equipment can fit the master tool for originally including to corresponding key by tool set Index is once monitored.Further, the monitoring sleep time of different key indexes can be different, correspondingly, different crucial The monitoring moment of index can also be different.In this way, by taking target critical index as an example, after running tool set script, Equipment can be every the monitoring sleep time of target critical index, and fitted the master tool monitoring objective for originally including by tool set Key index.
Optionally, above-mentioned key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.It is understood that in other embodiments, before key index is not limited to State the these types enumerated.
In an implementation, CPU usage, memory usage, load value, the CPU of I/O waiting time and each process can be chosen This five indices of utilization rate are as key index.Pointedly, for CPU usage, the progress of " mpstat " tool can be used Detection;For memory usage, can be detected by checking " used " and " free " field of " free-m ";For load Value, can the Load field in 1 minute by checking "/proc/load avg " file detected;When being waited for I/O It is long, " mpstat " tool can be used and detected;For the CPU usage of each process, " top " tool can be used and examined It surveys.
Step 102, if exception occurs in target critical index, this detection is fitted currently with the presence or absence of event by tool set Barrier, the monitoring sleep time otherwise based on the first preset duration adjustment target critical index.
In an implementation, equipment is when fitting the master tool monitoring objective key index in originally by tool set, Ke Yitong The mode for crossing threshold determination is tested according to some empirical datas used in routine analysis, judges the target monitored Whether key index there is exception, and so as to judge whether it is necessary to triggering following processing, specific processing refers to Fig. 2 institute Show.And if it find that target critical index occurs abnormal, equipment can be currently further then by tool set this detection of fitting It is no there are failure, be preset with for different key indexes number when its exception in this respectively specifically, tool set fits According to sampling instrument, equipment can first sampling instrument can collect set relevant to target critical Indexes Abnormality based on these data Then standby operating parameter is further confirmed that currently by these equipment operating parameters with the presence or absence of failure.And if target critical Index does not occur exception, then the monitoring sleep time of target critical index can be adjusted based on the first preset duration.
Optionally, if certain key index continuously detects normally, the monitoring suspend mode of the key index can be appropriately extended Duration, correspondingly, the part processing of step 102 can be such that the continuous normal number for otherwise counting target critical index, and will The monitoring sleep time of target critical index is adjusted to the product of continuous normal number and the first preset duration.
In an implementation, after being monitored to target critical index, if it find that target critical index does not occur exception, Equipment can then count the continuous normal number of target critical index, then adjust the monitoring sleep time of target critical index For the product of above-mentioned continuous normal number and the first preset duration.As an example it is assumed that the first preset duration is 1min, if on Target critical index is abnormal in primary monitoring, and target critical index is normal when this monitoring, then continuous normal number is 1, the monitoring sleep time of target critical index is then adjusted to 1*1min;If target critical index is in the monitoring of preceding n times Normally, target critical index is also normal when and this is monitored, then continuous normal number is N+1, and the monitoring of target critical index is stopped Dormancy duration is then adjusted to (N+1) * 1min.Furthermore, it is possible to set target critical index monitoring sleep time maximum value, i.e., without Why it is worth by continuous normal number, the monitoring sleep time of target critical index does not exceed the maximum value, this way it is possible to avoid When continuous normal number value is larger, i.e., when target critical index is chronically at normal condition, the monitoring of target critical index is stopped Dormancy duration is excessive, and the case where can not be monitored in time after target critical Indexes Abnormality.
Step 103, if there is currently failure, the monitoring suspend mode based on the second preset duration adjustment target critical index Duration, and fitted by tool set and this determination and report the fault message of current failure.
In an implementation, if in a step 102 by the confirmation of equipment operating parameter there is currently failure, equipment if, can be with base In the monitoring sleep time of the second preset duration adjustment target critical index.Meanwhile equipment can also fit this by tool set Determine the fault message of simultaneously reporting equipment current failure.Herein, technical staff can the various failures that are likely to occur of pre- measurement equipment, And the parameter attribute of equipment operating parameter when each failure occurs for recording equipment, it later can be by parameter attribute and fault message pair Tool set should be written to fit in this source code, in this way, equipment can be according to above-mentioned interior after collecting equipment operating parameter Hold, determines the corresponding fault message of equipment operating parameter acquired.
Optionally, if repeated detection has arrived same failure in a short time, corresponding event can intermittently be reported Hinder information, therefore, the processing of the part of step 103 can be such that is fitted the fault message of this determination current failure by tool set, The number of repetition in short-term of current failure is added one;When number of repetition reports threshold value equal to the corresponding failure of current failure in short-term, The fault message of current failure is reported, and reports threshold value by the failure that preset rules increase current failure.
In an implementation, equipment can be fitted the fault message of this determination current failure by tool set, and by current failure Number of repetition in short-term add one, it is readily appreciated that, number of repetition reflects equipment in a short time and repeats to detect the failure in short-term Number.Later, the failure corresponding with current failure of number of repetition in short-term that equipment can compare after adding one reports the big of threshold value Small, if number of repetition is equal to the corresponding failure of current failure and reports threshold value in short-term, equipment if, can report the event of current failure Hinder information, while reporting threshold value according to the failure that preset rules increase current failure, otherwise reports place without fault message Reason.As an example it is assumed that it is the increase with 3 for index that failure, which reports the regular i.e. preset rules of the increase of threshold value, then in failure Report threshold value is then followed successively by 1,3,9,27 ..., represents in conjunction with number of repetition in short-term: when determining the fault message of the failure for the first time into Row reports, and second does not report when determining the fault message of the failure, and third time reports when determining, does not report for the 4th time ... until It reports again for 9th time, it is subsequent and so on.
Optionally, it above-mentioned determining fault message and updates the processing of number of repetition in short-term and specifically can be such that and pass through tool Set script selects the failure cause of current failure in the corresponding preset failure reason of target critical index, and determines that failure is former The fault signature of cause;If the fault signature for the failure cause for locally recording faulty reason, and locally recording and this determination Fault signature similarity be greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, otherwise The failure cause and fault signature of this determination are recorded, and sets one for the number of repetition in short-term of failure cause.
In an implementation, during determining the fault message of current failure, equipment can be fitted by tool set and originally be existed The failure cause of current failure is selected in the corresponding preset failure reason of target critical index, and determines that the failure of failure cause is special Sign.By key index be CPU usage, load value, the CPU usage of each process and for I/O waiting time, it is specific default Failure cause and the method for determination of fault signature can refer to following table 1.Later, equipment may determine that locally whether recorded phase Same failure cause can determine fault signature and this event determined of the failure cause of local record if record has Hinder the similarity of feature, for example, fault signature there are 4, wherein only a fault signature is consistent, then similarity is 1/4.It Afterwards, if similarity is greater than preset threshold, the number of repetition in short-term of the failure cause of local record can be added one by equipment if.And Failure cause is not recorded or above-mentioned similarity is less than preset threshold if local, and equipment if can recorde the event of this determination Hinder reason and fault signature, and sets one for the number of repetition in short-term of failure cause.It is noted that the event of local record Barrier reason has certain storage duration, and after storing duration, equipment will be automatically deleted corresponding failure cause and event Hinder feature.
Table 1
Optionally, above-mentioned failure cause is recorded in the form of chained list, wherein chained list includes multiple nodes, Mei Gejie The corresponding key index of point, each key index respectively correspond one or more child list, and every child list includes multiple use In the linked list head of record failure cause, each linked list head corresponds to multiple child nodes, and it is former that multiple child nodes are respectively used to storage failure The fault signature of cause, in short-term number of repetition and failure report threshold value.
In an implementation, it is contemplated that in the data structure of programming, chained list is convenient for data traversal, while chained list shape Formula is easy to extend (i.e. in chained list can unlimited nested child list), and chained list has stronger data type compatibility, It can store the data under arbitrary data types, so above-mentioned failure cause can be recorded in the form of chained list.Equally to close Key index is CPU usage, load value, the CPU usage of each process and for I/O waiting time, and chained list is as shown in figure 3, chain Table trunk portion is made of, respectively CPU, LOAD, PROCESS, IO four nodes, and each key index is corresponding with one extremely A plurality of child list, the child list of CPU branch include the preset failure reason linked list head equal in number with CPU usage exception; The child list of LOAD branch can be divided into using disk (SDA, SDB ...), process (PROCESS_A, PROCESS_B ...), CPU (CPU0, CPU1 ...) three child lists, wherein the child list of LOAD- disk includes that number of disks corresponding with equipment is equal Linked list head, the child list of LOAD- process include N number of linked list head, and the child list of LOAD-CPU includes logic CPU corresponding with equipment The equal linked list head of quantity;The child list of PROCESS branch includes N number of linked list head;The child list of IO branch include and equipment pair The equal linked list head of the number of disks answered.Above-mentioned each linked list head can correspond to multiple events for being respectively used to storage failure cause Barrier feature, in short-term number of repetition and failure report the child node of threshold value.
Step 104, if there is currently no failure, the monitoring based on third preset duration adjustment target critical index is stopped Dormancy duration.
Wherein, the second preset duration is greater than the first preset duration, and the first preset duration is greater than third preset duration.
It in an implementation, can be with if in a step 102 by the confirmation of equipment operating parameter there is currently no failure, if equipment Monitoring sleep time based on third preset duration adjustment target critical index.It should be noted that the second preset duration is greater than First preset duration, the first preset duration are greater than third preset duration.It is appreciated that first, due to before failover, equipment Failure can generally have certain time, and corresponding key index will also be continuously in exception, so, detecting that target critical refers to Mark is abnormal, and successfully, in order to avoid frequently repeatedly detecting same failure, can control after the fault message of determining current failure Interval longer period of time is again monitored target critical index, therefore selection is adjusted based on longer second preset duration The monitoring sleep time of target critical index;Second, the probability is relatively small for the device fails under in operating status, equipment Most of the time is at normal condition, so without frequently being monitored to key index, while in order to which equipment is going out Can be detected in time after existing failure, the supervision interval of key index is not answered yet it is too long, so if monitoring target critical Index is normal, then selects the first preset duration of moderate length to adjust the monitoring sleep time of target critical index;Third, right The monitoring of key index primarily serves fault pre-alarming function, and when finding target critical Indexes Abnormality, equipment has greatly may be There is failure, and if further detection fails to find that failure, very possible failure are in the initial stage, Yi Xieshe Standby operating parameter is also not affected by influence, it is also possible to be therefore the other reasons such as target critical index Temporal fluctuations are set this In the case that standby state can not determine, need in a short time to monitor target critical index again, that is, need selection compared with Short third preset duration adjusts the monitoring sleep time of target critical index.
Optionally, if certain key index continuously detects exception, and be not further discovered that failure every time, then it can be appropriate Extend the monitoring sleep time of the key index, correspondingly, the processing of step 104 can be such that if there is currently no failure, Then statistics continuously monitors the continuous fault-free number after target critical Indexes Abnormality, and by the monitoring suspend mode of target critical index Duration is adjusted to the product of continuous fault-free number and third preset duration.
In an implementation, if it find that exception occurs in target critical index, but failure is not found in further detection process, Equipment, which can then count, continuously monitors the continuous fault-free number after target critical Indexes Abnormality, then by target critical index Monitoring sleep time be adjusted to the product of above-mentioned continuous fault-free number and third preset duration.As an example it is assumed that third Preset duration is 10s, if target critical index is normal in last monitoring, or target critical in last monitoring Indexes Abnormality, and confirmed equipment fault in further detection process, and target critical Indexes Abnormality when this monitoring, but not It was found that failure, then continuous fault-free number is 1, and the monitoring sleep time of target critical index is then adjusted to 1*10s;If preceding N Target critical index is exception in secondary monitoring, and does not find failure in further detection, while mesh when this monitoring It is also abnormal to mark key index, does not further also find failure in detection, then continuous fault-free number is N+1, target critical index Monitoring sleep time be then adjusted to (N+1) * 10s.Furthermore, it is possible to set the maximum of the monitoring sleep time of target critical index Value, i.e., no matter why continuous fault-free number is worth, and the monitoring sleep time of target critical index does not exceed the maximum value.
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
Based on the same technical idea, the embodiment of the invention also provides a kind of devices of monitoring equipment fault, such as Fig. 4 institute Show, described device includes:
Monitoring module 401, for the monitoring sleep time every target critical index, being fitted by tool set originally includes Master tool monitors the target critical index;
Module 402 is adjusted, if occurring for the target critical index abnormal, is fitted this inspection by the tool set It surveys and currently whether there is failure, the monitoring sleep time of the target critical index is otherwise adjusted based on the first preset duration, such as There is currently failures for fruit, then the monitoring sleep time of the target critical index is adjusted based on the second preset duration, and pass through institute It states tool set and fits and this determination and report the fault message of current failure, if there is currently no failure, it is default based on third Duration adjusts the monitoring sleep time of the target critical index;
Wherein, second preset duration is greater than first preset duration, and first preset duration is greater than described the Three preset durations.
Optionally, the adjustment module 402, is specifically used for:
Otherwise the continuous normal number of the target critical index is counted, and by the monitoring suspend mode of the target critical index Duration is adjusted to the product of the continuous normal number and the first preset duration.
Optionally, the adjustment module 402, is specifically used for:
If counting the continuous fault-free after continuously monitoring the target critical Indexes Abnormality there is currently no failure Number, and the monitoring sleep time of the target critical index is adjusted to the continuous fault-free number and third preset duration Product.
Optionally, the adjustment module 402, is specifically used for:
Fitted the fault message of this determination current failure by the tool set, by repeating in short-term time for the current failure Number plus one;
When the number of repetition in short-term, which is equal to the corresponding failure of the current failure, reports threshold value, the current event is reported The fault message of barrier, and threshold value is reported by the failure that preset rules increase the current failure.
Optionally, the adjustment module 402, is specifically used for:
Being fitted by the tool set, this selects current event in the corresponding preset failure reason of the target critical index The failure cause of barrier, and determine the fault signature of the failure cause;
If local record has the failure cause, and the fault signature of the failure cause locally recorded and this determination The similarity of fault signature is greater than preset threshold, then the number of repetition in short-term of the failure cause of local record is added one, no Then the record failure cause that this is determined and fault signature, and set the number of repetition in short-term of the failure cause to One.
Optionally, the failure cause is recorded in the form of chained list, wherein the chained list includes multiple nodes, often A corresponding key index of the node, each key index respectively correspond one or more child list, every subchain Table includes multiple for recording the linked list head of failure cause, and each linked list head corresponds to multiple child nodes, the multiple sub- section Fault signature, in short-term number of repetition and the failure that point is respectively used to store the failure cause report threshold value.
Optionally, the key index include at least CPU usage, memory usage, load value, I/O waiting time and It is one or more in the CPU usage of each process.
In the embodiment of the present invention, every the monitoring sleep time of target critical index, being fitted by tool set originally includes Master tool monitoring objective key index;If exception occurs in target critical index, this detection is fitted currently by tool set Monitoring sleep time with the presence or absence of failure, otherwise based on the first preset duration adjustment target critical index;If there is currently Failure, then the monitoring sleep time based on the second preset duration adjustment target critical index, and fitted this determination by tool set And report the fault message of current failure;If referred to there is currently no failure based on third preset duration adjustment target critical Target monitors sleep time, wherein the second preset duration is greater than the first preset duration, when the first preset duration is preset greater than third It is long.In this way, fitting this when using tool set, different key indexes is arranged different monitoring sleep times, multiple keys The monitoring processing of index is independent of each other, and based on different monitored results, is pointedly arranged and adjusts the different monitoring of length Sleep time not only can repeatedly report to avoid the frequent monitoring to key index and to the frequent of same failure, but also can be compared with For discovering device failure in time.
It should be understood that the device of monitoring equipment fault provided by the above embodiment is in monitoring equipment fault, only with The division progress of above-mentioned each functional module can according to need and for example, in practical application by above-mentioned function distribution by not Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above Or partial function.In addition, the device of monitoring equipment fault provided by the above embodiment and the method for monitoring equipment fault are implemented Example belongs to same design, and specific implementation process is detailed in embodiment of the method, and which is not described herein again.
Fig. 5 is the structural schematic diagram of equipment provided in an embodiment of the present invention.The equipment 500 can be due to configuration or performance be different Bigger difference is generated, may include one or more central processing units 522 (for example, one or more are handled Device) and memory 532, one or more storage application programs 552 or data 554 storage medium 530 (such as one or More than one mass memory unit).Wherein, memory 532 and storage medium 530 can be of short duration storage or persistent storage.It deposits Storage may include one or more modules (diagram does not mark) in the program of storage medium 530, and each module may include To the series of instructions operation in equipment.Further, central processing unit 522 can be set to communicate with storage medium 530, The series of instructions operation in storage medium 530 is executed in equipment 500.
Equipment 500 can also include one or more power supplys 525, one or more wired or wireless networks connect Mouthfuls 550, one or more input/output interfaces 558, one or more keyboards 555, and/or, one or one with Upper operating system 551, such as Windows Server, Mac OS X, UnixTM, Linux, FreeBSD etc..
Equipment 500 may include have memory and one perhaps one of them or one of more than one program with Upper program is stored in memory, and be configured to be executed by one or more than one processor it is one or one with Upper program includes the instruction for carrying out above-mentioned monitoring equipment fault.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (16)

1.一种监控设备故障的方法,其特征在于,所述方法包括:1. A method for monitoring equipment failure, wherein the method comprises: 每隔目标关键指标的监控休眠时长,通过工具集合脚本包含的基础工具监控所述目标关键指标;Every monitoring sleep duration of the target key indicators, monitor the target key indicators through the basic tools included in the tool collection script; 如果所述目标关键指标出现异常,则通过所述工具集合脚本检测当前是否存在故障,否则基于第一预设时长调整所述目标关键指标的监控休眠时长;If the target key indicator is abnormal, use the tool set script to detect whether there is a fault at present, otherwise, adjust the monitoring sleep duration of the target key indicator based on the first preset duration; 如果当前存在故障,则基于第二预设时长调整所述目标关键指标的监控休眠时长,并通过所述工具集合脚本确定并上报当前故障的故障信息;If there is a fault currently, adjust the monitoring sleep duration of the target key indicator based on the second preset duration, and determine and report the fault information of the current fault through the tool set script; 如果当前不存在故障,则基于第三预设时长调整所述目标关键指标的监控休眠时长,其中,所述第二预设时长大于所述第一预设时长,所述第一预设时长大于所述第三预设时长。If there is currently no fault, the monitoring sleep duration of the target key indicator is adjusted based on a third preset duration, wherein the second preset duration is greater than the first preset duration, and the first preset duration is greater than the third preset duration. 2.根据权利要求1所述的方法,其特征在于,所述否则基于第一预设时长调整所述目标关键指标的监控休眠时长,包括:2. The method according to claim 1, wherein the adjusting the monitoring sleep duration of the target key indicator based on the first preset duration otherwise, comprises: 否则统计所述目标关键指标的连续正常次数,并将所述目标关键指标的监控休眠时长调整为所述连续正常次数和第一预设时长的乘积。Otherwise, count the consecutive normal times of the target key indicator, and adjust the monitoring sleep duration of the target key indicator to be the product of the consecutive normal times and the first preset duration. 3.根据权利要求1所述的方法,其特征在于,所述如果当前不存在故障,则基于第三预设时长调整所述目标关键指标的监控休眠时长,包括:3. The method according to claim 1, wherein if there is no fault currently, adjusting the monitoring sleep duration of the target key indicator based on a third preset duration, comprising: 如果当前不存在故障,则统计连续监控到所述目标关键指标异常后的连续无故障次数,并将所述目标关键指标的监控休眠时长调整为所述连续无故障次数和第三预设时长的乘积。If there is currently no fault, count the number of consecutive failure-free times after the abnormality of the target key indicator is continuously monitored, and adjust the monitoring sleep duration of the target key indicator to the difference between the number of consecutive failure-free times and the third preset duration. product. 4.根据权利要求1所述的方法,其特征在于,所述通过所述工具集合脚本确定并上报当前故障的故障信息,包括:4. The method according to claim 1, wherein the determining and reporting the fault information of the current fault through the tool set script comprises: 通过所述工具集合脚本确定当前故障的故障信息,将所述当前故障的短时重复次数加一;Determine the fault information of the current fault through the tool set script, and add one to the short-term repetition times of the current fault; 当所述短时重复次数等于所述当前故障对应的故障上报阈值时,上报所述当前故障的故障信息,并按预设规则增加所述当前故障的故障上报阈值。When the number of short-term repetitions is equal to the fault reporting threshold corresponding to the current fault, the fault information of the current fault is reported, and the fault reporting threshold of the current fault is increased according to a preset rule. 5.根据权利要求4所述的方法,其特征在于,所述通过所述工具集合脚本确定当前故障的故障信息,将所述当前故障的短时重复次数加一,包括:5. The method according to claim 4, wherein the determining the fault information of the current fault through the tool set script, and adding one to the short-term repetition times of the current fault, comprising: 通过所述工具集合脚本在所述目标关键指标对应的预设故障原因中选择当前故障的故障原因,并确定所述故障原因的故障特征;Select the fault cause of the current fault from the preset fault causes corresponding to the target key indicators through the tool set script, and determine the fault characteristics of the fault cause; 如果本地记录有所述故障原因,且本地记录的故障原因的故障特征与本次确定的故障特征的相似度大于预设阈值,则将本地记录的所述故障原因的短时重复次数加一,否则记录所述本次确定的故障原因及故障特征,并将所述故障原因的短时重复次数设置为一。If the failure cause is recorded locally, and the similarity between the failure feature of the locally recorded failure cause and the failure feature determined this time is greater than the preset threshold, the number of short-term repetitions of the locally recorded failure cause is increased by one, Otherwise, record the fault cause and fault feature determined this time, and set the short-term repetition number of the fault cause to one. 6.根据权利要求5所述的方法,其特征在于,所述故障原因以链表的形式进行记录,其中,所述链表包含多个节点,每个所述节点对应一个关键指标,每个所述关键指标分别对应一条或多条子链表,每条子链表包含多个用于记录故障原因的链表头,每个所述链表头对应多个子节点,所述多个子节点分别用于存储所述故障原因的故障特征、短时重复次数和故障上报阈值。6. The method according to claim 5, wherein the failure cause is recorded in the form of a linked list, wherein the linked list includes a plurality of nodes, each of the nodes corresponds to a key indicator, and each of the The key indicators respectively correspond to one or more sub-linked lists, each sub-linked list contains multiple linked list headers used to record the cause of the failure, each of the linked list headers corresponds to multiple sub-nodes, and the multiple sub-nodes are used to store the failure cause. Fault characteristics, number of short-term repetitions, and fault reporting thresholds. 7.根据权利要求1-6任一项所述的方法,其特征在于,所述关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。7. The method according to any one of claims 1-6, wherein the key indicators include at least CPU usage, memory usage, load value, I/O waiting time, and CPU usage of each process. one or more of the. 8.一种监控设备故障的装置,其特征在于,所述装置包括:8. A device for monitoring equipment failure, wherein the device comprises: 监控模块,用于每隔目标关键指标的监控休眠时长,通过工具集合脚本包含的基础工具监控所述目标关键指标;The monitoring module is used to monitor the target key indicators through the basic tools included in the tool collection script for every monitoring sleep duration of the target key indicators; 调整模块,用于如果所述目标关键指标出现异常,则通过所述工具集合脚本检测当前是否存在故障,否则基于第一预设时长调整所述目标关键指标的监控休眠时长,如果当前存在故障,则基于第二预设时长调整所述目标关键指标的监控休眠时长,并通过所述工具集合脚本确定并上报当前故障的故障信息,如果当前不存在故障,则基于第三预设时长调整所述目标关键指标的监控休眠时长;an adjustment module, configured to detect whether there is a fault currently through the tool set script if the target key indicator is abnormal, otherwise adjust the monitoring sleep duration of the target key indicator based on the first preset duration, if there is a fault currently, Then adjust the monitoring sleep duration of the target key indicators based on the second preset duration, and determine and report the fault information of the current fault through the tool set script. If there is no fault currently, adjust the Monitoring sleep duration of target key indicators; 其中,所述第二预设时长大于所述第一预设时长,所述第一预设时长大于所述第三预设时长。Wherein, the second preset duration is greater than the first preset duration, and the first preset duration is greater than the third preset duration. 9.根据权利要求8所述的装置,其特征在于,所述调整模块,具体用于:9. The device according to claim 8, wherein the adjustment module is specifically used for: 否则统计所述目标关键指标的连续正常次数,并将所述目标关键指标的监控休眠时长调整为所述连续正常次数和第一预设时长的乘积。Otherwise, count the consecutive normal times of the target key indicator, and adjust the monitoring sleep duration of the target key indicator to be the product of the consecutive normal times and the first preset duration. 10.根据权利要求8所述的装置,其特征在于,所述调整模块,具体用于:10. The device according to claim 8, wherein the adjustment module is specifically used for: 如果当前不存在故障,则统计连续监控到所述目标关键指标异常后的连续无故障次数,并将所述目标关键指标的监控休眠时长调整为所述连续无故障次数和第三预设时长的乘积。If there is currently no fault, count the number of consecutive failure-free times after the abnormality of the target key indicator is continuously monitored, and adjust the monitoring sleep duration of the target key indicator to the difference between the number of consecutive failure-free times and the third preset duration. product. 11.根据权利要求8所述的装置,其特征在于,所述调整模块,具体用于:11. The device according to claim 8, wherein the adjustment module is specifically used for: 通过所述工具集合脚本确定当前故障的故障信息,将所述当前故障的短时重复次数加一;Determine the fault information of the current fault through the tool set script, and add one to the short-term repetition times of the current fault; 当所述短时重复次数等于所述当前故障对应的故障上报阈值时,上报所述当前故障的故障信息,并按预设规则增加所述当前故障的故障上报阈值。When the number of short-term repetitions is equal to the fault reporting threshold corresponding to the current fault, the fault information of the current fault is reported, and the fault reporting threshold of the current fault is increased according to a preset rule. 12.根据权利要求11所述的装置,其特征在于,所述调整模块,具体用于:12. The device according to claim 11, wherein the adjustment module is specifically used for: 通过所述工具集合脚本在所述目标关键指标对应的预设故障原因中选择当前故障的故障原因,并确定所述故障原因的故障特征;Select the fault cause of the current fault from the preset fault causes corresponding to the target key indicators through the tool set script, and determine the fault characteristics of the fault cause; 如果本地记录有所述故障原因,且本地记录的故障原因的故障特征与本次确定的故障特征的相似度大于预设阈值,则将本地记录的所述故障原因的短时重复次数加一,否则记录所述本次确定的故障原因及故障特征,并将所述故障原因的短时重复次数设置为一。If the failure cause is recorded locally, and the similarity between the failure feature of the locally recorded failure cause and the failure feature determined this time is greater than the preset threshold, the number of short-term repetitions of the locally recorded failure cause is increased by one, Otherwise, record the fault cause and fault feature determined this time, and set the short-term repetition number of the fault cause to one. 13.根据权利要求12所述的装置,其特征在于,所述故障原因以链表的形式进行记录,其中,所述链表包含多个节点,每个所述节点对应一个关键指标,每个所述关键指标分别对应一条或多条子链表,每条子链表包含多个用于记录故障原因的链表头,每个所述链表头对应多个子节点,所述多个子节点分别用于存储所述故障原因的故障特征、短时重复次数和故障上报阈值。13. The apparatus according to claim 12, wherein the failure cause is recorded in the form of a linked list, wherein the linked list includes a plurality of nodes, each of the nodes corresponds to a key indicator, and each of the The key indicators respectively correspond to one or more sub-linked lists, each sub-linked list contains multiple linked list headers used to record the cause of the failure, each of the linked list headers corresponds to multiple sub-nodes, and the multiple sub-nodes are respectively used to store the cause of the failure. Fault characteristics, number of short-term repetitions, and fault reporting thresholds. 14.根据权利要求8-13任一项所述的装置,其特征在于,所述关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。14. The apparatus according to any one of claims 8-13, wherein the key indicators include at least CPU usage, memory usage, load value, I/O waiting time, and CPU usage of each process. one or more of the. 15.一种设备,其特征在于,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的监控设备故障的方法。15. A device, characterized in that the device comprises a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one instruction, the at least one program . The code set or the instruction set is loaded and executed by the processor to implement the method for monitoring equipment failure as claimed in any one of claims 1 to 7 . 16.一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任一所述的监控设备故障的方法。16. A computer-readable storage medium, wherein the storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program, the code The set or instruction set is loaded and executed by the processor to implement the method of monitoring equipment failure as claimed in any one of claims 1 to 7 .
CN201810866734.6A 2018-08-01 2018-08-01 Method and device for monitoring equipment fault Active CN109165138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810866734.6A CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810866734.6A CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Publications (2)

Publication Number Publication Date
CN109165138A true CN109165138A (en) 2019-01-08
CN109165138B CN109165138B (en) 2022-06-17

Family

ID=64898638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810866734.6A Active CN109165138B (en) 2018-08-01 2018-08-01 Method and device for monitoring equipment fault

Country Status (1)

Country Link
CN (1) CN109165138B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110048932A (en) * 2019-04-03 2019-07-23 北京奇安信科技有限公司 Validation checking method, apparatus, equipment and the storage medium of mail Monitoring function
CN110995519A (en) * 2020-02-28 2020-04-10 北京信安世纪科技股份有限公司 Load balancing method and device
CN111143134A (en) * 2019-12-30 2020-05-12 深圳Tcl新技术有限公司 Fault processing method, equipment and computer storage medium
CN111464372A (en) * 2019-01-18 2020-07-28 广东天创同工大数据应用有限公司 Method for improving communication refreshing speed
CN112446978A (en) * 2019-08-29 2021-03-05 长鑫存储技术有限公司 Monitoring method and device of semiconductor equipment, storage medium and computer equipment
CN113840122A (en) * 2021-11-25 2021-12-24 南方电网数字电网研究院有限公司 Monitoring control method, device, electronic device and storage medium
CN114414911A (en) * 2021-12-20 2022-04-29 国网江苏省电力有限公司 Performance index monitoring system and method of power utilization information acquisition field operation and maintenance tool
CN115904917A (en) * 2023-02-22 2023-04-04 湖北泰跃卫星技术发展股份有限公司 Internet of things exception handling method and device, computer equipment and storage medium
CN116795196A (en) * 2023-08-25 2023-09-22 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer
CN116938624A (en) * 2023-08-15 2023-10-24 四川虹美智能科技有限公司 Equipment state reporting method and system, internet of things module and Internet of things platform

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290882A1 (en) * 2011-05-10 2012-11-15 Corkum David L Signal processing during fault conditions
CN105549508A (en) * 2015-12-25 2016-05-04 北京奇虎科技有限公司 Alarm method based on information combination and apparatus thereof
CN106357469A (en) * 2016-11-16 2017-01-25 郑州云海信息技术有限公司 Dynamic adjustment method and device of resource monitoring mode
CN106502868A (en) * 2016-11-18 2017-03-15 国云科技股份有限公司 A method for dynamically adjusting monitoring frequency suitable for cloud computing
CN106878111A (en) * 2017-03-15 2017-06-20 郑州云海信息技术有限公司 A highly available cloud monitoring system and monitoring method
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120290882A1 (en) * 2011-05-10 2012-11-15 Corkum David L Signal processing during fault conditions
CN105549508A (en) * 2015-12-25 2016-05-04 北京奇虎科技有限公司 Alarm method based on information combination and apparatus thereof
CN106357469A (en) * 2016-11-16 2017-01-25 郑州云海信息技术有限公司 Dynamic adjustment method and device of resource monitoring mode
CN106502868A (en) * 2016-11-18 2017-03-15 国云科技股份有限公司 A method for dynamically adjusting monitoring frequency suitable for cloud computing
CN106878111A (en) * 2017-03-15 2017-06-20 郑州云海信息技术有限公司 A highly available cloud monitoring system and monitoring method
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111464372B (en) * 2019-01-18 2021-09-24 广东天创同工大数据应用有限公司 A method for improving communication refresh rate
CN111464372A (en) * 2019-01-18 2020-07-28 广东天创同工大数据应用有限公司 Method for improving communication refreshing speed
CN110048932A (en) * 2019-04-03 2019-07-23 北京奇安信科技有限公司 Validation checking method, apparatus, equipment and the storage medium of mail Monitoring function
CN112446978A (en) * 2019-08-29 2021-03-05 长鑫存储技术有限公司 Monitoring method and device of semiconductor equipment, storage medium and computer equipment
CN111143134B (en) * 2019-12-30 2024-06-04 深圳Tcl新技术有限公司 Fault processing method, device and computer storage medium
CN111143134A (en) * 2019-12-30 2020-05-12 深圳Tcl新技术有限公司 Fault processing method, equipment and computer storage medium
CN110995519B (en) * 2020-02-28 2020-06-26 北京信安世纪科技股份有限公司 Load balancing method and device
CN110995519A (en) * 2020-02-28 2020-04-10 北京信安世纪科技股份有限公司 Load balancing method and device
CN113840122A (en) * 2021-11-25 2021-12-24 南方电网数字电网研究院有限公司 Monitoring control method, device, electronic device and storage medium
CN113840122B (en) * 2021-11-25 2022-03-08 南方电网数字电网研究院有限公司 Monitoring control method, device, electronic device and storage medium
CN114414911A (en) * 2021-12-20 2022-04-29 国网江苏省电力有限公司 Performance index monitoring system and method of power utilization information acquisition field operation and maintenance tool
CN115904917A (en) * 2023-02-22 2023-04-04 湖北泰跃卫星技术发展股份有限公司 Internet of things exception handling method and device, computer equipment and storage medium
CN116938624A (en) * 2023-08-15 2023-10-24 四川虹美智能科技有限公司 Equipment state reporting method and system, internet of things module and Internet of things platform
CN116795196A (en) * 2023-08-25 2023-09-22 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer
CN116795196B (en) * 2023-08-25 2023-11-17 深圳市德航智能技术有限公司 Implementation method for reinforcing ultra-long standby of handheld tablet computer

Also Published As

Publication number Publication date
CN109165138B (en) 2022-06-17

Similar Documents

Publication Publication Date Title
CN109165138A (en) A kind of method and apparatus of monitoring equipment fault
CN114500250B (en) System linkage comprehensive operation and maintenance system and method in cloud mode
Zheng et al. Co-analysis of RAS log and job log on Blue Gene/P
US11061756B2 (en) Enabling symptom verification
Tan et al. Adaptive system anomaly prediction for large-scale hosting infrastructures
CN111881011A (en) Log management method, platform, server and storage medium
JPH04230538A (en) Method and apparatus for detecting faulty software component
CN105224888B (en) A kind of data of magnetic disk array protection system based on safe early warning technology
CN112699007B (en) Method, system, network device and storage medium for monitoring machine performance
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN107807872A (en) A kind of power transmission and transformation system method for monitoring operation states
CN111327685A (en) Distributed storage system data processing method, device and device and storage medium
US8601318B2 (en) Method, apparatus and computer program product for rule-based directed problem resolution for servers with scalable proactive monitoring
CN115658420A (en) Database monitoring method and system
CN111901172A (en) Application service monitoring method and system based on cloud computing environment
CN112527594A (en) Hard disk inspection method, device and system
CN116340045A (en) Database exception processing method, device, equipment and computer-readable storage medium
CN112882903A (en) Distributed monitoring method
CN118964100A (en) Cache anomaly detection method, device, storage medium and electronic device
CN118916200A (en) Abnormality positioning method, device, equipment and medium
CN113900898B (en) Data processing system, equipment and medium
CN115686890A (en) Processor fault early warning method, system, electronic equipment and medium
CN117640341A (en) Node detection method and device
CN115913895B (en) A method, device, equipment and medium for server fault diagnosis and alarm
TWI881766B (en) Alarm system and alarm method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant