[go: up one dir, main page]

CN111835566A - System fault management method, device and system - Google Patents

System fault management method, device and system Download PDF

Info

Publication number
CN111835566A
CN111835566A CN202010651666.9A CN202010651666A CN111835566A CN 111835566 A CN111835566 A CN 111835566A CN 202010651666 A CN202010651666 A CN 202010651666A CN 111835566 A CN111835566 A CN 111835566A
Authority
CN
China
Prior art keywords
fault
troubleshooting
recovery
pushing
personnel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010651666.9A
Other languages
Chinese (zh)
Inventor
何俊敏
杨微
易玉凤
马兴
孟波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yanxi Software Information Technology Co ltd
Original Assignee
Shanghai Yanxi Software Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yanxi Software Information Technology Co ltd filed Critical Shanghai Yanxi Software Information Technology Co ltd
Priority to CN202010651666.9A priority Critical patent/CN111835566A/en
Publication of CN111835566A publication Critical patent/CN111835566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a system fault management method, a device and a system, wherein the method identifies system faults according to received fault prompt information and triggers fault work orders of corresponding dimensions, generates parallel fault troubleshooting tasks in the corresponding dimensions and respectively pushes the tasks to corresponding fault handlers, locates fault points according to received fault troubleshooting results, searches a recovery plan matched with the fault points and pushes the recovery plan to the fault handlers, and executes the recovery plan to repair system faults after the fault handlers select.

Description

System fault management method, device and system
Technical Field
The invention relates to the technical field of information system operation and maintenance, in particular to a system fault management method, device and system.
Background
The on-line fault management of the IT system is particularly important in the daily operation and maintenance of the system, and not only is the technology examined, but also the timeliness examined.
The on-line fault management process is a test for the reaction ability, judgment ability and organization ability of technicians/technical teams. In the face of sudden production faults, how to quickly locate the problem, find the recovery plan and quickly implement the recovery plan is not an easy matter. In the conventional online fault management process of the system at the present stage, it takes too long time to recover the whole link from fault identification to fault, and if the fault root cause cannot be found and repaired at one time in a short time, the whole fault time has multiplication risk. Service disruption due to system failure is often unacceptable to an enterprise, may be loss of large numbers of orders or loss of customers, and may in extreme cases cause adverse social effects.
Therefore, a method for quickly and accurately finding and processing faults is needed.
Disclosure of Invention
In order to solve the technical problems, the invention provides a system fault management method, device and system, which adopt a multi-dimensional and parallel fault troubleshooting mode and greatly improve the fault treatment timeliness.
The technical scheme provided by the invention is as follows:
in a first aspect, a system fault management method is provided, the method at least including the following steps:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
In some preferred embodiments, the identifying a system fault according to the received fault prompt information and triggering a fault work order of a corresponding dimension includes the following sub-steps:
receiving at least one fault prompt message of multidimensional monitoring alarm information or manual alarm information;
pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and triggering the fault work order of the corresponding dimension according to the received acceptance instruction of the fault acceptance personnel.
In some preferred embodiments, when the fault notification information is manual alarm information, before pushing the fault notification information to fault handling staff of a corresponding dimension, the method further includes:
extracting fault type key words in the manual alarm information,
judging whether matched fault dimension exists in a preset fault classification table or not according to the fault type key words;
if yes, pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and if not, pushing the fault prompt information to fault handling personnel with universal dimensionality.
In some preferred embodiments, the generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order and respectively pushing the generated troubleshooting tasks to corresponding troubleshooting personnel, and locating a fault point according to the received troubleshooting result corresponding to each troubleshooting task includes the following substeps:
matching corresponding fault processing personnel in a preset system personnel table according to the fault information of the fault work order;
matching corresponding troubleshooting tasks in a preset troubleshooting model according to the failure information;
establishing a troubleshooting task incidence relation between a troubleshooting task and a troubleshooting worker according to the fault information;
generating parallel troubleshooting tasks based on the troubleshooting task incidence relation and pushing the parallel troubleshooting tasks to corresponding fault handling personnel;
receiving a troubleshooting result obtained by each corresponding troubleshooting worker executing a corresponding troubleshooting task;
and the screening result is an abnormal troubleshooting result and is analyzed to obtain a fault point.
In some preferred embodiments, the troubleshooting task is pushed to the corresponding troubleshooting personnel, specifically adopting:
and simultaneously, triggering dialing notification, creating a fault management communication group, pushing production change information in a preset time period and pushing a mail notification to push a fault troubleshooting task to a fault handler.
In some preferred embodiments, the searching for the recovery plan matching the failure point in the preset recovery plan matching relationship, and pushing the recovery plans to the failure handler after sorting according to priority includes the following sub-steps:
searching whether a recovery plan matched with the fault point exists in a preset recovery plan matching relation, if so, sorting the recovery plans according to priority and then pushing the recovery plans to fault processing personnel;
if not, calling the non-plan recovery operation as a recovery plan to be pushed to the fault handling personnel.
In some preferred embodiments, after receiving and executing the recovery plan selected by the fault handler to repair the system fault, the method further includes: verifying whether the system fault is repaired or not, specifically comprising the following substeps:
pushing the execution result of the recovery plan to a fault handler;
receiving a verification instruction for verifying whether the fault is recovered by a fault handler, and if the verification instruction is not recovered,
continuing to execute the next priority recovery plan;
and ending the fault work order until the received verification instruction is recovered.
In some preferred embodiments, after identifying a system fault according to the received fault prompting information and triggering a fault work order, the method further includes: pushing a substitute plan corresponding to the fault work order to an access user, specifically comprising the following substeps:
searching a plurality of alternative plans matched with the fault information of the fault work order in a preset fault alternative plan relation;
and pushing the substitute plan information to the user side when the user accesses the related link of the fault work order.
In a second aspect, a system fault management apparatus is provided, the apparatus at least comprising:
the fault work order triggering module is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
the fault point positioning module is used for generating parallel fault troubleshooting tasks in corresponding dimensions according to the fault work order, respectively pushing the fault troubleshooting tasks to corresponding fault processing personnel, and positioning fault points according to received fault troubleshooting results corresponding to the fault troubleshooting tasks;
the emergency plan module is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation, sequencing the recovery plans according to priority and then pushing the recovery plans to fault handling personnel;
and the fault repairing module is used for receiving and executing the recovery plan selected by the fault processing personnel so as to repair the system fault.
In a third aspect, there is provided a computer system comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a system fault management method, a device and a system, the method firstly identifies system faults according to received fault prompt information and triggers fault work orders of corresponding dimensions, then generates parallel fault troubleshooting tasks in the corresponding dimensions and respectively pushes the tasks to corresponding fault handlers, then locates fault points according to received fault troubleshooting results, searches a recovery plan matched with the fault points and pushes the recovery plan to the fault handlers, and executes the recovery plan to repair system faults after the fault handlers select;
furthermore, the fault prompt information at least comprises one of multidimensional monitoring alarm information or manual alarm information acquired by the alarm platform, the method is divided into different dimensions according to fault types, the different dimensions are respectively monitored, a fault work order is triggered through multidimensional monitoring, the fault identification sensitivity can be greatly improved, the fault reporting time is shortened, the fault management time can be shortened, the fault multiplication risk can be avoided, the user experience is improved, therefore, when the fault work order is triggered, the corresponding fault type can be obtained, fault location, troubleshooting and repair can be carried out only by corresponding dimension, the troubleshooting range is reduced, the fault identification accuracy and timeliness are further improved, meanwhile, the fault identification is carried out by adopting a mode of combining the multidimensional monitoring alarm information and the manual alarm information, and all system faults can be comprehensively covered, omission is avoided;
after the fault work order is triggered, a preset substitution plan corresponding to the fault point is pushed to the access user, and during the waiting of system recovery, the system function can still be realized by executing the substitution plan, so that unnecessary pressure on service and flow caused by service interruption and repeated operation of the user is effectively avoided, and the use experience of the user is improved.
The scheme of the application can be realized only by realizing any technical effect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a system fault management method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a troubleshooting task in a first embodiment of the invention;
fig. 3 is a block diagram of a system fault management apparatus according to a second embodiment of the present invention;
FIG. 4 is a diagram of a computer system architecture in a third embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the current online fault management of the system, usually, a user reports a fault in the using process, and a technician checks the fault in a serial manner, so that the fault management time is long, the fault identification is insensitive, and a fault point is not easy to be quickly positioned.
The system fault management method, apparatus and system will be further described with reference to the following embodiments and accompanying drawings 1-4.
Example one
Referring to fig. 1, the present embodiment provides a system fault management method, which at least includes the following steps:
and S1, identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions.
The fault prompt information at least comprises one of multi-dimensional monitoring alarm information and manual alarm information triggered by an alarm platform.
In this embodiment, the faults are divided into different dimensions according to the fault types. The alarm platform is used for monitoring the fault dimensionality with the high triggering probability and sending corresponding dimensionality monitoring alarm information. When the user triggers other faults beyond the multiple dimensions of the alarm platform, system faults are prompted by generating manual alarm information.
The fault work order is triggered through multi-dimensional monitoring, so that the fault identification sensitivity can be greatly improved, the fault reporting time is shortened, and the fault management time is shortened. The monitoring alarm of the fault triggering alarm platform can acquire the corresponding fault type after the fault work order is triggered, and the troubleshooting range can be effectively reduced only by corresponding dimensionality when fault positioning, troubleshooting and repairing are subsequently carried out so as to improve the fault identification accuracy and the timeliness. Specifically, step S1 includes the following sub-steps:
s11, receiving at least one fault prompt message of multi-dimensional monitoring alarm information or manual alarm information;
s12, pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and S13, triggering the fault work order with corresponding dimension according to the received fault acceptance instruction of the fault acceptance personnel.
Preferably, when the fault notification information is manual warning information, the step S12 of pushing the fault notification information to the fault receiver of the corresponding dimension further includes:
extracting fault type keywords in the artificial alarm information, and judging whether matched fault dimension exists in a preset fault classification table or not according to the fault type keywords;
if yes, pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and if not, pushing the fault prompt information to fault handling personnel with universal dimensionality.
The embodiment adopts a mode of combining multidimensional monitoring alarm information and manual alarm information to identify faults, can comprehensively cover all system faults and avoid omission.
Illustratively, when the system fault management method is applied to a logistics management system, an alarm platform monitors three dimensions of service layer running data, link layer running data and background running data, and when the running data of the three dimensions exceed an alarm threshold, an alarm is triggered to generate service alarm information (STP), link monitoring alarm information (TRO) or background monitoring alarm information (AOPS), namely multi-dimensional monitoring alarm information. And when the operating data of the other dimensions except the three dimensions are abnormal, executing manual reporting to generate manual alarm information.
And then, the fault prompt information is pushed to the fault handling personnel with the corresponding dimensionality, and the fault work order with the corresponding dimensionality is triggered according to the received handling instruction of the fault handling personnel. Therefore, the fault prompt information is pushed to the fault acceptance staff with the corresponding dimensionality for further verification, and the fault acceptance staff further judges the fault prompt information according to experience and the current production environment so as to eliminate the accidental phenomenon caused by the production environment. If the forward volume of mail is suddenly increased due to the large promotion in the past of a certain platform, the current forward volume of mail is obviously reduced by 30% compared with the large promotion time, and the STP alarm is triggered. The failure acceptance person can judge that the failure prompt information is not generated by the failure and does not accept the failure prompt information in combination with the prior prompt situation. And when the system fault is judged to be true, the fault acceptance personnel accept and send acceptance instructions, and the fault work orders with corresponding dimensions are triggered according to the received acceptance instructions of the fault acceptance personnel.
The general dimensionality is generally the fault type which appears for the first time and is not covered by the warning platform, the faults are uniformly determined to be general faults, the faults are processed by fault handling personnel with the general dimensionality, and fault handling is carried out by adopting a general fault handling model after subsequent handling.
And S2, generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to the received troubleshooting results corresponding to the troubleshooting tasks. Specifically, step S2 includes the following sub-steps:
and S21, matching corresponding fault handling personnel in a preset system personnel list according to the fault information of the fault work order.
Before executing the fault management method, a relationship list of fault information and fault processing personnel in each dimension is constructed in advance.
Taking the logistics management system as an example, the fault information includes information such as the belonged dimension, the service type identifier, the monitoring index, the trigger content, the fault code and the like.
And S22, matching corresponding troubleshooting tasks in the preset troubleshooting model according to the failure information.
Specifically, the corresponding troubleshooting tasks are matched in a preset troubleshooting model according to the service type identification in the failure information.
Similarly, before the fault management method is executed, a troubleshooting model in which fault information in each dimension is associated with a troubleshooting task is constructed in advance.
And S23, establishing a troubleshooting task association relation between the troubleshooting task and the troubleshooting personnel according to the failure information.
And S24, generating parallel fault troubleshooting tasks based on the troubleshooting task incidence relation and pushing the tasks to corresponding fault handling personnel.
Each dimension comprises a plurality of sub-dimensions representing different fault positions, and after a fault work order belonging to a certain dimension is triggered, in order to thoroughly troubleshoot and locate fault points comprehensively and as soon as possible, all the sub-dimensions in the dimension need to be troubleshoot. In the embodiment, a parallel troubleshooting mode of all sub-dimensions is adopted, that is, troubleshooting tasks are simultaneously pushed to corresponding fault handlers, and the corresponding fault handlers perform fault troubleshooting and generate fault troubleshooting results almost in the same time period.
Taking fault management of a logistics management system as an example, the checking dimensions comprise seven sub-dimensions of storage, application, DBA, development, a machine room, a data center network and a garden network, and further, whether the operation state of an application container, the operation state of middleware, the operation state of a database and the like are normal operation states or not is mainly judged in the checking process.
And after triggering the STP related fault work order, respectively searching related fault processing personnel and fault troubleshooting tasks in a pre-constructed relation between the system personnel table and the fault troubleshooting model, and establishing a troubleshooting task association relation. And respectively sending seven different troubleshooting tasks to corresponding fault processing personnel, and requiring to complete troubleshooting within a certain time period and generating a fault troubleshooting result.
As shown in fig. 2, the troubleshooting task interface for developing the sub-dimension in the troubleshooting dimension includes an input item indicating whether each operation index related to "whether the system link monitoring alarm is normal" is normal, and the fault handling staff selects and inputs the operation index after performing the troubleshooting to give a troubleshooting result and complete the troubleshooting task.
Wherein, the troubleshooting task is pushed to the corresponding failure processing personnel to specifically adopt: and simultaneously, triggering dialing notification, creating a fault management communication group, pushing production change information in a preset time period and pushing a mail notification to push a fault troubleshooting task to a fault handler.
The embodiment adopts the push-to-talk to send corresponding troubleshooting tasks to all fault handling personnel in various ways so as to improve the time efficiency. Exemplary notification pathways are: determining fault handling personnel list information corresponding to the system ID based on the system ID corresponding to the monitoring alarm information, reading a telephone number of the fault handling personnel, calling an Ali small number dialing function by value transfer, dialing a telephone of the fault handling personnel according to a broadcasting template to broadcast relevant information of a fault system, calling an enterprise WeChat clustering function by value transfer, and creating a fault handling communication group; sending a fault notification mail to a fault handling attendant through a system mailbox; pushing all production change information (including versions and change logs) of a current abnormal system within 24 hours to an enterprise wechat immediate processing group; pushing the abnormal business monitoring view to an enterprise wechat instant communication group at regular time; and determining the list information of the fault handling personnel corresponding to the ID based on the system ID corresponding to the monitoring alarm index, and pushing a corresponding troubleshooting task to the fault handling personnel.
And S25, receiving a troubleshooting result obtained by each troubleshooting worker executing the corresponding troubleshooting task.
After each fault processing personnel receives the fault troubleshooting task, the fault processing personnel inspects the operation data of the corresponding sub-dimension and gives a fault troubleshooting result. For example, "choose" normal "or" abnormal "after checking" whether the system changes normally within 24 hours.
And S26, screening the abnormal troubleshooting results and analyzing to obtain fault points.
Specifically, the abnormal failure troubleshooting result is screened out, and the abnormal failure troubleshooting result is processed by a preset abnormal association relation or a pre-established abnormal analysis model, and then the fault point is located.
The system fault management method in the embodiment adopts a multi-dimensional and parallel fault troubleshooting mode, shortens the fault troubleshooting time, improves the fault troubleshooting accuracy and improves the fault management efficiency.
The trouble shooting process usually requires a certain time during which the user cannot use the related function, as a preferred embodiment, after triggering the trouble ticket, the method further comprises the steps of: the Sa pushes a substitute plan corresponding to the fault work order to the visiting user, and the Sa specifically comprises the following substeps:
sa1, searching a plurality of alternative plans matched with the fault information of the fault work order in a preset fault alternative plan relation;
sa2, pushing alternative plan information to the user terminal when the user accesses the link related to the fault work order, wherein the alternative plan information includes alternative plan suggestions and alternative paths.
Before step Sa, the method further comprises: and constructing a fault replacement plan relation in advance. The construction method comprises the following steps: and checking historical fault scenes and relevant system functions influenced by the faults, presetting a substitution plan for each influenced system function, and forming an incidence relation between the faults and the substitution plans. When the user triggers the failure, a substitute plan is automatically sent to the corresponding user.
Such as: the fault scene is as follows: the WeChat code scanning is abnormal in order function, and the services influenced in the scene are as follows: and (5) a business of placing the order of the bulk order. And aiming at the scene, finding out the alternative plan in the pre-constructed fault alternative plan relation as ordering by adopting a payment instrument, and sending an alternative plan suggestion to the user under the condition of confirming that the alternative plan is available.
When the fault never occurs and the corresponding substitution plan does not exist in the fault substitution plan relationship, the fault and the corresponding substitution plan are added into the fault substitution plan relationship after the substitution plan is received and effectively executed.
Therefore, after the fault work order is triggered, a preset substitution plan corresponding to the fault point is pushed to the access user, and during the waiting of system recovery, the system function can still be realized by executing the substitution plan, so that unnecessary pressure on service and flow caused by service interruption and repeated operation of the user is effectively avoided, and the use experience of the user is improved.
And S3, searching for a recovery plan matched with the fault point in the preset recovery plan matching relationship, sequencing the recovery plans according to priority, and pushing the recovery plans to fault handling personnel. Specifically, step S3 includes the following sub-steps:
searching whether a recovery plan matched with the fault point exists in a preset recovery plan matching relation, if so, sorting the recovery plans according to priority and then pushing the recovery plans to fault processing personnel;
if not, calling the non-plan recovery operation as a recovery plan to be pushed to the fault handling personnel.
Prior to step S3, the method further comprises: and (4) constructing a recovery plan matching relation in advance. The construction process comprises the following steps: counting historical fault points and the adopted recovery plans, and counting the data of the adopted recovery plans, such as the use frequency, the success rate and the like, by adopting methods such as statistical analysis and the like, so as to form an incidence relation between the fault points and the recovery plans after the existing recovery plans are sequenced according to the priority.
And S4, receiving and executing the recovery plan selected by the fault handling personnel to repair the system fault.
Illustratively, the recovery plan includes, but is not limited to, a system restart or a rollback version, etc., and the rollback version is usually a 24H rollback version in consideration of a short occurrence time of a failure.
S5 is further included after S4, and it is verified whether the system fault is repaired, and S5 specifically includes the following substeps:
s51, pushing the execution result of the recovery plan to a fault handler;
s52, receiving a verification instruction for verifying whether the fault is recovered by the fault handler, and continuing to execute the next priority recovery plan if the verification instruction is not recovered;
and ending the fault work order until the received verification instruction is recovered.
It should be noted that, in this embodiment, fault handlers related to different steps are not distinguished, and in practical application, the fault handlers are distinguished by different steps, which is not limited to this.
Example two
In order to execute the system fault management method in the first embodiment, this embodiment provides a system fault management apparatus corresponding to the method, as shown in fig. 3, the apparatus at least includes:
the fault work order triggering module 1 is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
the fault point positioning module 2 is used for generating parallel fault troubleshooting tasks in corresponding dimensions according to the fault work order, respectively pushing the fault troubleshooting tasks to corresponding fault processing personnel, and positioning fault points according to received fault troubleshooting results corresponding to the fault troubleshooting tasks;
and the emergency plan module 3 is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation, sequencing the recovery plans according to priority and then pushing the recovery plans to fault handling personnel.
And the fault repairing module 4 is used for receiving and executing the recovery plan selected by the fault processing personnel so as to repair the system fault.
Further, the fault work order triggering module 1 includes:
the first receiving unit is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
the pushing unit is used for pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and the fault work order triggering unit is used for triggering the fault work orders with corresponding dimensions according to the received acceptance instruction of the fault acceptance personnel.
Further, the fault prompt information at least comprises one of multidimensional monitoring alarm information or manual alarm information acquired by the alarm platform; when the fault prompt message is the manual alarm message, the fault work order triggering module 1 further includes:
the extraction unit is used for extracting fault type keywords in the artificial alarm information;
and the judging unit is used for judging whether matched fault dimension exists in a preset fault classification table or not according to the fault type keyword.
The fault point locating module 2 includes:
the first matching unit is used for matching corresponding fault processing personnel in a preset system staff table according to the fault information of the fault work order;
the second matching unit is used for matching corresponding troubleshooting tasks in a preset troubleshooting model according to the failure information;
the association unit is used for establishing a troubleshooting task association relation between the troubleshooting task and the troubleshooting personnel according to the fault information;
the generating unit is used for generating parallel troubleshooting tasks based on the troubleshooting task incidence relation;
the first pushing unit is used for pushing the troubleshooting tasks to corresponding troubleshooting personnel;
the second receiving unit is used for receiving troubleshooting results obtained by each troubleshooting worker executing corresponding troubleshooting tasks;
and the first processing unit is used for screening the fault troubleshooting result with an abnormal result and analyzing the fault troubleshooting result to obtain a fault point.
The emergency plan module 3 includes:
a first search unit for searching whether there is a recovery plan matching the failure point in a preset recovery plan matching relationship,
the second pushing unit is used for pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to the priority when the recovery plans matched with the fault points exist; and the method is also used for calling the plan-free recovery operation as a recovery plan to be pushed to the fault handling personnel when the recovery plan matched with the fault point does not exist.
The device also includes: the verification module 5 is used for verifying whether the system fault is repaired or not;
the third pushing unit is used for pushing the execution result of the recovery plan to the fault handling personnel;
the second processing unit is used for receiving a verification instruction input by a fault handler after verifying whether the fault is recovered, and continuously executing a next priority recovery plan when the verification instruction is not recovered; and ending the fault work order until the received verification instruction is recovered.
The device also includes: the substitute plan module 6 is configured to push a substitute plan corresponding to the faulty work order to an access user, and includes:
the second searching unit is used for searching a plurality of alternative plans matched with the fault information of the fault work order in a preset fault alternative plan relation;
and the fourth pushing unit is used for pushing alternative plan information to the user side when the user accesses the link related to the fault work order, wherein the alternative plan information comprises an alternative plan suggestion and an alternative path.
It should be noted that: the system fault management device provided in the foregoing embodiment is only illustrated by dividing the functional modules when triggering the system fault management service in the first embodiment, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the fault management apparatus provided in the above embodiment and the fault management method provided in the first embodiment belong to the same concept, that is, the apparatus is based on the method, and the specific implementation process thereof is described in the method embodiment, and will not be described herein again.
EXAMPLE III
Corresponding to the above method and apparatus, a third embodiment of the present application provides a computer system, including:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults and triggering fault work orders according to received fault prompt information, wherein the fault prompt information at least comprises one of multidimensional monitoring alarm information and manual alarm information;
generating parallel troubleshooting tasks according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
FIG. 4 illustrates an architecture of a computer system that may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.
The processor 1510 may be implemented by using a general CXU (Central processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solution provided by the present application.
The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.
The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.
It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement the data without inventive effort.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for system fault management, the method comprising the steps of:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
2. The method according to claim 1, wherein the identifying system faults and triggering fault work orders of corresponding dimensions according to the received fault prompt information comprises the following sub-steps:
receiving at least one fault prompt message of multidimensional monitoring alarm information or manual alarm information;
pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and triggering the fault work order of the corresponding dimension according to the received acceptance instruction of the fault acceptance personnel.
3. The method of claim 2,
when the fault prompt information is manual alarm information, the method further includes the following steps that before the fault prompt information is pushed to fault handling personnel with corresponding dimensionality:
extracting fault type keywords in the artificial alarm information, and judging whether matched fault dimension exists in a preset fault classification table or not according to the fault type keywords;
if yes, pushing the fault prompt information to fault handling personnel with corresponding dimensionality;
and if not, pushing the fault prompt information to fault handling personnel with universal dimensionality.
4. The method according to claim 1, wherein the step of generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order and respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel and the step of locating a fault point according to the received troubleshooting results corresponding to the troubleshooting tasks comprises the following substeps:
matching corresponding fault processing personnel in a preset system personnel table according to the fault information of the fault work order;
matching corresponding troubleshooting tasks in a preset troubleshooting model according to the failure information;
establishing a troubleshooting task incidence relation between a troubleshooting task and a troubleshooting worker according to the fault information;
generating parallel troubleshooting tasks based on the troubleshooting task incidence relation and pushing the parallel troubleshooting tasks to corresponding fault handling personnel;
receiving a troubleshooting result obtained by each troubleshooting worker executing a corresponding troubleshooting task;
and the screening result is an abnormal troubleshooting result and is analyzed to obtain a fault point.
5. The method according to claim 4, wherein the troubleshooting task is pushed to the corresponding troubleshooting personnel by triggering dialing notification, creating a fault management communication group, pushing production change information within a preset time period, and sending a mail notification.
6. The method according to claim 4, wherein the step of searching for the recovery plan matching the fault point in the preset recovery plan matching relationship and pushing the recovery plans to the fault handler after sorting according to priority comprises the following sub-steps:
searching whether a recovery plan matched with the fault point exists in a preset recovery plan matching relation, if so, sorting the recovery plans according to priority and then pushing the recovery plans to fault processing personnel;
if not, calling the non-plan recovery operation as a recovery plan to be pushed to the fault handling personnel.
7. The method according to any one of claims 1 to 6, wherein after receiving and executing the recovery plan selected by the fault handling personnel to repair the system fault, the method further comprises: verifying whether the system fault is repaired or not, specifically comprising the following substeps:
pushing the execution result of the recovery plan to a fault handler;
receiving a verification instruction input by a fault handler after verifying whether the fault is recovered, and continuing to execute a next priority recovery plan when the verification instruction is not recovered;
and ending the fault work order until the received verification instruction is recovered.
8. The method of claim 7, after identifying a system fault according to the received fault prompt information and triggering a fault work order, further comprising: pushing a substitute plan corresponding to the fault work order to an access user, specifically comprising the following substeps:
searching a plurality of alternative plans matched with the fault information of the fault work order in a preset fault alternative plan relation;
and pushing the substitute plan information to the user side when the user accesses the related link of the fault work order.
9. A system fault management apparatus, characterized in that the apparatus comprises at least:
the fault work order triggering module is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
the fault point positioning module is used for generating parallel fault troubleshooting tasks in corresponding dimensions according to the fault work order, respectively pushing the fault troubleshooting tasks to corresponding fault processing personnel, and positioning fault points according to received fault troubleshooting results corresponding to the fault troubleshooting tasks;
the emergency plan module is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation, sequencing the recovery plans according to priority and then pushing the recovery plans to fault handling personnel;
and the fault repairing module is used for receiving and executing the recovery plan selected by the fault processing personnel so as to repair the system fault.
10. A computer system, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating parallel troubleshooting tasks in corresponding dimensions according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
CN202010651666.9A 2020-07-08 2020-07-08 System fault management method, device and system Pending CN111835566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010651666.9A CN111835566A (en) 2020-07-08 2020-07-08 System fault management method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010651666.9A CN111835566A (en) 2020-07-08 2020-07-08 System fault management method, device and system

Publications (1)

Publication Number Publication Date
CN111835566A true CN111835566A (en) 2020-10-27

Family

ID=72900579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010651666.9A Pending CN111835566A (en) 2020-07-08 2020-07-08 System fault management method, device and system

Country Status (1)

Country Link
CN (1) CN111835566A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010331A (en) * 2021-03-12 2021-06-22 腾讯科技(深圳)有限公司 Abnormal data processing method and device and computer readable storage medium
CN114254099A (en) * 2021-11-03 2022-03-29 北京思特奇信息技术股份有限公司 Automatic processing recommendation method and system for fault work order and electronic equipment
CN114912883A (en) * 2022-03-31 2022-08-16 深圳依时货拉拉科技有限公司 Plan management system and plan execution method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107196804A (en) * 2017-06-01 2017-09-22 国网山东省电力公司信息通信公司 Power system terminal communication access network Centralized Alarm Monitoring system and method
CN107589735A (en) * 2017-08-31 2018-01-16 远景能源(江苏)有限公司 Photovoltaic O&M robot system
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109783257A (en) * 2019-01-29 2019-05-21 清华大学 Selection method of replacing and system towards batch Web service Passive fault-tolerant control
CN109993550A (en) * 2019-04-17 2019-07-09 连云港杰瑞电子有限公司 After-sale service system and method based on wechat small routine and smart allocation algorithm
CN110727531A (en) * 2019-09-18 2020-01-24 上海麦克风文化传媒有限公司 Fault prediction and processing method and system for online system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107196804A (en) * 2017-06-01 2017-09-22 国网山东省电力公司信息通信公司 Power system terminal communication access network Centralized Alarm Monitoring system and method
CN107589735A (en) * 2017-08-31 2018-01-16 远景能源(江苏)有限公司 Photovoltaic O&M robot system
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109783257A (en) * 2019-01-29 2019-05-21 清华大学 Selection method of replacing and system towards batch Web service Passive fault-tolerant control
CN109993550A (en) * 2019-04-17 2019-07-09 连云港杰瑞电子有限公司 After-sale service system and method based on wechat small routine and smart allocation algorithm
CN110727531A (en) * 2019-09-18 2020-01-24 上海麦克风文化传媒有限公司 Fault prediction and processing method and system for online system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010331A (en) * 2021-03-12 2021-06-22 腾讯科技(深圳)有限公司 Abnormal data processing method and device and computer readable storage medium
CN114254099A (en) * 2021-11-03 2022-03-29 北京思特奇信息技术股份有限公司 Automatic processing recommendation method and system for fault work order and electronic equipment
CN114912883A (en) * 2022-03-31 2022-08-16 深圳依时货拉拉科技有限公司 Plan management system and plan execution method

Similar Documents

Publication Publication Date Title
CN111736875B (en) Version update monitoring method, device, equipment and computer storage medium
CN110704231A (en) Fault processing method and device
CN110362473B (en) Method and device for optimizing test environment, storage medium, and terminal
CN111865673A (en) Automatic fault management method, device and system
CN112540887A (en) Fault drilling method and device, electronic equipment and storage medium
CN111913824B (en) Method for determining data link fault cause and related equipment
CN111835566A (en) System fault management method, device and system
CN112152823A (en) Website operation error monitoring method and device and computer storage medium
CN112087320A (en) Abnormity positioning method and device, electronic equipment and readable storage medium
CN115495587A (en) A method and device for alarm analysis based on knowledge graph
CN112966056B (en) Information processing method, device, equipment, system and readable storage medium
CN109408361A (en) Monkey tests restored method, device, electronic equipment and computer readable storage medium
CN114546759A (en) Database access error monitoring and analyzing method and device and electronic equipment
CN111506455B (en) Method and device for checking service release result
CN114020432A (en) Task exception handling method and device and task exception handling system
CN113656252A (en) Fault locating method, device, electronic device and storage medium
CN113591477A (en) Fault positioning method, device and equipment based on associated data and storage medium
JP2017167578A (en) Incident management system
CN110348984B (en) Automatic credit card data input method and related equipment under different transaction channels
CN118152227A (en) Transaction link tracking method and device and computer equipment
CN113806196B (en) Root cause analysis method and system
CN117076323A (en) Troubleshooting methods, devices, electronic equipment and storage media
CN116062009A (en) Failure analysis method, device, electronic equipment and storage medium
CN117271183A (en) Method and device for acquiring database abnormal job scheduling retry strategy
CN116702008A (en) System risk management method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201027