WO2013076798A1

WO2013076798A1 - Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium

Info

Publication number: WO2013076798A1
Application number: PCT/JP2011/076835
Authority: WO
Inventors: 伸子小高
Original assignee: 富士通株式会社
Priority date: 2011-11-21
Filing date: 2011-11-21
Publication date: 2013-05-30

Abstract

The invention prevents failure generation by determining the operation status of a system, then diagnosing the likelihood of failure generation and assessing whether a process is capable of functioning. When the start of a process is requested during the operation of the system to be applied, a prohibited process list is referenced (100). If the requested process is registered, the dynamically changing system operation status is determined (104), the likelihood of failure generation for the process is diagnosed (106), and if there is a high likelihood of failure generation for the process, the execution of the process is stopped (110). This procedure enables the prevention of failure generation.

Description

Failure occurrence prevention apparatus, failure occurrence prevention method, failure occurrence prevention program, and medium

The disclosed technology relates to a failure occurrence prevention device, a failure occurrence prevention method, a failure occurrence prevention program, and a medium.

Software such as OS (Operating System) and application programs installed in the system is required to prevent failure by the software itself at the time of execution. For this purpose, it is desirable to acquire a patch provided by a vendor or the like that provides software and apply the acquired patch to the system to prevent software failure.

It is common to apply a patch to the system with a time to stop the system operation for system maintenance regularly. In the application of a patch, when software that has already been reported and has a known failure is included in its own system, a patch for preventing the occurrence of the failure is applied.

Since software is enormous, there are enormous patches. For this reason, it is common to select and apply patches that can be applied to the own system. As a technique for selecting and applying patches from a vast number of patches, a technique for extracting and applying only patches that can be applied to its own system is known. In one example, an unapplied patch is extracted from a known patch according to a user policy.

When extracting a patch that can be applied to the own system, the patch is extracted based on system information such as a software version, a hardware type, and an applied patch ID (patch identifier). For this reason, all patches corresponding to the system information are extracted as applicable patches and applied to the system. When all patches corresponding to the system information are extracted and applied, it takes time to apply the patches. In addition, applying a patch to the system increases the risk of generating software whose level is reduced, such as function degradation. In addition, the risk of incompatibility between software that is downgraded by applying a patch and software that is not applied with a patch increases. Therefore, a technology is also known that applies a patch that a vendor collects and provides only patches that correct important and urgent failures such as security failures. In one example, the patch application time is determined according to the degree of urgency and the load on the computer to be applied.
Japanese Patent No. 3200661 JP 2007-25820 A JP 2005-38223 A JP-A-10-63527 JP 2010-250749 A International Publication No. 2007/105274 JP 2005-327275 A International Publication No. 2008/126221

By the way, whether or not a failure occurs is unknown until just before the failure actually occurs in its own system, so the possibility of the failure cannot be determined in advance. In other words, whether or not a failure occurs due to software depends on the operating state of the system when the software is executed in its own system. Since the occurrence of the failure depends on the status of the operating system, the possibility of the failure cannot be determined only by static system information such as the hardware and CPU type and the version of the software being used.

In one aspect, the objective is to suppress system failures.

The disclosed technology grasps the operating state of the application target system when the process is started, and refers to a list classified for each process including information on operating state conditions and patches in which a failure occurs. From this list, the condition of the operating state where the failure occurs is specified, and the possibility of the failure is diagnosed from the condition and the operating state of the grasped application target system. The diagnosis stops the process when the possibility of occurrence of a failure due to the execution of the process is a predetermined value or more. Also, an unapplied patch corresponding to the process whose execution has been stopped is acquired.

In one embodiment, there is an effect that occurrence of a failure can be suppressed.

1 is a block diagram showing a schematic configuration of a computer system according to a first embodiment. It is an image figure which shows an example of a prohibition process list | wrist (DB1). It is an image figure which shows an example of a failure content list (DB2). It is an image figure which shows an example of condition list 1 (DB3). It is an image figure which shows an example of condition list 2 (DB4). It is an image figure showing an example of condition list 3 (DB5). It is an image figure which shows an example of a shell program. It is explanatory drawing about the flow of the failure generation prevention process which concerns on 1st Embodiment. It is a flowchart which shows the flow of the process performed by the request | requirement of a process starting. It is a flowchart which shows the flow of a system maintenance process. It is a flowchart which shows the detailed flow centering on the diagnostic process which concerns on 1st Embodiment. It is a flowchart which shows the flow of the process centering on the life extension process which concerns on 1st Embodiment. It is explanatory drawing about the flow of the failure generation prevention process which concerns on 2nd Embodiment. It is a flowchart which shows the flow of the process centering on the diagnostic process which concerns on 2nd Embodiment. It is a flowchart which shows the flow of the process centering on the recovery process which concerns on 2nd Embodiment.

Hereinafter, an example of an embodiment of the disclosed technology will be described in detail with reference to the drawings.

(First embodiment)
FIG. 1 shows a schematic configuration of a computer system 10 applicable as a system for suppressing the occurrence of a failure according to the present embodiment. In the computer system 10, a plurality of client terminals 14 and a failure management server 16 are connected to a network 12 such as a LAN. The network 12 can include a communication network such as the Internet.

The failure management server 16 is for managing patches to be provided to the application target system and preventing the occurrence of the failure, and information thereof. The failure management server 16 includes a CPU 16A and a memory 16B such as a RAM. The failure management server 16 includes a nonvolatile storage unit 16C such as an HDD (Hard Disk Drive) or a flash memory. The failure management server 16 includes a network interface (I / F) unit 16D, and is connected to the network 12 via the network I / F unit 16D. The failure management server 16 is connected to a display 20 as an example of an output device, a keyboard 22 and a mouse 24 as input units.

In addition, in the storage unit 16C of the failure management server 16, an OS (Operating System) program and various application programs operating on the OS are installed in advance.

Also, a database storage area 17 is provided in the storage unit 16C of the failure management server 16. The storage unit 16C of the failure management server 16 is provided with a patch storage area 18 and a shell storage area 19.

The database storage area 17 is for storing patch information as a database. Five databases are stored in the database storage area 17. The first database is a database 1 (hereinafter referred to as “DB1”) that stores, as information, a list of prohibited processes that may cause a failure. The second database is a database 2 (hereinafter referred to as “DB2”) that stores a list of failure contents in each of the prohibited processes as information.

The third database to the fifth database are databases that respectively store a list of conditions when a failure occurs as information. The third database is a database 3 (hereinafter referred to as “DB3”) that stores, as information, a list of conditions for determining the occurrence of a failure by command output determined by the system. The fourth database is a database 4 (hereinafter referred to as “DB4”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the contents of a file defined by the system. The fifth database is a database 5 (hereinafter referred to as “DB5”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the execution result of a program in the system.

Note that registration and update of information stored in DB1 to DB5 is performed by periodic input by the user or automatic processing by the system.

In the present embodiment, the case where patch information is classified and stored in DB1 to DB5 will be described, but the present invention is not limited to this. For example, each piece of information may be stored in one database without classifying the patch information into DB1 to DB5. Further, some pieces of information stored in DB1 to DB5 may be combined and stored in a database of four or less.

In the present embodiment, examples of the file include a program file, a library file, or a data file. The file is not limited to one program file. For example, when a commercially available product includes a plurality of program files, the file can include a group of a plurality of program files. An example is a group of program files, library files, or data files that constitute the OS.

Each client terminal 14 connected to the network 12 is an example of a system (application target system) to which a patch for preventing the occurrence of a failure is applied. The client terminal 14 includes a CPU 14A, a memory 14B including a RAM, a nonvolatile storage unit 14C such as an HDD, and a network I / F unit 14D. The client terminal 14 is connected to the network 12 via the network I / F unit 14D. In addition, a display 20, a keyboard 22, and a mouse 24 are connected to the client terminal 14.

Further, an OS program and various application programs operating on the OS are installed in the storage unit 14C of the client terminal 14 in advance. A database storage area 15 is provided in the storage unit 14 </ b> C of the client terminal 14. The storage unit 14C of the client terminal 14 is provided with a pool area 26 and a shell storage area 28.

(Fault management server)
Next, failure management performed by the failure management server 16 will be described. The failure management server 16 manages patches and information applied to any application target system. The management of the patch and its information in the failure management server 16 includes a process of acquiring the patch to be applied and the patch information, and a process of managing the acquired patch and the patch information as a database. Further, the management of patches and patch information in the failure management server 16 includes a process of providing patches to be applied and patch information to the application target system.

∙ Software failure can occur when any operation is performed under various conditions. Examples of arbitrary operations include “execute a command”, “start a service”, and “use a library and system call”. The operations “execute a command” and “start a service” execute a file executable in the system to start and execute a process. Libraries and system calls are called in the process. Therefore, the operation “use library and system call” corresponds to starting and executing a process. The failure management server 16 of the present embodiment manages a process that may cause a failure as a prohibited process, stores a list of prohibited processes as information, and manages it as DB1. If a failure occurs when a specific option is specified when starting a process, an option corresponding to the process is also defined in the list of prohibited processes.

In the fault management server 16, a process of making the list of prohibited processes managed as DB1 the latest list is executed. There are manual processing and automatic processing as processing for making DB1 executed by the failure management server 16 the latest list. As an example of the manual process, there is a process in which the latest prohibited process is input by the user and registered in the DB 1. An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor and registering the latest prohibited process in the DB 1. Another example of the automatic process is a process of automatically receiving the latest prohibited process provided by the vendor and registering the received latest prohibited process in DB1 or updating DB1.

The fault management server 16 is configured to register or update the DB1 forbidden processes by user input in order to enable manual processing to make the DB1 the latest list. Further, the failure management server 16 refers to the prohibited process published by the vendor and registers in the DB 1 or updates the DB 1 in order to enable automatic processing to make the DB 1 the latest list. It has become. In addition, the failure management server 16 automatically receives the latest prohibition process provided by the vendor and registers the received latest prohibition process in DB1 in order to enable automatic processing to make DB1 the latest list. Or update DB1.

The failure management server 16 can acquire the latest patch and patch information corresponding to the prohibited process, along with the prohibited process provided by the vendor.

As shown in FIG. 2, an example of DB 1 that is a list of prohibited processes includes items of list item numbers (No.), names of prohibited processes, options, and failure numbers for identifying failures, classified by process. List can be adopted.

In this embodiment, data relating to a process that may cause a failure is managed by assigning a failure number to a list of prohibited processes (DB1) as a prohibited process. When there is a possibility that a failure will occur when a process for which a certain activation option is specified is executed, the activation option or the like is also stored as data in the list of prohibited processes. For example, in the list illustrated in FIG. 2, in the list item number No. 1, if the process of the name “fjpmgadd” is executed with the option “-a”, a failure corresponding to the failure number 0001 may occur. become.

The failure management server 16 manages the content of the failure corresponding to the failure number in the list of prohibited processes using a database that stores the failure content list. In other words, the failure management server 16 manages the failure number of the process causing the failure and information such as the failure content as a list by managing the list. An example of the failure content list DB2, as shown in FIG. 3, is a list including failure number, failure content, workaround, recovery method, patch number, and failure occurrence condition items corresponding to the failure number of DB1. Can be adopted.

Among the items in the failure content list, a workaround is an allowance for a process that has a high possibility of failure occurrence, which can temporarily execute the process by reducing the possibility of failure occurrence without stopping the target system. It points to. As an example, a shell program that can be executed without stopping and restarting the application target system can be employed as a workaround. In general, it is preferable to apply the latest patch for a process having a high possibility of failure. However, there is also a request to postpone maintenance to stop the operating system as much as possible, and a workaround that is a temporary measure in the current system is effective for this request.

It should be noted that the high possibility of occurrence of a failure means that the possibility of occurrence of a failure accompanying the execution of a process is not less than a predetermined value. As a value indicating the possibility of occurrence of a failure, an expected value described later can be adopted. In addition, a value derived from a predetermined function or mathematical expression for determining the possibility of failure occurrence may be used as a value indicating the possibility of failure occurrence.

∙ An example of a workaround that does not require restarting the target system is rewriting the definition file. A shell program is one that executes rewriting of definition files. In the present embodiment, it is assumed that a response that requires a system restart after applying the workaround is not a workaround. The failure management server 16 acquires information on avoidance measures such as specific contents by manual processing or automatic processing, and registers or updates DB2 as a work avoidance measure.

As an example of manual processing, there is a process in which information on a workaround is input by a user and registered in DB2 or DB2 is updated. An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor, acquiring information on a workaround when there is a workaround for the latest prohibited process, and registering it in the DB 2. Another example of automatic processing is to automatically receive the latest prohibition process provided by the vendor and register workaround information in DB2 or update DB2 when there is a workaround for the received latest prohibition process. There is a process to do. It is preferable to obtain the workaround information together with the latest patch and its patch information. For example, it is preferable to obtain a shell program to be provided as a workaround when a software developer provides a process patch. That is, the failure management server 16 acquires a workaround when acquiring the latest patch or patch information acquired by user input or automatic processing by the system, and registers the acquired workaround in DB2 or DB2 Is preferably updated.

In order to enable manual processing for registering or updating the workaround of DB2, the failure management server 16 can register information on the workaround by user input in DB2 or update DB2. It has become. In addition, the failure management server 16 can acquire and register the workaround information in the DB2 or update the DB2 in order to enable automatic processing for registering or updating the workaround information in the DB2. It has become. As an example of automatic processing, the failure management server 16 refers to a prohibited process disclosed by the vendor, checks whether there is a workaround for the prohibited process, and obtains workaround information when there is a workaround for the prohibited process. To do. The acquired workaround can be registered in DB2 or DB2 can be updated. Other examples of automatic processing of the failure management server 16 include automatically receiving a prohibited process provided by a vendor, and registering information on a workaround in DB2 when there is a workaround for the received prohibited process, or DB2 May be updated.

In addition, the recovery method of the items in the failure content list can be resumed (executed temporarily) without stopping the target system when the process in which the failure has become apparent is stopped, as described later. Refers to the allowance. That is, when a failure due to a process having a high possibility of failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system. As an example of the recovery method, a shell program that can be executed without restarting the application target system can be employed. As with the workaround, the recovery method is acquired together with the latest patch and its patch information. In other words, an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch. That is, the failure management server 16 acquires a workaround when acquiring the latest patch and its patch information, and is registered or updated as DB2.

Also, a patch number indicating a patch for correcting a failure is defined as a patch number among items in the failure content list. The patch number defined in the failure content list may be determined by the failure management server 16 or may be determined in advance.

The failure management server 16 manages the failure content of the process causing the failure by storing each data corresponding to the failure number in the failure content list (DB2). For example, in the list shown in FIG. 3, the failure content of the failure number “0001” is “panic”, and it is defined that a failure occurs when failure occurrence conditions 1 to 3 are met. FIG. 3 shows an example in which neither a workaround nor a recovery method exists for the failure number “0001”. Details of the failure occurrence conditions 1 to 3 will be described later. In the case of DB1 in FIG. 2 and DB2 in FIG. 3, the conditions of “J1_0001” in condition 1, “J2_0001” in condition 2, and “J3_0001” in condition 3 correspond to the conditions for the occurrence of the “panic” of the failure content. To do. In the case of the failure number “0001”, if the “fjpmgadd” process is started with the “-a” option, there is a high possibility that a “panic” will occur in the system.

Next, a failure occurrence condition among items in the failure content list will be described.
Of the items in the failure content list, the failure occurrence condition is information indicating a failure occurrence condition, and the conditions under which the software failure occurs are classified and managed by DB3 to DB5.

∙ A software failure is considered to occur under any condition, that is, when the target system in operation is in a specific state. The software developer assumes the operating state of the application target system at the time of executing the process at the time of software development. And if it deviates from the assumed operating state, it is considered that there is a high possibility of failure. Therefore, by grasping the operating state of the application target system in the operating state, the possibility of occurrence of a failure can be determined. Examples of grasping the operating status of the target system include “Understanding from the output result of the command defined by the system”, “Understanding from the contents of the file defined by the system”, and “Execution of predetermined program in the system” “Understanding the results”. The failure management server 16 manages a list of conditions that are in a specific state in the operating state of the application target system for an arbitrary process. In the present embodiment, the condition based on “understanding from the output result of the command determined by the system” is set as condition 1, and the list is managed as DB3. In addition, the condition obtained by “ascertaining from the contents of the file defined by the system” is set as the condition 2, and the list is managed as the DB 4. Further, the condition obtained by “ascertaining from the execution result of the predetermined program by the system” is set as the condition 3, and the list is managed as the DB 5.

The failure management server 16 registers information on conditions input by the user or collected by automatic processing by the system in DB3 to DB5, and updates DB3 to DB5 based on the information on the conditions. It has become.

DB3 stores a list of conditions for determining the possibility of failure occurrence based on command output determined by the system as information. The condition based on DB3 is a condition based on “understanding from the output result of a command determined by the system”, and is defined as condition 1 when a failure occurs. That is, the condition 1 is a condition for determining a specific state where the possibility of a failure occurrence is high among the operating states of the system from the output result of an arbitrary command in the system.

An example of DB3, which is a list of conditions 1 when a failure occurs, includes a condition management number corresponding to the stored contents of condition 1, which is an item of DB2, shown in FIG. 4, information collected for determination (command output), And a list including items of expected values of information collected for determination. In condition 1 in the present embodiment, the commands and start options determined by the system are information collected for determining the possibility of failure. In addition, regarding the output results of commands and start options determined by the system, the output value assumed by the developer at the time of development is set as the expected value of information collected for determination. A combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 1, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the command output result with the expected character string.

For example, in the list shown in FIG. 4, the condition for the failure occurrence with the condition management number “J1_0001” is when the OS version is “5.10”, the platform name is “sun4u”, and the patch “123456-01” has been applied. That is, when the execution result of the “uname -r” command is “5.10”. When the execution result of the “uname -i” command is “sun4u”. When the result of “patchadd -p | grep 123456-01” is “Patch: 123456-01 Obsoletes: Requires: Incompatibles: Packages: SUNWfctl”, it is predicted that a failure will occur.

DB4 registers a list of conditions for determining the possibility of failure based on the contents of a file defined by the system as information, and this is defined as condition 2 at the time of failure. In other words, condition 2 determines a specific state that is likely to cause a failure in the system operating state from the contents of a file that may be rewritten depending on the system operating state, such as a text file arranged in the system. It is a condition to do.

As an example of DB4, which is a list of conditions 2 at the time of failure occurrence, as shown in FIG. 5, the condition management number, information (file) collected for failure occurrence determination, and information collected for failure occurrence determination There is a list containing each item of expected values. In condition 2 in the present embodiment, a file (for example, a text file) arranged in the system is information to be collected for determining the occurrence of a failure. In addition, regarding the contents of the files arranged in the system, the output value assumed by the developer at the time of development is taken as the expected value. The combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 2, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the file contents with the expected character string.

An example of a text file placed in the system is a definition file defined by the system or software. If the definition file definition is specified as a file located in the system, the file contents are compared with the expected value character string to determine the possibility of failure. For example, a command such as a diff command that outputs data indicating coincidence or data indicating disagreement as a result of comparison execution can be employed. By using the diff command, it can be determined from the output result of the diff command whether or not the content of the expected character string in the DB 4 is different from the content of the file acquired in the operating state of the system.

For example, in the list shown in FIG. 5, the failure occurrence condition of the condition management number “J2_0001” is the expected value corresponding to the character strings “File1”, “File2”, and “File3” for each of the contents of the three files. It is time to match. That is, when the content of the file “/var/opt/FJSVpmgw/reg/.user” is the character string “File1”. When the content of the file "/var/opt/FJSVpmgw/etc/.config" is the character string "File2". This is when the content of the file “/var/opt/FJSVpmgw/etc/.role” is the character string “File3”. When any of these three character strings matches the corresponding expected value, it is predicted that a failure will occur.

DB5 stores a list of conditions when determining the occurrence of a failure based on the execution result of a program in the system as information. This DB5 condition is defined as condition 3 when a failure occurs. That is, the condition 3 is a condition for determining a specific state where there is a high possibility of failure in the system operating state from information including information resulting from the system operating state.

When there is a high possibility that the operating state of the system will cause a failure, the failure does not occur when the system is in stable operation, but the possibility of a failure increases when the system is operating at a high load. In other words, the possibility of a failure may change depending on the operating state of the system at the timing of executing the process. This cannot be determined by comparing the collected information with the expected value, such as “Understanding from the output result of the command determined by the system” in Condition 1 and “Understanding from the contents of the file defined by the system” in Condition 2. Is. Specifically, unlike condition 1 and condition 2, the expected value cannot be simply defined, and “matched / not met” cannot be determined based on the result of “matched / not matched with expected value”. When an expected value cannot be defined, a predetermined program can be used as a logic for determining a condition. A shell script created by a developer can be adopted as the predetermined program as the logic for performing the condition determination. For this reason, it is preferable to prepare a predetermined program in advance in order to grasp the operating state of the changing system.

An example of DB5, which is a list of condition 3 at the time of failure occurrence, includes a condition management number corresponding to the contents of condition 3 that is an item of DB2, and a determination method as information collected for determination, as shown in FIG. It is possible to employ a list including each item. In condition 3 in this embodiment, the shell script created by the developer is information to be collected for determination. Therefore, it is possible to determine the possibility of failure based on the execution result of the shell script stored in the determination method.

An example of determining the possibility of failure using a shell script will be described. Assume that the work area used by the activated process is not the default location, but has been customized by the user and changed. Further, it is assumed that data used as an input value by the activated process is variable specified by the user. In addition, when the work area used by the activated process is less than a predetermined value based on the size of data used as an input value, the process fails and an unexpected error occurs, resulting in a failure. And As described above, when the failure occurrence condition is complicated and the failure occurrence cannot be determined using the expected value, the failure occurrence can be determined from the execution result of the shell script. For example, it is possible to determine the occurrence of a failure by creating a shell script that operates according to a convention such as “exit code 1 if the condition is met, and exit code 0 if it is not met” and referring to the end code.

For example, in the list shown in FIG. 6, when the shell program a (an example is shown as the code 36 in FIG. 7) is executed and the end code is a character string “1”, the condition management number “J3_0001” is set. It can be determined that it is applicable.

As described above, the failure management server 16 is constructed with databases DB1 to DB5. Information in these databases is updated when a new failure occurs and the conditions for the occurrence of the failure become clear. The database is updated each time the developer is updated, or the known information is acquired via the network.

(Failure prevention processing)
Next, the failure occurrence prevention process will be described focusing on the process executed by the client terminal 14.

FIG. 8 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment. The management server which is the failure management server 16 has a list 40 classified for each process including conditions of the operating state of the system in which the failure occurs and information on a patch for preventing the occurrence of the failure. The management server also has a patch 42.

In the application target system 14 that is the client terminal 14, when the process is started, the grasping unit 46 grasps the operating state of the application target system. Then, the reference unit 44 refers to the list classified for each process. The specifying unit 48 specifies the condition of an unapplied patch for the process activated from the referenced list. The diagnosis unit 50 diagnoses the possibility of failure of the activated process from the condition specified by the specifying unit 48 and the operating state of the application target system grasped by the grasping unit 46. As a result of the diagnosis by the diagnosis unit 50, when there is a high possibility of failure of the activated process, the management unit 52 stops the process. Further, the acquisition unit 54 acquires an unapplied patch corresponding to the process stopped by the management unit 52.

Accordingly, when the application target system starts a process for realizing the function of the software, the reference unit 44 refers to the list at that timing, and the grasping unit 46 grasps the operating state of the application target system. The diagnosis unit 50 diagnoses whether the failure occurrence condition does not correspond to the grasped operation state of the application target system based on the operation state of the application target system. As a result of the diagnosis, when there is a high possibility that the activated process has a failure, the management unit 52 stops the process, and the acquisition unit 54 acquires the patch. Acquired patches can be applied to the system at any time. In this way, it is possible to determine whether or not the activated process is likely to cause a failure based on the operating state of the application target system, and to prevent the occurrence of a failure according to the operating state of the application target system. . In addition, when a possibility that a failure has occurred due to the activated process is high, a patch that eliminates the failure is acquired and can be applied to the system at an arbitrary timing. For this reason, patches that are candidates for patch application can be acquired when the activated process has a high possibility of failure, and there is no need to select patches.

Next, processing executed on the client terminal 14 will be described.
FIG. 9 is a flowchart showing processing executed by the client terminal 14.

In the client terminal 14 that is the application target system, when the process activation is requested during the operation of the system, the processing routine of FIG. 9 is executed, and the reference processing is executed in step 100. The reference process is a process of referring to the prohibited process list classified for each process exemplified in DB1. The process of step 100 includes a process of searching whether or not the process requested to be activated is registered (entry) in the prohibited process list. In the reference process, the latest patch information may be obtained from the management server that is the failure management server 16 and stored, and the prohibited process list included in the stored information may be referred to.

In the reference process, it is possible to specify a condition related to a patch that has not been applied to the process requested to be started, that is, a failure occurrence condition, from the referenced list.

When the process requested to be activated is registered (entry) in the prohibited process list, the result in step 102 is affirmative and the process proceeds to step 104. On the other hand, if the process requested to be activated is not registered in the prohibited process list, the result in step 102 is negative. If the determination in step 102 is negative, the process requested to be started has a low possibility of occurrence of failure, and there is no trouble in executing the process. Therefore, in step 124, the process is advanced to execute the process, and this routine is terminated.

In step 104, a grasping process, which is an information collecting process for a diagnostic process described later, is executed. The grasping process is a process for grasping the operating state of the application target system when the process activation is requested. In the grasping process, for example, a process of “obtaining from an output result of a command determined by the system” or “obtaining from a file content defined by the system” is performed. It is also possible to perform the process of “obtaining from the execution result of the predetermined program by the system”.

Next, in step 106, diagnostic processing is executed. The diagnosis process diagnoses the possibility of the failure of the process requested to be started from the failure occurrence condition and the operating state of the application target system grasped by the grasping process. As a result of the diagnosis, when there is a high possibility that a failure has occurred due to the process requested to be activated, the determination in step 108 is affirmative and the routine proceeds to step 110. On the other hand, if the possibility of failure is low and the result is NO in step 108, the process proceeds to step 124.

In step 110, since there is a high possibility of failure of the process requested to be started, the process is stopped. In the next step 112, notification / addition processing is executed. The notification / addition processing is performed by notifying the user of a patch number (patch ID or the like) that there is a patch for preventing the occurrence of a failure. The notification to the user may be a display process on the display, or may be a process of notifying by the notification unit to the user. This notification can be omitted. Also, in order to make it easier to apply patches during system maintenance described later, patches are acquired (downloaded) from the failure management server 16 and stored in the pool area 26, and patch numbers are added to the application candidate patch list. Also good.

In the next step 114, the life extension process is started. The process of step 114 is a process of referring to the DB 2 and confirming whether or not there is a workaround for a process with a high possibility of failure. If there is a workaround, it is affirmed at step 116 and the workaround is performed at step 118. The process of step 118 is a process of executing a workaround stored in the failure content list. That is, an allowance is provided to temporarily execute the process by reducing the possibility of failure without stopping the application target system. In the next step 120, the process proceeds to restart the process stopped in step 110, the process is executed, and this routine is terminated.

On the other hand, when there is no workaround in DB2 (No in step 116), a notification / addition process is executed in step 122. In this notification / addition processing, the user is notified that a process having a high possibility of failure has been stopped. This notification may be a display process on the display, or may be a process of notifying by a notification unit to the user. This notification can be omitted.

Also, it is preferable to provide the user with information indicating that a process having a high possibility of failure has been stopped. In this embodiment, at the time of system maintenance to be described later or at an arbitrary timing of the user, in order to be able to provide the user with information indicating that a process having a high possibility of a failure has been stopped, the user has been stopped as having a high possibility of a failure. Process information can be added to the process interruption list. The process interruption list is a database that stores, as information, a list such as the names of processes that are likely to have failed and stopped. An example of the process interruption list may include at least the name of the interrupted process as an item when only the user is notified of the stopped process. Further, the number of items in the process interruption list can be increased in accordance with the content of the item that provides the user with information on the interrupted process. For example, similarly to the prohibited process list shown in FIG. 1, a list in which each item of the list item number (No.), the name of the suspended process, the option, and the failure number is classified for each suspended process may be adopted. it can. The database storage area 15 of the client terminal 14 that is the application target system can include a database that stores, as information, a process interruption list that is a list of names of processes that have been stopped due to a high possibility of failure.

Next, system maintenance executed by the client terminal 14 that is an application target system will be described.
FIG. 10 is a flowchart showing system maintenance processing executed by the client terminal 14. The system maintenance process is executed at a predetermined timing or at an arbitrary timing specified by the user, and is a process for stopping system operation and executing system maintenance. Here, a case where the system operation is stopped and the minimum necessary patches are applied based on the application candidate patch list and the process interruption list will be described.

When the predetermined timing or the timing designated by the user comes, system maintenance processing is executed, and patch application processing is executed in step 130. In the example of FIG. 10, patches registered in the application candidate patch list are collectively applied in the patch application process. As a result, the patch is applied to the process that has been subjected to the life extension process and the process that has been stopped without any workaround, and the system can be stabilized.

As another example of step 130, there is a process of applying a selected patch by presenting a process interruption list to the user, causing the user to select only a process for which a patch is desired to be applied.

In this way, the system is always diagnosed at the timing when the process is started during system operation, and the minimum necessary patches that are candidates for applying patches to the target system during system maintenance are extracted. In the diagnosis performed when starting a process, the operating status of the target system for the process has failed by continuing the started process based on the "prohibited process list", "failure content list", and "condition list" obtained from the server. It is determined whether this condition is met. If the activated process has a high possibility of failure, the activated process is stopped and a workaround is taken to restart the process. If there is no workaround, keep the process stopped. Then, the application candidate patches are collectively applied during system maintenance. Patches can also be selectively applied by the user. As a result, the occurrence of a failure can be prevented in advance.

(Diagnosis processing)
Next, the diagnostic process executed by the client terminal 14 will be described in detail. FIG. 11 is a flowchart showing a detailed flow centered on a diagnostic process executed by the client terminal 14. The processing routine of FIG. 11 shows details of steps 104 to 110 and step 124 in the processing flow shown in FIG.

In the client terminal 14 (application target system), when process activation is requested, reference processing is executed in step 200. If the process requested to be activated is not registered (entry) in the prohibited process list (No in Step 202), the process proceeds to execute the process in Step 204 (similar to Step 124 in FIG. 9) This routine ends. Note that the difference between step 100 in FIG. 9 and step 200 in FIG. 11 is that in step 200, processing for specifying a failure occurrence condition for a process requested to be started is performed in later processing. In other words, in step 200, confirmation processing is performed based on a search processing result for searching that the process requested to be activated is registered (entry) in the prohibited process list.

When the process requested to be activated is registered (entry) in the prohibited process list (Yes in Step 202), the process proceeds to Step 206. In step 206, the failure content list (DB2) is referenced to identify the corresponding failure content. At the same time, conditions corresponding to the corresponding failure contents are extracted from DB3, DB4, and DB5. In step 206, a condition relating to a patch that has not been applied to the process requested to be activated (ie, a failure occurrence condition) is specified from the referenced list.

In the next step 208, the first condition process is executed. The first condition process is a process for collecting (command execution) information defined in the condition management number (for example, J1_0001) in the condition list 1 (DB3) and comparing it with an expected value. First, as the grasping process that is information collection for diagnosis, the operating state of the application target system when the process activation is requested is grasped. Information on the condition management number is obtained by grasping the operating state (step 210). The first condition process is a process of “ascertaining from an output result of a command determined by the system”, and corresponds to the grasping process in step 104 of FIG. In step 210, the command is executed to obtain the result value. Next, the expected value of the condition management number is acquired (step 212). The possibility of failure occurrence for the first condition is diagnosed based on whether or not the information obtained in step 210 matches the expected value obtained in step 212.

In the next step 214, it is determined whether or not there is a high possibility of a failure by determining whether or not the result value of the command execution matches the expected value. If the result value from the command execution does not match the expected value and the possibility of failure is not high, the result in Step 214 is negative and the process proceeds to Step 204. On the other hand, if the result value from the command execution matches the expected value and the possibility of failure is high, the result is affirmative in step 214 and the process proceeds to step 216.

In step 216, the second condition process is executed. The second condition process is a process for collecting information defined in the condition management number (for example, J2_0001) in the condition list 2 (DB4) and comparing it with the expected value. First, as the grasping process, the operation state of the application target system when the process activation is requested is grasped. As a result, information defined in the condition management number is obtained (step 218). The second condition process is a process of “ascertaining from the contents of the file defined by the system”, and reads the contents of the file to obtain the value. Next, an expected value corresponding to the condition management number is acquired (step 220). The possibility of failure occurrence for the second condition is diagnosed based on whether or not the information obtained in step 218 matches the expected value obtained in step 220.

In the next step 222, it is determined whether or not the possibility of failure is high by determining whether or not the value of the file content corresponding to the condition management number matches the expected value. If the value according to the file contents does not match the expected value and the possibility of failure is not high, the result in Step 222 is negative and the process proceeds to Step 204. On the other hand, if the value according to the file content matches the expected value and there is a high possibility of failure occurrence, the result in step 222 is affirmative and the routine proceeds to step 224.

In step 224, the third condition process is executed. The third condition process is a process that collects information defined in the condition management number (for example, J3_0001) in the condition list 3 (DB5) (executes a process such as a shell) and compares it with an expected value. First, as the grasping process, the operation state of the application target system when the process activation is requested is grasped. As a result, condition management number information is obtained (step 226). The third condition process is a process of “ascertaining from the execution result of the predetermined program by the system”, and here, a result value (end code) obtained by executing the shell script is obtained. Next, the expected value of the condition management number is acquired (step 228). The possibility of failure occurrence for the third condition is diagnosed based on whether or not the information obtained in step 226 matches the expected value obtained in step 228.

In the next step 230, it is determined whether or not the possibility of failure is high by determining whether or not the result value obtained by executing the processing of the shell or the like matches the expected value. If the result value obtained by executing the processing such as shell does not match the expected value and the possibility of failure is not high, the result is negative in step 230 and the process proceeds to step 204. On the other hand, when the result value obtained by executing the processing of the shell or the like matches the expected value and there is a high possibility of the occurrence of a failure, the result is affirmative in step 230 and the process proceeds to step 232. In step 232, the process requested to start is stopped. If all of

steps

214, 222, and 230 are affirmed, that is, if it is determined that a failure is likely to occur for each of the first condition to the third condition, the process that is requested to start is stopped.

In the present embodiment, a case has been described in which when each of the first condition process to the third condition process is consistent with the expected value, it is diagnosed that there is a high possibility of a failure occurring due to the process requested to be started. It is not limited to this. In other words, when the value obtained in at least one of the first condition process to the third condition process matches the expected value, it is diagnosed that the process that is requested to start is likely to cause a failure. It may be.

(Life extension processing)
Next, processing started after the process is stopped will be described.
FIG. 12 is a flowchart showing a detailed flow with a focus on processing after life extension processing executed by the client terminal 14. The processing routine of FIG. 12 shows details of steps 114 to 122 in the flow of processing shown in FIG. The process illustrated in FIG. 12 is a process for continuing the system operation as much as possible by executing a workaround.

When the process is stopped at the client terminal 14 (application target system) because it is diagnosed that there is a high possibility of failure occurrence, the processing routine of FIG. 12 is executed. When the process is stopped, the client terminal 14 refers to the failure content list (DB2), acquires (downloads) a patch for solving the failure occurrence of the stopped process from the failure management server 16, and stores it in the pool area 26. Store. The patch number is notified to the user and added to the application candidate patch list (step 112 in FIG. 9).

First, in step 240, confirmation processing is executed. The confirmation processing is processing for confirming whether or not a workaround (shell program) corresponding to the stopped process is registered in the failure content list with reference to the failure content list (DB2). When a workaround for the stopped process is registered in the failure content list (Yes in Step 242), the workaround is executed in Step 244 (similar to Step 118 in FIG. 9) to prolong the life of the process. . In the next step 246 (similar to step 120 in FIG. 9), the process proceeds to restart the stopped process, the process is restarted, and this routine is terminated.

When the workaround for the stopped process is not registered in the failure content list (No in step 242), the process proceeds to step 248 and the startup diagnosis process is executed. The startup diagnostic process is a process for executing the diagnostic process shown in FIG. There may be a possibility that the stopped process can be executed due to a change in the operating status (for example, CPU load status) of the application target system. Therefore, in order to find out the possibility of restarting the process according to the operating state of the system, an attempt is made to execute the startup diagnosis process in step 248. The start-up diagnosis process enables retrying a predetermined number of times by the user. The retry count can be defined as information in the operating environment of the client terminal 14.

If the process is stopped as a result of the diagnosis in step 248 (No in step 250), it is determined whether or not a retry exceeding the defined number of times has been executed. If the number of retries does not exceed the defined number (No at step 254), the number of retries is incremented and step 248 is executed again. As a result of the diagnosis in step 248, when the process is executed (Yes in step 250), if the stopped process is registered in the process interruption list in step 252, the corresponding process is deleted from the process interruption list and this routine is executed. finish.

Further, when the retry exceeding the defined number of times is attempted (Yes in Step 254), the retry is terminated, the corresponding process is registered in the process interruption list (Step 256), and the user is notified (Step 258), this routine is terminated.

As described above, in this embodiment, when activation of a process for realizing a function of software is requested, the list is referred to at a timing, and it is diagnosed whether the requested process does not satisfy the failure occurrence condition. To do. As a result of the diagnosis, when the activation request is an activation request for a process having a high possibility of occurrence of a failure, the execution of the corresponding process is stopped. This stops processes that are likely to cause known failures. As a result, the possibility of occurrence of a failure according to the operating state of the system can be diagnosed, and the occurrence of the failure can be prevented in advance according to the operating system.

In addition, since the minimum necessary patches can be applied in this embodiment, it is possible to reduce the man-hours for system maintenance.

(Second Embodiment)
Next, a second embodiment will be described. Since the present embodiment has substantially the same configuration as the above-described embodiment, the same portions are denoted by the same reference numerals and detailed description thereof is omitted.

Software generally has unknown problems that may cause failures at the time of development. When an unknown defect or the like becomes obvious, it becomes a bug (an error or a defect included in the program) that becomes an obstacle to executing the software. When a bug or the like becomes obvious and a known failure occurs, the vendor provides a patch that is data for preventing the occurrence of a software failure. Therefore, a failure due to a bug that has not been revealed cannot be prevented even if a patch is applied. In the present embodiment, when a defect or the like becomes known, if it is found that a process having a high possibility of occurrence of a failure has already been executed due to a code including a bug, the corresponding process is stopped, or is applicable. The process can be restored as much as possible.

Note that the recovery method is stored in the item of the failure content list. The recovery method refers to an allowance for resuming a corresponding process without stopping the application target system when a process that has been identified as having a high probability of occurrence of the fault is stopped. That is, when a process that is running is highly likely to cause a failure or when a failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system. As an example, an executable shell program can be employed without restarting the application target system. Note that, as with the workaround, the recovery method is acquired together with the latest patch and patch information. In other words, an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch.

(Failure prevention processing)
Next, the failure occurrence prevention process in this embodiment will be described.

FIG. 13 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment. The management server, which is the failure management server 16, updates the update unit 56 to update the list 40 categorized for each process including the conditions of the operating state of the system in which the failure occurs and the information on the patch that resolves the failure that has occurred. It has become. That is, when a new failure becomes apparent, the management server registers the patch 42 and patch information for preventing the occurrence of the failure. The management server updates the list 40 and notifies the client terminal 14 when the patch 42 and the patch information for preventing the occurrence of the failure are registered. The list 40 includes DB1 to DB5.

In the client terminal 14, the reception unit 58 receives information indicating that the list 40 has been updated by notification from the management server that is the failure management server 16. When the notification of the management server is received, the grasping unit 46 grasps the operating state of the application target system. The grasping of the operating state includes a process of extracting an operating process. Then, the reference unit 44 refers to the list. The specifying unit 48 determines whether or not the operating process is included in the list 40 referred to by the reference unit 44, and when the operating process is registered in the list 40, the condition for the operating process is determined. Is identified. The diagnosing unit 50 diagnoses the possibility of the failure of the running process from the condition specified by the specifying unit and the operating state of the application target system grasped by the grasping unit 46. As a result of the diagnosis by the diagnosis unit 50, when it is determined that there is a high possibility that a failure has occurred in the running process, the management unit 52 stops the running process. Further, the application target system acquires an unapplied patch corresponding to the process stopped by the management unit 52 by the acquisition unit 54.

Therefore, when the client terminal 14 receives a notification of a process in which a new failure has been realized from the failure management server 16, the client terminal 14 refers to the list 40 and diagnoses whether the operating state of the application target system corresponds to the failure occurrence condition. To do. When a running process is diagnosed as having a high possibility of failure, the running process is stopped. When a running process is stopped, a patch to be applied to the application target system for the stopped process is acquired. Acquired patches can be applied to the system at any time. As described above, when a new failure becomes apparent, it is possible to determine whether or not the operating process is highly likely to have a failure based on the operating state of the application target system. When a process having a high possibility of failure occurrence is running, the running process is stopped. For this reason, even if the process is in operation, the occurrence of the failure can be prevented by stopping the process when the possibility of the failure is high.

(Diagnosis processing)
Next, the diagnostic process executed by the client terminal 14 of this embodiment will be described in detail.
FIG. 14 is a flowchart showing a detailed flow centered on a diagnostic process executed in the client terminal 14. In the present embodiment, processing executed when the client terminal 14 receives a notification from the failure management server 16 is shown. The processing routine of FIG. 14 is executed instead of the processing shown in FIG.

When the client terminal 14 (application target system) receives notification of information indicating that the list 40 has been updated from the failure management server 16, a reference process is executed in step 300. The list 40 includes DB1 to DB5. When any one of the DB1 to DB5 is updated, the failure management server 16 notifies the client terminal 14 of information indicating that the list 40 has been updated. Shall. In the present embodiment, the processing in step 300 is processing for confirming update of the prohibited process list. If the prohibited process list has not been updated as a result of the execution of the reference process (Yes at step 302), this routine ends. On the other hand, when the prohibited process list is updated and a process is newly registered or updated in the prohibited process list (Yes in Step 302), the process proceeds to Step 304 and is being executed (in operation) in the application target system. ) Check the process.

In the next step 306, one arbitrary process is designated from the running processes. In the next step 308, the operation diagnosis process is executed. The operation diagnosis process at step 308 is the same as the start-up diagnosis process described at step 248 in FIG. Specifically, the diagnosis process shown in FIG. 11 is advanced with the process designated in step 306 as the process requested to be activated. The difference between the process shown in FIG. 11 and the process in step 308 of FIG. 14 in this embodiment is the process in step 204 and step 232 of FIG. In FIG. 11, the process requested to be started is the processing target. In the present embodiment, since an already running process is a processing target, the processing corresponding to step 204 in FIG. 11 is skipped in step 308 in FIG. 14, and the processing proceeds without doing anything. Further, in the present embodiment, in the process corresponding to step 232 of FIG. 11 in step 308 of FIG. 14, a process that is already in operation is forcibly stopped.

When the operation diagnosis process is completed, the process proceeds to step 310. In step 310, it is determined whether or not there remains a process for which the on-operation diagnosis process has not been executed among the active processes. When there is a process that has not been subjected to the operation time diagnostic process, the process returns to step 306, and the operation time diagnosis process is performed on the remaining operating process.

(Recovery processing)
Next, a recovery process executed after stopping an active process in the present embodiment will be described.
FIG. 15 is a flowchart showing a detailed flow centered on the recovery process executed in the client terminal 14. In the present embodiment, the recovery process is performed after the process running on the client terminal 14 is stopped. That is, in step 232 of FIG. 11, the recovery process is performed after forcibly stopping the already running process. The recovery process is a process for continuing the system operation. The processing routine of FIG. 15 is executed instead of the processing shown in FIG.

When the process is stopped at the client terminal 14 (application target system) because it is diagnosed that there is a high possibility of failure occurrence, the processing routine of FIG. 15 is executed. In order to facilitate patch application during system maintenance, when the process is stopped, patches are acquired (downloaded) from the failure management server 16 and accumulated in the pool area 26, and the patch numbers are stored in the application candidate patch list. May be added. Thereby, at the time of system maintenance, by applying the patches accumulated in the pool area 26, it is possible to apply the minimum necessary fault correction with a high probability of occurrence in the application target system.

First, in step 320, a confirmation process is executed. The confirmation process is a process for confirming whether or not a recovery method (shell program) corresponding to the stopped process is registered with reference to the failure content list (DB2). When the recovery method for the stopped process is registered (Yes in step 322), the process proceeds to step 324, and recovery processing by the recovery method is executed in order to extend the life of the process until the next system maintenance. In the next step 326, the process proceeds to resume the stopped process, the process is resumed, and this routine is terminated.

If the recovery method for the stopped process is not registered in the failure content list (No in step 322), the process proceeds to step 328, and the process is registered in the process interruption list. In the next step 330, information indicating that the stopped process is registered in the process interruption list is notified to the user, and this routine is terminated.

As described above, in this embodiment, when a failure is newly registered for a process for realizing the function of software, the list is referred to during the operation of the process to diagnose whether the failure occurs. To do. Then, when a process that is highly likely to cause a newly registered failure is in operation, the operation of the corresponding process is stopped. As a result, when an unknown failure such as a defect becomes known, the operation of the process is stopped, so even a newly registered failure is likely to occur quickly, and the failure occurs. Can be prevented in advance.

In this embodiment, an example of the process for confirming the update of the prohibited process list has been described. However, the present invention is not limited to the confirmation of the update of the prohibited process list. For example, the list 40 includes DB1 to DB5, and can be applied when a list other than the prohibited process list is updated. Information included in DB2 to DB5 may be updated to the latest information.

By the way, the user may not want to apply the patch. For example, there is a case where the application of a patch is avoided or postponed due to a request to continue the current system from the viewpoint of system operation or a request to shorten the system stop time. One example is the idea that even if a function being used has a failure, the current system is operating without a failure, so that it does not want to apply the latest patch. In addition, if a situation such as incompatibility or level down occurs due to the application of a patch, a lot of man-hours are required for the restoration work. For this reason, there is an idea that it is not desirable to actively apply a patch in order to prevent a known failure. On the other hand, there is also an idea that it will be too late after a failure occurs without applying a patch, and the user may have a contradictory idea.

The above embodiment is also applied to a user who has a contradictory idea that “it does not want to actively apply a patch” and “it is too late after a failure occurs without applying a patch”. Works effectively. For example, if the process is requested to realize the function of the software, refer to the list at the process startup timing, and check whether the failure occurrence condition corresponds to the operating status of the target system for the requested process Diagnose. As a result of the diagnosis, if the activation request is a process having a high possibility of occurrence of a failure in the operating state of the application target system, the execution of the requested process is stopped. Thus, the possibility of the occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of a failure can be prevented in advance according to the application target system that is operating. In addition, even if there is a process that is highly likely to cause a failure, there is a workaround that reduces the possibility of failure without stopping the application target system. As a result, the occurrence of a failure can be suppressed without stopping the operating system. Furthermore, since patches are acquired at the timing when the process is stopped, the minimum necessary patches can be applied, and the number of work steps for system maintenance can be reduced.

In addition, when a process with a newly revealed failure is registered in the prohibited process list, whether or not the running process in the operating status of the target system meets the failure occurrence condition of the process with the newly identified failure Diagnose. If the process running on the application target system is a process with a high possibility of failure, the execution of the corresponding process is stopped. As a result, when an unknown failure such as a defect becomes a process with a high possibility of occurrence of a failure as a known failure, the corresponding process is stopped, so that the newly registered failure is a highly likely process. Can be dealt with quickly, and the occurrence of failures can be prevented. Even when a running process is stopped, the recovery process is applied when there is a recovery process that reduces the possibility of a failure without stopping the system. As a result, the occurrence of a failure can be suppressed without stopping the operating system.

Therefore, the user does not have a contradictory idea that “it does not want to apply the patch positively” and “it is too late if a failure occurs without applying the patch”.

Also, when extracting and applying all patches corresponding to the system information, it is not sufficient to select and extract patches that are actually required in the application target system from the system information. For example, even if a serious problem such as a security problem or panic occurs, it does not always occur in its own system, but more patches than necessary are applied in consideration of safety. Note that the software may provide not only a failure but also a function addition as a patch. For this reason, when all applicable patches are extracted and applied, even patches not desired by the user are applied.

Even in this case, the above embodiment functions effectively. In the above embodiment, the possibility of failure occurrence is diagnosed according to the operating state of the system at the time of starting the process or at the timing when the latest patch for the process is registered. When the diagnosis result indicates that there is a high possibility that a failure has occurred, execution of the process is stopped. Thus, the possibility of occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of the failure can be prevented in advance according to the operating system. Therefore, even patches that are not desired by the user are not applied.

In the above description, the server system taking the fault management server as an example and the application target system taking the client terminal as an example have been described. However, the present invention is not limited to these configurations, and various improvements and modifications may be made without departing from the gist described above.

In the above description, the program is stored (installed) in advance in the storage unit of the client terminal. However, the processing program is provided in a form recorded on a recording medium such as a CD-ROM or DVD-ROM. It is also possible.

All documents, patent applications and technical standards mentioned in this specification are to the same extent as if each individual document, patent application and technical standard were specifically and individually stated to be incorporated by reference. Incorporated by reference in the book.

10 Computer System 14 Client Terminal 14B Memory 14C Storage Unit 16 Fault Management Server

Claims

A grasping unit for grasping an operating state of the application target system when a process is started in the application target system, with a computer system that is an application target of a patch for preventing the occurrence of a failure,
A reference unit that refers to a list classified for each process, including an operation state condition that may cause a failure in the application target system;
A specifying unit that specifies a condition of an operating state in which a failure may occur with respect to the activated process from the list referred to by the reference unit;
A failure associated with the execution of the started process based on the operating state condition that may cause the failure specified by the specifying unit and the operating state of the application target system determined by the grasping unit A diagnostic unit for diagnosing the possibility of occurrence of
A process management unit that stops execution of the process when the diagnosis unit diagnoses that the failure of the application target system accompanying the execution of the started process is greater than or equal to a predetermined value;
A failure occurrence prevention device comprising:
The list further includes a workaround that reduces the possibility of a failure due to the execution of the corresponding process without stopping the operation of the application target system,
The specifying unit specifies a workaround corresponding to the activated process from the list,
In the case where the process management unit is diagnosed as having a possibility of occurrence of a failure accompanying the execution of the activated process by the diagnosis unit as being a predetermined value or more, and there is a workaround corresponding to the process in the list, The execution of the activated process is performed after the applicable workaround is applied to the application target system, and the execution of the activated process is stopped when the workaround corresponding to the process is not in the list. The failure prevention apparatus according to 1.
The failure occurrence prevention apparatus according to claim 1, wherein the operating condition registered in the list includes a process name and a start option of the process.
The operating condition registered in the list includes a command for outputting the operating status of the application target system and an expected value of the execution result of the command. The failure occurrence prevention device described.
The operation condition registered in the list includes a name of a data file used in a process and an expected value as the content of the data file. Failure prevention device.
The failure according to any one of claims 1 to 5, wherein the operating condition registered in the list includes a program that operates in the application target system and an expected value of an execution result of the program. Occurrence prevention device.
A reception unit for receiving information indicating that the list has been updated from the outside;
The reference unit refers to the list when information indicating the update of the list is received by the reception unit,
The grasping unit grasps an operating state of the application target system when information indicating the update of the list is received by the receiving unit;
The specifying unit specifies a process being executed in the application target system included in the updated list, and a failure occurs in the application target system as the specified process is executed from the updated list. Identify possible health conditions,
The diagnosis unit, for the process specified by the specifying unit, the condition of the operating state in which the failure specified by the specifying unit may occur and the operating state of the application target system ascertained by the grasping unit To diagnose the possibility of failure,
The process management unit stops the execution of the process when the diagnosis unit diagnoses that the possibility of a failure occurring due to the execution of the specified process is a predetermined value or more. The failure occurrence preventing apparatus according to any one of the above.
The list further includes a recovery measure for recovering from a failure caused by the execution of the process without stopping the operation of the target system.
The identification unit identifies a recovery measure corresponding to a specific process whose execution is suspended from the list,
The process management unit, when there is a recovery measure corresponding to the specific process specified by the specific unit in the list, to resume the execution of the specific process after applying the corresponding recovery measure, The failure occurrence prevention apparatus according to any one of claims 1 to 7, wherein when the recovery measure corresponding to a specific process is not included in the list, resumption of execution of the specific process is stopped.
The failure occurrence prevention apparatus according to any one of claims 1 to 8, further comprising an acquisition unit that acquires a patch corresponding to a process whose execution has been stopped by the process management unit.
The failure occurrence prevention apparatus according to claim 7 or 8, further comprising a cancellation process patch acquisition unit that acquires a patch corresponding to a process whose execution has been canceled by the process management unit.
A grasping step of grasping an operating state of the application target system when a process is started in the application target system, with a computer system as a target of application of a patch for preventing occurrence of a failure,
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
A failure prevention method including:
The list further includes a workaround that reduces the possibility of a failure due to the execution of the corresponding process without stopping the operation of the application target system,
The identifying step identifies a workaround corresponding to the invoked process from the list;
In the process management step, when the possibility of occurrence of a failure accompanying the execution of the started process is diagnosed as a predetermined value or more by the diagnosis step, and a workaround corresponding to the process is in the list, The execution of the activated process is performed after the applicable workaround is applied to the application target system, and the execution of the activated process is stopped when the workaround corresponding to the process is not in the list. 11. A failure prevention method according to 11.
The failure occurrence prevention method according to claim 11 or 12, wherein the operating condition registered in the list includes a process name and an activation option of the process.
The operating condition registered in the list includes a command for outputting the operating status of the application target system and an expected value of an execution result of the command. The failure prevention method described.
The operation condition registered in the list includes a name of a data file used in a process and an expected value as the content of the data file. How to prevent failure.
The failure according to any one of claims 11 to 15, wherein the operating condition registered in the list includes a program that operates in the application target system and an expected value of an execution result of the program. Occurrence prevention method.
A step of accepting information indicating that the list has been updated from the outside;
The reference step refers to the list when information indicating the update of the list is received by the reception step,
The grasping step grasps the operating state of the application target system when the information indicating the update of the list is accepted by the accepting step;
The specifying step specifies a process being executed in the application target system included in the updated list, and a failure occurs in the application target system in accordance with execution of the specified process from the updated list. Identify possible health conditions,
In the diagnosis step, for the process specified in the specification step, the condition of the operation state in which the failure specified in the specification step may occur and the operation state of the application target system grasped in the grasp step To diagnose the possibility of failure,
The process management step stops the execution of the process when it is diagnosed by the diagnosis step that the possibility of a failure due to the execution of the specified process is greater than or equal to a predetermined value. The failure prevention method according to any one of the above.
The list further includes a recovery measure for recovering from a failure caused by the execution of the process without stopping the operation of the target system.
The identifying step identifies from the list a recovery measure corresponding to a particular process whose execution has been suspended,
In the process management step, when there is a recovery measure corresponding to the specific process specified by the specific step in the list, the execution of the specific process is resumed after applying the recovery measure. The failure occurrence prevention method according to any one of claims 11 to 17, wherein when the recovery measure corresponding to a specific process is not in the list, the resumption of execution of the specific process is stopped.
The failure occurrence prevention method according to any one of claims 11 to 18, further comprising an acquisition step of acquiring a patch corresponding to a process whose execution has been stopped by the process management step.
The failure occurrence prevention method according to claim 17 or 18, further comprising a stop process patch acquisition step of acquiring a patch corresponding to a process whose execution has been stopped by the process management step.
On the computer,
A grasping step of grasping an operating state of the application target system when a process is started in the application target system as an application target system to which the patch is applied to prevent the occurrence of a failure;
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
A failure prevention program for executing processing including
21. A failure occurrence prevention program for causing a computer to execute processing according to the failure occurrence prevention method according to any one of claims 10 to 20.
On the computer,
A grasping step of grasping an operating state of the application target system when a process is started in the application target system as an application target system to which the patch is applied to prevent the occurrence of a failure;
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
The computer-readable recording medium which recorded the failure generation | occurrence | production prevention program for performing the process containing this.
21. A computer-readable recording medium having recorded thereon a failure occurrence prevention program for causing a computer to execute processing according to the failure occurrence prevention method according to any one of claims 11 to 20.