WO2013076798A1 - Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium - Google Patents
Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium Download PDFInfo
- Publication number
- WO2013076798A1 WO2013076798A1 PCT/JP2011/076835 JP2011076835W WO2013076798A1 WO 2013076798 A1 WO2013076798 A1 WO 2013076798A1 JP 2011076835 W JP2011076835 W JP 2011076835W WO 2013076798 A1 WO2013076798 A1 WO 2013076798A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- failure
- list
- target system
- execution
- application target
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/008—Reliability or availability analysis
Definitions
- the disclosed technology relates to a failure occurrence prevention device, a failure occurrence prevention method, a failure occurrence prevention program, and a medium.
- Software such as OS (Operating System) and application programs installed in the system is required to prevent failure by the software itself at the time of execution.
- OS Operating System
- the patch When extracting a patch that can be applied to the own system, the patch is extracted based on system information such as a software version, a hardware type, and an applied patch ID (patch identifier). For this reason, all patches corresponding to the system information are extracted as applicable patches and applied to the system. When all patches corresponding to the system information are extracted and applied, it takes time to apply the patches. In addition, applying a patch to the system increases the risk of generating software whose level is reduced, such as function degradation. In addition, the risk of incompatibility between software that is downgraded by applying a patch and software that is not applied with a patch increases. Therefore, a technology is also known that applies a patch that a vendor collects and provides only patches that correct important and urgent failures such as security failures.
- the patch application time is determined according to the degree of urgency and the load on the computer to be applied.
- the objective is to suppress system failures.
- the disclosed technology grasps the operating state of the application target system when the process is started, and refers to a list classified for each process including information on operating state conditions and patches in which a failure occurs. From this list, the condition of the operating state where the failure occurs is specified, and the possibility of the failure is diagnosed from the condition and the operating state of the grasped application target system. The diagnosis stops the process when the possibility of occurrence of a failure due to the execution of the process is a predetermined value or more. Also, an unapplied patch corresponding to the process whose execution has been stopped is acquired.
- FIG. 1 is a block diagram showing a schematic configuration of a computer system according to a first embodiment. It is an image figure which shows an example of a prohibition process list
- FIG. 1 shows a schematic configuration of a computer system 10 applicable as a system for suppressing the occurrence of a failure according to the present embodiment.
- a plurality of client terminals 14 and a failure management server 16 are connected to a network 12 such as a LAN.
- the network 12 can include a communication network such as the Internet.
- the failure management server 16 is for managing patches to be provided to the application target system and preventing the occurrence of the failure, and information thereof.
- the failure management server 16 includes a CPU 16A and a memory 16B such as a RAM.
- the failure management server 16 includes a nonvolatile storage unit 16C such as an HDD (Hard Disk Drive) or a flash memory.
- the failure management server 16 includes a network interface (I / F) unit 16D, and is connected to the network 12 via the network I / F unit 16D.
- the failure management server 16 is connected to a display 20 as an example of an output device, a keyboard 22 and a mouse 24 as input units.
- an OS Operating System
- various application programs operating on the OS are installed in advance.
- a database storage area 17 is provided in the storage unit 16C of the failure management server 16.
- the storage unit 16C of the failure management server 16 is provided with a patch storage area 18 and a shell storage area 19.
- the database storage area 17 is for storing patch information as a database.
- Five databases are stored in the database storage area 17.
- the first database is a database 1 (hereinafter referred to as “DB1”) that stores, as information, a list of prohibited processes that may cause a failure.
- the second database is a database 2 (hereinafter referred to as “DB2”) that stores a list of failure contents in each of the prohibited processes as information.
- the third database to the fifth database are databases that respectively store a list of conditions when a failure occurs as information.
- the third database is a database 3 (hereinafter referred to as “DB3”) that stores, as information, a list of conditions for determining the occurrence of a failure by command output determined by the system.
- the fourth database is a database 4 (hereinafter referred to as “DB4”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the contents of a file defined by the system.
- the fifth database is a database 5 (hereinafter referred to as “DB5”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the execution result of a program in the system.
- registration and update of information stored in DB1 to DB5 is performed by periodic input by the user or automatic processing by the system.
- each piece of information may be stored in one database without classifying the patch information into DB1 to DB5. Further, some pieces of information stored in DB1 to DB5 may be combined and stored in a database of four or less.
- examples of the file include a program file, a library file, or a data file.
- the file is not limited to one program file.
- the file can include a group of a plurality of program files.
- An example is a group of program files, library files, or data files that constitute the OS.
- Each client terminal 14 connected to the network 12 is an example of a system (application target system) to which a patch for preventing the occurrence of a failure is applied.
- the client terminal 14 includes a CPU 14A, a memory 14B including a RAM, a nonvolatile storage unit 14C such as an HDD, and a network I / F unit 14D.
- the client terminal 14 is connected to the network 12 via the network I / F unit 14D.
- a display 20, a keyboard 22, and a mouse 24 are connected to the client terminal 14.
- an OS program and various application programs operating on the OS are installed in the storage unit 14C of the client terminal 14 in advance.
- a database storage area 15 is provided in the storage unit 14 ⁇ / b> C of the client terminal 14.
- the storage unit 14C of the client terminal 14 is provided with a pool area 26 and a shell storage area 28.
- the failure management server 16 manages patches and information applied to any application target system.
- the management of the patch and its information in the failure management server 16 includes a process of acquiring the patch to be applied and the patch information, and a process of managing the acquired patch and the patch information as a database. Further, the management of patches and patch information in the failure management server 16 includes a process of providing patches to be applied and patch information to the application target system.
- the failure management server 16 of the present embodiment manages a process that may cause a failure as a prohibited process, stores a list of prohibited processes as information, and manages it as DB1. If a failure occurs when a specific option is specified when starting a process, an option corresponding to the process is also defined in the list of prohibited processes.
- a process of making the list of prohibited processes managed as DB1 the latest list is executed.
- As an example of the manual process there is a process in which the latest prohibited process is input by the user and registered in the DB 1.
- An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor and registering the latest prohibited process in the DB 1.
- Another example of the automatic process is a process of automatically receiving the latest prohibited process provided by the vendor and registering the received latest prohibited process in DB1 or updating DB1.
- the fault management server 16 is configured to register or update the DB1 forbidden processes by user input in order to enable manual processing to make the DB1 the latest list. Further, the failure management server 16 refers to the prohibited process published by the vendor and registers in the DB 1 or updates the DB 1 in order to enable automatic processing to make the DB 1 the latest list. It has become. In addition, the failure management server 16 automatically receives the latest prohibition process provided by the vendor and registers the received latest prohibition process in DB1 in order to enable automatic processing to make DB1 the latest list. Or update DB1.
- the failure management server 16 can acquire the latest patch and patch information corresponding to the prohibited process, along with the prohibited process provided by the vendor.
- an example of DB 1 that is a list of prohibited processes includes items of list item numbers (No.), names of prohibited processes, options, and failure numbers for identifying failures, classified by process. List can be adopted.
- data relating to a process that may cause a failure is managed by assigning a failure number to a list of prohibited processes (DB1) as a prohibited process.
- DB1 prohibited processes
- the activation option or the like is also stored as data in the list of prohibited processes. For example, in the list illustrated in FIG. 2, in the list item number No. 1, if the process of the name “fjpmgadd” is executed with the option “-a”, a failure corresponding to the failure number 0001 may occur. become.
- the failure management server 16 manages the content of the failure corresponding to the failure number in the list of prohibited processes using a database that stores the failure content list. In other words, the failure management server 16 manages the failure number of the process causing the failure and information such as the failure content as a list by managing the list.
- An example of the failure content list DB2, as shown in FIG. 3, is a list including failure number, failure content, workaround, recovery method, patch number, and failure occurrence condition items corresponding to the failure number of DB1. Can be adopted.
- a workaround is an allowance for a process that has a high possibility of failure occurrence, which can temporarily execute the process by reducing the possibility of failure occurrence without stopping the target system. It points to.
- a shell program that can be executed without stopping and restarting the application target system can be employed as a workaround.
- the high possibility of occurrence of a failure means that the possibility of occurrence of a failure accompanying the execution of a process is not less than a predetermined value.
- a value indicating the possibility of occurrence of a failure an expected value described later can be adopted.
- a value derived from a predetermined function or mathematical expression for determining the possibility of failure occurrence may be used as a value indicating the possibility of failure occurrence.
- An example of a workaround that does not require restarting the target system is rewriting the definition file.
- a shell program is one that executes rewriting of definition files.
- the failure management server 16 acquires information on avoidance measures such as specific contents by manual processing or automatic processing, and registers or updates DB2 as a work avoidance measure.
- An example of manual processing there is a process in which information on a workaround is input by a user and registered in DB2 or DB2 is updated.
- An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor, acquiring information on a workaround when there is a workaround for the latest prohibited process, and registering it in the DB 2.
- Another example of automatic processing is to automatically receive the latest prohibition process provided by the vendor and register workaround information in DB2 or update DB2 when there is a workaround for the received latest prohibition process. There is a process to do. It is preferable to obtain the workaround information together with the latest patch and its patch information.
- the failure management server 16 acquires a workaround when acquiring the latest patch or patch information acquired by user input or automatic processing by the system, and registers the acquired workaround in DB2 or DB2 Is preferably updated.
- the failure management server 16 can register information on the workaround by user input in DB2 or update DB2. It has become. In addition, the failure management server 16 can acquire and register the workaround information in the DB2 or update the DB2 in order to enable automatic processing for registering or updating the workaround information in the DB2. It has become.
- the failure management server 16 refers to a prohibited process disclosed by the vendor, checks whether there is a workaround for the prohibited process, and obtains workaround information when there is a workaround for the prohibited process. To do. The acquired workaround can be registered in DB2 or DB2 can be updated.
- Other examples of automatic processing of the failure management server 16 include automatically receiving a prohibited process provided by a vendor, and registering information on a workaround in DB2 when there is a workaround for the received prohibited process, or DB2 May be updated.
- the recovery method of the items in the failure content list can be resumed (executed temporarily) without stopping the target system when the process in which the failure has become apparent is stopped, as described later.
- the allowance That is, when a failure due to a process having a high possibility of failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system.
- a shell program that can be executed without restarting the application target system can be employed.
- the recovery method is acquired together with the latest patch and its patch information.
- an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch. That is, the failure management server 16 acquires a workaround when acquiring the latest patch and its patch information, and is registered or updated as DB2.
- a patch number indicating a patch for correcting a failure is defined as a patch number among items in the failure content list.
- the patch number defined in the failure content list may be determined by the failure management server 16 or may be determined in advance.
- the failure management server 16 manages the failure content of the process causing the failure by storing each data corresponding to the failure number in the failure content list (DB2).
- DB2 failure content list
- the failure content of the failure number “0001” is “panic”, and it is defined that a failure occurs when failure occurrence conditions 1 to 3 are met.
- FIG. 3 shows an example in which neither a workaround nor a recovery method exists for the failure number “0001”. Details of the failure occurrence conditions 1 to 3 will be described later.
- condition 3 the conditions of “J1_0001” in condition 1, “J2_0001” in condition 2, and “J3_0001” in condition 3 correspond to the conditions for the occurrence of the “panic” of the failure content. To do. In the case of the failure number “0001”, if the “fjpmgadd” process is started with the “-a” option, there is a high possibility that a “panic” will occur in the system.
- the failure occurrence condition is information indicating a failure occurrence condition, and the conditions under which the software failure occurs are classified and managed by DB3 to DB5.
- a software failure is considered to occur under any condition, that is, when the target system in operation is in a specific state.
- the software developer assumes the operating state of the application target system at the time of executing the process at the time of software development. And if it deviates from the assumed operating state, it is considered that there is a high possibility of failure. Therefore, by grasping the operating state of the application target system in the operating state, the possibility of occurrence of a failure can be determined. Examples of grasping the operating status of the target system include “Understanding from the output result of the command defined by the system”, “Understanding from the contents of the file defined by the system”, and “Execution of predetermined program in the system” “Understanding the results”.
- the failure management server 16 manages a list of conditions that are in a specific state in the operating state of the application target system for an arbitrary process.
- the condition based on “understanding from the output result of the command determined by the system” is set as condition 1, and the list is managed as DB3.
- the condition obtained by “ascertaining from the contents of the file defined by the system” is set as the condition 2, and the list is managed as the DB 4.
- the condition obtained by “ascertaining from the execution result of the predetermined program by the system” is set as the condition 3, and the list is managed as the DB 5.
- the failure management server 16 registers information on conditions input by the user or collected by automatic processing by the system in DB3 to DB5, and updates DB3 to DB5 based on the information on the conditions. It has become.
- DB3 stores a list of conditions for determining the possibility of failure occurrence based on command output determined by the system as information.
- the condition based on DB3 is a condition based on “understanding from the output result of a command determined by the system”, and is defined as condition 1 when a failure occurs. That is, the condition 1 is a condition for determining a specific state where the possibility of a failure occurrence is high among the operating states of the system from the output result of an arbitrary command in the system.
- the commands and start options determined by the system are information collected for determining the possibility of failure.
- the output value assumed by the developer at the time of development is set as the expected value of information collected for determination.
- a combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 1, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the command output result with the expected character string.
- the condition for the failure occurrence with the condition management number “J1_0001” is when the OS version is “5.10”, the platform name is “sun4u”, and the patch “123456-01” has been applied. That is, when the execution result of the “uname -r” command is “5.10”. When the execution result of the “uname -i” command is “sun4u”. When the result of “patchadd -p
- condition 2 determines a specific state that is likely to cause a failure in the system operating state from the contents of a file that may be rewritten depending on the system operating state, such as a text file arranged in the system. It is a condition to do.
- condition management number As an example of DB4, which is a list of conditions 2 at the time of failure occurrence, as shown in FIG. 5, the condition management number, information (file) collected for failure occurrence determination, and information collected for failure occurrence determination There is a list containing each item of expected values.
- a file for example, a text file
- the output value assumed by the developer at the time of development is taken as the expected value.
- the combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 2, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the file contents with the expected character string.
- An example of a text file placed in the system is a definition file defined by the system or software. If the definition file definition is specified as a file located in the system, the file contents are compared with the expected value character string to determine the possibility of failure. For example, a command such as a diff command that outputs data indicating coincidence or data indicating disagreement as a result of comparison execution can be employed. By using the diff command, it can be determined from the output result of the diff command whether or not the content of the expected character string in the DB 4 is different from the content of the file acquired in the operating state of the system.
- a command such as a diff command that outputs data indicating coincidence or data indicating disagreement as a result of comparison execution can be employed.
- the failure occurrence condition of the condition management number “J2_0001” is the expected value corresponding to the character strings “File1”, “File2”, and “File3” for each of the contents of the three files. It is time to match. That is, when the content of the file “/var/opt/FJSVpmgw/reg/.user” is the character string “File1”. When the content of the file “/var/opt/FJSVpmgw/etc/.config” is the character string “File2”. This is when the content of the file “/var/opt/FJSVpmgw/etc/.role” is the character string “File3”. When any of these three character strings matches the corresponding expected value, it is predicted that a failure will occur.
- DB5 stores a list of conditions when determining the occurrence of a failure based on the execution result of a program in the system as information.
- This DB5 condition is defined as condition 3 when a failure occurs. That is, the condition 3 is a condition for determining a specific state where there is a high possibility of failure in the system operating state from information including information resulting from the system operating state.
- the failure does not occur when the system is in stable operation, but the possibility of a failure increases when the system is operating at a high load.
- the possibility of a failure may change depending on the operating state of the system at the timing of executing the process. This cannot be determined by comparing the collected information with the expected value, such as “Understanding from the output result of the command determined by the system” in Condition 1 and “Understanding from the contents of the file defined by the system” in Condition 2. Is. Specifically, unlike condition 1 and condition 2, the expected value cannot be simply defined, and “matched / not met” cannot be determined based on the result of “matched / not matched with expected value”.
- a predetermined program can be used as a logic for determining a condition.
- a shell script created by a developer can be adopted as the predetermined program as the logic for performing the condition determination. For this reason, it is preferable to prepare a predetermined program in advance in order to grasp the operating state of the changing system.
- An example of DB5 which is a list of condition 3 at the time of failure occurrence, includes a condition management number corresponding to the contents of condition 3 that is an item of DB2, and a determination method as information collected for determination, as shown in FIG. It is possible to employ a list including each item.
- condition 3 in this embodiment the shell script created by the developer is information to be collected for determination. Therefore, it is possible to determine the possibility of failure based on the execution result of the shell script stored in the determination method.
- the shell program a (an example is shown as the code 36 in FIG. 7) is executed and the end code is a character string “1”, the condition management number “J3_0001” is set. It can be determined that it is applicable.
- the failure management server 16 is constructed with databases DB1 to DB5. Information in these databases is updated when a new failure occurs and the conditions for the occurrence of the failure become clear.
- the database is updated each time the developer is updated, or the known information is acquired via the network.
- FIG. 8 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment.
- the management server which is the failure management server 16 has a list 40 classified for each process including conditions of the operating state of the system in which the failure occurs and information on a patch for preventing the occurrence of the failure.
- the management server also has a patch 42.
- the grasping unit 46 grasps the operating state of the application target system. Then, the reference unit 44 refers to the list classified for each process.
- the specifying unit 48 specifies the condition of an unapplied patch for the process activated from the referenced list.
- the diagnosis unit 50 diagnoses the possibility of failure of the activated process from the condition specified by the specifying unit 48 and the operating state of the application target system grasped by the grasping unit 46. As a result of the diagnosis by the diagnosis unit 50, when there is a high possibility of failure of the activated process, the management unit 52 stops the process. Further, the acquisition unit 54 acquires an unapplied patch corresponding to the process stopped by the management unit 52.
- the reference unit 44 refers to the list at that timing, and the grasping unit 46 grasps the operating state of the application target system.
- the diagnosis unit 50 diagnoses whether the failure occurrence condition does not correspond to the grasped operation state of the application target system based on the operation state of the application target system.
- the management unit 52 stops the process, and the acquisition unit 54 acquires the patch. Acquired patches can be applied to the system at any time.
- FIG. 9 is a flowchart showing processing executed by the client terminal 14.
- the processing routine of FIG. 9 is executed, and the reference processing is executed in step 100.
- the reference process is a process of referring to the prohibited process list classified for each process exemplified in DB1.
- the process of step 100 includes a process of searching whether or not the process requested to be activated is registered (entry) in the prohibited process list.
- the latest patch information may be obtained from the management server that is the failure management server 16 and stored, and the prohibited process list included in the stored information may be referred to.
- step 102 When the process requested to be activated is registered (entry) in the prohibited process list, the result in step 102 is affirmative and the process proceeds to step 104. On the other hand, if the process requested to be activated is not registered in the prohibited process list, the result in step 102 is negative. If the determination in step 102 is negative, the process requested to be started has a low possibility of occurrence of failure, and there is no trouble in executing the process. Therefore, in step 124, the process is advanced to execute the process, and this routine is terminated.
- a grasping process which is an information collecting process for a diagnostic process described later, is executed.
- the grasping process is a process for grasping the operating state of the application target system when the process activation is requested.
- a process of “obtaining from an output result of a command determined by the system” or “obtaining from a file content defined by the system” is performed. It is also possible to perform the process of “obtaining from the execution result of the predetermined program by the system”.
- step 106 diagnostic processing is executed.
- the diagnosis process diagnoses the possibility of the failure of the process requested to be started from the failure occurrence condition and the operating state of the application target system grasped by the grasping process.
- the determination in step 108 is affirmative and the routine proceeds to step 110.
- the process proceeds to step 124.
- step 110 since there is a high possibility of failure of the process requested to be started, the process is stopped.
- notification / addition processing is executed.
- the notification / addition processing is performed by notifying the user of a patch number (patch ID or the like) that there is a patch for preventing the occurrence of a failure.
- the notification to the user may be a display process on the display, or may be a process of notifying by the notification unit to the user. This notification can be omitted.
- patches are acquired (downloaded) from the failure management server 16 and stored in the pool area 26, and patch numbers are added to the application candidate patch list. Also good.
- step 114 the life extension process is started.
- the process of step 114 is a process of referring to the DB 2 and confirming whether or not there is a workaround for a process with a high possibility of failure. If there is a workaround, it is affirmed at step 116 and the workaround is performed at step 118.
- the process of step 118 is a process of executing a workaround stored in the failure content list. That is, an allowance is provided to temporarily execute the process by reducing the possibility of failure without stopping the application target system.
- the process proceeds to restart the process stopped in step 110, the process is executed, and this routine is terminated.
- a notification / addition process is executed in step 122.
- the user is notified that a process having a high possibility of failure has been stopped.
- This notification may be a display process on the display, or may be a process of notifying by a notification unit to the user. This notification can be omitted.
- Process information can be added to the process interruption list.
- the process interruption list is a database that stores, as information, a list such as the names of processes that are likely to have failed and stopped.
- An example of the process interruption list may include at least the name of the interrupted process as an item when only the user is notified of the stopped process.
- the number of items in the process interruption list can be increased in accordance with the content of the item that provides the user with information on the interrupted process. For example, similarly to the prohibited process list shown in FIG. 1, a list in which each item of the list item number (No.), the name of the suspended process, the option, and the failure number is classified for each suspended process may be adopted. it can.
- the database storage area 15 of the client terminal 14 that is the application target system can include a database that stores, as information, a process interruption list that is a list of names of processes that have been stopped due to a high possibility of failure.
- FIG. 10 is a flowchart showing system maintenance processing executed by the client terminal 14.
- the system maintenance process is executed at a predetermined timing or at an arbitrary timing specified by the user, and is a process for stopping system operation and executing system maintenance.
- a case where the system operation is stopped and the minimum necessary patches are applied based on the application candidate patch list and the process interruption list will be described.
- patch application processing is executed in step 130.
- patches registered in the application candidate patch list are collectively applied in the patch application process.
- the patch is applied to the process that has been subjected to the life extension process and the process that has been stopped without any workaround, and the system can be stabilized.
- step 130 there is a process of applying a selected patch by presenting a process interruption list to the user, causing the user to select only a process for which a patch is desired to be applied.
- the system is always diagnosed at the timing when the process is started during system operation, and the minimum necessary patches that are candidates for applying patches to the target system during system maintenance are extracted.
- the diagnosis performed when starting a process the operating status of the target system for the process has failed by continuing the started process based on the "prohibited process list", "failure content list”, and "condition list” obtained from the server. It is determined whether this condition is met. If the activated process has a high possibility of failure, the activated process is stopped and a workaround is taken to restart the process. If there is no workaround, keep the process stopped. Then, the application candidate patches are collectively applied during system maintenance. Patches can also be selectively applied by the user. As a result, the occurrence of a failure can be prevented in advance.
- FIG. 11 is a flowchart showing a detailed flow centered on a diagnostic process executed by the client terminal 14.
- the processing routine of FIG. 11 shows details of steps 104 to 110 and step 124 in the processing flow shown in FIG.
- step 200 when process activation is requested, reference processing is executed in step 200. If the process requested to be activated is not registered (entry) in the prohibited process list (No in Step 202), the process proceeds to execute the process in Step 204 (similar to Step 124 in FIG. 9) This routine ends. Note that the difference between step 100 in FIG. 9 and step 200 in FIG. 11 is that in step 200, processing for specifying a failure occurrence condition for a process requested to be started is performed in later processing. In other words, in step 200, confirmation processing is performed based on a search processing result for searching that the process requested to be activated is registered (entry) in the prohibited process list.
- Step 206 the failure content list (DB2) is referenced to identify the corresponding failure content.
- conditions corresponding to the corresponding failure contents are extracted from DB3, DB4, and DB5.
- a condition relating to a patch that has not been applied to the process requested to be activated ie, a failure occurrence condition
- the first condition process is a process for collecting (command execution) information defined in the condition management number (for example, J1_0001) in the condition list 1 (DB3) and comparing it with an expected value.
- the grasping process that is information collection for diagnosis
- Information on the condition management number is obtained by grasping the operating state (step 210).
- the first condition process is a process of “ascertaining from an output result of a command determined by the system”, and corresponds to the grasping process in step 104 of FIG.
- the command is executed to obtain the result value.
- the expected value of the condition management number is acquired (step 212). The possibility of failure occurrence for the first condition is diagnosed based on whether or not the information obtained in step 210 matches the expected value obtained in step 212.
- next step 214 it is determined whether or not there is a high possibility of a failure by determining whether or not the result value of the command execution matches the expected value. If the result value from the command execution does not match the expected value and the possibility of failure is not high, the result in Step 214 is negative and the process proceeds to Step 204. On the other hand, if the result value from the command execution matches the expected value and the possibility of failure is high, the result is affirmative in step 214 and the process proceeds to step 216.
- the second condition process is executed.
- the second condition process is a process for collecting information defined in the condition management number (for example, J2_0001) in the condition list 2 (DB4) and comparing it with the expected value.
- the grasping process the operation state of the application target system when the process activation is requested is grasped.
- information defined in the condition management number is obtained (step 218).
- the second condition process is a process of “ascertaining from the contents of the file defined by the system”, and reads the contents of the file to obtain the value.
- an expected value corresponding to the condition management number is acquired (step 220). The possibility of failure occurrence for the second condition is diagnosed based on whether or not the information obtained in step 218 matches the expected value obtained in step 220.
- next step 222 it is determined whether or not the possibility of failure is high by determining whether or not the value of the file content corresponding to the condition management number matches the expected value. If the value according to the file contents does not match the expected value and the possibility of failure is not high, the result in Step 222 is negative and the process proceeds to Step 204. On the other hand, if the value according to the file content matches the expected value and there is a high possibility of failure occurrence, the result in step 222 is affirmative and the routine proceeds to step 224.
- the third condition process is executed.
- the third condition process is a process that collects information defined in the condition management number (for example, J3_0001) in the condition list 3 (DB5) (executes a process such as a shell) and compares it with an expected value.
- the grasping process the operation state of the application target system when the process activation is requested is grasped.
- condition management number information is obtained (step 226).
- the third condition process is a process of “ascertaining from the execution result of the predetermined program by the system”, and here, a result value (end code) obtained by executing the shell script is obtained.
- the expected value of the condition management number is acquired (step 228). The possibility of failure occurrence for the third condition is diagnosed based on whether or not the information obtained in step 226 matches the expected value obtained in step 228.
- step 230 it is determined whether or not the possibility of failure is high by determining whether or not the result value obtained by executing the processing of the shell or the like matches the expected value. If the result value obtained by executing the processing such as shell does not match the expected value and the possibility of failure is not high, the result is negative in step 230 and the process proceeds to step 204. On the other hand, when the result value obtained by executing the processing of the shell or the like matches the expected value and there is a high possibility of the occurrence of a failure, the result is affirmative in step 230 and the process proceeds to step 232. In step 232, the process requested to start is stopped. If all of steps 214, 222, and 230 are affirmed, that is, if it is determined that a failure is likely to occur for each of the first condition to the third condition, the process that is requested to start is stopped.
- FIG. 12 is a flowchart showing a detailed flow with a focus on processing after life extension processing executed by the client terminal 14.
- the processing routine of FIG. 12 shows details of steps 114 to 122 in the flow of processing shown in FIG.
- the process illustrated in FIG. 12 is a process for continuing the system operation as much as possible by executing a workaround.
- the processing routine of FIG. 12 is executed.
- the client terminal 14 refers to the failure content list (DB2), acquires (downloads) a patch for solving the failure occurrence of the stopped process from the failure management server 16, and stores it in the pool area 26. Store.
- the patch number is notified to the user and added to the application candidate patch list (step 112 in FIG. 9).
- step 240 confirmation processing is executed.
- the confirmation processing is processing for confirming whether or not a workaround (shell program) corresponding to the stopped process is registered in the failure content list with reference to the failure content list (DB2).
- DB2 failure content list
- the workaround is executed in Step 244 (similar to Step 118 in FIG. 9) to prolong the life of the process.
- Step 246 similar to step 120 in FIG. 9
- the process proceeds to restart the stopped process, the process is restarted, and this routine is terminated.
- the startup diagnostic process is a process for executing the diagnostic process shown in FIG.
- the operating status for example, CPU load status
- an attempt is made to execute the startup diagnosis process in step 248.
- the start-up diagnosis process enables retrying a predetermined number of times by the user.
- the retry count can be defined as information in the operating environment of the client terminal 14.
- step 250 If the process is stopped as a result of the diagnosis in step 248 (No in step 250), it is determined whether or not a retry exceeding the defined number of times has been executed. If the number of retries does not exceed the defined number (No at step 254), the number of retries is incremented and step 248 is executed again. As a result of the diagnosis in step 248, when the process is executed (Yes in step 250), if the stopped process is registered in the process interruption list in step 252, the corresponding process is deleted from the process interruption list and this routine is executed. finish.
- Step 254 when the retry exceeding the defined number of times is attempted (Yes in Step 254), the retry is terminated, the corresponding process is registered in the process interruption list (Step 256), and the user is notified (Step 258), this routine is terminated.
- the list is referred to at a timing, and it is diagnosed whether the requested process does not satisfy the failure occurrence condition. To do.
- the diagnosis when the activation request is an activation request for a process having a high possibility of occurrence of a failure, the execution of the corresponding process is stopped. This stops processes that are likely to cause known failures. As a result, the possibility of occurrence of a failure according to the operating state of the system can be diagnosed, and the occurrence of the failure can be prevented in advance according to the operating system.
- Software generally has unknown problems that may cause failures at the time of development.
- an unknown defect or the like becomes obvious, it becomes a bug (an error or a defect included in the program) that becomes an obstacle to executing the software.
- the vendor provides a patch that is data for preventing the occurrence of a software failure. Therefore, a failure due to a bug that has not been revealed cannot be prevented even if a patch is applied.
- a defect or the like becomes known, if it is found that a process having a high possibility of occurrence of a failure has already been executed due to a code including a bug, the corresponding process is stopped, or is applicable. The process can be restored as much as possible.
- the recovery method is stored in the item of the failure content list.
- the recovery method refers to an allowance for resuming a corresponding process without stopping the application target system when a process that has been identified as having a high probability of occurrence of the fault is stopped. That is, when a process that is running is highly likely to cause a failure or when a failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system.
- an executable shell program can be employed without restarting the application target system.
- the recovery method is acquired together with the latest patch and patch information. In other words, an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch.
- FIG. 13 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment.
- the management server which is the failure management server 16, updates the update unit 56 to update the list 40 categorized for each process including the conditions of the operating state of the system in which the failure occurs and the information on the patch that resolves the failure that has occurred. It has become. That is, when a new failure becomes apparent, the management server registers the patch 42 and patch information for preventing the occurrence of the failure.
- the management server updates the list 40 and notifies the client terminal 14 when the patch 42 and the patch information for preventing the occurrence of the failure are registered.
- the list 40 includes DB1 to DB5.
- the reception unit 58 receives information indicating that the list 40 has been updated by notification from the management server that is the failure management server 16.
- the grasping unit 46 grasps the operating state of the application target system.
- the grasping of the operating state includes a process of extracting an operating process.
- the reference unit 44 refers to the list.
- the specifying unit 48 determines whether or not the operating process is included in the list 40 referred to by the reference unit 44, and when the operating process is registered in the list 40, the condition for the operating process is determined. Is identified.
- the diagnosing unit 50 diagnoses the possibility of the failure of the running process from the condition specified by the specifying unit and the operating state of the application target system grasped by the grasping unit 46.
- the management unit 52 stops the running process. Further, the application target system acquires an unapplied patch corresponding to the process stopped by the management unit 52 by the acquisition unit 54.
- the client terminal 14 when the client terminal 14 receives a notification of a process in which a new failure has been realized from the failure management server 16, the client terminal 14 refers to the list 40 and diagnoses whether the operating state of the application target system corresponds to the failure occurrence condition. To do. When a running process is diagnosed as having a high possibility of failure, the running process is stopped. When a running process is stopped, a patch to be applied to the application target system for the stopped process is acquired. Acquired patches can be applied to the system at any time. As described above, when a new failure becomes apparent, it is possible to determine whether or not the operating process is highly likely to have a failure based on the operating state of the application target system. When a process having a high possibility of failure occurrence is running, the running process is stopped. For this reason, even if the process is in operation, the occurrence of the failure can be prevented by stopping the process when the possibility of the failure is high.
- FIG. 14 is a flowchart showing a detailed flow centered on a diagnostic process executed in the client terminal 14.
- processing executed when the client terminal 14 receives a notification from the failure management server 16 is shown.
- the processing routine of FIG. 14 is executed instead of the processing shown in FIG.
- a reference process is executed in step 300.
- the list 40 includes DB1 to DB5.
- the failure management server 16 notifies the client terminal 14 of information indicating that the list 40 has been updated.
- the processing in step 300 is processing for confirming update of the prohibited process list. If the prohibited process list has not been updated as a result of the execution of the reference process (Yes at step 302), this routine ends.
- the prohibited process list is updated and a process is newly registered or updated in the prohibited process list (Yes in Step 302), the process proceeds to Step 304 and is being executed (in operation) in the application target system. ) Check the process.
- step 306 one arbitrary process is designated from the running processes.
- step 308 the operation diagnosis process is executed.
- the operation diagnosis process at step 308 is the same as the start-up diagnosis process described at step 248 in FIG. Specifically, the diagnosis process shown in FIG. 11 is advanced with the process designated in step 306 as the process requested to be activated.
- the difference between the process shown in FIG. 11 and the process in step 308 of FIG. 14 in this embodiment is the process in step 204 and step 232 of FIG.
- the process requested to be started is the processing target.
- the processing corresponding to step 204 in FIG. 11 is skipped in step 308 in FIG. 14, and the processing proceeds without doing anything.
- a process that is already in operation is forcibly stopped.
- step 310 it is determined whether or not there remains a process for which the on-operation diagnosis process has not been executed among the active processes. When there is a process that has not been subjected to the operation time diagnostic process, the process returns to step 306, and the operation time diagnosis process is performed on the remaining operating process.
- FIG. 15 is a flowchart showing a detailed flow centered on the recovery process executed in the client terminal 14.
- the recovery process is performed after the process running on the client terminal 14 is stopped. That is, in step 232 of FIG. 11, the recovery process is performed after forcibly stopping the already running process.
- the recovery process is a process for continuing the system operation.
- the processing routine of FIG. 15 is executed instead of the processing shown in FIG.
- the processing routine of FIG. 15 is executed.
- patches are acquired (downloaded) from the failure management server 16 and accumulated in the pool area 26, and the patch numbers are stored in the application candidate patch list. May be added. Thereby, at the time of system maintenance, by applying the patches accumulated in the pool area 26, it is possible to apply the minimum necessary fault correction with a high probability of occurrence in the application target system.
- a confirmation process is executed.
- the confirmation process is a process for confirming whether or not a recovery method (shell program) corresponding to the stopped process is registered with reference to the failure content list (DB2).
- DB2 failure content list
- the process proceeds to step 324, and recovery processing by the recovery method is executed in order to extend the life of the process until the next system maintenance.
- the process proceeds to resume the stopped process, the process is resumed, and this routine is terminated.
- step 322 If the recovery method for the stopped process is not registered in the failure content list (No in step 322), the process proceeds to step 328, and the process is registered in the process interruption list. In the next step 330, information indicating that the stopped process is registered in the process interruption list is notified to the user, and this routine is terminated.
- the list is referred to during the operation of the process to diagnose whether the failure occurs. To do. Then, when a process that is highly likely to cause a newly registered failure is in operation, the operation of the corresponding process is stopped. As a result, when an unknown failure such as a defect becomes known, the operation of the process is stopped, so even a newly registered failure is likely to occur quickly, and the failure occurs. Can be prevented in advance.
- the present invention is not limited to the confirmation of the update of the prohibited process list.
- the list 40 includes DB1 to DB5, and can be applied when a list other than the prohibited process list is updated.
- Information included in DB2 to DB5 may be updated to the latest information.
- the user may not want to apply the patch.
- the application of a patch is avoided or postponed due to a request to continue the current system from the viewpoint of system operation or a request to shorten the system stop time.
- One example is the idea that even if a function being used has a failure, the current system is operating without a failure, so that it does not want to apply the latest patch.
- a situation such as incompatibility or level down occurs due to the application of a patch, a lot of man-hours are required for the restoration work. For this reason, there is an idea that it is not desirable to actively apply a patch in order to prevent a known failure.
- it will be too late after a failure occurs without applying a patch and the user may have a contradictory idea.
- the above embodiment is also applied to a user who has a contradictory idea that “it does not want to actively apply a patch” and “it is too late after a failure occurs without applying a patch”.
- the process is requested to realize the function of the software, refer to the list at the process startup timing, and check whether the failure occurrence condition corresponds to the operating status of the target system for the requested process Diagnose.
- the diagnosis if the activation request is a process having a high possibility of occurrence of a failure in the operating state of the application target system, the execution of the requested process is stopped.
- the possibility of the occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of a failure can be prevented in advance according to the application target system that is operating.
- the occurrence of a failure can be suppressed without stopping the operating system.
- patches are acquired at the timing when the process is stopped, the minimum necessary patches can be applied, and the number of work steps for system maintenance can be reduced.
- the above embodiment functions effectively.
- the possibility of failure occurrence is diagnosed according to the operating state of the system at the time of starting the process or at the timing when the latest patch for the process is registered.
- the diagnosis result indicates that there is a high possibility that a failure has occurred, execution of the process is stopped.
- the possibility of occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of the failure can be prevented in advance according to the operating system. Therefore, even patches that are not desired by the user are not applied.
- the program is stored (installed) in advance in the storage unit of the client terminal.
- the processing program is provided in a form recorded on a recording medium such as a CD-ROM or DVD-ROM. It is also possible.
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention prevents failure generation by determining the operation status of a system, then diagnosing the likelihood of failure generation and assessing whether a process is capable of functioning. When the start of a process is requested during the operation of the system to be applied, a prohibited process list is referenced (100). If the requested process is registered, the dynamically changing system operation status is determined (104), the likelihood of failure generation for the process is diagnosed (106), and if there is a high likelihood of failure generation for the process, the execution of the process is stopped (110). This procedure enables the prevention of failure generation.
Description
開示の技術は、障害発生防止装置、障害発生防止方法、障害発生防止プログラム及び媒体に関する。
The disclosed technology relates to a failure occurrence prevention device, a failure occurrence prevention method, a failure occurrence prevention program, and a medium.
システムにインストールされているOS(Operating System)やアプリケーション・プログラムなどのソフトウェアは、実行時にソフトウェア自身による障害発生を防止することが求められる。このためには、ソフトウェアを提供するベンダ等から供与されるパッチを取得し、取得したパッチをシステムに適用することでソフトウェアの障害発生を防止することが望ましい。
Software such as OS (Operating System) and application programs installed in the system is required to prevent failure by the software itself at the time of execution. For this purpose, it is desirable to acquire a patch provided by a vendor or the like that provides software and apply the acquired patch to the system to prevent software failure.
システムへのパッチの適用は、定期的にシステムメンテナンスのためにシステムの運用を停止する時間を設けて行うことが一般的である。パッチの適用では、既報かつ既知の障害が顕在化されているソフトウェアが自身のシステムに含まれるときに、その障害の発生を防止するためのパッチを適用する。
It is common to apply a patch to the system with a time to stop the system operation for system maintenance regularly. In the application of a patch, when software that has already been reported and has a known failure is included in its own system, a patch for preventing the occurrence of the failure is applied.
ソフトウェアは膨大にあるのでそのパッチも膨大に存在する。このため、自身のシステムに適用できるパッチを選別して適用するのが一般的である。膨大な数のパッチの中からパッチを選別して適用する技術として、自身のシステムに適用できるパッチだけを抽出して適用する技術が知られている。一例には、既知のパッチから未適用パッチをユーザのポリシーに従って抽出するものがある。
Since software is enormous, there are enormous patches. For this reason, it is common to select and apply patches that can be applied to the own system. As a technique for selecting and applying patches from a vast number of patches, a technique for extracting and applying only patches that can be applied to its own system is known. In one example, an unapplied patch is extracted from a known patch according to a user policy.
自身のシステムに適用できるパッチを抽出する場合、ソフトウェアのバージョンや、ハードウェアのタイプ、適用済のパッチID(パッチの識別子)等のシステムの情報に基づきパッチを抽出する。このため、適用できるパッチとしてシステムの情報に該当する全パッチを抽出して、システムに適用することになる。システムの情報に該当する全パッチを抽出して適用する場合、パッチの適用に時間を要する。また、システムにパッチを適用することによって機能低下等のレベルダウンするソフトウェアが生じる等のリスクが増加する。また、パッチを適用することによってレベルダウンするソフトウェアとパッチを適用していないソフトウェアとに互換性がなくなるリスクも増加する。そこで、セキュリティ障害などの重要かつ緊急な障害を修正するパッチだけをベンダが集約して提供したものを適用する技術も知られている。一例には緊急度と適用するコンピュータの負荷とに応じて、パッチの適用時期を決定するものがある。
特許3200661号公報
特開2007-25820号公報
特開2005-38223号公報
特開平10-63527号公報
特開2010-250749号公報
国際公開2007/105274号公報
特開2005-327275号公報
国際公開2008/126221号公報
When extracting a patch that can be applied to the own system, the patch is extracted based on system information such as a software version, a hardware type, and an applied patch ID (patch identifier). For this reason, all patches corresponding to the system information are extracted as applicable patches and applied to the system. When all patches corresponding to the system information are extracted and applied, it takes time to apply the patches. In addition, applying a patch to the system increases the risk of generating software whose level is reduced, such as function degradation. In addition, the risk of incompatibility between software that is downgraded by applying a patch and software that is not applied with a patch increases. Therefore, a technology is also known that applies a patch that a vendor collects and provides only patches that correct important and urgent failures such as security failures. In one example, the patch application time is determined according to the degree of urgency and the load on the computer to be applied.
Japanese Patent No. 3200661 JP 2007-25820 A JP 2005-38223 A JP-A-10-63527 JP 2010-250749 A International Publication No. 2007/105274 JP 2005-327275 A International Publication No. 2008/126221
ところで、障害が発生するかどうかは、自身のシステムで実際に障害が発生する直前まで不明であるので、障害発生の可能性は事前に判断できない。つまり、ソフトウェアによる障害が発生するか否かは、そのソフトウェアを自身のシステムで実行したときのシステムの稼働状態に左右される。障害の発生は稼働中のシステムの状況に依存するので、ハードウェアやCPUのタイプ、利用中のソフトウェアのバージョンなどの静的なシステム情報だけでは障害が発生する可能性を判断できない。
By the way, whether or not a failure occurs is unknown until just before the failure actually occurs in its own system, so the possibility of the failure cannot be determined in advance. In other words, whether or not a failure occurs due to software depends on the operating state of the system when the software is executed in its own system. Since the occurrence of the failure depends on the status of the operating system, the possibility of the failure cannot be determined only by static system information such as the hardware and CPU type and the version of the software being used.
1つの側面では、システムの障害発生を抑制することを目的とする。
In one aspect, the objective is to suppress system failures.
開示の技術は、プロセスが起動された際に、適用対象システムの稼働状態を把握する、また、障害が発生する稼働状態の条件及びパッチの情報を含むプロセス毎に分類されたリストを参照する。このリストから、障害が発生する稼働状態の条件を特定し、その条件と、把握された適用対象システムの稼働状態とから、障害発生の可能性を診断する。その診断が、プロセスの実行に伴う障害発生の可能性が所定値以上の場合に当該プロセスを停止させる。また、実行が停止されたプロセスに対応する未適用のパッチを取得する。
The disclosed technology grasps the operating state of the application target system when the process is started, and refers to a list classified for each process including information on operating state conditions and patches in which a failure occurs. From this list, the condition of the operating state where the failure occurs is specified, and the possibility of the failure is diagnosed from the condition and the operating state of the grasped application target system. The diagnosis stops the process when the possibility of occurrence of a failure due to the execution of the process is a predetermined value or more. Also, an unapplied patch corresponding to the process whose execution has been stopped is acquired.
1つの実施態様では、障害発生を抑制することができる、という効果を有する。
In one embodiment, there is an effect that occurrence of a failure can be suppressed.
以下、図面を参照して開示の技術の実施形態の一例を詳細に説明する。
Hereinafter, an example of an embodiment of the disclosed technology will be described in detail with reference to the drawings.
(第1実施形態)
図1には、本実施形態に係る障害発生を抑制するシステムとして適用可能なコンピュータ・システム10の概略構成が示されている。コンピュータ・システム10は、LAN等によるネットワーク12に、複数台のクライアント端末14と、障害管理サーバ16と、が各々接続されている。なお、ネットワーク12はインターネット等の通信ネットワークを含むことができる。 (First embodiment)
FIG. 1 shows a schematic configuration of acomputer system 10 applicable as a system for suppressing the occurrence of a failure according to the present embodiment. In the computer system 10, a plurality of client terminals 14 and a failure management server 16 are connected to a network 12 such as a LAN. The network 12 can include a communication network such as the Internet.
図1には、本実施形態に係る障害発生を抑制するシステムとして適用可能なコンピュータ・システム10の概略構成が示されている。コンピュータ・システム10は、LAN等によるネットワーク12に、複数台のクライアント端末14と、障害管理サーバ16と、が各々接続されている。なお、ネットワーク12はインターネット等の通信ネットワークを含むことができる。 (First embodiment)
FIG. 1 shows a schematic configuration of a
障害管理サーバ16は、適用対象システムに供与する障害発生防止のためのパッチ及びその情報を管理するためのものである。障害管理サーバ16は、CPU16A、RAM等のメモリ16Bを含んでいる。また障害管理サーバ16は、HDD(Hard Disk Drive)やフラッシュメモリ等の不揮発性の記憶部16Cを備えている。また障害管理サーバ16は、ネットワークインタフェース(I/F)部16Dを備え、ネットワークI/F部16Dを介してネットワーク12に接続されている。また、障害管理サーバ16には、出力デバイスの一例の表示装置であるディスプレイ20、入力部としてのキーボード22及びマウス24が各々接続されている。
The failure management server 16 is for managing patches to be provided to the application target system and preventing the occurrence of the failure, and information thereof. The failure management server 16 includes a CPU 16A and a memory 16B such as a RAM. The failure management server 16 includes a nonvolatile storage unit 16C such as an HDD (Hard Disk Drive) or a flash memory. The failure management server 16 includes a network interface (I / F) unit 16D, and is connected to the network 12 via the network I / F unit 16D. The failure management server 16 is connected to a display 20 as an example of an output device, a keyboard 22 and a mouse 24 as input units.
また、障害管理サーバ16の記憶部16Cには、OS(Operating System)のプログラム、及びOS上で動作する各種のアプリケーション・プログラムが予め各々インストールされている。
In addition, in the storage unit 16C of the failure management server 16, an OS (Operating System) program and various application programs operating on the OS are installed in advance.
また、障害管理サーバ16の記憶部16Cには、データベース格納領域17が設けられている。また、障害管理サーバ16の記憶部16Cには、パッチ格納領域18及びシェル格納領域19が設けられている。
Also, a database storage area 17 is provided in the storage unit 16C of the failure management server 16. The storage unit 16C of the failure management server 16 is provided with a patch storage area 18 and a shell storage area 19.
データベース格納領域17は、パッチの情報をデータベースとして格納するためのものである。データベース格納領域17には、5つのデータベースが格納される。第1データベースは、障害を引き起こす可能性のある禁止プロセスのリストを情報として格納するデータベース1(以下、「DB1」という。)である。第2データベースは、禁止プロセスの各々における障害内容のリストを情報として格納するデータベース2(以下、「DB2」という。)である。
The database storage area 17 is for storing patch information as a database. Five databases are stored in the database storage area 17. The first database is a database 1 (hereinafter referred to as “DB1”) that stores, as information, a list of prohibited processes that may cause a failure. The second database is a database 2 (hereinafter referred to as “DB2”) that stores a list of failure contents in each of the prohibited processes as information.
第3データベース~第5データベースは、障害が発生するときの条件のリストをそれぞれ情報として格納するデータベースである。第3データベースは、システムで定められたコマンド出力で障害発生を判定する条件のリストを情報として格納するデータベース3(以下、「DB3」という。)である。第4データベースは、システムで規定されたファイルの内容で障害発生を判定する条件のリストを情報として格納するデータベース4(以下、「DB4」という。)である。第5データベースは、システムにおけるプログラムの実行結果で障害発生を判定する条件のリストを情報として格納するデータベース5(以下、「DB5」という。)である。
The third database to the fifth database are databases that respectively store a list of conditions when a failure occurs as information. The third database is a database 3 (hereinafter referred to as “DB3”) that stores, as information, a list of conditions for determining the occurrence of a failure by command output determined by the system. The fourth database is a database 4 (hereinafter referred to as “DB4”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the contents of a file defined by the system. The fifth database is a database 5 (hereinafter referred to as “DB5”) that stores, as information, a list of conditions for determining the occurrence of a failure based on the execution result of a program in the system.
なお、DB1~DB5に格納される情報の登録や更新等は、ユーザによる入力またはシステムによる自動的な処理が定期的に行われることによりなされる。
Note that registration and update of information stored in DB1 to DB5 is performed by periodic input by the user or automatic processing by the system.
本実施形態では、パッチの情報をDB1~DB5に分類して格納する場合を説明するが、これに限定されるものではない。例えば、パッチの情報をDB1~DB5に分類することなく、各情報を1つのデータベースに纏めて格納してもよい。また、DB1~DB5に格納される各情報のうち、一部の情報を組み合わせて、4以下の数のデータベースに纏めて格納してもよい。
In the present embodiment, the case where patch information is classified and stored in DB1 to DB5 will be described, but the present invention is not limited to this. For example, each piece of information may be stored in one database without classifying the patch information into DB1 to DB5. Further, some pieces of information stored in DB1 to DB5 may be combined and stored in a database of four or less.
なお、本実施形態において、ファイルの一例には、プログラムファイル、ライブラリファイル、またはデータファイルが挙げられる。なお、ファイルは1つのプログラムファイル等に限られるものではない。例えば、市販する製品に複数のプログラムファイルが含まれる等の場合、ファイルには複数のプログラムファイル等のかたまりを含めることができる。一例には、OSを構成するプログラムファイル、ライブラリファイル、またはデータファイルのかたまりが挙げられる。
In the present embodiment, examples of the file include a program file, a library file, or a data file. The file is not limited to one program file. For example, when a commercially available product includes a plurality of program files, the file can include a group of a plurality of program files. An example is a group of program files, library files, or data files that constitute the OS.
ネットワーク12に接続された個々のクライアント端末14は、障害の発生を防止するパッチの適用対象とするシステム(適用対象システム)の一例である。クライアント端末14は、CPU14A、RAM等を含むメモリ14B、HDD等の不揮発性の記憶部14C、ネットワークI/F部14Dを備えている。クライアント端末14は、ネットワークI/F部14Dを介してネットワーク12に接続される。また、クライアント端末14には、ディスプレイ20、キーボード22及びマウス24が各々接続されている。
Each client terminal 14 connected to the network 12 is an example of a system (application target system) to which a patch for preventing the occurrence of a failure is applied. The client terminal 14 includes a CPU 14A, a memory 14B including a RAM, a nonvolatile storage unit 14C such as an HDD, and a network I / F unit 14D. The client terminal 14 is connected to the network 12 via the network I / F unit 14D. In addition, a display 20, a keyboard 22, and a mouse 24 are connected to the client terminal 14.
また、クライアント端末14の記憶部14Cには、OSのプログラム、及びOS上で動作する各種のアプリケーション・プログラムが予め各々インストールされている。このクライアント端末14の記憶部14Cには、データベース格納領域15が設けられている。また、クライアント端末14の記憶部14Cには、プール領域26及びシェル格納領域28が設けられている。
Further, an OS program and various application programs operating on the OS are installed in the storage unit 14C of the client terminal 14 in advance. A database storage area 15 is provided in the storage unit 14 </ b> C of the client terminal 14. The storage unit 14C of the client terminal 14 is provided with a pool area 26 and a shell storage area 28.
(障害管理サーバ)
次に、障害管理サーバ16によって行われる障害管理について説明する。障害管理サーバ16は、任意の適用対象システムに適用するパッチ及びその情報を管理するものである。障害管理サーバ16におけるパッチ及びその情報の管理には、適用するパッチ及びパッチ情報を入手する処理、および入手した適用するパッチ及びパッチ情報をデータベースとして管理する処理が含まれる。また、障害管理サーバ16におけるパッチ及びパッチ情報の管理には、適用するパッチ及びパッチ情報を、適用対象システムに供与する処理が含まれる。 (Fault management server)
Next, failure management performed by thefailure management server 16 will be described. The failure management server 16 manages patches and information applied to any application target system. The management of the patch and its information in the failure management server 16 includes a process of acquiring the patch to be applied and the patch information, and a process of managing the acquired patch and the patch information as a database. Further, the management of patches and patch information in the failure management server 16 includes a process of providing patches to be applied and patch information to the application target system.
次に、障害管理サーバ16によって行われる障害管理について説明する。障害管理サーバ16は、任意の適用対象システムに適用するパッチ及びその情報を管理するものである。障害管理サーバ16におけるパッチ及びその情報の管理には、適用するパッチ及びパッチ情報を入手する処理、および入手した適用するパッチ及びパッチ情報をデータベースとして管理する処理が含まれる。また、障害管理サーバ16におけるパッチ及びパッチ情報の管理には、適用するパッチ及びパッチ情報を、適用対象システムに供与する処理が含まれる。 (Fault management server)
Next, failure management performed by the
ソフトウェア障害は、様々な条件下で、任意の操作を行うと発生する可能性がある。任意の操作の一例には、「コマンドを実行する」、「サービスを起動する」、及び「ライブラリ、システムコールを使用する」等が挙げられる。「コマンドを実行する」や「サービスを起動する」という操作は、システムにおいて実行可能なファイルを実行してプロセスを起動し実行させる。ライブラリやシステムコール等は、プロセスの中で呼び出される。このため、「ライブラリ、システムコールを使用する」という操作は、プロセスを起動し実行することに相当する。本実施形態の障害管理サーバ16は、障害を引き起こす可能性のあるプロセスを禁止プロセスとして、禁止プロセスのリストを情報として格納してDB1として管理する。なお、プロセスを起動するときに、特定のオプションを指定したときに障害が発生する場合は、禁止プロセスのリストに、プロセスに対応するオプションも定義しておくものとする。
∙ Software failure can occur when any operation is performed under various conditions. Examples of arbitrary operations include “execute a command”, “start a service”, and “use a library and system call”. The operations “execute a command” and “start a service” execute a file executable in the system to start and execute a process. Libraries and system calls are called in the process. Therefore, the operation “use library and system call” corresponds to starting and executing a process. The failure management server 16 of the present embodiment manages a process that may cause a failure as a prohibited process, stores a list of prohibited processes as information, and manages it as DB1. If a failure occurs when a specific option is specified when starting a process, an option corresponding to the process is also defined in the list of prohibited processes.
障害管理サーバ16では、DB1として管理する禁止プロセスのリストを最新のリストにする処理が実行される。障害管理サーバ16で実行されるDB1を最新のリストにする処理には、手動処理と自動処理がある。手動処理の一例には、最新の禁止プロセスをユーザによって入力させて、DB1に登録させる処理が挙げられる。自動処理の一例は、ベンダが公開している禁止プロセスを定期的に参照し、最新の禁止プロセスをDB1に登録する処理が挙げられる。自動処理の他例は、ベンダによって提供される最新の禁止プロセスを自動的に受け取り、受け取った最新の禁止プロセスをDB1に登録したりDB1を更新したりする処理がある。
In the fault management server 16, a process of making the list of prohibited processes managed as DB1 the latest list is executed. There are manual processing and automatic processing as processing for making DB1 executed by the failure management server 16 the latest list. As an example of the manual process, there is a process in which the latest prohibited process is input by the user and registered in the DB 1. An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor and registering the latest prohibited process in the DB 1. Another example of the automatic process is a process of automatically receiving the latest prohibited process provided by the vendor and registering the received latest prohibited process in DB1 or updating DB1.
障害管理サーバ16は、DB1を最新のリストにする手動処理を可能とするために、ユーザの入力による禁止プロセスを、DB1に登録したりDB1を更新したりするようになっている。また、障害管理サーバ16は、DB1を最新のリストにする自動処理を可能とするために、ベンダが公開している禁止プロセスを参照して、DB1に登録したりDB1を更新したりするようになっている。また、障害管理サーバ16は、DB1を最新のリストにする自動処理を可能とするために、ベンダによって提供される最新の禁止プロセスを自動的に受け取り、受け取った最新の禁止プロセスを、DB1に登録したりDB1を更新したりするようになっている。
The fault management server 16 is configured to register or update the DB1 forbidden processes by user input in order to enable manual processing to make the DB1 the latest list. Further, the failure management server 16 refers to the prohibited process published by the vendor and registers in the DB 1 or updates the DB 1 in order to enable automatic processing to make the DB 1 the latest list. It has become. In addition, the failure management server 16 automatically receives the latest prohibition process provided by the vendor and registers the received latest prohibition process in DB1 in order to enable automatic processing to make DB1 the latest list. Or update DB1.
なお、障害管理サーバ16は、ベンダの提供による禁止プロセスと共に、禁止プロセスに対応する最新のパッチやパッチ情報を取得することができるようになっている。
The failure management server 16 can acquire the latest patch and patch information corresponding to the prohibited process, along with the prohibited process provided by the vendor.
禁止プロセスのリストであるDB1の一例は、図2に示すように、リスト項目番号(No.)、禁止プロセスの名称、オプション、及び障害を識別するための障害番号の各項目がプロセス毎に分類されたリストを採用することができる。
As shown in FIG. 2, an example of DB 1 that is a list of prohibited processes includes items of list item numbers (No.), names of prohibited processes, options, and failure numbers for identifying failures, classified by process. List can be adopted.
本実施形態では、障害を引き起こす可能性のあるプロセスに関するデータを、禁止プロセスとして禁止プロセスのリスト(DB1)に障害番号を付与して格納して管理する。或る起動オプションが指定されたプロセスが実行されると障害が発生する可能性があるときは、その起動オプション等もデータとして禁止プロセスのリストに格納される。例えば図2に示すリストにおいて、リスト項目番号No1では、名称「fjpmgadd」のプロセスを「-a」のオプションを付与して実行すると、障害番号0001に対応する障害が発生する可能性があるという定義になる。
In this embodiment, data relating to a process that may cause a failure is managed by assigning a failure number to a list of prohibited processes (DB1) as a prohibited process. When there is a possibility that a failure will occur when a process for which a certain activation option is specified is executed, the activation option or the like is also stored as data in the list of prohibited processes. For example, in the list illustrated in FIG. 2, in the list item number No. 1, if the process of the name “fjpmgadd” is executed with the option “-a”, a failure corresponding to the failure number 0001 may occur. become.
障害管理サーバ16は、禁止プロセスのリストにおける障害番号に対応する障害の内容を、障害内容リストを格納するデータベースで管理する。すなわち、障害管理サーバ16は、障害を引き起こすプロセスの障害番号と障害内容等の情報をリスト化してDB2として管理する。障害内容リストであるDB2の一例は、図3に示すように、DB1の障害番号に対応する障害番号、障害内容、回避策、復旧方法、パッチ番号、及び障害発生条件の各項目を含むリストを採用することができる。
The failure management server 16 manages the content of the failure corresponding to the failure number in the list of prohibited processes using a database that stores the failure content list. In other words, the failure management server 16 manages the failure number of the process causing the failure and information such as the failure content as a list by managing the list. An example of the failure content list DB2, as shown in FIG. 3, is a list including failure number, failure content, workaround, recovery method, patch number, and failure occurrence condition items corresponding to the failure number of DB1. Can be adopted.
障害内容リストの項目のうち回避策は、障害発生の可能性が高いプロセスについて、適用対象システムを停止させることなく、障害発生の可能性を低下させて該プロセスを一時的に実行可能とする手当を指すものである。一例として、適用対象システムの停止及び再起動を行うことなく実行可能なシェルプログラムを回避策として採用することができる。一般的には、障害発生の可能性が高いプロセスに対しては最新のパッチを適用することが好ましいとされる。しかし、稼働中のシステムを停止するメインテナンスを極力延期したいという要望もあり、この要望に対して現状システムで一時的な対策となる回避策は有効である。
Among the items in the failure content list, a workaround is an allowance for a process that has a high possibility of failure occurrence, which can temporarily execute the process by reducing the possibility of failure occurrence without stopping the target system. It points to. As an example, a shell program that can be executed without stopping and restarting the application target system can be employed as a workaround. In general, it is preferable to apply the latest patch for a process having a high possibility of failure. However, there is also a request to postpone maintenance to stop the operating system as much as possible, and a workaround that is a temporary measure in the current system is effective for this request.
なお、障害発生の可能性が高いとは、プロセスの実行に伴う障害発生の可能性が予め定めた所定値以上であることをいう。障害発生の可能性を示す値は、後述する期待値を採用することができる。また、障害発生の可能性を求める所定の関数や数式から導き出される値を、障害発生の可能性を示す値として用いても良い。
It should be noted that the high possibility of occurrence of a failure means that the possibility of occurrence of a failure accompanying the execution of a process is not less than a predetermined value. As a value indicating the possibility of occurrence of a failure, an expected value described later can be adopted. In addition, a value derived from a predetermined function or mathematical expression for determining the possibility of failure occurrence may be used as a value indicating the possibility of failure occurrence.
適用対象システムの再起動を必要としない回避策の一例には、定義ファイルの書き換えがある。定義ファイルの書き換えなどを実行するものにはシェルプログラムがある。なお、回避策を適用後にシステムの再起動が要求される対応は回避策にはならないものと本実施形態ではみなす。障害管理サーバ16は手動処理または自動処理により具体的な内容等の回避策の情報を取得し、プロセスの回避策としてDB2に登録したりDB2を更新したりする。
∙ An example of a workaround that does not require restarting the target system is rewriting the definition file. A shell program is one that executes rewriting of definition files. In the present embodiment, it is assumed that a response that requires a system restart after applying the workaround is not a workaround. The failure management server 16 acquires information on avoidance measures such as specific contents by manual processing or automatic processing, and registers or updates DB2 as a work avoidance measure.
手動処理の一例には、回避策の情報をユーザによって入力させて、DB2に登録させたりDB2を更新させたりする処理が挙げられる。自動処理の一例は、ベンダが公開している禁止プロセスを定期的に参照し、最新の禁止プロセスに対する回避策があるときに回避策の情報を取得しDB2に登録する処理が挙げられる。自動処理の他例は、ベンダによって提供される最新の禁止プロセスを自動的に受け取り、受け取った最新の禁止プロセスに対する回避策があるときに回避策の情報をDB2に登録したりDB2を更新したりする処理がある。回避策の情報は、最新のパッチやそのパッチ情報を取得するときに併せて取得することが好ましい。例えば、回避策となるシェルプログラムは、ソフトウェアの開発担当者がプロセスのパッチを提供する際に併せて提供するものを取得することが好ましい。すなわち、障害管理サーバ16は、ユーザによる入力またはシステムによる自動処理によって取得する最新のパッチやパッチの情報を取得するときに併せて回避策を取得し、取得した回避策をDB2に登録したりDB2を更新したりすることが好ましい。
As an example of manual processing, there is a process in which information on a workaround is input by a user and registered in DB2 or DB2 is updated. An example of the automatic process is a process of periodically referring to a prohibited process published by a vendor, acquiring information on a workaround when there is a workaround for the latest prohibited process, and registering it in the DB 2. Another example of automatic processing is to automatically receive the latest prohibition process provided by the vendor and register workaround information in DB2 or update DB2 when there is a workaround for the received latest prohibition process. There is a process to do. It is preferable to obtain the workaround information together with the latest patch and its patch information. For example, it is preferable to obtain a shell program to be provided as a workaround when a software developer provides a process patch. That is, the failure management server 16 acquires a workaround when acquiring the latest patch or patch information acquired by user input or automatic processing by the system, and registers the acquired workaround in DB2 or DB2 Is preferably updated.
障害管理サーバ16は、DB2の回避策を登録または更新する手動処理を可能とするために、ユーザの入力による回避策の情報を、DB2に登録したりDB2を更新したりすることができるようになっている。また、障害管理サーバ16は、DB2の回避策の情報を登録または更新する自動処理を可能とするために、回避策の情報を取得しDB2に登録したりDB2を更新したりすることができるようになっている。自動処理の一例には、障害管理サーバ16がベンダが公開している禁止プロセスを参照し、禁止プロセスに対する回避策の有無を確認して禁止プロセスに対する回避策があるときに回避策の情報を取得する。取得した回避策をDB2に登録したりDB2を更新したりすることができるようになっている。また、障害管理サーバ16の自動処理の他例には、ベンダによって提供される禁止プロセスを自動的に受け取り、受け取った禁止プロセスに対する回避策があるときに回避策の情報をDB2に登録したりDB2を更新したりすることがある。
In order to enable manual processing for registering or updating the workaround of DB2, the failure management server 16 can register information on the workaround by user input in DB2 or update DB2. It has become. In addition, the failure management server 16 can acquire and register the workaround information in the DB2 or update the DB2 in order to enable automatic processing for registering or updating the workaround information in the DB2. It has become. As an example of automatic processing, the failure management server 16 refers to a prohibited process disclosed by the vendor, checks whether there is a workaround for the prohibited process, and obtains workaround information when there is a workaround for the prohibited process. To do. The acquired workaround can be registered in DB2 or DB2 can be updated. Other examples of automatic processing of the failure management server 16 include automatically receiving a prohibited process provided by a vendor, and registering information on a workaround in DB2 when there is a workaround for the received prohibited process, or DB2 May be updated.
また、障害内容リストの項目のうち復旧方法は、後述するように、障害が顕在化したプロセスを停止させたときに適用対象システムを停止させることなく、該当プロセスを再開(一時的に実行)可能とする手当を指すものである。つまり、障害発生の可能性が高いプロセスに起因する障害が発生したとき、稼働中のプロセスは強制的に停止される。プロセスが停止されたとき、適用対象システムを停止させることなく、障害発生の可能性を低下させてプロセスを再開させることが好ましい。復旧方法の一例として、適用対象システムを再起動せずに実行可能なシェルプログラムを採用することができる。なお、復旧方法は、回避策と同様に、最新のパッチやそのパッチ情報を取得するときに併せて取得するものとする。つまり、復旧方法となる一例のシェルプログラムは、ソフトウェアの開発担当者がプロセスのパッチを提供する際に併せて提供するものを取得することとする。すなわち、障害管理サーバ16は、最新のパッチやそのパッチ情報を取得するときに併せて回避策を取得し、DB2として登録されたり更新されたりするようになっている。
In addition, the recovery method of the items in the failure content list can be resumed (executed temporarily) without stopping the target system when the process in which the failure has become apparent is stopped, as described later. Refers to the allowance. That is, when a failure due to a process having a high possibility of failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system. As an example of the recovery method, a shell program that can be executed without restarting the application target system can be employed. As with the workaround, the recovery method is acquired together with the latest patch and its patch information. In other words, an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch. That is, the failure management server 16 acquires a workaround when acquiring the latest patch and its patch information, and is registered or updated as DB2.
また、障害内容リストの項目のうちパッチ番号は、障害を修正するためのパッチを示すパッチ番号が定義されている。障害内容リストに定義されるパッチ番号は、障害管理サーバ16で決定してもよく、予め定められたものを採用してもよい。
Also, a patch number indicating a patch for correcting a failure is defined as a patch number among items in the failure content list. The patch number defined in the failure content list may be determined by the failure management server 16 or may be determined in advance.
障害管理サーバ16は、障害を引き起こすプロセスの障害内容を、障害番号に対応する各データを障害内容リスト(DB2)に格納することで管理する。例えば図3に示すリストにおいて、障害番号「0001」の障害内容は「パニック」で、障害発生条件1~条件3が該当したときに障害が発生するという定義になる。なお図3は、障害番号「0001」について、回避策及び復旧方法が共に存在しない例を示している。障害発生条件1~条件3の詳細については後述する。図2のDB1と図3のDB2の例の場合だと、条件1の「J1_0001」かつ条件2の「J2_0001」かつ条件3の「J3_0001」の条件が障害内容の「パニック」発生の条件に該当する。障害番号「0001」の場合には、「fjpmgadd」プロセスを「-a」オプションつきで起動するとシステムに「パニック」が発生する可能性が高いことになる。
The failure management server 16 manages the failure content of the process causing the failure by storing each data corresponding to the failure number in the failure content list (DB2). For example, in the list shown in FIG. 3, the failure content of the failure number “0001” is “panic”, and it is defined that a failure occurs when failure occurrence conditions 1 to 3 are met. FIG. 3 shows an example in which neither a workaround nor a recovery method exists for the failure number “0001”. Details of the failure occurrence conditions 1 to 3 will be described later. In the case of DB1 in FIG. 2 and DB2 in FIG. 3, the conditions of “J1_0001” in condition 1, “J2_0001” in condition 2, and “J3_0001” in condition 3 correspond to the conditions for the occurrence of the “panic” of the failure content. To do. In the case of the failure number “0001”, if the “fjpmgadd” process is started with the “-a” option, there is a high possibility that a “panic” will occur in the system.
次に、障害内容リストの項目のうちの障害発生条件について説明する。
障害内容リストの項目のうち、障害発生条件とは、障害が発生する条件を示す情報であり、ソフトウェアの障害が発生する条件を分類してDB3~DB5により管理される。 Next, a failure occurrence condition among items in the failure content list will be described.
Of the items in the failure content list, the failure occurrence condition is information indicating a failure occurrence condition, and the conditions under which the software failure occurs are classified and managed by DB3 to DB5.
障害内容リストの項目のうち、障害発生条件とは、障害が発生する条件を示す情報であり、ソフトウェアの障害が発生する条件を分類してDB3~DB5により管理される。 Next, a failure occurrence condition among items in the failure content list will be described.
Of the items in the failure content list, the failure occurrence condition is information indicating a failure occurrence condition, and the conditions under which the software failure occurs are classified and managed by DB3 to DB5.
ソフトウェア障害は、任意の条件下、つまり稼働状態にある適用対象システムが特定の状態になった場合に発生すると考えられる。ソフトウェアの開発者は、ソフトウェア開発時点に、プロセスの実行時における適用対象システムの稼働状態を想定している。そして、想定した稼働状態を逸脱すると障害発生の可能性が高いと考える。そこで、稼働状態にある適用対象システムの稼働状態を把握することで、障害発生の可能性を判定することができる。適用対象システムの稼働状態を把握する一例には、「システムで定められたコマンドの出力結果から把握する」、「システムで規定されたファイルの内容から把握する」、及び「システムで所定プログラムの実行結果から把握する」などが挙げられる。障害管理サーバ16は、任意のプロセスについて、適用対象システムの稼働状態において特定の状態となる条件をリスト化して管理する。本実施形態では、「システムで定められたコマンドの出力結果から把握する」ことによる条件を条件1とし、リスト化してDB3として管理する。また、「システムで規定されたファイルの内容から把握する」ことによる条件を条件2とし、リスト化してDB4として管理する。また、「システムで所定プログラムの実行結果から把握する」ことによる条件を条件3とし、リスト化してDB5として管理する。
∙ A software failure is considered to occur under any condition, that is, when the target system in operation is in a specific state. The software developer assumes the operating state of the application target system at the time of executing the process at the time of software development. And if it deviates from the assumed operating state, it is considered that there is a high possibility of failure. Therefore, by grasping the operating state of the application target system in the operating state, the possibility of occurrence of a failure can be determined. Examples of grasping the operating status of the target system include “Understanding from the output result of the command defined by the system”, “Understanding from the contents of the file defined by the system”, and “Execution of predetermined program in the system” “Understanding the results”. The failure management server 16 manages a list of conditions that are in a specific state in the operating state of the application target system for an arbitrary process. In the present embodiment, the condition based on “understanding from the output result of the command determined by the system” is set as condition 1, and the list is managed as DB3. In addition, the condition obtained by “ascertaining from the contents of the file defined by the system” is set as the condition 2, and the list is managed as the DB 4. Further, the condition obtained by “ascertaining from the execution result of the predetermined program by the system” is set as the condition 3, and the list is managed as the DB 5.
障害管理サーバ16は、ユーザによって入力されるか、またはシステムによる自動処理によって収集された条件に関する情報をDB3~DB5に登録したり、前記条件に関する情報に基づきDB3~DB5を更新したりするようになっている。
The failure management server 16 registers information on conditions input by the user or collected by automatic processing by the system in DB3 to DB5, and updates DB3 to DB5 based on the information on the conditions. It has become.
DB3は、システムで定められたコマンド出力に基づいて障害発生の可能性を判定するときの条件のリストを情報として格納したものである。DB3による条件は、「システムで定められたコマンドの出力結果から把握する」ことによる条件であり、これを障害発生時の条件1とする。つまり、条件1は、システムでの任意のコマンドの出力結果から、システムの稼働状態のうち障害発生の可能性が高い特定状態を判定するための条件である。
DB3 stores a list of conditions for determining the possibility of failure occurrence based on command output determined by the system as information. The condition based on DB3 is a condition based on “understanding from the output result of a command determined by the system”, and is defined as condition 1 when a failure occurs. That is, the condition 1 is a condition for determining a specific state where the possibility of a failure occurrence is high among the operating states of the system from the output result of an arbitrary command in the system.
障害発生時の条件1のリストであるDB3の一例には、図4に示す、DB2の項目である条件1の格納内容に対応する条件管理番号、判定のために収集する情報(コマンド出力)、及び判定のために収集する情報の期待値の各項目を含むリストがある。本実施形態における条件1では、システムで定められたコマンド及び起動オプションを障害発生可能性の判定のために収集する情報とする。また、システムで定められたコマンド及び起動オプションの出力結果について、開発者が開発時点に想定した出力値を、判定のために収集する情報の期待値とする。取得したシステムの稼働状態と、開発者が開発時点に想定した期待値との組み合わせが条件1となり、障害発生の可能性の判定のために収集される。従って、コマンド出力結果と期待値の文字列との比較によって、障害発生の可能性を判定できる。
An example of DB3, which is a list of conditions 1 when a failure occurs, includes a condition management number corresponding to the stored contents of condition 1, which is an item of DB2, shown in FIG. 4, information collected for determination (command output), And a list including items of expected values of information collected for determination. In condition 1 in the present embodiment, the commands and start options determined by the system are information collected for determining the possibility of failure. In addition, regarding the output results of commands and start options determined by the system, the output value assumed by the developer at the time of development is set as the expected value of information collected for determination. A combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 1, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the command output result with the expected character string.
例えば、図4に示すリストにおいて、条件管理番号「J1_0001」の障害発生の条件は、OSバージョンが「5.10」、プラットフォーム名が「sun4u」、パッチ「123456-01」が適用済のときである。すなわち、「uname -r」コマンドの実行結果が「5.10」であるとき。「uname -i」コマンドの実行結果が「sun4u」であるとき。「patchadd -p | grep 123456-01」の結果が「Patch: 123456-01 Obsoletes: Requires: Incompatibles: Packages: SUNWfctl」であるときに、障害が発生すると予測する。
For example, in the list shown in FIG. 4, the condition for the failure occurrence with the condition management number “J1_0001” is when the OS version is “5.10”, the platform name is “sun4u”, and the patch “123456-01” has been applied. That is, when the execution result of the “uname -r” command is “5.10”. When the execution result of the “uname -i” command is “sun4u”. When the result of “patchadd -p | grep 123456-01” is “Patch: 123456-01 Obsoletes: Requires: Incompatibles: Packages: SUNWfctl”, it is predicted that a failure will occur.
DB4は、システムで規定されたファイルの内容で障害発生の可能性を判定するときの条件のリストを情報として登録したものであり、これを障害発生時の条件2とする。つまり、条件2は、システムに配置されているテキストファイル等のようにシステムの稼働状態により書き換えられる可能性のあるファイルの内容から、システムの稼働状態における障害発生の可能性が高い特定状態を判定するための条件である。
DB4 registers a list of conditions for determining the possibility of failure based on the contents of a file defined by the system as information, and this is defined as condition 2 at the time of failure. In other words, condition 2 determines a specific state that is likely to cause a failure in the system operating state from the contents of a file that may be rewritten depending on the system operating state, such as a text file arranged in the system. It is a condition to do.
障害発生時の条件2のリストであるDB4の一例には、図5に示すように、条件管理番号、障害発生判定のために収集する情報(ファイル)、及び障害発生判定のために収集する情報の期待値の各項目を含むリストがある。本実施形態における条件2では、システムに配置されているファイル(例えばテキストファイル)を、障害発生判定のために収集する情報とする。また、システムに配置されているファイルの内容について、開発者が開発時点に想定した出力値を、期待値とする。取得したシステムの稼働状態と、開発者が開発時点に想定した期待値との組み合わせが条件2となり、障害発生の可能性の判定のために収集される。従って、ファイル内容と期待値の文字列との比較によって、障害発生の可能性を判定できる。
As an example of DB4, which is a list of conditions 2 at the time of failure occurrence, as shown in FIG. 5, the condition management number, information (file) collected for failure occurrence determination, and information collected for failure occurrence determination There is a list containing each item of expected values. In condition 2 in the present embodiment, a file (for example, a text file) arranged in the system is information to be collected for determining the occurrence of a failure. In addition, regarding the contents of the files arranged in the system, the output value assumed by the developer at the time of development is taken as the expected value. The combination of the acquired operating state of the system and the expected value assumed by the developer at the time of development is Condition 2, and is collected for determining the possibility of failure. Therefore, the possibility of failure can be determined by comparing the file contents with the expected character string.
システムに配置されているテキストファイルの一例には、システムやソフトウェアで規定する定義ファイルの定義などがある。システムに配置されているファイルとして定義ファイルの定義が指定されている場合、障害発生の可能性を判定するために、ファイル内容と期待値の文字列との比較を実行する。例えば、比較実行の結果、一致を示すデータまたは不一致を示すデータを出力するdiffコマンド等のコマンドを採用できる。diffコマンドを用いることにより、システムの稼働状態で取得したファイルの内容に対してDB4における期待値の文字列の内容が相違するか否かを、diffコマンドの出力結果から判定できる。
An example of a text file placed in the system is a definition file defined by the system or software. If the definition file definition is specified as a file located in the system, the file contents are compared with the expected value character string to determine the possibility of failure. For example, a command such as a diff command that outputs data indicating coincidence or data indicating disagreement as a result of comparison execution can be employed. By using the diff command, it can be determined from the output result of the diff command whether or not the content of the expected character string in the DB 4 is different from the content of the file acquired in the operating state of the system.
例えば、図5に示すリストにおいて、条件管理番号「J2_0001」の障害発生条件は、3つのファイルの内容の各々が「File1」、「File2」及び「File3」の各文字列に対応する期待値と一致するときである。すなわち、ファイル「/var/opt/FJSVpmgw/reg/.user」の内容が「File1」の文字列であるとき。ファイル「/var/opt/FJSVpmgw/etc/.config」の内容が「File2」の文字列であるとき。ファイル「/var/opt/FJSVpmgw/etc/.role」の内容が「File3」の文字列であるときである。これらの3つの文字列がいずれも対応する期待値と一致するとき、障害が発生すると予測する。
For example, in the list shown in FIG. 5, the failure occurrence condition of the condition management number “J2_0001” is the expected value corresponding to the character strings “File1”, “File2”, and “File3” for each of the contents of the three files. It is time to match. That is, when the content of the file “/var/opt/FJSVpmgw/reg/.user” is the character string “File1”. When the content of the file "/var/opt/FJSVpmgw/etc/.config" is the character string "File2". This is when the content of the file “/var/opt/FJSVpmgw/etc/.role” is the character string “File3”. When any of these three character strings matches the corresponding expected value, it is predicted that a failure will occur.
DB5は、システムにおけるプログラムの実行結果で障害発生を判定するときの条件のリストを情報として格納するものである。このDB5による条件を障害発生時の条件3とする。つまり、条件3は、システムの稼働状態に起因する情報を含めた情報から、システムの稼働状態における障害発生の可能性が高い特定状態を判定できる条件である。
DB5 stores a list of conditions when determining the occurrence of a failure based on the execution result of a program in the system as information. This DB5 condition is defined as condition 3 when a failure occurs. That is, the condition 3 is a condition for determining a specific state where there is a high possibility of failure in the system operating state from information including information resulting from the system operating state.
システムの稼働状態が障害発生の原因となる可能性が高い場合、システムが安定稼働中であるときには障害が発生しないが、システムが高負荷稼働すると障害が発生する可能性が高くなるときがある。すなわち、プロセスを実行しようとするタイミングにおけるシステムの稼働状態に依存して障害発生の可能性が変化する場合がある。これは、条件1の「システムで定められたコマンドの出力結果から把握」及び条件2の「システムで規定されたファイルの内容から把握」のように、収集した情報と期待値の比較では判定できないものである。具体的には、条件1や条件2のように、単純に期待値を定義できず、「期待値と一致した/しない」の結果によって「条件が合致した/しない」を判定できない。期待値を定義できない場合は、所定プログラムを条件判定のためのロジックとして用いることが可能である。条件判定を行うためのロジックとしての所定プログラムは、開発者により作成されたシェルスクリプトを採用できる。このため、変化するシステムの稼働状態を把握するために、予め所定プログラムを用意しておくことが好ましい。
When there is a high possibility that the operating state of the system will cause a failure, the failure does not occur when the system is in stable operation, but the possibility of a failure increases when the system is operating at a high load. In other words, the possibility of a failure may change depending on the operating state of the system at the timing of executing the process. This cannot be determined by comparing the collected information with the expected value, such as “Understanding from the output result of the command determined by the system” in Condition 1 and “Understanding from the contents of the file defined by the system” in Condition 2. Is. Specifically, unlike condition 1 and condition 2, the expected value cannot be simply defined, and “matched / not met” cannot be determined based on the result of “matched / not matched with expected value”. When an expected value cannot be defined, a predetermined program can be used as a logic for determining a condition. A shell script created by a developer can be adopted as the predetermined program as the logic for performing the condition determination. For this reason, it is preferable to prepare a predetermined program in advance in order to grasp the operating state of the changing system.
障害発生時の条件3のリストであるDB5の一例は、図6に示すように、DB2の項目である条件3の内容に対応する条件管理番号、及び判定のために収集する情報としての判定方法の各項目を含むリストを採用することができる。本実施形態における条件3では、開発者により作成されたシェルスクリプトを判定のために収集する情報とする。従って、判定方法に格納されたシェルスクリプトの実行結果によって、障害発生の可能性を判定できる。
An example of DB5, which is a list of condition 3 at the time of failure occurrence, includes a condition management number corresponding to the contents of condition 3 that is an item of DB2, and a determination method as information collected for determination, as shown in FIG. It is possible to employ a list including each item. In condition 3 in this embodiment, the shell script created by the developer is information to be collected for determination. Therefore, it is possible to determine the possibility of failure based on the execution result of the shell script stored in the determination method.
シェルスクリプトにより障害発生の可能性を判定する一例を説明する。起動されたプロセスが使用する作業領域が、デフォルトの場所ではなく、ユーザがカスタマイズして場所が変更されているとする。また、起動されたプロセスが入力値として使用するデータがユーザの指定した可変のものであるとする。また、起動されたプロセスが使用する作業領域が入力値として使用するデータのサイズをもとに予め定めた値よりも少ないときは、プロセスが失敗して予期せぬエラーとなり、障害発生に至るものとする。このように、障害発生の条件が複雑で期待値を用いて障害発生が判断できないものは、シェルスクリプトの実行結果で障害発生を判定できる。例えば、「条件が合致したら終了コード1、合致しなかったら終了コード0」といった規約で動作するシェルスクリプトを作成して、終了コードを参照することで障害発生を判定することができる。
An example of determining the possibility of failure using a shell script will be described. Assume that the work area used by the activated process is not the default location, but has been customized by the user and changed. Further, it is assumed that data used as an input value by the activated process is variable specified by the user. In addition, when the work area used by the activated process is less than a predetermined value based on the size of data used as an input value, the process fails and an unexpected error occurs, resulting in a failure. And As described above, when the failure occurrence condition is complicated and the failure occurrence cannot be determined using the expected value, the failure occurrence can be determined from the execution result of the shell script. For example, it is possible to determine the occurrence of a failure by creating a shell script that operates according to a convention such as “exit code 1 if the condition is met, and exit code 0 if it is not met” and referring to the end code.
例えば、図6に示すリストにおいて、シェルプログラムa(一例を図7にコード36として示す。)を実行し、終了コードが「1」の文字列であるとき、条件管理番号「J3_0001」の条件に該当すると判定できる。
For example, in the list shown in FIG. 6, when the shell program a (an example is shown as the code 36 in FIG. 7) is executed and the end code is a character string “1”, the condition management number “J3_0001” is set. It can be determined that it is applicable.
以上のように、障害管理サーバ16には、DB1~DB5のデータベースが構築される。これらデータベースの情報は、新規に障害が発生して、発生した障害の発生条件が明確になったときに更新される。データベースは、開発者がその都度更新したり、ネットワークを経由して既知の情報を取得したりすることでなされる。
As described above, the failure management server 16 is constructed with databases DB1 to DB5. Information in these databases is updated when a new failure occurs and the conditions for the occurrence of the failure become clear. The database is updated each time the developer is updated, or the known information is acquired via the network.
(障害発生防止処理)
次に、クライアント端末14で実行される処理を中心に、障害発生防止処理について説明する。 (Failure prevention processing)
Next, the failure occurrence prevention process will be described focusing on the process executed by theclient terminal 14.
次に、クライアント端末14で実行される処理を中心に、障害発生防止処理について説明する。 (Failure prevention processing)
Next, the failure occurrence prevention process will be described focusing on the process executed by the
図8は本実施形態に係る障害発生防止処理の流れについての説明図である。障害管理サーバ16である管理サーバは、障害が発生するシステムの稼働状態の条件、障害の発生を防止するパッチの情報を含むプロセス毎に分類されたリスト40を有している。また、管理サーバは、パッチ42も有している。
FIG. 8 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment. The management server which is the failure management server 16 has a list 40 classified for each process including conditions of the operating state of the system in which the failure occurs and information on a patch for preventing the occurrence of the failure. The management server also has a patch 42.
クライアント端末14である適用対象システム14は、プロセスが起動されたときに、把握部46が適用対象システムの稼働状態を把握する。そして、参照部44がプロセス毎に分類されたリストを参照する。特定部48では、参照したリストから起動したプロセスに対して未適用のパッチの条件を特定する。診断部50は、特定部48で特定された条件と把握部46で把握した適用対象システムの稼働状態とから、起動したプロセスの障害発生の可能性を診断する。診断部50の診断の結果、起動したプロセスの障害発生の可能性が高いときには、管理部52によって当該プロセスを停止させる。また、取得部54によって、管理部52により停止されたプロセスに対応する未適用のパッチを取得する。
In the application target system 14 that is the client terminal 14, when the process is started, the grasping unit 46 grasps the operating state of the application target system. Then, the reference unit 44 refers to the list classified for each process. The specifying unit 48 specifies the condition of an unapplied patch for the process activated from the referenced list. The diagnosis unit 50 diagnoses the possibility of failure of the activated process from the condition specified by the specifying unit 48 and the operating state of the application target system grasped by the grasping unit 46. As a result of the diagnosis by the diagnosis unit 50, when there is a high possibility of failure of the activated process, the management unit 52 stops the process. Further, the acquisition unit 54 acquires an unapplied patch corresponding to the process stopped by the management unit 52.
従って、適用対象システムは、ソフトウェアが機能を実現するためのプロセスを起動すると、そのタイミングで参照部44によりリストを参照すると共に、把握部46により適用対象システムの稼働状態を把握する。診断部50は、適用対象システムの稼働状態に基づき、把握した適用対象システムの稼働状態に対して障害の発生条件が該当しないかを診断する。診断の結果、起動したプロセスが障害発生の可能性が高いときには、管理部52がプロセスを停止し、取得部54がパッチを取得する。取得したパッチは任意のタイミングでシステムに適用できる。このように、適用対象システムの稼働状態に基づき、起動したプロセスが障害を発生する可能性が高いか否かを判定でき、適用対象システムの稼働状態に則した障害発生の防止をすることができる。また、起動したプロセスによる障害発生の可能性が高いときにこれを解消するパッチを取得して任意のタイミングでシステムに適用できるようにする。このため、パッチ適用の候補となるパッチについて、起動したプロセスが障害発生の可能性が高いときに取得でき、パッチを選別する必要がない。
Accordingly, when the application target system starts a process for realizing the function of the software, the reference unit 44 refers to the list at that timing, and the grasping unit 46 grasps the operating state of the application target system. The diagnosis unit 50 diagnoses whether the failure occurrence condition does not correspond to the grasped operation state of the application target system based on the operation state of the application target system. As a result of the diagnosis, when there is a high possibility that the activated process has a failure, the management unit 52 stops the process, and the acquisition unit 54 acquires the patch. Acquired patches can be applied to the system at any time. In this way, it is possible to determine whether or not the activated process is likely to cause a failure based on the operating state of the application target system, and to prevent the occurrence of a failure according to the operating state of the application target system. . In addition, when a possibility that a failure has occurred due to the activated process is high, a patch that eliminates the failure is acquired and can be applied to the system at an arbitrary timing. For this reason, patches that are candidates for patch application can be acquired when the activated process has a high possibility of failure, and there is no need to select patches.
次に、クライアント端末14で実行される処理について説明する。
図9はクライアント端末14で実行される処理を示すフローチャートである。 Next, processing executed on theclient terminal 14 will be described.
FIG. 9 is a flowchart showing processing executed by theclient terminal 14.
図9はクライアント端末14で実行される処理を示すフローチャートである。 Next, processing executed on the
FIG. 9 is a flowchart showing processing executed by the
適用対象システムであるクライアント端末14では、システムの稼働中にプロセス起動が要求されると、図9の処理ルーチンが実行され、ステップ100において参照処理が実行される。参照処理は、DB1で例示した、プロセス毎に分類された禁止プロセスリストを参照する処理である。ステップ100の処理では、起動が要求されたプロセスが禁止プロセスリストに登録(エントリ)されているか否かを検索する処理を含んでいる。なお、参照処理は、障害管理サーバ16である管理サーバから、最新のパッチの情報を入手して格納し、その格納した情報に含まれる禁止プロセスリストを参照するようにしてもよい。
In the client terminal 14 that is the application target system, when the process activation is requested during the operation of the system, the processing routine of FIG. 9 is executed, and the reference processing is executed in step 100. The reference process is a process of referring to the prohibited process list classified for each process exemplified in DB1. The process of step 100 includes a process of searching whether or not the process requested to be activated is registered (entry) in the prohibited process list. In the reference process, the latest patch information may be obtained from the management server that is the failure management server 16 and stored, and the prohibited process list included in the stored information may be referred to.
参照処理では、参照したリストから、起動が要求されたプロセスに対して未適用のパッチに関する条件、すなわち障害発生条件を特定することができる。
In the reference process, it is possible to specify a condition related to a patch that has not been applied to the process requested to be started, that is, a failure occurrence condition, from the referenced list.
起動が要求されたプロセスが禁止プロセスリストに登録(エントリ)されているときは、ステップ102で肯定されて、ステップ104へ進む。一方、起動要求されたプロセスが禁止プロセスリストに登録されていない場合には、ステップ102で否定される。ステップ102で否定されたときは、起動要求されたプロセスは障害発生の可能性が低く、プロセスの実行支障はないので、ステップ124で当該プロセスを実行するべく処理を進め、本ルーチンを終了する。
When the process requested to be activated is registered (entry) in the prohibited process list, the result in step 102 is affirmative and the process proceeds to step 104. On the other hand, if the process requested to be activated is not registered in the prohibited process list, the result in step 102 is negative. If the determination in step 102 is negative, the process requested to be started has a low possibility of occurrence of failure, and there is no trouble in executing the process. Therefore, in step 124, the process is advanced to execute the process, and this routine is terminated.
ステップ104では、後述する診断処理のための情報収集処理である把握処理が実行される。把握処理は、プロセスの起動が要求されたときの適用対象システムの稼働状態を把握する処理である。把握処理では例えば、「システムで定められたコマンドの出力結果から把握」や「システムで規定されたファイルの内容から把握」の処理を行う。なお、「システムで所定プログラムの実行結果から把握」の処理を行うことも可能である。
In step 104, a grasping process, which is an information collecting process for a diagnostic process described later, is executed. The grasping process is a process for grasping the operating state of the application target system when the process activation is requested. In the grasping process, for example, a process of “obtaining from an output result of a command determined by the system” or “obtaining from a file content defined by the system” is performed. It is also possible to perform the process of “obtaining from the execution result of the predetermined program by the system”.
次に、ステップ106では、診断処理が実行される。診断処理は、起動が要求されたプロセスについて、障害発生条件と把握処理で把握した適用対象システムの稼働状態とから、障害発生の可能性を診断する。診断の結果、起動が要求されたプロセスによる障害発生の可能性が高いときには、ステップ108で肯定されステップ110へ進む。一方、障害発生の可能性が低く、ステップ108で否定されるとステップ124へ進む。
Next, in step 106, diagnostic processing is executed. The diagnosis process diagnoses the possibility of the failure of the process requested to be started from the failure occurrence condition and the operating state of the application target system grasped by the grasping process. As a result of the diagnosis, when there is a high possibility that a failure has occurred due to the process requested to be activated, the determination in step 108 is affirmative and the routine proceeds to step 110. On the other hand, if the possibility of failure is low and the result is NO in step 108, the process proceeds to step 124.
ステップ110では、起動が要求されたプロセスの障害発生の可能性が高いために、当該プロセスが停止される。次のステップ112では通知・追加処理が実行される。通知・追加処理は、障害の発生を防止するパッチが存在することを、パッチ番号(パッチIDなど)をユーザに通知することによって行う。ユーザへの通知は、ディスプレイへの表示処理でもよく、その他ユーザへの通知部により通知する処理でもよい。なお、この通知は省略することができる。また、後述するシステム保守時にパッチを適用することを容易にするために、障害管理サーバ16からパッチを取得(ダウンロード)してプール領域26に蓄積し、パッチ番号を適用候補パッチリストに追加してもよい。
In step 110, since there is a high possibility of failure of the process requested to be started, the process is stopped. In the next step 112, notification / addition processing is executed. The notification / addition processing is performed by notifying the user of a patch number (patch ID or the like) that there is a patch for preventing the occurrence of a failure. The notification to the user may be a display process on the display, or may be a process of notifying by the notification unit to the user. This notification can be omitted. Also, in order to make it easier to apply patches during system maintenance described later, patches are acquired (downloaded) from the failure management server 16 and stored in the pool area 26, and patch numbers are added to the application candidate patch list. Also good.
次のステップ114では、延命処理が開始される。ステップ114の処理は、DB2を参照して、障害発生の可能性が高いプロセスに対する回避策の有無を確認する処理である。回避策があるときは、ステップ116で肯定されて、ステップ118において回避策を実行する。ステップ118の処理は、障害内容リストに格納されている回避策を実行する処理である。すなわち、適用対象システムを停止させることなく、障害発生の可能性を低下させてプロセスを一時的に実行可能とする手当を施す。次のステップ120では、ステップ110で停止したプロセスを再開させるべく処理を進め、プロセスを実行させて本ルーチンを終了する。
In the next step 114, the life extension process is started. The process of step 114 is a process of referring to the DB 2 and confirming whether or not there is a workaround for a process with a high possibility of failure. If there is a workaround, it is affirmed at step 116 and the workaround is performed at step 118. The process of step 118 is a process of executing a workaround stored in the failure content list. That is, an allowance is provided to temporarily execute the process by reducing the possibility of failure without stopping the application target system. In the next step 120, the process proceeds to restart the process stopped in step 110, the process is executed, and this routine is terminated.
一方、DB2に回避策がないときは(ステップ116で否定)、ステップ122において通知・追加処理が実行される。この通知・追加処理では、障害発生の可能性が高いプロセスを停止したことをユーザに通知する。この通知は、ディスプレイへの表示処理でもよく、その他ユーザへの通知部により通知する処理でもよい。なお、この通知は省略することができる。
On the other hand, when there is no workaround in DB2 (No in step 116), a notification / addition process is executed in step 122. In this notification / addition processing, the user is notified that a process having a high possibility of failure has been stopped. This notification may be a display process on the display, or may be a process of notifying by a notification unit to the user. This notification can be omitted.
また、ユーザには、障害発生の可能性が高いプロセスを停止した情報を提供することが好ましい。本実施形態では、後述するシステム保守時やユーザの任意のタイミングで、ユーザに障害発生の可能性が高いプロセスを停止した情報を提供可能とするために、障害発生の可能性が高いとして停止したプロセスの情報を、プロセス中断リストに追加することができる。プロセス中断リストは、障害発生の可能性が高く停止したプロセスの名称等のリストを情報として格納するデータベースである。プロセス中断リストの一例は、ユーザに停止したプロセスを通知するのみのときは、中断したプロセスの名称を項目として少なくとも備えればよい。また、中断したプロセスの情報をユーザに提供する項目内容に応じて、プロセス中断リストの項目を増加することができる。例えば、図1に示す禁止プロセスリストと同様に、リスト項目番号(No.)、中断したプロセスの名称、オプション、及び障害番号の各項目が中断したプロセス毎に分類されたリストを採用することができる。適用対象システムであるクライアント端末14のデータベース格納領域15は、障害発生の可能性が高いとして停止したプロセスの名称等のリストであるプロセス中断リストを情報として格納するデータベースを含むことができる。
Also, it is preferable to provide the user with information indicating that a process having a high possibility of failure has been stopped. In this embodiment, at the time of system maintenance to be described later or at an arbitrary timing of the user, in order to be able to provide the user with information indicating that a process having a high possibility of a failure has been stopped, the user has been stopped as having a high possibility of a failure. Process information can be added to the process interruption list. The process interruption list is a database that stores, as information, a list such as the names of processes that are likely to have failed and stopped. An example of the process interruption list may include at least the name of the interrupted process as an item when only the user is notified of the stopped process. Further, the number of items in the process interruption list can be increased in accordance with the content of the item that provides the user with information on the interrupted process. For example, similarly to the prohibited process list shown in FIG. 1, a list in which each item of the list item number (No.), the name of the suspended process, the option, and the failure number is classified for each suspended process may be adopted. it can. The database storage area 15 of the client terminal 14 that is the application target system can include a database that stores, as information, a process interruption list that is a list of names of processes that have been stopped due to a high possibility of failure.
次に、適用対象システムであるクライアント端末14で実行されるシステム保守について説明する。
図10はクライアント端末14で実行されるシステム保守処理を示すフローチャートである。システム保守処理は、予め定めたタイミングやユーザの指定による任意のタイミングで実行されるものであり、システム運用を停止してシステムの保守を実行する処理である。ここでは、適用候補パッチリスト、プロセス中断リストをもとに、システム運用を停止し、必要最小限のパッチを適用する場合を説明する。 Next, system maintenance executed by theclient terminal 14 that is an application target system will be described.
FIG. 10 is a flowchart showing system maintenance processing executed by theclient terminal 14. The system maintenance process is executed at a predetermined timing or at an arbitrary timing specified by the user, and is a process for stopping system operation and executing system maintenance. Here, a case where the system operation is stopped and the minimum necessary patches are applied based on the application candidate patch list and the process interruption list will be described.
図10はクライアント端末14で実行されるシステム保守処理を示すフローチャートである。システム保守処理は、予め定めたタイミングやユーザの指定による任意のタイミングで実行されるものであり、システム運用を停止してシステムの保守を実行する処理である。ここでは、適用候補パッチリスト、プロセス中断リストをもとに、システム運用を停止し、必要最小限のパッチを適用する場合を説明する。 Next, system maintenance executed by the
FIG. 10 is a flowchart showing system maintenance processing executed by the
予め定めたタイミングやユーザが指定したタイミングになると、システム保守処理が実行され、ステップ130においてパッチ適用処理が実行される。図10の例では、パッチ適用処理で適用候補パッチリストに登録されているパッチを一括して適用する。これにより、延命処理が施されたプロセス及び回避策がなく停止されたプロセスに対してパッチが適用され、システムの安定化を図ることができる。
When the predetermined timing or the timing designated by the user comes, system maintenance processing is executed, and patch application processing is executed in step 130. In the example of FIG. 10, patches registered in the application candidate patch list are collectively applied in the patch application process. As a result, the patch is applied to the process that has been subjected to the life extension process and the process that has been stopped without any workaround, and the system can be stabilized.
ステップ130の他例としては、プロセス中断リストをユーザへ提示し、ユーザがパッチの適用を希望するプロセスについてのみパッチを適用するべく選択させて、選択されたパッチを適用する処理がある。
As another example of step 130, there is a process of applying a selected patch by presenting a process interruption list to the user, causing the user to select only a process for which a patch is desired to be applied.
このように、システム運用中にプロセスを起動するタイミングで常にシステムの診断を行い、システム保守時に適用対象システムへパッチの適用候補となる必要最小限のパッチを抽出する。プロセス起動時に行う診断では、サーバより入手した「禁止プロセスリスト」「障害内容リスト」「条件リスト」をもとに、起動したプロセスを続行することで該プロセスに対する適用対象システムの稼働状態が障害発生の条件に該当するか否かを判定する。起動したプロセスが障害発生の可能性が高い場合、起動したプロセスを停止した上で回避策を施してプロセスを再開する。回避策がないときは、プロセスの停止状態を継続する。そして、適用候補のパッチはシステム保守時に一括で適用する。パッチはユーザにより選択的に適用することもできる。これによって、障害発生を未然に防ぐことができる。
In this way, the system is always diagnosed at the timing when the process is started during system operation, and the minimum necessary patches that are candidates for applying patches to the target system during system maintenance are extracted. In the diagnosis performed when starting a process, the operating status of the target system for the process has failed by continuing the started process based on the "prohibited process list", "failure content list", and "condition list" obtained from the server. It is determined whether this condition is met. If the activated process has a high possibility of failure, the activated process is stopped and a workaround is taken to restart the process. If there is no workaround, keep the process stopped. Then, the application candidate patches are collectively applied during system maintenance. Patches can also be selectively applied by the user. As a result, the occurrence of a failure can be prevented in advance.
(診断処理)
次に、クライアント端末14で実行される診断処理を詳細に説明する。図11はクライアント端末14で実行される診断処理を中心にした詳細な流れを示すフローチャートである。図11の処理ルーチンは、図9に示した処理の流れのうち、ステップ104~ステップ110及びステップ124について詳細に示したものである。 (Diagnosis processing)
Next, the diagnostic process executed by theclient terminal 14 will be described in detail. FIG. 11 is a flowchart showing a detailed flow centered on a diagnostic process executed by the client terminal 14. The processing routine of FIG. 11 shows details of steps 104 to 110 and step 124 in the processing flow shown in FIG.
次に、クライアント端末14で実行される診断処理を詳細に説明する。図11はクライアント端末14で実行される診断処理を中心にした詳細な流れを示すフローチャートである。図11の処理ルーチンは、図9に示した処理の流れのうち、ステップ104~ステップ110及びステップ124について詳細に示したものである。 (Diagnosis processing)
Next, the diagnostic process executed by the
クライアント端末14(適用対象システム)では、プロセス起動が要求されると、ステップ200において参照処理が実行される。起動が要求されたプロセスが禁止プロセスリストに登録(エントリ)されていないときは(ステップ202で否定)、ステップ204で(図9のステップ124と同様に)当該プロセスを実行するべく処理を進め、本ルーチンを終了する。なお、図9のステップ100と図11のステップ200の相違点は、ステップ200では、起動が要求されたプロセスについての障害発生条件を特定する処理を後の処理で行うことである。つまり、ステップ200では、起動が要求されたプロセスが禁止プロセスリストに登録(エントリ)されていることを検索する検索処理結果で確認処理が行われる。
In the client terminal 14 (application target system), when process activation is requested, reference processing is executed in step 200. If the process requested to be activated is not registered (entry) in the prohibited process list (No in Step 202), the process proceeds to execute the process in Step 204 (similar to Step 124 in FIG. 9) This routine ends. Note that the difference between step 100 in FIG. 9 and step 200 in FIG. 11 is that in step 200, processing for specifying a failure occurrence condition for a process requested to be started is performed in later processing. In other words, in step 200, confirmation processing is performed based on a search processing result for searching that the process requested to be activated is registered (entry) in the prohibited process list.
起動が要求されたプロセスが禁止プロセスリストに登録(エントリ)されているときは(ステップ202で肯定)、ステップ206へ進む。ステップ206では、障害内容リスト(DB2)を参照して、該当する障害内容を特定する。これと共に、該当する障害内容に対応する条件をDB3,DB4,DB5から抽出する。ステップ206では、参照したリストから、起動が要求されたプロセスに対して未適用のパッチに関する条件(すなわち障害発生条件)を特定する。
When the process requested to be activated is registered (entry) in the prohibited process list (Yes in Step 202), the process proceeds to Step 206. In step 206, the failure content list (DB2) is referenced to identify the corresponding failure content. At the same time, conditions corresponding to the corresponding failure contents are extracted from DB3, DB4, and DB5. In step 206, a condition relating to a patch that has not been applied to the process requested to be activated (ie, a failure occurrence condition) is specified from the referenced list.
次のステップ208では、第1条件処理が実行される。第1条件処理は、条件リスト1(DB3)の条件管理番号(例えばJ1_0001)に定義されている情報を収集(コマンド実行)し、期待値と比較する処理である。まず、診断のための情報収集である把握処理として、プロセスの起動が要求されたときの適用対象システムの稼働状態を把握する。稼働状態の把握により、条件管理番号の情報を得る(ステップ210)。第1条件処理は「システムで定められたコマンドの出力結果から把握」する処理であり、図9のステップ104における把握処理に対応する。ステップ210では、コマンドを実行してその結果値を得る。次に、条件管理番号の期待値を取得する(ステップ212)。ステップ210で得られた情報と、ステップ212で取得した期待値とが一致するか否かにより、第1条件についての障害発生の可能性を診断する。
In the next step 208, the first condition process is executed. The first condition process is a process for collecting (command execution) information defined in the condition management number (for example, J1_0001) in the condition list 1 (DB3) and comparing it with an expected value. First, as the grasping process that is information collection for diagnosis, the operating state of the application target system when the process activation is requested is grasped. Information on the condition management number is obtained by grasping the operating state (step 210). The first condition process is a process of “ascertaining from an output result of a command determined by the system”, and corresponds to the grasping process in step 104 of FIG. In step 210, the command is executed to obtain the result value. Next, the expected value of the condition management number is acquired (step 212). The possibility of failure occurrence for the first condition is diagnosed based on whether or not the information obtained in step 210 matches the expected value obtained in step 212.
次のステップ214では、コマンド実行による結果値が期待値と一致するか否かを判断することで、障害発生の可能性が高いか否かを判定する。コマンド実行による結果値が期待値と一致せずに障害発生の可能性が高くない場合にはステップ214で否定され、ステップ204へ進む。一方、コマンド実行による結果値が期待値と一致して障害発生の可能性が高い場合にはステップ214で肯定され、ステップ216へ進む。
In the next step 214, it is determined whether or not there is a high possibility of a failure by determining whether or not the result value of the command execution matches the expected value. If the result value from the command execution does not match the expected value and the possibility of failure is not high, the result in Step 214 is negative and the process proceeds to Step 204. On the other hand, if the result value from the command execution matches the expected value and the possibility of failure is high, the result is affirmative in step 214 and the process proceeds to step 216.
ステップ216では、第2条件処理が実行される。第2条件処理は、条件リスト2(DB4)の条件管理番号(例えばJ2_0001)に定義されている情報を収集し、期待値と比較する処理である。まず、把握処理として、プロセスの起動が要求されたときの適用対象システムの稼働状態を把握する。これにより条件管理番号に定義されている情報を得る(ステップ218)。第2条件処理は、「システムで規定されたファイルの内容から把握」する処理であり、ファイルの内容を読み取ってその値を得る。次に、条件管理番号に対応する期待値を取得する(ステップ220)。ステップ218で得られた情報と、ステップ220で取得した期待値とが合致するか否かにより、第2条件についての障害発生の可能性を診断する。
In step 216, the second condition process is executed. The second condition process is a process for collecting information defined in the condition management number (for example, J2_0001) in the condition list 2 (DB4) and comparing it with the expected value. First, as the grasping process, the operation state of the application target system when the process activation is requested is grasped. As a result, information defined in the condition management number is obtained (step 218). The second condition process is a process of “ascertaining from the contents of the file defined by the system”, and reads the contents of the file to obtain the value. Next, an expected value corresponding to the condition management number is acquired (step 220). The possibility of failure occurrence for the second condition is diagnosed based on whether or not the information obtained in step 218 matches the expected value obtained in step 220.
次のステップ222では、条件管理番号に対応したファイル内容による値が期待値と一致するか否かを判断することで、障害発生の可能性が高いか否かを判定する。ファイル内容による値が期待値と一致せずに障害発生の可能性が高くない場合にはステップ222で否定され、ステップ204へ進む。一方、ファイル内容による値が期待値と一致して障害発生の可能性が高い場合にはステップ222で肯定され、ステップ224へ進む。
In the next step 222, it is determined whether or not the possibility of failure is high by determining whether or not the value of the file content corresponding to the condition management number matches the expected value. If the value according to the file contents does not match the expected value and the possibility of failure is not high, the result in Step 222 is negative and the process proceeds to Step 204. On the other hand, if the value according to the file content matches the expected value and there is a high possibility of failure occurrence, the result in step 222 is affirmative and the routine proceeds to step 224.
ステップ224では、第3条件処理が実行される。第3条件処理は、条件リスト3(DB5)の条件管理番号(例えばJ3_0001)に定義されている情報を収集(シェル等の処理を実行)し、期待値と比較する処理である。まず、把握処理として、プロセスの起動が要求されたときの適用対象システムの稼働状態を把握する。これにより条件管理番号の情報を得る(ステップ226)。第3条件処理は、「システムで所定プログラムの実行結果から把握」する処理であり、ここではシェルスクリプトを実行した結果値(終了コード)を得る。次に、条件管理番号の期待値を取得する(ステップ228)。ステップ226で得られた情報と、ステップ228で取得した期待値が合致するか否かにより第3条件についての障害発生の可能性を診断する。
In step 224, the third condition process is executed. The third condition process is a process that collects information defined in the condition management number (for example, J3_0001) in the condition list 3 (DB5) (executes a process such as a shell) and compares it with an expected value. First, as the grasping process, the operation state of the application target system when the process activation is requested is grasped. As a result, condition management number information is obtained (step 226). The third condition process is a process of “ascertaining from the execution result of the predetermined program by the system”, and here, a result value (end code) obtained by executing the shell script is obtained. Next, the expected value of the condition management number is acquired (step 228). The possibility of failure occurrence for the third condition is diagnosed based on whether or not the information obtained in step 226 matches the expected value obtained in step 228.
次のステップ230では、シェル等の処理を実行による結果値が期待値と一致するか否かを判断することで、障害発生の可能性が高いか否かを判定する。シェル等の処理を実行による結果値が期待値と一致せず障害発生の可能性が高くない場合にはステップ230で否定され、ステップ204へ進む。一方、シェル等の処理を実行による結果値が期待値と一致して障害発生の可能性が高い場合にはステップ230で肯定され、ステップ232へ進む。ステップ232では、起動が要求されたプロセスが停止される。ステップ214、222及び230の全てで肯定された場合、つまり、第1条件~第3条件の各々について障害発生の可能性が高いと判断された場合、起動が要求されたプロセスは停止される。
In the next step 230, it is determined whether or not the possibility of failure is high by determining whether or not the result value obtained by executing the processing of the shell or the like matches the expected value. If the result value obtained by executing the processing such as shell does not match the expected value and the possibility of failure is not high, the result is negative in step 230 and the process proceeds to step 204. On the other hand, when the result value obtained by executing the processing of the shell or the like matches the expected value and there is a high possibility of the occurrence of a failure, the result is affirmative in step 230 and the process proceeds to step 232. In step 232, the process requested to start is stopped. If all of steps 214, 222, and 230 are affirmed, that is, if it is determined that a failure is likely to occur for each of the first condition to the third condition, the process that is requested to start is stopped.
なお、本実施形態では、第1条件処理~第3条件処理の各々が全て期待値と一致した場合、起動が要求されたプロセスによる障害発生の可能性が高いと診断する場合を説明したが、これに限定されない。すなわち、第1条件処理~第3条件処理の少なくとも1つの処理において得られた値と期待値とが一致した場合に、起動が要求されたプロセスが障害を発生する可能性が高いと診断するようにしてもよい。
In the present embodiment, a case has been described in which when each of the first condition process to the third condition process is consistent with the expected value, it is diagnosed that there is a high possibility of a failure occurring due to the process requested to be started. It is not limited to this. In other words, when the value obtained in at least one of the first condition process to the third condition process matches the expected value, it is diagnosed that the process that is requested to start is likely to cause a failure. It may be.
(延命処理)
次に、プロセス停止後に開始される処理について説明する。
図12は、クライアント端末14で実行される延命処理以降の処理を中心にした詳細な流れを示すフローチャートである。図12の処理ルーチンは、図9に示した処理の流れのうち、ステップ114~ステップ122について詳細に示したものである。図12に示す処理は、回避策を実行することで、可能な限りシステム運用の継続を図る処理である。 (Life extension processing)
Next, processing started after the process is stopped will be described.
FIG. 12 is a flowchart showing a detailed flow with a focus on processing after life extension processing executed by theclient terminal 14. The processing routine of FIG. 12 shows details of steps 114 to 122 in the flow of processing shown in FIG. The process illustrated in FIG. 12 is a process for continuing the system operation as much as possible by executing a workaround.
次に、プロセス停止後に開始される処理について説明する。
図12は、クライアント端末14で実行される延命処理以降の処理を中心にした詳細な流れを示すフローチャートである。図12の処理ルーチンは、図9に示した処理の流れのうち、ステップ114~ステップ122について詳細に示したものである。図12に示す処理は、回避策を実行することで、可能な限りシステム運用の継続を図る処理である。 (Life extension processing)
Next, processing started after the process is stopped will be described.
FIG. 12 is a flowchart showing a detailed flow with a focus on processing after life extension processing executed by the
クライアント端末14(適用対象システム)において、障害発生の可能性が高いと診断されたことによってプロセスが停止されると、図12の処理ルーチンが実行される。なお、プロセスが停止された場合、クライアント端末14は、障害内容リスト(DB2)を参照し、停止したプロセスの障害発生を解消するパッチを障害管理サーバ16から取得(ダウンロード)してプール領域26に格納する。また、パッチ番号は、ユーザに通知した上で、適用候補パッチリストに追加しておくものとする(図9のステップ112)。
When the process is stopped at the client terminal 14 (application target system) because it is diagnosed that there is a high possibility of failure occurrence, the processing routine of FIG. 12 is executed. When the process is stopped, the client terminal 14 refers to the failure content list (DB2), acquires (downloads) a patch for solving the failure occurrence of the stopped process from the failure management server 16, and stores it in the pool area 26. Store. The patch number is notified to the user and added to the application candidate patch list (step 112 in FIG. 9).
まず、ステップ240では、確認処理が実行される。確認処理は、障害内容リスト(DB2)を参照して、停止したプロセスに該当する回避策(シェルプログラム)が障害内容リストに登録されているか否かを確認する処理である。停止されたプロセスに対する回避策が障害内容リストに登録されているときは(ステップ242で肯定)、ステップ244で(図9のステップ118と同様に)当該プロセスの延命を図るべく回避策を実行する。次のステップ246では(図9のステップ120と同様に)、停止したプロセスを再開させるべく処理を進め、プロセスを再開させて本ルーチンを終了する。
First, in step 240, confirmation processing is executed. The confirmation processing is processing for confirming whether or not a workaround (shell program) corresponding to the stopped process is registered in the failure content list with reference to the failure content list (DB2). When a workaround for the stopped process is registered in the failure content list (Yes in Step 242), the workaround is executed in Step 244 (similar to Step 118 in FIG. 9) to prolong the life of the process. . In the next step 246 (similar to step 120 in FIG. 9), the process proceeds to restart the stopped process, the process is restarted, and this routine is terminated.
停止されたプロセスに対する回避策が障害内容リストに登録されていないときは(ステップ242で否定)、ステップ248へ進み起動時診断処理を実行する。起動時診断処理は、図11に示す診断処理を実行する処理である。適用対象システムの稼働状況(例えば、CPUの負荷状況など)が変化したことにより、停止したプロセスを実行できる可能性が生じる場合がある。そこで、システムの稼働状態に応じてプロセス再開の可能性を見い出すため、ステップ248にて起動時診断処理の実行を試みる。起動時診断処理は、予めユーザが定義した回数のリトライを可能とする。リトライ回数は、クライアント端末14の動作環境に情報として定義することができるものとする。
When the workaround for the stopped process is not registered in the failure content list (No in step 242), the process proceeds to step 248 and the startup diagnosis process is executed. The startup diagnostic process is a process for executing the diagnostic process shown in FIG. There may be a possibility that the stopped process can be executed due to a change in the operating status (for example, CPU load status) of the application target system. Therefore, in order to find out the possibility of restarting the process according to the operating state of the system, an attempt is made to execute the startup diagnosis process in step 248. The start-up diagnosis process enables retrying a predetermined number of times by the user. The retry count can be defined as information in the operating environment of the client terminal 14.
ステップ248の診断の結果、プロセスを停止する場合には(ステップ250で否定)、定義されている回数を超えたリトライが実行されたか否かを判断する。リトライ回数が定義されている回数を超えていない場合には(ステップ254で否定)、リトライ回数をインクリメントし、ステップ248を再度実行する。ステップ248の診断の結果、プロセスを実行するときは(ステップ250で肯定)、ステップ252でプロセス中断リストに、停止したプロセスが登録されているときには該当プロセスをプロセス中断リストから削除して本ルーチンを終了する。
If the process is stopped as a result of the diagnosis in step 248 (No in step 250), it is determined whether or not a retry exceeding the defined number of times has been executed. If the number of retries does not exceed the defined number (No at step 254), the number of retries is incremented and step 248 is executed again. As a result of the diagnosis in step 248, when the process is executed (Yes in step 250), if the stopped process is registered in the process interruption list in step 252, the corresponding process is deleted from the process interruption list and this routine is executed. finish.
また、定義されている回数を超えたリトライを実行しようとしているときは(ステップ254で肯定)、リトライを終了し、該当プロセスをプロセス中断リストへ登録し(ステップ256)、ユーザへの通知(ステップ258)をした後に本ルーチンを終了する。
Further, when the retry exceeding the defined number of times is attempted (Yes in Step 254), the retry is terminated, the corresponding process is registered in the process interruption list (Step 256), and the user is notified (Step 258), this routine is terminated.
このように、本実施形態では、ソフトウェアが機能を実現するためのプロセスの起動が要求されると、タイミングでリストを参照して、起動要求されたプロセスが障害の発生条件に該当しないかを診断する。その診断の結果、起動要求が障害発生の可能性が高いプロセスについての起動要求であるときは、該当するプロセスの実行を中止する。これによって、既知の障害を生じる可能性の高いプロセスを中止させる。これによって、システムの稼働状態に応じた障害発生の可能性を診断でき、運用中のシステムに則して障害発生を未然に防止することができる。
As described above, in this embodiment, when activation of a process for realizing a function of software is requested, the list is referred to at a timing, and it is diagnosed whether the requested process does not satisfy the failure occurrence condition. To do. As a result of the diagnosis, when the activation request is an activation request for a process having a high possibility of occurrence of a failure, the execution of the corresponding process is stopped. This stops processes that are likely to cause known failures. As a result, the possibility of occurrence of a failure according to the operating state of the system can be diagnosed, and the occurrence of the failure can be prevented in advance according to the operating system.
また、本実施形態では必要最低限のパッチを適用することができるので、システム保守のときの作業工数を軽減することができる。
In addition, since the minimum necessary patches can be applied in this embodiment, it is possible to reduce the man-hours for system maintenance.
(第2実施形態)
次に、第2実施形態を説明する。本実施形態は、上記実施形態とほぼ同様の構成のため、同一部分には同一符号を付して詳細な説明を省略する。 (Second Embodiment)
Next, a second embodiment will be described. Since the present embodiment has substantially the same configuration as the above-described embodiment, the same portions are denoted by the same reference numerals and detailed description thereof is omitted.
次に、第2実施形態を説明する。本実施形態は、上記実施形態とほぼ同様の構成のため、同一部分には同一符号を付して詳細な説明を省略する。 (Second Embodiment)
Next, a second embodiment will be described. Since the present embodiment has substantially the same configuration as the above-described embodiment, the same portions are denoted by the same reference numerals and detailed description thereof is omitted.
ソフトウェアは、一般に開発時点において障害発生の可能性がある未知の不具合等を内在している。未知の不具合等が顕在化すると、ソフトウェアを実行する上で障害となるバグ(プログラムに含まれる誤りや不具合)等になる。バグ等が顕在化して、既知の障害が発生すると、ベンダはソフトウェアの障害の発生を防止するためのデータであるパッチを提供する。従って、顕在化していないバグに起因する障害は、パッチを適用しても防ぐことができない。本実施形態は、不具合等が既知になったとき、バグを含むコードにより障害発生の可能性が高いプロセスが既に実行されていることが判明した場合に、該当するプロセスを中止させる、また該当するプロセスについて可能な限りの復旧を可能とするものである。
Software generally has unknown problems that may cause failures at the time of development. When an unknown defect or the like becomes obvious, it becomes a bug (an error or a defect included in the program) that becomes an obstacle to executing the software. When a bug or the like becomes obvious and a known failure occurs, the vendor provides a patch that is data for preventing the occurrence of a software failure. Therefore, a failure due to a bug that has not been revealed cannot be prevented even if a patch is applied. In the present embodiment, when a defect or the like becomes known, if it is found that a process having a high possibility of occurrence of a failure has already been executed due to a code including a bug, the corresponding process is stopped, or is applicable. The process can be restored as much as possible.
なお、障害内容リストの項目には復旧方法が格納されている。復旧方法は、障害が顕在化されて障害発生の可能性が高いとされるプロセスを停止させたときに適用対象システムを停止させることなく、該当プロセスを再開させる手当を指すものである。つまり、稼働中のプロセスが障害発生の可能性が高いときまたは障害が発生したとき、稼働中のプロセスは強制的に停止される。当該プロセスが停止されたとき、適用対象システムを停止させることなく、障害発生の可能性を低下させて該プロセスを再開させることが好ましい。一例として、適用対象システムを再起動せずに、実行可能なシェルプログラムを採用することができる。なお、復旧方法は、回避策と同様に、最新のパッチやパッチ情報を取得するときに併せて取得するものとする。つまり、復旧方法となる一例のシェルプログラムは、ソフトウェアの開発担当者がプロセスのパッチを提供する際に併せて提供するものを取得することとする。
Note that the recovery method is stored in the item of the failure content list. The recovery method refers to an allowance for resuming a corresponding process without stopping the application target system when a process that has been identified as having a high probability of occurrence of the fault is stopped. That is, when a process that is running is highly likely to cause a failure or when a failure occurs, the running process is forcibly stopped. When the process is stopped, it is preferable to reduce the possibility of occurrence of a failure and restart the process without stopping the application target system. As an example, an executable shell program can be employed without restarting the application target system. Note that, as with the workaround, the recovery method is acquired together with the latest patch and patch information. In other words, an example shell program serving as a recovery method is to be acquired when a software development person provides a process patch.
(障害発生防止処理)
次に、本実施形態における障害発生防止処理について説明する。 (Failure prevention processing)
Next, the failure occurrence prevention process in this embodiment will be described.
次に、本実施形態における障害発生防止処理について説明する。 (Failure prevention processing)
Next, the failure occurrence prevention process in this embodiment will be described.
図13は本実施形態に係る障害発生防止処理の流れについての説明図である。障害管理サーバ16である管理サーバは、障害を発生するシステムの稼働状態の条件、発生する障害を解消するパッチの情報を含むプロセス毎に分類されたリスト40を、更新部56により更新するようになっている。すなわち、管理サーバは新規の障害が顕在化されると、その障害の発生を防止するパッチ42及びパッチの情報を登録する。管理サーバは、障害の発生を防止するパッチ42及びパッチの情報が登録されると、リスト40を更新し、クライアント端末14に報知する。なお、リスト40は、DB1~DB5を含んでいる。
FIG. 13 is an explanatory diagram showing the flow of the failure occurrence prevention process according to the present embodiment. The management server, which is the failure management server 16, updates the update unit 56 to update the list 40 categorized for each process including the conditions of the operating state of the system in which the failure occurs and the information on the patch that resolves the failure that has occurred. It has become. That is, when a new failure becomes apparent, the management server registers the patch 42 and patch information for preventing the occurrence of the failure. The management server updates the list 40 and notifies the client terminal 14 when the patch 42 and the patch information for preventing the occurrence of the failure are registered. The list 40 includes DB1 to DB5.
クライアント端末14では、障害管理サーバ16である管理サーバの報知による、リスト40が更新されたことを示す情報を受付部58により受け付ける。管理サーバの報知を受け付けたとき、把握部46は、適用対象システムの稼働状態を把握する。稼働状態の把握は、稼働中のプロセスを抽出する処理を含むものである。そして、参照部44によって、リストを参照する。特定部48では、稼働中のプロセスが、参照部44により参照したリスト40に含まれているか否かを判別すると共に、稼働中のプロセスがリスト40に登録されているときには稼働中のプロセスに対する条件が特定される。診断部50は、稼働中のプロセスについて、特定部で特定された条件と把握部46で把握した適用対象システムの稼働状態とから、障害発生の可能性を診断する。診断部50の診断の結果、稼働中のプロセスについて障害発生の可能性が高いと判断されたとき、管理部52によって、稼働中のプロセスを停止させる。また、適用対象システムは、取得部54によって、管理部52により停止されたプロセスに対応する未適用のパッチを取得する。
In the client terminal 14, the reception unit 58 receives information indicating that the list 40 has been updated by notification from the management server that is the failure management server 16. When the notification of the management server is received, the grasping unit 46 grasps the operating state of the application target system. The grasping of the operating state includes a process of extracting an operating process. Then, the reference unit 44 refers to the list. The specifying unit 48 determines whether or not the operating process is included in the list 40 referred to by the reference unit 44, and when the operating process is registered in the list 40, the condition for the operating process is determined. Is identified. The diagnosing unit 50 diagnoses the possibility of the failure of the running process from the condition specified by the specifying unit and the operating state of the application target system grasped by the grasping unit 46. As a result of the diagnosis by the diagnosis unit 50, when it is determined that there is a high possibility that a failure has occurred in the running process, the management unit 52 stops the running process. Further, the application target system acquires an unapplied patch corresponding to the process stopped by the management unit 52 by the acquisition unit 54.
従って、クライアント端末14は、新規の障害が顕在化されたプロセスの報知を障害管理サーバ16から受け付けると、リスト40を参照すると共に、適用対象システムの稼働状態が障害発生条件に該当するかを診断する。稼働中のプロセスが障害発生の可能性が高いと診断されたときには、稼働中のプロセスを停止する。稼働中のプロセスを停止したときには、停止したプロセスに対する適用対象システムに適用するパッチを取得するようにする。取得したパッチは任意のタイミングでシステムに適用できる。このように、新規の障害が顕在化されたときに、適用対象システムの稼働状態に基づき、稼働中のプロセスが障害発生の可能性が高いか否かを判定できる。障害発生の可能性が高いプロセスが稼働しているときには、稼働中のプロセスを停止させる。このため、稼働中のプロセスであっても、障害発生の可能性が高いときにはプロセスを停止させることで、障害発生を防止できる。
Therefore, when the client terminal 14 receives a notification of a process in which a new failure has been realized from the failure management server 16, the client terminal 14 refers to the list 40 and diagnoses whether the operating state of the application target system corresponds to the failure occurrence condition. To do. When a running process is diagnosed as having a high possibility of failure, the running process is stopped. When a running process is stopped, a patch to be applied to the application target system for the stopped process is acquired. Acquired patches can be applied to the system at any time. As described above, when a new failure becomes apparent, it is possible to determine whether or not the operating process is highly likely to have a failure based on the operating state of the application target system. When a process having a high possibility of failure occurrence is running, the running process is stopped. For this reason, even if the process is in operation, the occurrence of the failure can be prevented by stopping the process when the possibility of the failure is high.
(診断処理)
次に、本実施形態のクライアント端末14で実行される診断処理を詳細に説明する。
図14はクライアント端末14で実行される診断処理を中心にした詳細な流れを示すフローチャートである。本実施形態ではクライアント端末14が障害管理サーバ16から報知を受け付けたときに実行される処理を示す。図14の処理ルーチンは、図11に示した処理に代えて実行される。 (Diagnosis processing)
Next, the diagnostic process executed by theclient terminal 14 of this embodiment will be described in detail.
FIG. 14 is a flowchart showing a detailed flow centered on a diagnostic process executed in theclient terminal 14. In the present embodiment, processing executed when the client terminal 14 receives a notification from the failure management server 16 is shown. The processing routine of FIG. 14 is executed instead of the processing shown in FIG.
次に、本実施形態のクライアント端末14で実行される診断処理を詳細に説明する。
図14はクライアント端末14で実行される診断処理を中心にした詳細な流れを示すフローチャートである。本実施形態ではクライアント端末14が障害管理サーバ16から報知を受け付けたときに実行される処理を示す。図14の処理ルーチンは、図11に示した処理に代えて実行される。 (Diagnosis processing)
Next, the diagnostic process executed by the
FIG. 14 is a flowchart showing a detailed flow centered on a diagnostic process executed in the
クライアント端末14(適用対象システム)では、障害管理サーバ16からのリスト40が更新されたことを示す情報の報知を受け付けると、ステップ300において参照処理が実行される。リスト40にはDB1~DB5が含まれており、DB1~DB5の何れか1つのDBが更新されたとき、障害管理サーバ16はクライアント端末14へリスト40が更新されたことを示す情報を報知するものとする。本実施形態ではステップ300の処理は、禁止プロセスリストの更新を確認する処理とする。参照処理の実行結果、禁止プロセスリストが更新されていないときは(ステップ302で肯定)、本ルーチンを終了する。一方、禁止プロセスリストが更新され、禁止プロセスリストにプロセスが新規登録されたり内容が更新されたりしているときは(ステップ302で肯定)、ステップ304へ進み、適用対象システムで実行中(稼働中)のプロセスを確認する。
When the client terminal 14 (application target system) receives notification of information indicating that the list 40 has been updated from the failure management server 16, a reference process is executed in step 300. The list 40 includes DB1 to DB5. When any one of the DB1 to DB5 is updated, the failure management server 16 notifies the client terminal 14 of information indicating that the list 40 has been updated. Shall. In the present embodiment, the processing in step 300 is processing for confirming update of the prohibited process list. If the prohibited process list has not been updated as a result of the execution of the reference process (Yes at step 302), this routine ends. On the other hand, when the prohibited process list is updated and a process is newly registered or updated in the prohibited process list (Yes in Step 302), the process proceeds to Step 304 and is being executed (in operation) in the application target system. ) Check the process.
次のステップ306では、稼働中のプロセスから任意の1プロセスを指定する。次のステップ308では、稼働時診断処理が実行される。ステップ308の稼働時診断処理は、図12のステップ248で説明した起動時診断処理と同様の処理が実行される。具体的には、ステップ306で指定したプロセスを、起動が要求されたプロセスとして図11に示す診断処理を進める。図11に示す処理と、本実施形態において図14のステップ308における処理との相違は、図11のステップ204とステップ232における処理である。図11では起動が要求されたプロセスを処理対象としている。本実施形態では、既に稼働中のプロセスを処理対象とするため、図14のステップ308において図11のステップ204に相当する処理をスキップし、何もせずに処理を進める。また、本実施形態では、図14のステップ308のうち図11のステップ232に相当する処理において、既に稼働中のプロセスを強制的に停止させる。
In the next step 306, one arbitrary process is designated from the running processes. In the next step 308, the operation diagnosis process is executed. The operation diagnosis process at step 308 is the same as the start-up diagnosis process described at step 248 in FIG. Specifically, the diagnosis process shown in FIG. 11 is advanced with the process designated in step 306 as the process requested to be activated. The difference between the process shown in FIG. 11 and the process in step 308 of FIG. 14 in this embodiment is the process in step 204 and step 232 of FIG. In FIG. 11, the process requested to be started is the processing target. In the present embodiment, since an already running process is a processing target, the processing corresponding to step 204 in FIG. 11 is skipped in step 308 in FIG. 14, and the processing proceeds without doing anything. Further, in the present embodiment, in the process corresponding to step 232 of FIG. 11 in step 308 of FIG. 14, a process that is already in operation is forcibly stopped.
稼働時診断処理が終了すると、ステップ310へ進む。ステップ310では、稼働中のプロセスのうち稼働時診断処理が実行されていないプロセスが残存するか否かが判断される。稼働時診断処理が実行されていないプロセスが残存している場合にはステップ306へ戻り、稼働時診断処理を残存する稼働中プロセスに対して実行する。
When the operation diagnosis process is completed, the process proceeds to step 310. In step 310, it is determined whether or not there remains a process for which the on-operation diagnosis process has not been executed among the active processes. When there is a process that has not been subjected to the operation time diagnostic process, the process returns to step 306, and the operation time diagnosis process is performed on the remaining operating process.
(復旧処理)
次に、本実施形態において稼働中のプロセスを停止した後に実行される復旧処理について説明する。
図15はクライアント端末14で実行される復旧処理を中心にした詳細な流れを示すフローチャートである。本実施形態ではクライアント端末14で稼働中のプロセスを停止した後に復旧処理を行うものを示す。つまり、図11のステップ232において、既に稼働中のプロセスを強制的に停止させた後に復旧処理を行う。復旧処理は、システム運用の継続を図る処理である。図15の処理ルーチンは、図12に示した処理に代えて実行される。 (Recovery processing)
Next, a recovery process executed after stopping an active process in the present embodiment will be described.
FIG. 15 is a flowchart showing a detailed flow centered on the recovery process executed in theclient terminal 14. In the present embodiment, the recovery process is performed after the process running on the client terminal 14 is stopped. That is, in step 232 of FIG. 11, the recovery process is performed after forcibly stopping the already running process. The recovery process is a process for continuing the system operation. The processing routine of FIG. 15 is executed instead of the processing shown in FIG.
次に、本実施形態において稼働中のプロセスを停止した後に実行される復旧処理について説明する。
図15はクライアント端末14で実行される復旧処理を中心にした詳細な流れを示すフローチャートである。本実施形態ではクライアント端末14で稼働中のプロセスを停止した後に復旧処理を行うものを示す。つまり、図11のステップ232において、既に稼働中のプロセスを強制的に停止させた後に復旧処理を行う。復旧処理は、システム運用の継続を図る処理である。図15の処理ルーチンは、図12に示した処理に代えて実行される。 (Recovery processing)
Next, a recovery process executed after stopping an active process in the present embodiment will be described.
FIG. 15 is a flowchart showing a detailed flow centered on the recovery process executed in the
クライアント端末14(適用対象システム)において、障害発生の可能性が高いと診断されたことによってプロセスが停止されると、図15の処理ルーチンが実行される。なお、システム保守時にパッチの適用を容易にするために、プロセスが停止された場合、障害管理サーバ16からパッチを取得(ダウンロード)してプール領域26に蓄積し、パッチ番号を適用候補パッチリストに追加してもよい。これにより、システム保守時には、プール領域26に蓄積されたパッチを適用することで、適用対象システムで発生確率の高い障害修正を必要最小限適用することができる。
When the process is stopped at the client terminal 14 (application target system) because it is diagnosed that there is a high possibility of failure occurrence, the processing routine of FIG. 15 is executed. In order to facilitate patch application during system maintenance, when the process is stopped, patches are acquired (downloaded) from the failure management server 16 and accumulated in the pool area 26, and the patch numbers are stored in the application candidate patch list. May be added. Thereby, at the time of system maintenance, by applying the patches accumulated in the pool area 26, it is possible to apply the minimum necessary fault correction with a high probability of occurrence in the application target system.
まず、ステップ320では、確認処理が実行される。確認処理は、障害内容リスト(DB2)を参照して、停止したプロセスに該当する復旧方法(シェルプログラム)が登録されているか否かを確認する処理である。停止されたプロセスに対する復旧方法が登録されているときは(ステップ322で肯定)、ステップ324へ進み、次回のシステム保守までの当該プロセスの延命を図るべく復旧方法による復旧処理を実行する。次のステップ326では、停止したプロセスを再開させるべく処理を進め、プロセスを再開させて本ルーチンを終了する。
First, in step 320, a confirmation process is executed. The confirmation process is a process for confirming whether or not a recovery method (shell program) corresponding to the stopped process is registered with reference to the failure content list (DB2). When the recovery method for the stopped process is registered (Yes in step 322), the process proceeds to step 324, and recovery processing by the recovery method is executed in order to extend the life of the process until the next system maintenance. In the next step 326, the process proceeds to resume the stopped process, the process is resumed, and this routine is terminated.
停止されたプロセスに対する復旧方法が障害内容リストに登録されていないときは(ステップ322で否定)、ステップ328へ進み、当該プロセスをプロセス中断リストに登録する。次のステップ330では、停止されたプロセスをプロセス中断リストに登録したことを示す情報をユーザへ通知し、本ルーチンを終了する。
If the recovery method for the stopped process is not registered in the failure content list (No in step 322), the process proceeds to step 328, and the process is registered in the process interruption list. In the next step 330, information indicating that the stopped process is registered in the process interruption list is notified to the user, and this routine is terminated.
このように、本実施形態では、ソフトウェアが機能を実現するためのプロセスについて新規に障害が登録された場合に、プロセスの稼働中にリストを参照して、障害の発生条件に該当しないかを診断する。そして、新規に登録された障害が発生する可能性が高いプロセスが稼働中であるときは、該当するプロセスの稼働を中止する。これによって、不具合等の未知の障害が既知になったときに、プロセスの稼働を中止させるので、新規に登録された障害が発生する可能性が高いプロセスであっても迅速に対応でき、障害発生を未然に防止することができる。
As described above, in this embodiment, when a failure is newly registered for a process for realizing the function of software, the list is referred to during the operation of the process to diagnose whether the failure occurs. To do. Then, when a process that is highly likely to cause a newly registered failure is in operation, the operation of the corresponding process is stopped. As a result, when an unknown failure such as a defect becomes known, the operation of the process is stopped, so even a newly registered failure is likely to occur quickly, and the failure occurs. Can be prevented in advance.
なお、本実施形態では、禁止プロセスリストの更新を確認する処理の一例を説明したが、禁止プロセスリストの更新の確認に限定するものではない。例えば、リスト40にはDB1~DB5が含まれており、禁止プロセスリスト以外のリストが更新された場合にも適用出可能である。DB2~DB5に含まれる情報について最新の情報に更新するようにしてもよい。
In this embodiment, an example of the process for confirming the update of the prohibited process list has been described. However, the present invention is not limited to the confirmation of the update of the prohibited process list. For example, the list 40 includes DB1 to DB5, and can be applied when a list other than the prohibited process list is updated. Information included in DB2 to DB5 may be updated to the latest information.
ところで、ユーザは、パッチの適用を希望しない場合がある。例えば、システムの運用という観点から現状のシステムを継続したいという要望やシステムの停止時間を短くしたいという要望等からパッチの適用を回避したり、先延ばしにしたりする場合がある。一例には利用中の機能が障害を抱えていたとしても、現状のシステムで障害が発生せずに稼働しているので、最新パッチを適用したくないという考え方がある。また、パッチを適用したことで、非互換やレベルダウンなどの状況が発生すると、その復旧作業には多くの工数が必要になる。このため、既知の障害を未然に防止するために、積極的にはパッチを適用したくないという考え方もある。一方では、パッチを適用せずに障害が発生してからでは手遅れになるという考え方もあり、ユーザは二律相反する考え方を抱えている場合がある。
By the way, the user may not want to apply the patch. For example, there is a case where the application of a patch is avoided or postponed due to a request to continue the current system from the viewpoint of system operation or a request to shorten the system stop time. One example is the idea that even if a function being used has a failure, the current system is operating without a failure, so that it does not want to apply the latest patch. In addition, if a situation such as incompatibility or level down occurs due to the application of a patch, a lot of man-hours are required for the restoration work. For this reason, there is an idea that it is not desirable to actively apply a patch in order to prevent a known failure. On the other hand, there is also an idea that it will be too late after a failure occurs without applying a patch, and the user may have a contradictory idea.
「積極的にはパッチを適用したくない」と「パッチを適用せずに障害が発生してからでは手遅れになる」という二律相反する考え方を抱えているユーザに対しても、上記実施形態は有効に機能する。例えば、ソフトウェアが機能を実現するためのプロセスの起動が要求されると、プロセス起動のタイミングでリストを参照して、起動要求されたプロセスについて障害発生条件が適用対象システムの稼働状態に該当しないかを診断する。診断の結果、適用対象システムの稼働状態において障害発生の可能性が高いプロセスの起動要求であったときは、起動要求されたプロセスの実行を中止する。このように、システムの稼働状態に応じて障害発生の可能性を診断でき、稼働中の適用対象システムに則して障害発生を未然に防止することができる。また、停止した障害発生の可能性が高いプロセスであっても、適用対象システムを停止せずに障害発生の可能性を低下させる回避策があるときには、それを適用する。これによって、稼働中のシステムを停止することなく障害発生を抑制できる。さらにプロセスを停止させたタイミングでパッチを取得するので、必要最低限のパッチを適用することができ、システム保守のときの作業工数を軽減することができる。
The above embodiment is also applied to a user who has a contradictory idea that “it does not want to actively apply a patch” and “it is too late after a failure occurs without applying a patch”. Works effectively. For example, if the process is requested to realize the function of the software, refer to the list at the process startup timing, and check whether the failure occurrence condition corresponds to the operating status of the target system for the requested process Diagnose. As a result of the diagnosis, if the activation request is a process having a high possibility of occurrence of a failure in the operating state of the application target system, the execution of the requested process is stopped. Thus, the possibility of the occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of a failure can be prevented in advance according to the application target system that is operating. In addition, even if there is a process that is highly likely to cause a failure, there is a workaround that reduces the possibility of failure without stopping the application target system. As a result, the occurrence of a failure can be suppressed without stopping the operating system. Furthermore, since patches are acquired at the timing when the process is stopped, the minimum necessary patches can be applied, and the number of work steps for system maintenance can be reduced.
また、新たに障害が顕在化したプロセスが禁止プロセスリストに登録された場合、適用対象システムの稼働状態における稼働中のプロセスが新たに障害が顕在化したプロセスの障害発生条件に該当するか否かを診断する。適用対象システムで稼働中のプロセスが障害発生の可能性が高いプロセスであるときは、該当するプロセスの実行を中止する。これによって、不具合等の未知の障害が既知の障害として障害発生の可能性が高いプロセスになったとき、該当するプロセスを中止させるので、新規に登録された障害の可能性が高いプロセスであっても迅速に対応でき、障害発生を未然に防止することができる。また、稼働中のプロセスを停止した場合であっても、システムを停止せずに障害発生の可能性を低下させる復旧処理があるときには、復旧処理を適用する。これによって、稼働中のシステムを停止することなく障害発生を抑制できる。
In addition, when a process with a newly revealed failure is registered in the prohibited process list, whether or not the running process in the operating status of the target system meets the failure occurrence condition of the process with the newly identified failure Diagnose. If the process running on the application target system is a process with a high possibility of failure, the execution of the corresponding process is stopped. As a result, when an unknown failure such as a defect becomes a process with a high possibility of occurrence of a failure as a known failure, the corresponding process is stopped, so that the newly registered failure is a highly likely process. Can be dealt with quickly, and the occurrence of failures can be prevented. Even when a running process is stopped, the recovery process is applied when there is a recovery process that reduces the possibility of a failure without stopping the system. As a result, the occurrence of a failure can be suppressed without stopping the operating system.
従って、ユーザは「積極的にはパッチを適用したくない」と「パッチを適用せずに障害が発生してからでは手遅れになる」という二律相反する考え方を抱えることもない。
Therefore, the user does not have a contradictory idea that “it does not want to apply the patch positively” and “it is too late if a failure occurs without applying the patch”.
また、システムの情報に該当する全パッチを抽出して適用するとき、適用対象システムで実際に必要とするパッチを選別して抽出するにしても、システムの情報から行うのでは不十分である。例えば、セキュリティ問題やパニックなどの重大な障害であっても、自身のシステムで発生するとは限らないが、安全性を考えて必要以上に多くのパッチを適用することになる。なお、ソフトウェアは、障害の発生を防止するためのみならず機能追加等をパッチとして提供する場合がある。このため、適用できるパッチを全て抽出し適用する場合、ユーザが望まないパッチまでも適用することになる。
Also, when extracting and applying all patches corresponding to the system information, it is not sufficient to select and extract patches that are actually required in the application target system from the system information. For example, even if a serious problem such as a security problem or panic occurs, it does not always occur in its own system, but more patches than necessary are applied in consideration of safety. Note that the software may provide not only a failure but also a function addition as a patch. For this reason, when all applicable patches are extracted and applied, even patches not desired by the user are applied.
このようなことに対しても、上記実施形態は有効に機能する。上記実施形態では、プロセスの起動時やプロセスに対する最新のパッチが登録されたタイミングで、障害発生の可能性をシステムの稼働状態に応じて診断する。その診断結果が、障害発生の可能性が高いとき、そのプロセスの実行を中止する。これによって、システムの稼働状態に応じて障害発生の可能性を診断でき、稼働中のシステムに則して障害発生を未然に防止することができる。従って、ユーザが望まないパッチまでも適用することはない。
Even in this case, the above embodiment functions effectively. In the above embodiment, the possibility of failure occurrence is diagnosed according to the operating state of the system at the time of starting the process or at the timing when the latest patch for the process is registered. When the diagnosis result indicates that there is a high possibility that a failure has occurred, execution of the process is stopped. Thus, the possibility of occurrence of a failure can be diagnosed according to the operating state of the system, and the occurrence of the failure can be prevented in advance according to the operating system. Therefore, even patches that are not desired by the user are not applied.
なお、上記では障害管理サーバを一例としたサーバ装置とクライアント端末を一例とした適用対象システムを例に説明した。しかし、これらの構成に限定されるものではなく、上記説明した要旨を逸脱しない範囲において、各種の改良及び変更を行っても良いのはもちろんである。
In the above description, the server system taking the fault management server as an example and the application target system taking the client terminal as an example have been described. However, the present invention is not limited to these configurations, and various improvements and modifications may be made without departing from the gist described above.
また、上記ではプログラムがクライアント端末の記憶部に予め記憶(インストール)されている態様を説明したが、処理プログラムは、CD-ROMやDVD-ROM等の記録媒体に記録されている形態で提供することも可能である。
In the above description, the program is stored (installed) in advance in the storage unit of the client terminal. However, the processing program is provided in a form recorded on a recording medium such as a CD-ROM or DVD-ROM. It is also possible.
本明細書に記載された全ての文献、特許出願及び技術規格は、個々の文献、特許出願及び技術規格が参照により取り込まれることが具体的かつ個々に記された場合と同程度に、本明細書中に参照により取り込まれる。
All documents, patent applications and technical standards mentioned in this specification are to the same extent as if each individual document, patent application and technical standard were specifically and individually stated to be incorporated by reference. Incorporated by reference in the book.
10 コンピュータ・システム
14 クライアント端末
14B メモリ
14C 記憶部
16 障害管理サーバ 10Computer System 14 Client Terminal 14B Memory 14C Storage Unit 16 Fault Management Server
14 クライアント端末
14B メモリ
14C 記憶部
16 障害管理サーバ 10
Claims (24)
- 障害の発生を防止するパッチの適用対象とするコンピュータシステムを適用対象システムとして、前記適用対象システムにおいてプロセスが起動された際に、前記適用対象システムの稼働状態を把握する把握部と、
前記適用対象システムで障害が発生する可能性のある稼働状態の条件を含む、プロセス毎に分類されたリストを参照する参照部と、
前記参照部により参照されたリストから、前記起動されたプロセスに対して障害が発生する可能性のある稼働状態の条件を特定する特定部と、
前記特定部で特定された障害が発生する可能性のある稼働状態の条件と、前記把握部で把握された前記適用対象システムの稼働状態とに基づいて、前記起動されたプロセスの実行に伴う障害が発生する可能性を診断する診断部と、
前記診断部により前記起動されたプロセスの実行に伴う前記適用対象システムの障害が発生する可能性が所定値以上と診断された場合に、当該プロセスの実行を停止させるプロセス管理部と、
を備える障害発生防止装置。 A grasping unit for grasping an operating state of the application target system when a process is started in the application target system, with a computer system that is an application target of a patch for preventing the occurrence of a failure,
A reference unit that refers to a list classified for each process, including an operation state condition that may cause a failure in the application target system;
A specifying unit that specifies a condition of an operating state in which a failure may occur with respect to the activated process from the list referred to by the reference unit;
A failure associated with the execution of the started process based on the operating state condition that may cause the failure specified by the specifying unit and the operating state of the application target system determined by the grasping unit A diagnostic unit for diagnosing the possibility of occurrence of
A process management unit that stops execution of the process when the diagnosis unit diagnoses that the failure of the application target system accompanying the execution of the started process is greater than or equal to a predetermined value;
A failure occurrence prevention device comprising: - 前記リストは、前記適用対象システムの動作を停止させずに、該当するプロセスの実行に伴う障害発生の可能性を低下させる回避策を更に含み、
前記特定部は、前記リストから前記起動されたプロセスに対応する回避策を特定し、
前記プロセス管理部は、前記起動されたプロセスの実行に伴う障害発生の可能性が前記診断部により所定値以上と診断され、かつ、当該プロセスに対応する回避策が前記リストに有る場合には、該当する回避策を前記適用対象システムに施した後に前記起動されたプロセスの実行を進め、前記プロセスに対応する回避策が前記リストに無い場合には前記起動されたプロセスの実行を停止させる
請求項1に記載の障害発生防止装置。 The list further includes a workaround that reduces the possibility of a failure due to the execution of the corresponding process without stopping the operation of the application target system,
The specifying unit specifies a workaround corresponding to the activated process from the list,
In the case where the process management unit is diagnosed as having a possibility of occurrence of a failure accompanying the execution of the activated process by the diagnosis unit as being a predetermined value or more, and there is a workaround corresponding to the process in the list, The execution of the activated process is performed after the applicable workaround is applied to the application target system, and the execution of the activated process is stopped when the workaround corresponding to the process is not in the list. The failure prevention apparatus according to 1. - 前記リストに登録されている稼働状態の条件には、プロセスの名称及び該プロセスの起動オプションが含まれる
請求項1または請求項2記載の障害発生防止装置。 The failure occurrence prevention apparatus according to claim 1, wherein the operating condition registered in the list includes a process name and a start option of the process. - 前記リストに登録されている稼働状態の条件には、前記適用対象システムの稼働状態を出力するコマンド及び該コマンドの実行結果の期待値が含まれる
請求項1~請求項3の何れか1項に記載の障害発生防止装置。 The operating condition registered in the list includes a command for outputting the operating status of the application target system and an expected value of the execution result of the command. The failure occurrence prevention device described. - 前記リストに登録されている稼働状態の条件は、プロセスで利用されるデータファイルの名称及び該データファイルの内容としての期待値が含まれる
請求項1~請求項4の何れか1項に記載の障害発生防止装置。 The operation condition registered in the list includes a name of a data file used in a process and an expected value as the content of the data file. Failure prevention device. - 前記リストに登録されている稼働状態の条件には、前記適用対象システムで動作するプログラム及び該プログラムの実行結果の期待値が含まれる
請求項1~請求項5の何れか1項に記載の障害発生防止装置。 The failure according to any one of claims 1 to 5, wherein the operating condition registered in the list includes a program that operates in the application target system and an expected value of an execution result of the program. Occurrence prevention device. - 前記リストが更新されたことを示す情報を外部から受け付ける受付部を更に備え、
前記参照部は、前記受付部により前記リストの更新を示す情報が受け付けられた場合に、前記リストを参照し、
前記把握部は、前記受付部により前記リストの更新を示す情報が受け付けられた場合に、前記適用対象システムの稼働状態を把握し、
前記特定部は、前記更新されたリストに含まれる前記適用対象システムで実行中のプロセスを特定すると共に、前記更新されたリストから、特定したプロセスの実行に伴い前記適用対象システムに障害が発生する可能性のある稼働状態の条件を特定し、
前記診断部は、前記特定部によって特定されたプロセスについて、前記特定部で特定された障害が発生する可能性のある稼働状態の条件と前記把握部で把握された前記適用対象システムの稼働状態とに基づいて、障害発生の可能性を診断し、
前記プロセス管理部は、前記診断部により、前記特定されたプロセスの実行に伴う障害発生の可能性が所定値以上と診断された場合に、当該プロセスの実行を中止させる
請求項1~請求項6の何れか1項に記載の障害発生防止装置。 A reception unit for receiving information indicating that the list has been updated from the outside;
The reference unit refers to the list when information indicating the update of the list is received by the reception unit,
The grasping unit grasps an operating state of the application target system when information indicating the update of the list is received by the receiving unit;
The specifying unit specifies a process being executed in the application target system included in the updated list, and a failure occurs in the application target system as the specified process is executed from the updated list. Identify possible health conditions,
The diagnosis unit, for the process specified by the specifying unit, the condition of the operating state in which the failure specified by the specifying unit may occur and the operating state of the application target system ascertained by the grasping unit To diagnose the possibility of failure,
The process management unit stops the execution of the process when the diagnosis unit diagnoses that the possibility of a failure occurring due to the execution of the specified process is a predetermined value or more. The failure occurrence preventing apparatus according to any one of the above. - 前記リストは、前記適用対象システムの動作を停止させずに、プロセスの実行に伴って発生した障害を復旧させる復旧策を更に含み、
前記特定部は、前記リストから、実行が中断されている特定のプロセスに対応する復旧策を特定し、
前記プロセス管理部は、前記特定部により特定された特定のプロセスに対応する復旧策が前記リストに有る場合には、該当する前記復旧策を施した後に前記特定のプロセスの実行を再開させ、前記特定のプロセスに対応する前記復旧策が前記リストに無い場合には、前記特定のプロセスの実行再開を中止する
請求項1~請求項7の何れか1項に記載の障害発生防止装置。 The list further includes a recovery measure for recovering from a failure caused by the execution of the process without stopping the operation of the target system.
The identification unit identifies a recovery measure corresponding to a specific process whose execution is suspended from the list,
The process management unit, when there is a recovery measure corresponding to the specific process specified by the specific unit in the list, to resume the execution of the specific process after applying the corresponding recovery measure, The failure occurrence prevention apparatus according to any one of claims 1 to 7, wherein when the recovery measure corresponding to a specific process is not included in the list, resumption of execution of the specific process is stopped. - 前記プロセス管理部により実行が停止されたプロセスに対応するパッチを取得する取得部を更に備える
請求項1~請求項8の何れか1項に記載の障害発生防止装置。 The failure occurrence prevention apparatus according to any one of claims 1 to 8, further comprising an acquisition unit that acquires a patch corresponding to a process whose execution has been stopped by the process management unit. - 前記プロセス管理部により実行が中止されたプロセスに対応するパッチを取得する中止プロセスパッチ取得部を更に備える
請求項7または請求項8に記載の障害発生防止装置。 The failure occurrence prevention apparatus according to claim 7 or 8, further comprising a cancellation process patch acquisition unit that acquires a patch corresponding to a process whose execution has been canceled by the process management unit. - 障害の発生を防止するパッチの適用対象とするコンピュータシステムを適用対象システムとして、前記適用対象システムにおいてプロセスが起動された際に、前記適用対象システムの稼働状態を把握する把握ステップと、
前記適用対象システムで障害が発生する可能性のある稼働状態の条件を含む、プロセス毎に分類されたリストを参照する参照ステップと、
前記参照ステップにより参照されたリストから、前記起動されたプロセスに対して障害が発生する可能性のある稼働状態の条件を特定する特定ステップと、
前記特定ステップで特定された障害が発生する可能性のある稼働状態の条件と、前記把握ステップで把握された前記適用対象システムの稼働状態とに基づいて、前記起動されたプロセスの実行に伴う障害が発生する可能性を診断する診断ステップと、
前記診断ステップにより前記起動されたプロセスの実行に伴う前記適用対象システムの障害が発生する可能性が所定値以上と診断された場合に、当該プロセスの実行を停止させるプロセス管理ステップと、
を含む障害発生防止方法。 A grasping step of grasping an operating state of the application target system when a process is started in the application target system, with a computer system as a target of application of a patch for preventing occurrence of a failure,
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
A failure prevention method including: - 前記リストは、前記適用対象システムの動作を停止させずに、該当するプロセスの実行に伴う障害発生の可能性を低下させる回避策を更に含み、
前記特定ステップは、前記リストから前記起動されたプロセスに対応する回避策を特定し、
前記プロセス管理ステップは、前記起動されたプロセスの実行に伴う障害発生の可能性が前記診断ステップにより所定値以上と診断され、かつ、当該プロセスに対応する回避策が前記リストに有る場合には、該当する回避策を前記適用対象システムに施した後に前記起動されたプロセスの実行を進め、前記プロセスに対応する回避策が前記リストに無い場合には前記起動されたプロセスの実行を停止させる
請求項11に記載の障害発生防止方法。 The list further includes a workaround that reduces the possibility of a failure due to the execution of the corresponding process without stopping the operation of the application target system,
The identifying step identifies a workaround corresponding to the invoked process from the list;
In the process management step, when the possibility of occurrence of a failure accompanying the execution of the started process is diagnosed as a predetermined value or more by the diagnosis step, and a workaround corresponding to the process is in the list, The execution of the activated process is performed after the applicable workaround is applied to the application target system, and the execution of the activated process is stopped when the workaround corresponding to the process is not in the list. 11. A failure prevention method according to 11. - 前記リストに登録されている稼働状態の条件には、プロセスの名称及び該プロセスの起動オプションが含まれる
請求項11または請求項12記載の障害発生防止方法。 The failure occurrence prevention method according to claim 11 or 12, wherein the operating condition registered in the list includes a process name and an activation option of the process. - 前記リストに登録されている稼働状態の条件には、前記適用対象システムの稼働状態を出力するコマンド及び該コマンドの実行結果の期待値が含まれる
請求項11~請求項13の何れか1項に記載の障害発生防止方法。 The operating condition registered in the list includes a command for outputting the operating status of the application target system and an expected value of an execution result of the command. The failure prevention method described. - 前記リストに登録されている稼働状態の条件は、プロセスで利用されるデータファイルの名称及び該データファイルの内容としての期待値が含まれる
請求項11~請求項14の何れか1項に記載の障害発生防止方法。 The operation condition registered in the list includes a name of a data file used in a process and an expected value as the content of the data file. How to prevent failure. - 前記リストに登録されている稼働状態の条件には、前記適用対象システムで動作するプログラム及び該プログラムの実行結果の期待値が含まれる
請求項11~請求項15の何れか1項に記載の障害発生防止方法。 The failure according to any one of claims 11 to 15, wherein the operating condition registered in the list includes a program that operates in the application target system and an expected value of an execution result of the program. Occurrence prevention method. - 前記リストが更新されたことを示す情報を外部から受け付ける受付ステップを更に含み、
前記参照ステップは、前記受付ステップにより前記リストの更新を示す情報が受け付けられた場合に、前記リストを参照し、
前記把握ステップは、前記受付ステップにより前記リストの更新を示す情報が受け付けられた場合に、前記適用対象システムの稼働状態を把握し、
前記特定ステップは、前記更新されたリストに含まれる前記適用対象システムで実行中のプロセスを特定すると共に、前記更新されたリストから、特定したプロセスの実行に伴い前記適用対象システムに障害が発生する可能性のある稼働状態の条件を特定し、
前記診断ステップは、前記特定ステップによって特定されたプロセスについて、前記特定ステップで特定された障害が発生する可能性のある稼働状態の条件と前記把握ステップで把握された前記適用対象システムの稼働状態とに基づいて、障害発生の可能性を診断し、
前記プロセス管理ステップは、前記診断ステップにより、前記特定されたプロセスの実行に伴う障害発生の可能性が所定値以上と診断された場合に、当該プロセスの実行を中止させる
請求項11~請求項16の何れか1項に記載の障害発生防止方法。 A step of accepting information indicating that the list has been updated from the outside;
The reference step refers to the list when information indicating the update of the list is received by the reception step,
The grasping step grasps the operating state of the application target system when the information indicating the update of the list is accepted by the accepting step;
The specifying step specifies a process being executed in the application target system included in the updated list, and a failure occurs in the application target system in accordance with execution of the specified process from the updated list. Identify possible health conditions,
In the diagnosis step, for the process specified in the specification step, the condition of the operation state in which the failure specified in the specification step may occur and the operation state of the application target system grasped in the grasp step To diagnose the possibility of failure,
The process management step stops the execution of the process when it is diagnosed by the diagnosis step that the possibility of a failure due to the execution of the specified process is greater than or equal to a predetermined value. The failure prevention method according to any one of the above. - 前記リストは、前記適用対象システムの動作を停止させずに、プロセスの実行に伴って発生した障害を復旧させる復旧策を更に含み、
前記特定ステップは、前記リストから、実行が中断されている特定のプロセスに対応する復旧策を特定し、
前記プロセス管理ステップは、前記特定ステップにより特定された特定のプロセスに対応する復旧策が前記リストに有る場合には、該当する前記復旧策を施した後に前記特定のプロセスの実行を再開させ、前記特定のプロセスに対応する前記復旧策が前記リストに無い場合には、前記特定のプロセスの実行再開を中止する
請求項11~請求項17の何れか1項に記載の障害発生防止方法。 The list further includes a recovery measure for recovering from a failure caused by the execution of the process without stopping the operation of the target system.
The identifying step identifies from the list a recovery measure corresponding to a particular process whose execution has been suspended,
In the process management step, when there is a recovery measure corresponding to the specific process specified by the specific step in the list, the execution of the specific process is resumed after applying the recovery measure. The failure occurrence prevention method according to any one of claims 11 to 17, wherein when the recovery measure corresponding to a specific process is not in the list, the resumption of execution of the specific process is stopped. - 前記プロセス管理ステップにより実行が停止されたプロセスに対応するパッチを取得する取得ステップを更に備える
請求項11~請求項18の何れか1項に記載の障害発生防止方法。 The failure occurrence prevention method according to any one of claims 11 to 18, further comprising an acquisition step of acquiring a patch corresponding to a process whose execution has been stopped by the process management step. - 前記プロセス管理ステップにより実行が中止されたプロセスに対応するパッチを取得する中止プロセスパッチ取得ステップを更に備える
請求項17または請求項18に記載の障害発生防止方法。 The failure occurrence prevention method according to claim 17 or 18, further comprising a stop process patch acquisition step of acquiring a patch corresponding to a process whose execution has been stopped by the process management step. - コンピュータに、
前記コンピュータを、障害の発生を防止するパッチの適用対象とする適用対象システムとして、前記適用対象システムにおいてプロセスが起動された際に、前記適用対象システムの稼働状態を把握する把握ステップと、
前記適用対象システムで障害が発生する可能性のある稼働状態の条件を含む、プロセス毎に分類されたリストを参照する参照ステップと、
前記参照ステップにより参照されたリストから、前記起動されたプロセスに対して障害が発生する可能性のある稼働状態の条件を特定する特定ステップと、
前記特定ステップで特定された障害が発生する可能性のある稼働状態の条件と、前記把握ステップで把握された前記適用対象システムの稼働状態とに基づいて、前記起動されたプロセスの実行に伴う障害が発生する可能性を診断する診断ステップと、
前記診断ステップにより前記起動されたプロセスの実行に伴う前記適用対象システムの障害が発生する可能性が所定値以上と診断された場合に、当該プロセスの実行を停止させるプロセス管理ステップと、
を含む処理を実行させるための障害発生防止プログラム。 On the computer,
A grasping step of grasping an operating state of the application target system when a process is started in the application target system as an application target system to which the patch is applied to prevent the occurrence of a failure;
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
A failure prevention program for executing processing including - コンピュータを、請求項10~請求項20の何れか1項記載の障害発生防止方法に係る処理を実行させるための障害発生防止プログラム。 21. A failure occurrence prevention program for causing a computer to execute processing according to the failure occurrence prevention method according to any one of claims 10 to 20.
- コンピュータに、
前記コンピュータを、障害の発生を防止するパッチの適用対象とする適用対象システムとして、前記適用対象システムにおいてプロセスが起動された際に、前記適用対象システムの稼働状態を把握する把握ステップと、
前記適用対象システムで障害が発生する可能性のある稼働状態の条件を含む、プロセス毎に分類されたリストを参照する参照ステップと、
前記参照ステップにより参照されたリストから、前記起動されたプロセスに対して障害が発生する可能性のある稼働状態の条件を特定する特定ステップと、
前記特定ステップで特定された障害が発生する可能性のある稼働状態の条件と、前記把握ステップで把握された前記適用対象システムの稼働状態とに基づいて、前記起動されたプロセスの実行に伴う障害が発生する可能性を診断する診断ステップと、
前記診断ステップにより前記起動されたプロセスの実行に伴う前記適用対象システムの障害が発生する可能性が所定値以上と診断された場合に、当該プロセスの実行を停止させるプロセス管理ステップと、
を含む処理を実行させるための障害発生防止プログラムを記録したコンピュータ読み取り可能な記録媒体。 On the computer,
A grasping step of grasping an operating state of the application target system when a process is started in the application target system as an application target system to which the patch is applied to prevent the occurrence of a failure;
A reference step for referring to a list classified by process including a condition of an operating state in which a failure may occur in the application target system;
A specifying step for specifying an operating condition that may cause a failure in the activated process from the list referred to in the reference step;
A failure associated with the execution of the activated process based on the operating state condition in which the failure specified in the specific step may occur and the operating state of the application target system determined in the grasping step A diagnostic step for diagnosing the possibility of occurrence of
A process management step for stopping the execution of the process when it is diagnosed that the possibility of the failure of the application target system accompanying the execution of the process started by the diagnosis step is a predetermined value or more;
The computer-readable recording medium which recorded the failure generation | occurrence | production prevention program for performing the process containing this. - コンピュータに、請求項11~請求項20の何れか1項記載の障害発生防止方法に係る処理を実行させるための障害発生防止プログラムを記録したコンピュータ読み取り可能な記録媒体。 21. A computer-readable recording medium having recorded thereon a failure occurrence prevention program for causing a computer to execute processing according to the failure occurrence prevention method according to any one of claims 11 to 20.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2011/076835 WO2013076798A1 (en) | 2011-11-21 | 2011-11-21 | Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2011/076835 WO2013076798A1 (en) | 2011-11-21 | 2011-11-21 | Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013076798A1 true WO2013076798A1 (en) | 2013-05-30 |
Family
ID=48469279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/076835 WO2013076798A1 (en) | 2011-11-21 | 2011-11-21 | Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2013076798A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018028760A (en) * | 2016-08-16 | 2018-02-22 | 富士ゼロックス株式会社 | Information processing apparatus and program |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003233512A (en) * | 2002-02-13 | 2003-08-22 | Nec Corp | Client monitoring system with maintenance function, monitoring server, program, and client monitoring/ maintaining method |
JP2003345915A (en) * | 2002-05-22 | 2003-12-05 | Hitachi Ltd | Remote maintenance method, its implementation system and its processing program |
JP2005100203A (en) * | 2003-09-26 | 2005-04-14 | Hitachi Information Systems Ltd | Job information management system, job information management method, and program therefor |
WO2009104268A1 (en) * | 2008-02-21 | 2009-08-27 | 富士通株式会社 | Patch candidate selector, program for selecting patch candidate, and method for selecting patch candidate |
JP2011209857A (en) * | 2010-03-29 | 2011-10-20 | Hitachi Solutions Ltd | Known failure prevention/avoidance system and method |
-
2011
- 2011-11-21 WO PCT/JP2011/076835 patent/WO2013076798A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003233512A (en) * | 2002-02-13 | 2003-08-22 | Nec Corp | Client monitoring system with maintenance function, monitoring server, program, and client monitoring/ maintaining method |
JP2003345915A (en) * | 2002-05-22 | 2003-12-05 | Hitachi Ltd | Remote maintenance method, its implementation system and its processing program |
JP2005100203A (en) * | 2003-09-26 | 2005-04-14 | Hitachi Information Systems Ltd | Job information management system, job information management method, and program therefor |
WO2009104268A1 (en) * | 2008-02-21 | 2009-08-27 | 富士通株式会社 | Patch candidate selector, program for selecting patch candidate, and method for selecting patch candidate |
JP2011209857A (en) * | 2010-03-29 | 2011-10-20 | Hitachi Solutions Ltd | Known failure prevention/avoidance system and method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018028760A (en) * | 2016-08-16 | 2018-02-22 | 富士ゼロックス株式会社 | Information processing apparatus and program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8151257B2 (en) | Managing different versions of server components regarding compatibility with collaborating servers | |
EP2210183B1 (en) | Managing updates to create a virtual machine facsimile | |
JP5970617B2 (en) | Development support system | |
US8140907B2 (en) | Accelerated virtual environments deployment troubleshooting based on two level file system signature | |
US8661418B2 (en) | Setting program, workflow creating method, and work flow creating apparatus | |
US9928059B1 (en) | Automated deployment of a multi-version application in a network-based computing environment | |
US20090327815A1 (en) | Process Reflection | |
US8132186B1 (en) | Automatic detection of hardware and device drivers during restore operations | |
US20110296247A1 (en) | System and method for mitigating repeated crashes of an application resulting from supplemental code | |
CN104915263A (en) | Process fault processing method and device based on container technology | |
CN112099825B (en) | Method, device, equipment and storage medium for upgrading component | |
US9542173B2 (en) | Dependency handling for software extensions | |
CN108804239B (en) | Platform integration method and device, computer equipment and storage medium | |
CN112965913A (en) | Method for automatically repairing dependency conflict problem of Java software | |
CN113703823A (en) | BMC (baseboard management controller) firmware upgrading method and device, electronic equipment and storage medium | |
US20130086572A1 (en) | Generation apparatus, generation method and computer readable information recording medium | |
US20120096303A1 (en) | Detecting and recovering from process failures | |
CN113849200B (en) | Installation optimization method and system for android application in android compatible environment | |
CN111698558A (en) | Television software upgrading method, television terminal and computer readable storage medium | |
US8689048B1 (en) | Non-logging resumable distributed cluster | |
CN115543429A (en) | Project environment building method, electronic equipment and computer readable storage medium | |
US9760364B2 (en) | Checks for software extensions | |
CN111949290B (en) | Hot patch management method and device, electronic equipment and storage medium | |
CN119225925A (en) | Service management method, device, electronic device and storage medium | |
WO2013076798A1 (en) | Failure generation prevention device, failure generation prevention method, failure generation prevention program and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11876344 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11876344 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |