[go: up one dir, main page]

WO2009110326A1 - Error analysis device, error analysis method, and recording medium - Google Patents

Error analysis device, error analysis method, and recording medium Download PDF

Info

Publication number
WO2009110326A1
WO2009110326A1 PCT/JP2009/052992 JP2009052992W WO2009110326A1 WO 2009110326 A1 WO2009110326 A1 WO 2009110326A1 JP 2009052992 W JP2009052992 W JP 2009052992W WO 2009110326 A1 WO2009110326 A1 WO 2009110326A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
abnormality
degree
failure
identification information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2009/052992
Other languages
French (fr)
Japanese (ja)
Inventor
慎二 中台
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Publication of WO2009110326A1 publication Critical patent/WO2009110326A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment

Definitions

  • the present invention relates to a failure analysis device, a failure analysis method, and a recording medium, and more particularly, to a failure analysis device, a failure analysis method, and a recording medium that can detect and classify system failures without setting rules or thresholds.
  • FIG. 1 is a diagram showing an example of a failure analysis apparatus, which is disclosed in Japanese Patent No. 3581934.
  • the failure analysis apparatus 100 includes an abnormal call volume monitoring unit 101 such as an operation measurement record (OM) transfer unit and a failure record transfer unit, a threshold determination unit 115, and a determination result display unit 116. Has been.
  • an abnormal call volume monitoring unit 101 such as an operation measurement record (OM) transfer unit and a failure record transfer unit
  • OM operation measurement record
  • a threshold determination unit 115 threshold determination unit
  • a determination result display unit 116 determination result display unit 116. Has been.
  • the failure analysis apparatus 100 configured as described above operates as follows.
  • the abnormal call volume monitoring unit 101 monitors the presence / absence of a log indicating the occurrence of an abnormality from the monitoring target devices 131 and 132, and if there is a log, the call volume that is the traffic volume per hour according to the type of abnormality. Count.
  • the threshold determination unit 115 notifies the maintenance operator of the abnormality as a failure through the determination result display unit 116 when the call volume within a predetermined time exceeds a predetermined threshold.
  • the failure analysis apparatus 100 shown in FIG. 1 can automatically detect a failure.
  • Fig. 2 is a diagram showing another example of a failure analysis device, and the documents "JING WU, JIAN-GUO ZHOU, PU-LIUYAN, MING WU," A STUDY ON NET WORK FAULT KNOWLEDGE ACQUISITION BASED ON SUPPORTVECTOR MACHINE ", Proceedings This is what is disclosed in of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005.
  • the failure analysis apparatus 200 manages the monitoring target system 230 including the monitoring target apparatuses 231 to 234 in order to manage the abnormality level monitoring unit 201, the abnormality level storage unit 210, and the failure case registration unit. 211, a case storage unit 212, a pattern learning unit 213, a knowledge storage unit 214, a pattern determination unit 215, a determination result display unit 216, and a determination correction input unit 217.
  • the failure analysis apparatus 200 configured as described above collects the degree of abnormality, which is an index indicating the possibility of failure in units of devices and lines, from the monitoring results for the monitoring target devices 231 to 234.
  • FIG. 3 is a diagram showing the value of the degree of abnormality used in the failure analysis apparatus 200 shown in FIG.
  • the degree of abnormality used in the failure analysis apparatus 200 shown in FIG. 2 includes values such as whether or not the link is down, an error rate, a congestion rate, a rejection rate, and a utilization rate.
  • the pattern determination unit 215 uses the knowledge information stored in the knowledge storage unit 214 to determine whether or not a failure has occurred in the combination of the obtained abnormalities, and the determination result display unit 216. Through this, the judgment result is presented to the maintenance operator.
  • Knowledge information stored in the knowledge storage unit 214 is generated by the following procedure.
  • the maintenance operator uses the failure case registration unit 211 to register past failure cases in the case storage unit 212.
  • the pattern learning unit 213 generates knowledge information from the combination of the failure case stored in the case storage unit 212 and the abnormality degree stored in the abnormality degree storage unit 210, and stores the knowledge information in the knowledge storage unit 214.
  • the failure case is information indicating when and where a failure has occurred.
  • the pattern learning unit 213 generates knowledge information by pattern learning performed using a pattern classifier called Support Vector Machine (SVM).
  • SVM Support Vector Machine
  • a one-dimensional class is estimated from a multi-dimensional variable, the variable used as the multi-dimensional variable is called a feature, and a d-dimensional space with d features is called a feature space Rd.
  • the input variable is a feature variable x ( ⁇ Rd) in this feature space and the output variable is a class y ( ⁇ ⁇ 1, ⁇ 1 ⁇ )
  • y changes when x exceeds a certain region in the feature space.
  • the boundary of the region that causes such a change is called a hyperplane.
  • the knowledge information obtained by the pattern learning means 213 is a threshold for detecting and classifying this fault, and in a feature space composed of combinations of abnormalities, it is a hyperplane for classifying a plurality of classes.
  • the failure analysis device 200 shown in FIG. 2 can detect a failure without setting a threshold for failure detection and classification. it can.
  • the failure analysis device described above does not reflect the failure detection sensitivity desired by the maintenance operator when generating the failure detection threshold from the case, the maintenance operator's policy mistakenly detects a normal state as a failure. Even if the policy is to reduce the number of oversights of faults, there is a problem that the generated threshold value may be a threshold value where there are many oversights of faults instead of few false detections.
  • the present invention has been made in view of the above-described problems, and provides a failure analysis apparatus, a failure analysis method, and a recording medium capable of performing failure detection or classification reflecting failure detection sensitivity desired by a maintenance operator. With the goal.
  • the present invention provides: The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information.
  • Anomaly information receiving means A type determination unit that compares each degree of abnormality information received by the degree of abnormality information reception unit with a predetermined determination criterion, and classifies each degree of abnormality information for each type based on a comparison result;
  • a determination result output means for associating and outputting identification information of each abnormality degree information and information indicating various types into which each abnormality degree information is classified;
  • Failure case registration means for receiving input of information indicating the true type for the identification information of each abnormality degree information,
  • a case storage unit that stores the identification information of each degree of abnormality information in association with the true type;
  • a detection sensitivity input means for receiving an input of a setting parameter for updating the determination criterion; Based on each abnormality degree information received by the abnormality degree information receiving means, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter, the determination criterion And pattern learning means for updating.
  • a failure analysis method using an information processing apparatus sequentially outputs abnormality degree information including a plurality of index values indicating the degree of abnormality of the monitoring target apparatus together with identification information of the abnormality degree information from the monitoring target apparatus. Sequentially receiving identification information;
  • the information processing device compares the received abnormality degree information with a predetermined criterion, and classifies the abnormality degree information for each type based on a comparison result;
  • the information processing apparatus outputs the identification information of each degree of abnormality information and information indicating various types into which each degree of abnormality information is classified, The information processing apparatus receiving an input of information indicating a true type for each identification information of the degree of abnormality information; and
  • the information processing apparatus storing the identification information of each abnormality degree information in association with the true type;
  • the information processing apparatus accepting an input of a setting parameter for updating the determination criterion;
  • the information processing apparatus determines the determination criterion based on each received abnormality level information, information indicating a true type stored in association with identification information of each abnormality
  • a recording medium on which a program for operating a computer is written In the computer, The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information.
  • a procedure for updating the determination criterion based on each received abnormality degree information, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter is executed. The program to make it have been written.
  • the present invention can perform fault detection or classification reflecting the fault detection sensitivity desired by the maintenance operator.
  • FIG. 5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • 5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • 5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • 5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • 5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • FIG. 4 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4.
  • FIG. 5 is a configuration diagram of a monitoring target for explaining an embodiment of the operation of the failure analysis apparatus shown in FIG. 4. It is a figure which shows the feature space for demonstrating one Example of operation
  • FIG. 4 is a block diagram showing an embodiment of the failure analysis apparatus of the present invention.
  • a computer (a central processing unit, a processor, and a data processing unit is connected to a system 330 including monitoring target devices 331 to 334 and is an information processing device that operates under program control. 300).
  • the computer 300 includes a failure case registration unit 311, a case storage unit 312, an abnormality degree monitoring unit 301 that is an abnormality degree information receiving unit, an abnormality degree storage unit 310, a pattern learning unit 313, a knowledge storage unit 314, A pattern determination unit 315 that is a type determination unit, a determination result display unit 316 that is a determination result output unit, a determination correction input unit 317, a detection sensitivity input unit 318, and a detection sensitivity storage unit 319 are included.
  • the failure case registration unit 311 is connected to the case storage unit 312, the case storage unit 312 is connected to the failure case registration unit 311 and the pattern learning unit 313, and the detection sensitivity input unit 318 is connected to the detection sensitivity storage unit 319.
  • the detection sensitivity storage unit 319 is connected to the detection sensitivity input unit 318 and the pattern learning unit 313.
  • the pattern learning unit 313 includes the abnormality degree storage unit 310, the case storage unit 312, the detection sensitivity storage unit 319, and the knowledge storage unit. 314, the abnormality degree storage unit 310 is connected to the pattern learning unit 313 and the abnormality degree monitoring unit 301, respectively, and the knowledge storage unit 314 is connected to the pattern learning unit 313 and the pattern determination unit 315, respectively.
  • the monitoring unit 301 is connected to the abnormality degree storage unit 310 and the pattern determination unit 315, and the pattern determination unit 315 Is connected to the knowledge storage section 314 and the abnormality monitoring unit 301 and the determination result display unit 316, respectively, the determination result display section 316 is connected to the pattern determining unit 315.
  • the knowledge information, threshold value, boundary surface, and hyperplane indicate the same thing and correspond to the determination criteria of the present invention.
  • the feature in this embodiment corresponds to the index value of the present invention.
  • the detection sensitivity input by the operator corresponds to the cost shown in the tables of FIGS.
  • the detection sensitivity is a setting parameter for changing the threshold value (determination criterion), and is a cost set for each case described later.
  • the detection sensitivity corresponds to the setting parameter of the present invention.
  • the failure case registration unit 311 receives input of the failure occurrence time and location from a terminal (not shown) used by the maintenance operator who is an operator in the present invention.
  • This pair of failure occurrence time and location is called an example. This may include the type of failure and the root cause location.
  • the case is information in which the above-described failure occurrence time and place, or the time and place where the failure was normal, are associated with each other. Here, both the time and place stored as an example may have a spread like a period or a range.
  • the cases include a failure case indicating a case where the failure is actually caused and a normal case indicating a case where the failure is actually normal.
  • the failure case includes a failure occurrence time and place, and the normal case includes a normal time and place.
  • the case may include a case type (corresponding to a class or a pattern. Also, corresponding to a true type in the present invention).
  • the type of case is information indicating that the case is normal or information including the type of failure.
  • the failure case includes a failure occurrence time and location, and the type of failure
  • the normal case includes time and location where the case is normal and information indicating that the case is normal.
  • the type of case may be configured as information independent of the case. In this embodiment, it is assumed that the types of cases are not included.
  • the case type may be included in the case.
  • the failure case registration unit 311 may receive an input of the type of the case together with the case.
  • the location may be an identifier for identifying each of the monitoring target devices 331 to 334, and may be any location that can identify the location where a failure has occurred, such as a line name or address.
  • the failure occurrence time and location are included in the identification information of the abnormality level information of the present invention.
  • the identification information of the abnormality degree information corresponds to a case.
  • the identification information of abnormality degree information should just contain the information which can identify abnormality degree information, and should just contain the identifier etc. which are attached
  • the case storage unit 312 receives a case from the failure case registration unit 311 or the determination correction input unit 317 described later, and stores the received case.
  • FIG. 5 is a diagram showing a table in the case storage unit 312 shown in FIG.
  • the case storage unit 312 stores a case number, a time, a place, and a pattern in association with each other.
  • the case number, time, and place are identification information of the degree of abnormality information, and the pattern is the type of case. Note that the case number, time, and location are not indispensable, and at least one piece of information that can identify the abnormality degree information is sufficient.
  • the abnormality level monitoring unit 301 acquires abnormality level information including the abnormality level from the monitoring target devices 331 to 334 in the monitoring target system 330.
  • the abnormality degree monitoring unit 301 stores the obtained abnormality degree information in the abnormality degree storage unit 310.
  • the abnormality degree monitoring unit 301 passes information indicating the time included in the abnormality degree information or information indicating the time when the abnormality degree monitoring unit 301 receives the abnormality degree information to the pattern determination unit 315.
  • the abnormality degree storage unit 310 stores the abnormality degree, time, place, and value included in the abnormality degree information received by the abnormality degree monitoring unit 301 in the past in association with each other. Further, for example, the degree of abnormality information that can be identified by time and place may be stored so as to be returned.
  • the pattern learning unit 313 corresponds to each case stored in the case storage unit 312 when the maintenance operator inputs the failure case registration unit 300 or the determination correction input unit 317 or periodically.
  • the attached abnormality degree information is read from the abnormality degree storage unit 310.
  • a feature space used by the pattern learning unit 313 is configured by each abnormality degree (feature) included in each read abnormality degree information.
  • the pattern learning unit 313 reads the detection sensitivity for each label (corresponding to the type of the present invention) such as the type of failure case and normality from the detection sensitivity storage unit 319 described later.
  • the pattern learning unit 313 generates a threshold (hyperplane) for detecting and classifying a fault based on the abnormality degree information read from the abnormality degree storage unit 310 and the detection sensitivity read from the detection sensitivity storage unit 319, Store in the knowledge storage unit 314.
  • the derivation of the hyperplane is realized by performing the optimization of Equation 1 under the constraint described in Equation 2 in the feature space Rd.
  • ⁇ i which is described as a slack variable in the literature “Hideki Aso, Koji Tsuda, Noboru Murata,“ Statistics of Pattern Recognition and Learning ”, Iwanami Shoten, pp.107-123, 2005, is the case i is a hyperplane.
  • the hyperplane learned by ⁇ i being weighted by the cost Cyi determined corresponding to the label yi of the case i reflects the ratio of the cost Cy between the labels. This cost Cyi is the detection sensitivity.
  • This example shows only two-class classification, but it can be realized in the same way even in multi-class classification such as multiple failure patterns.
  • the knowledge storage unit 314 stores the threshold value generated by the pattern learning unit 313.
  • the pattern determination unit 315 receives the abnormality level information from the abnormality level acquisition unit 301. Then, the pattern determination unit 315 reads out the threshold value stored in the knowledge storage unit 314 and indicates whether the abnormality level information received from the abnormality level acquisition unit 301 indicates a failure or a normal state. Determine. Further, when it is determined that there is a failure, it is determined what type of failure it is, and the identification information of the abnormality level information and the determination result are passed to the determination result display unit 316.
  • the determination result display unit 316 displays the determination result received from the pattern determination unit 315 (corresponding to the pattern, the type of case, and the type of the present invention) and the identification information (case) of the abnormality level information to the maintenance operator. To do.
  • the determination correction input unit 317 When the determination result (corresponding to the pattern, type of case, and type of the present invention) presented by the determination result display unit 316 to the maintenance operator is incorrect, the determination correction input unit 317 The type of case considered to be correct (corresponding to the true type of the present invention) and the case are registered in the case storage unit 312. For example, in addition to time and place (case), the type of the case (true type) is added to the case storage unit 312 or the case stored in the case storage unit 312 is considered correct by the maintenance operator. You may modify the case.
  • the detection sensitivity input unit 318 receives detection sensitivity input from a terminal (not shown) used by the maintenance operator. An input may be received by associating this detection sensitivity with a true type.
  • the detection sensitivity storage unit 319 receives the detection sensitivity from the detection sensitivity input unit 318 and stores it.
  • the detection sensitivity storage unit 319 may receive the true type together with the detection sensitivity, and store the received detection sensitivity and the true type in association with each other.
  • 6 to 9 are flowcharts for explaining the operation of the failure analysis apparatus 300 shown in FIG.
  • the abnormality level monitoring unit 301 acquires abnormality level information including the abnormality level from the monitoring target system 330 (step 401), and passes the acquired abnormality level information to the pattern determination unit 315.
  • the pattern determination unit 315 determines the type of case in the monitored system 330 from the abnormality level information received from the abnormality level monitoring unit 301 using the threshold value (hyperplane) included in the knowledge storage unit 314, and the determination result (example) And the identification information (example) of the abnormality level information are passed to the determination result display unit 316 (step 402).
  • the determination result display unit 316 displays the pattern (type) and abnormality level identification information received from the pattern determination unit 315. (Step 403).
  • the maintenance operator inputs the failure occurrence time or normal time, location, and case type as the case and the true type to the failure case registration unit 311 or the determination correction unit 317.
  • the failure case registration unit 311 or the determination correction unit 317 stores the input case in the case storage unit 312 (step 601).
  • the maintenance operator sets detection sensitivity for each type in the detection sensitivity storage unit 319 (step 500), and inputs the set detection sensitivity for each type via the detection sensitivity input unit 318 (step 501).
  • a high detection sensitivity for normality has the same meaning as that it is difficult to detect all faults, and therefore the input information is the detection sensitivity for each type and the common detection sensitivity for each type. May be.
  • the pattern learning unit 313 generates a threshold for performing failure determination by pattern learning (step 602). This step may be executed separately by an instruction from the maintenance operator.
  • the pattern learning unit 413 associates all cases included in the case storage unit 312 with the time or place included in the case from the situation storage unit 310.
  • System information is acquired (steps 701 and 702).
  • the pattern learning unit 313 uses the feature vector composed of the degree of abnormality and the situation information included in each system information associated with each case obtained from the case storage unit 312 for each system information.
  • the hyperplane for classifying the information into the pattern of the kind of information is learned (step 703), and the hyperplane is generated.
  • the pattern learning unit 313 reads out each detection sensitivity stored in the detection sensitivity storage unit 319 and uses each detection sensitivity as a weight of the cost given to each case exceeding the hyperplane, Learn.
  • the pattern learning unit 313 stores the learned hyperplane in the knowledge storage unit 314, and the pattern determination unit 315 uses the hyperplane stored in the knowledge storage unit 314 to receive each abnormality received from the abnormality level monitoring unit 301. Patterns are classified for the degree of abnormality information (step 704).
  • the maintenance operator thinks about the type of each fault and the overall detection sensitivity information, because each fault or normal labeled case in the feature space is given as a cost that exceeds the hyperplane.
  • the threshold value represented by the hyperplane reflects the failure detection policy considered by the maintenance operator.
  • the detection sensitivity value is set based on the severity of each trouble inputted by the maintenance operator.
  • the cost exceeding the hyperplane for cases of other types of faults and normal cases is increased, so that more fault cases are included in the region where the generated hyperplane is determined to be a fault. become.
  • it is easy to determine a failure by applying this hyperplane threshold to system monitoring data and using it for failure detection.
  • the cost as the inputted failure detection sensitivity value is reduced, it is difficult to determine a failure.
  • FIG. 10 is a configuration diagram of a monitoring target for explaining an embodiment of the operation of the failure analysis apparatus 300 shown in FIG. 11 to 13 are diagrams showing a feature space for explaining an embodiment of the operation of the failure analysis apparatus 300 shown in FIG.
  • the monitoring target system 330 includes monitoring target devices 901 and 902, and communication is performed between them, and the management system 300 of the present invention
  • the device 901 acquires the call loss rate 904 of communication with the monitoring target device 902 and the CPU usage rate 905 of the monitoring target device 902 from the monitoring target device 902 as the degree of abnormality. Shall be specified.
  • the maintenance operator registers the detection sensitivity from the detection sensitivity input unit 318, and a hyperplane in the feature space representing the detection threshold is generated from this information.
  • the generated hyperplane 1003 has a failure case that exists on the normal region 1005 side.
  • the ratio of exceeding the hyperplane 1003 and the ratio of normal cases existing on the obstacle region side exceeding the hyperplane 1003 are approximately the same.
  • the detection threshold value represented by the generated hyperplane 1103 can determine that all the data having characteristics near the hyperplane 1003 in FIG.
  • the detection threshold value represented by the hyperplane 1203 generated at this time is such that all data having features near the hyperplane 1203 in FIG. it can.
  • FIGS. 14 to 16 are diagrams showing a feature space for explaining another embodiment of the operation of the failure analysis apparatus 300 shown in FIG.
  • the region determined as failure 1 is a region surrounded by the hyperplane 1303, and the region determined as failure 2 is the hyperplane. An area surrounded by 1304 is obtained.
  • the maintenance operator determines that failure 1 is a serious failure and sets this detection sensitivity as high as the set value 1410 in FIG. 15, the generated hyperplane 1403 is shown in FIG.
  • the monitoring result near the boundary of the hyperplane 1303 is also determined as the failure 1, and the rate of overlooking this failure is reduced.
  • the maintenance operator determines that failure 2 is an important failure and sets this detection sensitivity as high as the setting value 1510 in FIG. 16, the generated hyperplane 1504 is as shown in FIG.
  • the monitoring result in the vicinity of the boundary of the hyperplane 1304 is also determined as the failure 2, and the rate of overlooking this failure is reduced.
  • the present invention can be applied to applications such as monitoring a system including a computer, a network device, and a communication device, and detecting and classifying a failure.
  • the processing in the failure analyzer is recorded on a recording medium readable by the failure analyzer, in addition to the above-described dedicated hardware.
  • the program recorded on the recording medium may be read by the failure analysis apparatus and executed.
  • Recording media that can be read by the failure analysis device include IC cards, memory cards, transferable storage media such as floppy disks (registered trademark), magneto-optical disks, DVDs, and CDs, as well as built-in failure analysis devices. Refers to the HDD or the like.
  • the program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Provided is an error analysis method including steps of: successively receiving error degree information including a plurality of index values indicating an error degree of a device to be monitored together with an identifier of the error degree information; comparing the received error degree information to a predetermined judgment reference; classifying the error degree information into a corresponding type according to the comparison result; correlating each identifier of the error degree information with the corresponding type of the classified error degree information for output; receiving information indicating the true type for each of the identifiers of the error degree information; storing the identifiers of the respective error degree information while correlating them with the true types; receiving an input of a setting parameter used for updating the judgment reference; and updating the judgment reference according to the received respective error degree information, the information indicating the true types stored while being correlated with the identifiers of the respective error degree information, and the setting parameter.

Description

障害分析装置、障害分析方法および記録媒体Fault analysis apparatus, fault analysis method, and recording medium

 本発明は、障害分析装置、障害分析方法および記録媒体に関し、特に、ルールや閾値を設定することなく、システム障害を検出して分類できる障害分析装置、障害分析方法および記録媒体に関する。 The present invention relates to a failure analysis device, a failure analysis method, and a recording medium, and more particularly, to a failure analysis device, a failure analysis method, and a recording medium that can detect and classify system failures without setting rules or thresholds.

 図1は、障害分析装置の一例を示す図であり、特許第3581934号公報に開示されたものを示す。 FIG. 1 is a diagram showing an example of a failure analysis apparatus, which is disclosed in Japanese Patent No. 3581934.

 図1に示すように、この障害分析装置100は、動作測定記録(OM)転送ユニットや障害記録転送ユニットといった異常呼量監視部101と、閾値判定部115と、判定結果表示部116とから構成されている。 As shown in FIG. 1, the failure analysis apparatus 100 includes an abnormal call volume monitoring unit 101 such as an operation measurement record (OM) transfer unit and a failure record transfer unit, a threshold determination unit 115, and a determination result display unit 116. Has been.

 上記のように構成された障害分析装置100は、次のように動作する。 The failure analysis apparatus 100 configured as described above operates as follows.

 異常呼量監視部101が、監視対象装置131,132から異常の発生を示すログの有無を監視し、ログが存在する場合は、異常の種別に応じて、時間当たりのトラフィック量である呼量をカウントする。閾値判定部115は、一定時間内の呼量が所定の閾値以上になると、判定結果表示部116を通じて、保守運用者にその異常を障害として通知する。 The abnormal call volume monitoring unit 101 monitors the presence / absence of a log indicating the occurrence of an abnormality from the monitoring target devices 131 and 132, and if there is a log, the call volume that is the traffic volume per hour according to the type of abnormality. Count. The threshold determination unit 115 notifies the maintenance operator of the abnormality as a failure through the determination result display unit 116 when the call volume within a predetermined time exceeds a predetermined threshold.

 このような動作により、図1に示した障害分析装置100では、自動で障害を検出することができる。 With this operation, the failure analysis apparatus 100 shown in FIG. 1 can automatically detect a failure.

 図2は、障害分析装置の他の例を示す図であり、文献“JING WU, JIAN-GUO ZHOU, PU-LIUYAN, MING WU、「A STUDY ON NET WORK FAULT KNOWLEDGE ACQUISITION BASED ON SUPPORTVECTOR MACHINE」、Proceedings of the Fourth International Conference on MachineLearning and Cybernetics, Guangzhou, 18-21 August 2005”に開示されたものを示す。 Fig. 2 is a diagram showing another example of a failure analysis device, and the documents "JING WU, JIAN-GUO ZHOU, PU-LIUYAN, MING WU," A STUDY ON NET WORK FAULT KNOWLEDGE ACQUISITION BASED ON SUPPORTVECTOR MACHINE ", Proceedings This is what is disclosed in of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August 2005.

 図2に示すように、この障害分析装置200は、監視対象装置231~234からなる監視対象システム230を管理するために、異常度監視部201と、異常度格納部210と、障害事例登録部211と、事例格納部212と、パターン学習部213と、知識格納部214と、パターン判定部215と、判定結果表示部216と、判定修正入力部217とから構成されている。 As shown in FIG. 2, the failure analysis apparatus 200 manages the monitoring target system 230 including the monitoring target apparatuses 231 to 234 in order to manage the abnormality level monitoring unit 201, the abnormality level storage unit 210, and the failure case registration unit. 211, a case storage unit 212, a pattern learning unit 213, a knowledge storage unit 214, a pattern determination unit 215, a determination result display unit 216, and a determination correction input unit 217.

 上記のように構成された障害分析装置200は、監視対象装置231~234に対する監視結果から、装置や回線単位の故障の可能性を表す指標である異常度を収集する。 The failure analysis apparatus 200 configured as described above collects the degree of abnormality, which is an index indicating the possibility of failure in units of devices and lines, from the monitoring results for the monitoring target devices 231 to 234.

 図3は、図2に示した障害分析装置200で用いられる異常度の値を示す図である。 FIG. 3 is a diagram showing the value of the degree of abnormality used in the failure analysis apparatus 200 shown in FIG.

 図2に示した障害分析装置200で用いられる異常度は、図3に示すように、リンクが落ちているか否か、エラー率、輻輳率、棄却率、利用率といった値が挙げられる。 As shown in FIG. 3, the degree of abnormality used in the failure analysis apparatus 200 shown in FIG. 2 includes values such as whether or not the link is down, an error rate, a congestion rate, a rejection rate, and a utilization rate.

 得られた異常度の組み合わせを、パターン判定部215は、知識格納部214に格納された知識情報を用いて、監視対象システム230において障害が発生したか否かを判定し、判定結果表示部216を通して、判定結果を保守運用者に提示する。 The pattern determination unit 215 uses the knowledge information stored in the knowledge storage unit 214 to determine whether or not a failure has occurred in the combination of the obtained abnormalities, and the determination result display unit 216. Through this, the judgment result is presented to the maintenance operator.

 知識格納部214に格納される知識情報は、以下の手順で生成される。 Knowledge information stored in the knowledge storage unit 214 is generated by the following procedure.

 まず、保守運用者が障害事例登録部211を用いて、過去の障害事例を事例格納部212に登録する。 First, the maintenance operator uses the failure case registration unit 211 to register past failure cases in the case storage unit 212.

 パターン学習部213は、事例格納部212に格納されている障害事例と、異常度格納部210に格納された異常度の組み合わせとから知識情報を生成し、知識格納部214に格納する。ここで、障害事例とは、いつどこでどのような障害が発生したかを表す情報である。なお、パターン学習手段213は、Support Vector Machine(SVM)というパターン識別器を用いて行われるパターン学習によって知識情報を生成する。 The pattern learning unit 213 generates knowledge information from the combination of the failure case stored in the case storage unit 212 and the abnormality degree stored in the abnormality degree storage unit 210, and stores the knowledge information in the knowledge storage unit 214. Here, the failure case is information indicating when and where a failure has occurred. The pattern learning unit 213 generates knowledge information by pattern learning performed using a pattern classifier called Support Vector Machine (SVM).

 このSVMは、“麻生英樹, 津田宏治, 村田昇,「パターン認識と学習の統計学」、岩波書店,pp.107-123, 2005”に詳細に記載されている。一般に、パターン学習においては、まず、多次元の変数から一次元のクラス(パターン)を推定する。この多次元の変数として用いる変数を特徴と呼ぶ。またd個からなる特徴が張るd次元空間を特徴空間Rdと呼ぶ。また、入力変数を、この特徴空間における特徴変数x(∈Rd)とし、出力変数をクラスy(∈{1,-1})とすると、特徴空間内でxがある領域を超えるとyが変化する。このような変化を生む領域の境界を超平面と呼ぶ。 This SVM is described in detail in “Hideki Aso, Koji Tsuda, Noboru Murata,“ Statistics of Pattern Recognition and Learning ”, Iwanami Shoten, pp.107-123, 2005. In general, in pattern learning, First, a one-dimensional class (pattern) is estimated from a multi-dimensional variable, the variable used as the multi-dimensional variable is called a feature, and a d-dimensional space with d features is called a feature space Rd. When the input variable is a feature variable x (∈Rd) in this feature space and the output variable is a class y (∈ {1, −1}), y changes when x exceeds a certain region in the feature space. The boundary of the region that causes such a change is called a hyperplane.

 この超平面は、n個の入力値xi(i=1,2,...,n)に対する出力値yiが与えられると、パターン学習により生成することができる。パターン学習の際、出力値yの異なる入力値間の距離をマージンと呼ぶ。 This hyperplane can be generated by pattern learning given output values yi for n input values xi (i = 1, 2,..., N). During pattern learning, a distance between input values having different output values y is called a margin.

 パターン学習手段213にて得られる知識情報とは、この障害を検出し分類するための閾値であり、異常度の組み合わせからなる特徴空間においては、複数のクラスを分類する超平面となる。 The knowledge information obtained by the pattern learning means 213 is a threshold for detecting and classifying this fault, and in a feature space composed of combinations of abnormalities, it is a hyperplane for classifying a plurality of classes.

 判定結果表示部216が保守運用者に対して示した障害判定結果が、実際には障害ではなかった場合には、判定修正入力部217を用いて、事例格納部212に入力される。 When the failure determination result shown to the maintenance operator by the determination result display unit 216 is not actually a failure, it is input to the case storage unit 212 using the determination correction input unit 217.

 このような動作により、図2に示した障害分析装置200では、図1に示した障害分析装置100とは異なり、障害検出および分類のための閾値を設定することなく、障害を検出することができる。 With such an operation, unlike the failure analysis device 100 shown in FIG. 1, the failure analysis device 200 shown in FIG. 2 can detect a failure without setting a threshold for failure detection and classification. it can.

 しかしながら、上述した障害分析装置では、事例から障害検出閾値を生成する際に、保守運用者が望む障害検出感度を反映していないため、保守運用者の方針が、正常な状態を障害と誤検出しても構わないので障害の見落としを減らしたいという方針であったとしても、生成される閾値は誤検出が少ない代わりに、障害の見落としが多い閾値であることもあり得るという問題点がある。 However, since the failure analysis device described above does not reflect the failure detection sensitivity desired by the maintenance operator when generating the failure detection threshold from the case, the maintenance operator's policy mistakenly detects a normal state as a failure. Even if the policy is to reduce the number of oversights of faults, there is a problem that the generated threshold value may be a threshold value where there are many oversights of faults instead of few false detections.

 本発明は、上述した問題点に鑑みてなされたものであって、保守運用者が望む障害検出感度を反映した障害検出、または分類ができる障害分析装置、障害分析方法および記録媒体を提供することを目的とする。 The present invention has been made in view of the above-described problems, and provides a failure analysis apparatus, a failure analysis method, and a recording medium capable of performing failure detection or classification reflecting failure detection sensitivity desired by a maintenance operator. With the goal.

 上記目的を達成するために本発明は、
 監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信する異常度情報受信手段と、
 前記異常度情報受信手段が受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類する種別判定手段と、
 前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報とを対応付けて出力する判定結果出力手段と、
 前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受ける障害事例登録手段と、
 前記各異常度情報の識別情報を前記真の種別と対応付けて記憶する事例格納部と、
 前記判定基準を更新するための設定パラメータの入力を受ける検出感度入力手段と、
 前記異常度情報受信手段が受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新するパターン学習手段とを有する。
In order to achieve the above object, the present invention provides:
The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information. Anomaly information receiving means;
A type determination unit that compares each degree of abnormality information received by the degree of abnormality information reception unit with a predetermined determination criterion, and classifies each degree of abnormality information for each type based on a comparison result;
A determination result output means for associating and outputting identification information of each abnormality degree information and information indicating various types into which each abnormality degree information is classified;
Failure case registration means for receiving input of information indicating the true type for the identification information of each abnormality degree information,
A case storage unit that stores the identification information of each degree of abnormality information in association with the true type;
A detection sensitivity input means for receiving an input of a setting parameter for updating the determination criterion;
Based on each abnormality degree information received by the abnormality degree information receiving means, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter, the determination criterion And pattern learning means for updating.

 また、情報処理装置を用いた障害分析方法であって、
 前記情報処理装置が、監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信するステップと、
 前記情報処理装置が、受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類するステップと、
 前記情報処理装置が、前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報とを対応付けて出力するステップと、
 前記情報処理装置が、前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受け付けるステップと、
 前記情報処理装置が、前記各異常度情報の識別情報を前記真の種別と対応付けて記憶するステップと、
 前記情報処理装置が、前記判定基準を更新するための設定パラメータの入力を受け付けるステップと、
 前記情報処理装置が、受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新するステップとを有する。
A failure analysis method using an information processing apparatus,
The information processing apparatus sequentially outputs abnormality degree information including a plurality of index values indicating the degree of abnormality of the monitoring target apparatus together with identification information of the abnormality degree information from the monitoring target apparatus. Sequentially receiving identification information;
The information processing device compares the received abnormality degree information with a predetermined criterion, and classifies the abnormality degree information for each type based on a comparison result;
The information processing apparatus outputs the identification information of each degree of abnormality information and information indicating various types into which each degree of abnormality information is classified,
The information processing apparatus receiving an input of information indicating a true type for each identification information of the degree of abnormality information; and
The information processing apparatus storing the identification information of each abnormality degree information in association with the true type;
The information processing apparatus accepting an input of a setting parameter for updating the determination criterion;
The information processing apparatus determines the determination criterion based on each received abnormality level information, information indicating a true type stored in association with identification information of each abnormality level information, and the setting parameter. Updating.

 また、コンピュータを動作させるためのプログラムが書き込まれた記録媒体であって、
 前記コンピュータに、
 監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信する手順と、
 受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類する手順と、
 前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報と、を対応付けて出力する手順と、
 前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受け付ける手順と、
 前記各異常度情報の識別情報を前記真の種別と対応付けて記憶する手順と、
 前記判定基準を更新するための設定パラメータの入力を受け付ける手順と、
 受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新する手順とを実行させるためのプログラムが書き込まれている。
Further, a recording medium on which a program for operating a computer is written,
In the computer,
The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information. Procedure and
Comparing each received degree of abnormality information with a predetermined criterion, and classifying each degree of abnormality information for each type based on the result of comparison;
A procedure for outputting the identification information of each degree of abnormality information and information indicating each type of classification of each degree of abnormality information in association with each other,
A procedure for accepting input of information indicating a true type for identification information of each degree of abnormality information;
A procedure for storing the identification information of each degree of abnormality information in association with the true type;
A procedure for receiving an input of a setting parameter for updating the determination criterion;
A procedure for updating the determination criterion based on each received abnormality degree information, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter is executed. The program to make it have been written.

 本発明は、保守運用者が望む障害検出感度を反映した障害検出、または分類ができる。 The present invention can perform fault detection or classification reflecting the fault detection sensitivity desired by the maintenance operator.

障害分析装置の一例を示す図である。It is a figure which shows an example of a failure analyzer. 障害分析装置の他の例を示す図である。It is a figure which shows the other example of a failure analyzer. 図2に示した障害分析装置で用いられる異常度の値を示す図である。It is a figure which shows the value of the abnormality degree used with the failure analyzer shown in FIG. 本発明の障害分析装置の実施の一形態を示すブロック図である。It is a block diagram which shows one Embodiment of the failure analysis apparatus of this invention. 図4に示した事例格納部内のテーブルを示す図である。It is a figure which shows the table in the example storage part shown in FIG. 図4に示した障害分析装置の動作を説明するためのフローチャートである。5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4. 図4に示した障害分析装置の動作を説明するためのフローチャートである。5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4. 図4に示した障害分析装置の動作を説明するためのフローチャートである。5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4. 図4に示した障害分析装置の動作を説明するためのフローチャートである。5 is a flowchart for explaining the operation of the failure analysis apparatus shown in FIG. 4. 図4に示した障害分析装置の動作の一実施例を説明するための監視対象の構成図である。FIG. 5 is a configuration diagram of a monitoring target for explaining an embodiment of the operation of the failure analysis apparatus shown in FIG. 4. 図4に示した障害分析装置の動作の一実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating one Example of operation | movement of the failure analyzer shown in FIG. 図4に示した障害分析装置の動作の一実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating one Example of operation | movement of the failure analyzer shown in FIG. 図4に示した障害分析装置の動作の一実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating one Example of operation | movement of the failure analyzer shown in FIG. 図4に示した障害分析装置の動作の他の実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating the other Example of operation | movement of the failure analyzer shown in FIG. 図4に示した障害分析装置の動作の他の実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating the other Example of operation | movement of the failure analyzer shown in FIG. 図4に示した障害分析装置の動作の他の実施例を説明するための特徴空間を示す図である。It is a figure which shows the feature space for demonstrating the other Example of operation | movement of the failure analyzer shown in FIG.

 以下に、本発明の実施の形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

 図4は、本発明の障害分析装置の実施の一形態を示すブロック図である。 FIG. 4 is a block diagram showing an embodiment of the failure analysis apparatus of the present invention.

 本形態は図4に示すように、監視対象装置331~334を備えるシステム330と通信可能に接続され、プログラム制御により動作する情報処理装置であるコンピュータ(中央処理装置とプロセッサとデータ処理装置とを少なくとも備える)300である。 In this embodiment, as shown in FIG. 4, a computer (a central processing unit, a processor, and a data processing unit is connected to a system 330 including monitoring target devices 331 to 334 and is an information processing device that operates under program control. 300).

 コンピュータ300は、障害事例登録部311と、事例格納部312と、異常度情報受信手段である異常度監視部301と、異常度格納部310と、パターン学習部313と、知識格納部314と、種別判定手段であるパターン判定部315と、判定結果出力手段である判定結果表示部316と、判定修正入力部317と、検出感度入力部318と、検出感度格納部319とを含む。 The computer 300 includes a failure case registration unit 311, a case storage unit 312, an abnormality degree monitoring unit 301 that is an abnormality degree information receiving unit, an abnormality degree storage unit 310, a pattern learning unit 313, a knowledge storage unit 314, A pattern determination unit 315 that is a type determination unit, a determination result display unit 316 that is a determination result output unit, a determination correction input unit 317, a detection sensitivity input unit 318, and a detection sensitivity storage unit 319 are included.

 障害事例登録部311は、事例格納部312と接続され、事例格納部312は、障害事例登録部311とパターン学習部313とそれぞれ接続され、検出感度入力部318は、検出感度格納部319と接続され、検出感度格納部319は、検出感度入力部318とパターン学習部313とそれぞれ接続され、パターン学習部313は、異常度格納部310と事例格納部312と検出感度格納部319と知識格納部314とそれぞれ接続され、異常度格納部310は、パターン学習部313と異常度監視部301とそれぞれ接続され、知識格納部314は、パターン学習部313とパターン判定部315とそれぞれ接続され、異常度監視部301は、異常度格納部310とパターン判定部315とそれぞれ接続され、パターン判定部315は、知識格納部314と異常度監視部301と判定結果表示部316とそれぞれ接続され、判定結果表示部316は、パターン判定部315と接続されている。 The failure case registration unit 311 is connected to the case storage unit 312, the case storage unit 312 is connected to the failure case registration unit 311 and the pattern learning unit 313, and the detection sensitivity input unit 318 is connected to the detection sensitivity storage unit 319. The detection sensitivity storage unit 319 is connected to the detection sensitivity input unit 318 and the pattern learning unit 313. The pattern learning unit 313 includes the abnormality degree storage unit 310, the case storage unit 312, the detection sensitivity storage unit 319, and the knowledge storage unit. 314, the abnormality degree storage unit 310 is connected to the pattern learning unit 313 and the abnormality degree monitoring unit 301, respectively, and the knowledge storage unit 314 is connected to the pattern learning unit 313 and the pattern determination unit 315, respectively. The monitoring unit 301 is connected to the abnormality degree storage unit 310 and the pattern determination unit 315, and the pattern determination unit 315 Is connected to the knowledge storage section 314 and the abnormality monitoring unit 301 and the determination result display unit 316, respectively, the determination result display section 316 is connected to the pattern determining unit 315.

 なお、本形態において、知識情報、閾値、境界面および超平面は同一のものを指し、本発明の判定基準に相当する。また、本形態における特徴は、本発明の指標値に相当する。また、本形態では、オペレータが入力する検出感度は、図11~図16の表中に示すコストに相当する。また、検出感度とは、上記閾値(判定基準)を変更するための設定パラメータであり、後述する各事例に対しそれぞれ設定されるコストである。検出感度は、本発明の設定パラメータに相当する。 In this embodiment, the knowledge information, threshold value, boundary surface, and hyperplane indicate the same thing and correspond to the determination criteria of the present invention. The feature in this embodiment corresponds to the index value of the present invention. In this embodiment, the detection sensitivity input by the operator corresponds to the cost shown in the tables of FIGS. The detection sensitivity is a setting parameter for changing the threshold value (determination criterion), and is a cost set for each case described later. The detection sensitivity corresponds to the setting parameter of the present invention.

 上述した構成要素は、それぞれ概略次のように動作する。 The components described above generally operate as follows.

 障害事例登録部311は、本発明におけるオペレータとなる保守運用者が使用する図示しない端末から、障害発生時間と場所との入力を受け付ける。この障害発生時間と場所との組を事例と呼ぶ。これには、障害の種類や根本原因の箇所も含めて良い。事例とは、上述した障害発生時間と場所とが、あるいは正常であった時間と場所とが、対応付けられている情報である。ここで、事例として記憶されている時間と場所とはともに、期間や範囲のように広がりを持っていても良い。また、事例には、実際に障害であった場合の事例を示す障害事例と、実際には正常であった場合の事例を示す正常事例とがある。障害事例には障害発生時間と場所とが、正常事例には正常であった時間と場所とがそれぞれ含まれている。また、事例には事例の種類(クラス、パターンに相当する。また、本発明における真の種別に相当する)が含まれていてもよい。事例の種類とは、当該事例が正常であることを示す情報または障害の種類を含む情報である。この場合、障害事例には障害発生時間と場所と障害の種類とが、正常事例には正常であった時間と場所と当該事例が正常であることを示す情報とがそれぞれ含まれている。あるいは、事例の種類は、事例とは独立した情報として構成されていてもよい。本形態においては、事例の種類を含まないものとして考える。もちろん、事例に事例の種類を含んでいてもよい。 The failure case registration unit 311 receives input of the failure occurrence time and location from a terminal (not shown) used by the maintenance operator who is an operator in the present invention. This pair of failure occurrence time and location is called an example. This may include the type of failure and the root cause location. The case is information in which the above-described failure occurrence time and place, or the time and place where the failure was normal, are associated with each other. Here, both the time and place stored as an example may have a spread like a period or a range. In addition, the cases include a failure case indicating a case where the failure is actually caused and a normal case indicating a case where the failure is actually normal. The failure case includes a failure occurrence time and place, and the normal case includes a normal time and place. The case may include a case type (corresponding to a class or a pattern. Also, corresponding to a true type in the present invention). The type of case is information indicating that the case is normal or information including the type of failure. In this case, the failure case includes a failure occurrence time and location, and the type of failure, and the normal case includes time and location where the case is normal and information indicating that the case is normal. Alternatively, the type of case may be configured as information independent of the case. In this embodiment, it is assumed that the types of cases are not included. Of course, the case type may be included in the case.

 障害事例登録部311は、事例とともに、当該事例の種類の入力を受け付けてもよい。場所とは、各監視対象装置331~334を識別する識別子であってもよいし、回線名や住所などのように障害発生の箇所を特定できるものであればよい。障害発生時間と場所とは、本発明の異常度情報の識別情報に含まれるものである。また、本形態では、異常度情報の識別情報は事例に相当する。なお、異常度情報の識別情報は、異常度情報が識別できる情報を含んでいればよく、一意に付される識別子などを含んでいればよい。 The failure case registration unit 311 may receive an input of the type of the case together with the case. The location may be an identifier for identifying each of the monitoring target devices 331 to 334, and may be any location that can identify the location where a failure has occurred, such as a line name or address. The failure occurrence time and location are included in the identification information of the abnormality level information of the present invention. In the present embodiment, the identification information of the abnormality degree information corresponds to a case. In addition, the identification information of abnormality degree information should just contain the information which can identify abnormality degree information, and should just contain the identifier etc. which are attached | subjected uniquely.

 事例格納部312は、障害事例登録部311または後述する判定修正入力部317から事例を受け取り、受け取った事例を格納する。 The case storage unit 312 receives a case from the failure case registration unit 311 or the determination correction input unit 317 described later, and stores the received case.

 図5は、図4に示した事例格納部312内のテーブルを示す図である。 FIG. 5 is a diagram showing a table in the case storage unit 312 shown in FIG.

 図5に示すように、事例格納部312は、事例番号と時刻と場所とパターンとを対応付けて記憶している。事例番号、時刻および場所は異常度情報の識別情報であり、パターンは事例の種類である。なお、事例番号、時刻、場所は、それぞれ必須ではなく、異常度情報を識別できる情報が少なくとも1つあればよい。 As shown in FIG. 5, the case storage unit 312 stores a case number, a time, a place, and a pattern in association with each other. The case number, time, and place are identification information of the degree of abnormality information, and the pattern is the type of case. Note that the case number, time, and location are not indispensable, and at least one piece of information that can identify the abnormality degree information is sufficient.

 異常度監視部301は、監視対象システム330における監視対象装置331~334から異常度を含む異常度情報を取得する。異常度監視部301は、取得した異常度情報を異常度格納部310に格納する。また、異常度監視部301は、異常度情報に含まれている時刻を示す情報もしくは異常度監視部301が異常度情報を受信した時刻を示す情報をパターン判定部315に渡す。 The abnormality level monitoring unit 301 acquires abnormality level information including the abnormality level from the monitoring target devices 331 to 334 in the monitoring target system 330. The abnormality degree monitoring unit 301 stores the obtained abnormality degree information in the abnormality degree storage unit 310. In addition, the abnormality degree monitoring unit 301 passes information indicating the time included in the abnormality degree information or information indicating the time when the abnormality degree monitoring unit 301 receives the abnormality degree information to the pattern determination unit 315.

 異常度格納部310は、過去に異常度監視部301が受信した異常度情報に含まれる異常度と時刻と場所と値とを対応付けて記憶している。また、例えば、時間と場所で識別できる異常度情報を返すことができるように格納してもよい。 The abnormality degree storage unit 310 stores the abnormality degree, time, place, and value included in the abnormality degree information received by the abnormality degree monitoring unit 301 in the past in association with each other. Further, for example, the degree of abnormality information that can be identified by time and place may be stored so as to be returned.

 パターン学習部313は、保守運用者から障害事例登録部300あるいは判定修正入力部317に対して入力があったタイミングで、あるいは定期的に実行され、事例格納部312に格納された各事例に対応付けられている異常度情報を、異常度格納部310から読み出す。読み出された各異常度情報に含まれる各異常度(特徴)でパターン学習部313が用いる特徴空間を構成している。また、パターン学習部313は、後述する検出感度格納部319から障害事例の種類や正常といったラベル(本発明の種別に相当する)ごとの検出感度を読み出す。また、パターン学習部313は、異常度格納部310から読み出した異常度情報および検出感度格納部319から読み出した検出感度に基づいて障害を検出し分類するための閾値(超平面)を生成し、知識格納部314に格納する。 The pattern learning unit 313 corresponds to each case stored in the case storage unit 312 when the maintenance operator inputs the failure case registration unit 300 or the determination correction input unit 317 or periodically. The attached abnormality degree information is read from the abnormality degree storage unit 310. A feature space used by the pattern learning unit 313 is configured by each abnormality degree (feature) included in each read abnormality degree information. Also, the pattern learning unit 313 reads the detection sensitivity for each label (corresponding to the type of the present invention) such as the type of failure case and normality from the detection sensitivity storage unit 319 described later. Further, the pattern learning unit 313 generates a threshold (hyperplane) for detecting and classifying a fault based on the abnormality degree information read from the abnormality degree storage unit 310 and the detection sensitivity read from the detection sensitivity storage unit 319, Store in the knowledge storage unit 314.

 ここで、文献“Chih-ChungChang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm”の記載にならい、パターン学習の具体例を示し、障害検出感度を反映させる様子を例示する。 Here, the document “Chih-ChungChang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm” In the meantime, a specific example of pattern learning is shown, and a state of reflecting the failure detection sensitivity is illustrated.

 超平面の導出には、特徴空間Rdにおける、数2に記載の制約のもと数1の最適化を行うことで実現する。ここで、文献“麻生英樹, 津田宏治,村田昇,「パターン認識と学習の統計学」、岩波書店, pp.107-123, 2005”でスラック変数として記載されるξiは、事例iが超平面を超えて学習されている程度を表し、ξiが事例iのラベルyiに対応して定められるコストCyiで重み付けられることにより学習される超平面は、各ラベル間でのコストCyの比を反映したものとなる。このコストCyiが検出感度である。 The derivation of the hyperplane is realized by performing the optimization of Equation 1 under the constraint described in Equation 2 in the feature space Rd. Here, ξi, which is described as a slack variable in the literature “Hideki Aso, Koji Tsuda, Noboru Murata,“ Statistics of Pattern Recognition and Learning ”, Iwanami Shoten, pp.107-123, 2005, is the case i is a hyperplane. The hyperplane learned by ξi being weighted by the cost Cyi determined corresponding to the label yi of the case i reflects the ratio of the cost Cy between the labels. This cost Cyi is the detection sensitivity.

 この例は、2クラスの分類のみを示しているが、複数の障害パターンのような多クラスの分類においても同様の方法で実現できる。 This example shows only two-class classification, but it can be realized in the same way even in multi-class classification such as multiple failure patterns.

Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001

Figure JPOXMLDOC01-appb-M000002
 なお、上述した文献“Chih-ChungChang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm”で提供されるSVMでは、このCyiを重みとして設定可能であるが、この文献に記載のような、従来のパターン学習を用いた障害検出システムでは、障害検出感度を可変とするためにこのCyiを利用することには言及していない。
Figure JPOXMLDOC01-appb-M000002
Provided in the above-mentioned document “Chih-ChungChang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm” In this SVM, this Cyi can be set as a weight. However, in a fault detection system using conventional pattern learning as described in this document, this Cyi is used to make the fault detection sensitivity variable. It does not mention that.

 知識格納部314は、パターン学習部313によって生成された閾値が格納される。 The knowledge storage unit 314 stores the threshold value generated by the pattern learning unit 313.

 パターン判定部315は、異常度取得部301から異常度情報を受信する。そしてパターン判定部315は、知識格納部314に格納された閾値を読み出して、異常度取得部301から受信した異常度情報が、どのような障害であるか、あるいは正常であるかを示しているかを判定する。さらに障害であると判定された場合はどのような障害であるかを判定し、異常度情報の識別情報と判定結果とを判定結果表示部316に渡す。 The pattern determination unit 315 receives the abnormality level information from the abnormality level acquisition unit 301. Then, the pattern determination unit 315 reads out the threshold value stored in the knowledge storage unit 314 and indicates whether the abnormality level information received from the abnormality level acquisition unit 301 indicates a failure or a normal state. Determine. Further, when it is determined that there is a failure, it is determined what type of failure it is, and the identification information of the abnormality level information and the determination result are passed to the determination result display unit 316.

 判定結果表示部316は、パターン判定部315から受け取った判定結果(パターン、事例の種類、本発明の種別に相当する)と異常度情報の識別情報(事例)とを保守運用者に対して表示する。 The determination result display unit 316 displays the determination result received from the pattern determination unit 315 (corresponding to the pattern, the type of case, and the type of the present invention) and the identification information (case) of the abnormality level information to the maintenance operator. To do.

 判定修正入力部317は、判定結果表示部316が保守運用者に対して提示した判定結果(パターン、事例の種類、本発明の種別に相当する)が間違いであった場合に、保守運用者が正しいと考える事例の種類(本発明の真の種別に相当する)と事例とを事例格納部312に登録する。たとえば、時間と場所(事例)に加え、事例の種類(真の種別)などを、事例格納部312に追加する、あるいは、事例格納部312に格納されている事例を保守運用者が正しいと考える事例に修正してもよい。 When the determination result (corresponding to the pattern, type of case, and type of the present invention) presented by the determination result display unit 316 to the maintenance operator is incorrect, the determination correction input unit 317 The type of case considered to be correct (corresponding to the true type of the present invention) and the case are registered in the case storage unit 312. For example, in addition to time and place (case), the type of the case (true type) is added to the case storage unit 312 or the case stored in the case storage unit 312 is considered correct by the maintenance operator. You may modify the case.

 検出感度入力部318は、保守運用者が使用する図示しない端末から、検出感度の入力を受け付ける。この検出感度に真の種別を対応付けて入力を受け付けてもよい。 The detection sensitivity input unit 318 receives detection sensitivity input from a terminal (not shown) used by the maintenance operator. An input may be received by associating this detection sensitivity with a true type.

 検出感度格納部319は、検出感度入力部318から検出感度を受け取り格納する。検出感度格納部319は、検出感度とともに真の種別を受け取り、受け取った検出感度と真の種別とを対応付けて記憶してもよい。 The detection sensitivity storage unit 319 receives the detection sensitivity from the detection sensitivity input unit 318 and stores it. The detection sensitivity storage unit 319 may receive the true type together with the detection sensitivity, and store the received detection sensitivity and the true type in association with each other.

 次に、図6~図9のフローチャートを参照して本形態の全体の動作について詳細に説明する。 Next, the overall operation of this embodiment will be described in detail with reference to the flowcharts of FIGS.

 図6~図9は、図4に示した障害分析装置300の動作を説明するためのフローチャートである。 6 to 9 are flowcharts for explaining the operation of the failure analysis apparatus 300 shown in FIG.

 まず、異常度監視部301が監視対象システム330から異常度を含む異常度情報を取得し(ステップ401)、取得した異常度情報をパターン判別部315に渡す。 First, the abnormality level monitoring unit 301 acquires abnormality level information including the abnormality level from the monitoring target system 330 (step 401), and passes the acquired abnormality level information to the pattern determination unit 315.

 パターン判定部315が知識格納部314に含まれる閾値(超平面)を用いて、異常度監視部301から受け取った異常度情報から、監視対象システム330における事例の種類を判定し、判定結果(事例の種類、種別)と当該異常度情報の識別情報(事例)とを判定結果表示部316に渡す(ステップ402)。 The pattern determination unit 315 determines the type of case in the monitored system 330 from the abnormality level information received from the abnormality level monitoring unit 301 using the threshold value (hyperplane) included in the knowledge storage unit 314, and the determination result (example) And the identification information (example) of the abnormality level information are passed to the determination result display unit 316 (step 402).

 次に、ステップ402においてパターン判定部315が障害であると判定した場合には、判定結果表示部316は、パターン判定部315から受け取ったパターン(種別)と異常度の識別情報とを保守運用者に表示する(ステップ403)。 Next, when the pattern determination unit 315 determines that there is a failure in step 402, the determination result display unit 316 displays the pattern (type) and abnormality level identification information received from the pattern determination unit 315. (Step 403).

 次に、保守運用者は、障害事例登録部311あるいは判定修正部317に対して、事例および真の種別として障害発生時間または正常である時間、場所、事例の種類を入力する。障害事例登録部311あるいは判定修正部317は、入力された事例を事例格納部312に格納する(ステップ601)。また、保守運用者は、検出感度格納部319に種別毎の検出感度を設定し(ステップ500)、検出感度入力部318を介して、設定した種別毎の検出感度を入力する(ステップ501)。ここで、正常に対する検出感度が高いことは、障害全般を検出しにくくなることと同様の意味を持つため、入力される情報が、種別毎の検出感度と、各種別に共通の検出感度とであっても良い。 Next, the maintenance operator inputs the failure occurrence time or normal time, location, and case type as the case and the true type to the failure case registration unit 311 or the determination correction unit 317. The failure case registration unit 311 or the determination correction unit 317 stores the input case in the case storage unit 312 (step 601). The maintenance operator sets detection sensitivity for each type in the detection sensitivity storage unit 319 (step 500), and inputs the set detection sensitivity for each type via the detection sensitivity input unit 318 (step 501). Here, a high detection sensitivity for normality has the same meaning as that it is difficult to detect all faults, and therefore the input information is the detection sensitivity for each type and the common detection sensitivity for each type. May be.

 次に、パターン学習部313は、パターン学習により障害判定を行うための閾値を生成する(ステップ602)。このステップは、別途保守運用者からの指示により実行されても良い。 Next, the pattern learning unit 313 generates a threshold for performing failure determination by pattern learning (step 602). This step may be executed separately by an instruction from the maintenance operator.

 事例から障害判定を行うための閾値を生成するために、パターン学習部413は、事例格納部312に含まれる全ての事例について、状況格納部310から当該事例に含まれる時間または場所に対応付けられているシステム情報を取得する(ステップ701,702)。 In order to generate a threshold value for determining a failure from a case, the pattern learning unit 413 associates all cases included in the case storage unit 312 with the time or place included in the case from the situation storage unit 310. System information is acquired (steps 701 and 702).

 パターン学習部313は、事例格納部312から得られた各事例に対応付けられている各システム情報に含まれる異常度および状況情報から構成される特徴ベクトルを用いて、各システム情報について、各システム情報の事例の種類というパターンに分類するための超平面を学習し(ステップ703)、超平面を生成する。このとき、パターン学習部313は、検出感度格納部319に格納された各検出感度が読み出し、各検出感度を、各事例が超平面を超えることに対して与えられるコストの重みとして用いることで、学習をおこなう。 The pattern learning unit 313 uses the feature vector composed of the degree of abnormality and the situation information included in each system information associated with each case obtained from the case storage unit 312 for each system information. The hyperplane for classifying the information into the pattern of the kind of information is learned (step 703), and the hyperplane is generated. At this time, the pattern learning unit 313 reads out each detection sensitivity stored in the detection sensitivity storage unit 319 and uses each detection sensitivity as a weight of the cost given to each case exceeding the hyperplane, Learn.

 パターン学習部313は、学習して生成した超平面を知識格納部314に格納し、パターン判定部315は、知識格納部314に格納された超平面を用いて異常度監視部301から受け取った各異常度情報についてパターンを分類する(ステップ704)。 The pattern learning unit 313 stores the learned hyperplane in the knowledge storage unit 314, and the pattern determination unit 315 uses the hyperplane stored in the knowledge storage unit 314 to receive each abnormality received from the abnormality level monitoring unit 301. Patterns are classified for the degree of abnormality information (step 704).

 次に、本形態の効果について説明する。 Next, the effect of this embodiment will be described.

 本形態では、保守運用者が考える各障害の種類や障害全般の検出感度の情報が、特徴空間において各障害や正常でラベル付けされた事例が超平面を超えるコストとして与えられるため、生成された超平面で表される閾値が、保守運用者の考える障害検出方針を反映したものとなる。それにより、誤検出が多いが見落としが少ない障害検出・分類を行ったり、逆に誤検出が少なく見落としが多い障害検出・分類を行うことができる。 In this form, the maintenance operator thinks about the type of each fault and the overall detection sensitivity information, because each fault or normal labeled case in the feature space is given as a cost that exceeds the hyperplane. The threshold value represented by the hyperplane reflects the failure detection policy considered by the maintenance operator. Thereby, fault detection / classification with many false detections but few oversights can be performed, and conversely fault detection / classification with few false detections and many oversights can be performed.

 また、本形態によれば、特徴空間内で複数の種類や場所の障害を隔てる超平面が生成される際に、保守運用者により入力された各障害の深刻度に基づいて検出感度の値を大きくし、その値が大きいほど他の障害種類の事例や正常事例に対する超平面を超えるコストを大きくすることで、生成される超平面が障害と判定する領域内により多くの障害事例が含まれるようになる。また、この超平面からなる閾値をシステム監視データに適用して障害検出に用いることで、障害と判定しやすくなる。また逆に、入力された障害検出感度の値を小さくするほど前記コストを小さくすることで、障害と判定しにくくなる。 Further, according to the present embodiment, when a hyperplane separating a plurality of types and places of obstacles is generated in the feature space, the detection sensitivity value is set based on the severity of each trouble inputted by the maintenance operator. By increasing the value and increasing the value, the cost exceeding the hyperplane for cases of other types of faults and normal cases is increased, so that more fault cases are included in the region where the generated hyperplane is determined to be a fault. become. In addition, it is easy to determine a failure by applying this hyperplane threshold to system monitoring data and using it for failure detection. Conversely, by reducing the cost as the inputted failure detection sensitivity value is reduced, it is difficult to determine a failure.

 以下に、上述した障害分析装置300の動作について、具体的な実施例を用いて説明する。 Hereinafter, the operation of the above-described failure analysis apparatus 300 will be described using a specific example.

 図10は、図4に示した障害分析装置300の動作の一実施例を説明するための監視対象の構成図である。また、図11~図13は、図4に示した障害分析装置300の動作の一実施例を説明するための特徴空間を示す図である。 FIG. 10 is a configuration diagram of a monitoring target for explaining an embodiment of the operation of the failure analysis apparatus 300 shown in FIG. 11 to 13 are diagrams showing a feature space for explaining an embodiment of the operation of the failure analysis apparatus 300 shown in FIG.

 図10に示すように、本実施例では、監視対象のシステム330には監視対象装置901と902が存在し、それらの間で通信が行われており、本発明の管理システム300は、監視対象装置901から監視対象装置902との通信の呼損率904および、監視対象装置902から監視対象装置902のCPU利用率905を異常度として取得し、これを特徴空間としてパターン判定部315が、障害の種類を特定するものとする。 As shown in FIG. 10, in this embodiment, the monitoring target system 330 includes monitoring target devices 901 and 902, and communication is performed between them, and the management system 300 of the present invention The device 901 acquires the call loss rate 904 of communication with the monitoring target device 902 and the CPU usage rate 905 of the monitoring target device 902 from the monitoring target device 902 as the degree of abnormality. Shall be specified.

 このとき、保守運用者からは、検出感度入力部318から検出感度が登録され、この情報から検出閾値を表す特徴空間内での超平面が生成される。 At this time, the maintenance operator registers the detection sensitivity from the detection sensitivity input unit 318, and a hyperplane in the feature space representing the detection threshold is generated from this information.

 保守運用者が、図11の設定値1010に示すように正常と障害とが同一の検出感度の値として設定されると、生成される超平面1003は、正常領域1005側に存在する障害事例が超平面1003を超えている割合と、障害領域側に存在する正常事例が超平面1003を超えている割合とが同程度となる。 When the maintenance operator sets normality and failure as the same detection sensitivity value as indicated by a setting value 1010 in FIG. 11, the generated hyperplane 1003 has a failure case that exists on the normal region 1005 side. The ratio of exceeding the hyperplane 1003 and the ratio of normal cases existing on the obstacle region side exceeding the hyperplane 1003 are approximately the same.

 この超平面1003で表される閾値で障害を検出すると、超平面1003付近の監視結果では正常と判定されるものもあれば、異常と判定される場合もある。 If a fault is detected with the threshold value represented by the hyperplane 1003, some of the monitoring results near the hyperplane 1003 are determined to be normal, and some are determined to be abnormal.

 次に、仮に保守運用者が、超平面1003付近のような正常と看做せるような監視結果も頻繁に障害検出するようでは、障害の発生確認作業等に時間が取られ煩わしいと考え、検出感度を落として本当に深刻そうな場合のみ検知すると判断し、図12の設定値1110に示すように正常の検出感度を高く設定したとする。 Next, if the maintenance operator frequently detects faults that can be regarded as normal, such as the vicinity of the hyperplane 1003, it will be troublesome because it takes time to check the fault occurrence and so on. Assume that the sensitivity is lowered and it is determined that the detection is performed only when it is really serious, and the normal detection sensitivity is set high as indicated by a setting value 1110 in FIG.

 このとき、生成される超平面1103で表される検出閾値は、図11の超平面1003付近の特徴を持つデータを、全て正常と判断して、障害検出を少なくすることができる。 At this time, the detection threshold value represented by the generated hyperplane 1103 can determine that all the data having characteristics near the hyperplane 1003 in FIG.

 逆に、仮に保守運用者が、障害と少しでも疑われる場合は検出したいと考え、検出感度を高め、図13の設定値1210に示すように障害の検出感度を高く設定したとする。 On the contrary, suppose that the maintenance operator wants to detect when a failure is suspected as much as possible, raises the detection sensitivity, and sets the failure detection sensitivity to a high value as shown by a setting value 1210 in FIG.

 このとき生成される超平面1203で表される検出閾値は、図13の超平面1203付近の特徴を持つデータを、全て障害と判断して、障害と少しでも疑われる場合は全て検出させることができる。 The detection threshold value represented by the hyperplane 1203 generated at this time is such that all data having features near the hyperplane 1203 in FIG. it can.

 次に、図4に示した障害分析装置300の動作の他の実施例について説明する。 Next, another embodiment of the operation of the failure analysis apparatus 300 shown in FIG. 4 will be described.

 図14~図16は、図4に示した障害分析装置300の動作の他の実施例を説明するための特徴空間を示す図である。 FIGS. 14 to 16 are diagrams showing a feature space for explaining another embodiment of the operation of the failure analysis apparatus 300 shown in FIG.

 本実施例では、図10と同様のシステムを監視するとするが、ここではCPU利用率と呼損率の上昇をおよぼす2種類の障害があるとする。 In this embodiment, it is assumed that a system similar to that shown in FIG. 10 is monitored. Here, it is assumed that there are two types of failures that cause an increase in the CPU usage rate and the call loss rate.

 保守運用者が、図14の設定値1310のような検出感度を設定したとすると、障害1と判定される領域は超平面1303で囲まれた領域となり、障害2と判定される領域は超平面1304で囲まれた領域となる。 If the maintenance operator sets a detection sensitivity such as the setting value 1310 in FIG. 14, the region determined as failure 1 is a region surrounded by the hyperplane 1303, and the region determined as failure 2 is the hyperplane. An area surrounded by 1304 is obtained.

 次に、仮に保守運用者が、障害1は重大な障害であると判断し、この検出感度を図15の設定値1410のように高く設定したとすると、生成される超平面1403は図14の超平面1303の境界付近の監視結果も障害1と判定するようになり、この障害を見落とす割合が減る。 Next, if the maintenance operator determines that failure 1 is a serious failure and sets this detection sensitivity as high as the set value 1410 in FIG. 15, the generated hyperplane 1403 is shown in FIG. The monitoring result near the boundary of the hyperplane 1303 is also determined as the failure 1, and the rate of overlooking this failure is reduced.

 このとき他の障害についてはその検出感度はほとんど変化しない。 At this time, the detection sensitivity of other obstacles hardly changes.

 逆に、仮に保守運用者が、障害2が重要な障害であると判断し、この検出感度を図16の設定値1510のように高く設定したとすると、生成される超平面1504は図14の超平面1304の境界付近の監視結果も障害2と判定するようになり、この障害を見落とす割合が減る。 Conversely, if the maintenance operator determines that failure 2 is an important failure and sets this detection sensitivity as high as the setting value 1510 in FIG. 16, the generated hyperplane 1504 is as shown in FIG. The monitoring result in the vicinity of the boundary of the hyperplane 1304 is also determined as the failure 2, and the rate of overlooking this failure is reduced.

 ここでも、他の障害についてはその検出感度はほとんど変化しない。 Again, the detection sensitivity of other obstacles hardly changes.

 本発明によれば、コンピュータやネットワーク機器・通信装置からなるシステムを監視し、障害を検出・分類するといった用途に適用できる。 According to the present invention, the present invention can be applied to applications such as monitoring a system including a computer, a network device, and a communication device, and detecting and classifying a failure.

 なお、本発明においては、障害分析装置内の処理は上述の専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを障害分析装置にて読取可能な記録媒体に記録し、この記録媒体に記録されたプログラムを障害分析装置に読み込ませ、実行するものであっても良い。障害分析装置にて読取可能な記録媒体とは、ICカードやメモリカード、あるいは、フロッピーディスク(登録商標)、光磁気ディスク、DVD、CDなどの移設可能な記録媒体の他、障害分析装置に内蔵されたHDD等を指す。この記録媒体に記録されたプログラムは、例えば、制御ブロックにて読み込まれ、制御ブロックの制御によって、上述したものと同様の処理が行われる。 In the present invention, the processing in the failure analyzer is recorded on a recording medium readable by the failure analyzer, in addition to the above-described dedicated hardware. The program recorded on the recording medium may be read by the failure analysis apparatus and executed. Recording media that can be read by the failure analysis device include IC cards, memory cards, transferable storage media such as floppy disks (registered trademark), magneto-optical disks, DVDs, and CDs, as well as built-in failure analysis devices. Refers to the HDD or the like. The program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.

 以上、実施例を参照して本願発明を説明したが、本願発明は上記実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

 この出願は、2008年3月7日に出願された日本出願特願2008-058440を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2008-058440 filed on Mar. 7, 2008, the entire disclosure of which is incorporated herein.

Claims (6)

 監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信する異常度情報受信手段と、
 前記異常度情報受信手段が受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類する種別判定手段と、
 前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報とを対応付けて出力する判定結果出力手段と、
 前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受ける障害事例登録手段と、
 前記各異常度情報の識別情報を前記真の種別と対応付けて記憶する事例格納部と、
 前記判定基準を更新するための設定パラメータの入力を受ける検出感度入力手段と、
 前記異常度情報受信手段が受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新するパターン学習手段とを有する障害分析装置。
The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information. Anomaly information receiving means;
A type determination unit that compares each degree of abnormality information received by the degree of abnormality information reception unit with a predetermined determination criterion, and classifies each degree of abnormality information for each type based on a comparison result;
A determination result output means for associating and outputting identification information of each abnormality degree information and information indicating various types into which each abnormality degree information is classified;
Failure case registration means for receiving input of information indicating the true type for the identification information of each abnormality degree information,
A case storage unit that stores the identification information of each degree of abnormality information in association with the true type;
A detection sensitivity input means for receiving an input of a setting parameter for updating the determination criterion;
Based on each abnormality degree information received by the abnormality degree information receiving means, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter, the determination criterion Failure analysis apparatus having pattern learning means for updating.
 請求項1に記載の障害分析装置において、
 前記真の種別を示す情報は、前記監視対象装置が正常であるか異常であるかを示す情報である障害分析装置。
The failure analysis apparatus according to claim 1,
The information indicating the true type is a failure analysis device which is information indicating whether the monitoring target device is normal or abnormal.
 請求項1に記載の障害分析装置において、
 前記障害事例登録手段は、前記真の種別を示す情報を、オペレータにより操作される端末から受信する障害分析装置。
The failure analysis apparatus according to claim 1,
The failure case registration unit is a failure analysis device that receives information indicating the true type from a terminal operated by an operator.
 請求項1に記載の障害分析装置において、
 前記検出感度入力手段は、前記検出感度を、オペレータにより操作される端末から受信する障害分析装置。
The failure analysis apparatus according to claim 1,
The detection sensitivity input means is a failure analysis device that receives the detection sensitivity from a terminal operated by an operator.
 情報処理装置を用いた障害分析方法であって、
 前記情報処理装置が、監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信するステップと、
 前記情報処理装置が、受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類するステップと、
 前記情報処理装置が、前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報とを対応付けて出力するステップと、
 前記情報処理装置が、前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受け付けるステップと、
 前記情報処理装置が、前記各異常度情報の識別情報を前記真の種別と対応付けて記憶するステップと、
 前記情報処理装置が、前記判定基準を更新するための設定パラメータの入力を受け付けるステップと、
 前記情報処理装置が、受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新するステップとを有する障害分析方法。
A failure analysis method using an information processing device,
The information processing apparatus sequentially outputs abnormality degree information including a plurality of index values indicating the degree of abnormality of the monitoring target apparatus together with identification information of the abnormality degree information from the monitoring target apparatus. Sequentially receiving identification information;
The information processing device compares the received abnormality degree information with a predetermined criterion, and classifies the abnormality degree information for each type based on a comparison result;
The information processing apparatus outputs the identification information of each degree of abnormality information and information indicating various types into which each degree of abnormality information is classified,
The information processing apparatus receiving an input of information indicating a true type for each identification information of the degree of abnormality information; and
The information processing apparatus storing the identification information of each abnormality degree information in association with the true type;
The information processing apparatus accepting an input of a setting parameter for updating the determination criterion;
The information processing apparatus determines the determination criterion based on each received abnormality level information, information indicating a true type stored in association with identification information of each abnormality level information, and the setting parameter. A failure analysis method comprising the steps of:
 コンピュータを動作させるためのプログラムが書き込まれた記録媒体であって、
 前記コンピュータに、
 監視対象装置の異常度を示す複数の指標値を含む異常度情報を前記異常度情報の識別情報とともに順次出力する監視対象装置から、前記異常度情報および前記異常度情報の識別情報を順次受信する手順と、
 受信した前記各異常度情報を所定の判定基準と比較し、比較の結果に基づいて前記各異常度情報を種別毎に分類する手順と、
 前記各異常度情報の識別情報と、前記各異常度情報が分類された各種別を示す情報と、を対応付けて出力する手順と、
 前記各異常度情報の識別情報についてそれぞれ真の種別を示す情報の入力を受け付ける手順と、
 前記各異常度情報の識別情報を前記真の種別と対応付けて記憶する手順と、
 前記判定基準を更新するための設定パラメータの入力を受け付ける手順と、
 受信した各異常度情報と、前記各異常度情報の識別情報に対応付けて記憶されている真の種別を示す情報と、前記設定パラメータとに基づいて、前記判定基準を更新する手順とを実行させるためのプログラムが書き込まれた記録媒体。
A recording medium on which a program for operating a computer is written,
In the computer,
The abnormality degree information and the identification information of the abnormality degree information are sequentially received from the monitoring target apparatus that sequentially outputs the abnormality degree information including a plurality of index values indicating the abnormality degree of the monitoring target apparatus together with the identification information of the abnormality degree information. Procedure and
Comparing each received degree of abnormality information with a predetermined criterion, and classifying each degree of abnormality information for each type based on the result of comparison;
A procedure for outputting the identification information of each degree of abnormality information and information indicating each type of classification of each degree of abnormality information in association with each other,
A procedure for accepting input of information indicating a true type for identification information of each degree of abnormality information;
A procedure for storing the identification information of each degree of abnormality information in association with the true type;
A procedure for receiving an input of a setting parameter for updating the determination criterion;
A procedure for updating the determination criterion based on each received abnormality degree information, information indicating a true type stored in association with identification information of each abnormality degree information, and the setting parameter is executed. A recording medium on which a program is written.
PCT/JP2009/052992 2008-03-07 2009-02-20 Error analysis device, error analysis method, and recording medium Ceased WO2009110326A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2008058440A JP2009217381A (en) 2008-03-07 2008-03-07 Failure analysis system, failure analysis method, failure analysis server, and failure analysis program
JP2008-058440 2008-03-07

Publications (1)

Publication Number Publication Date
WO2009110326A1 true WO2009110326A1 (en) 2009-09-11

Family

ID=41055887

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/052992 Ceased WO2009110326A1 (en) 2008-03-07 2009-02-20 Error analysis device, error analysis method, and recording medium

Country Status (2)

Country Link
JP (1) JP2009217381A (en)
WO (1) WO2009110326A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016208159A1 (en) * 2015-06-26 2016-12-29 日本電気株式会社 Information processing device, information processing system, information processing method, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5818953B2 (en) * 2010-06-04 2015-11-18 三菱電機株式会社 Broadcast receiving apparatus and broadcast receiving method
JP5541130B2 (en) 2010-12-10 2014-07-09 富士通株式会社 Management device, management method, and management program
JP6207078B2 (en) * 2014-02-28 2017-10-04 三菱重工メカトロシステムズ株式会社 Monitoring device, monitoring method and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05120058A (en) * 1991-10-24 1993-05-18 Nec Ibaraki Ltd Faulty device analysis dictionary
JPH0744526A (en) * 1993-07-29 1995-02-14 Kubota Corp Case-based failure diagnosis system for electronic devices
JPH11177549A (en) * 1997-12-09 1999-07-02 Fujitsu Ltd Traffic monitoring device and traffic monitoring method
JP2005020713A (en) * 2003-06-02 2005-01-20 Ricoh Co Ltd Image forming system, image forming method, image forming program, and recording medium
JP2008009842A (en) * 2006-06-30 2008-01-17 Hitachi Ltd Computer system control method and computer system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002318734A (en) * 2001-04-18 2002-10-31 Teamgia:Kk Method and system for processing communication log
JP2005198970A (en) * 2004-01-19 2005-07-28 Konica Minolta Medical & Graphic Inc Medical image processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05120058A (en) * 1991-10-24 1993-05-18 Nec Ibaraki Ltd Faulty device analysis dictionary
JPH0744526A (en) * 1993-07-29 1995-02-14 Kubota Corp Case-based failure diagnosis system for electronic devices
JPH11177549A (en) * 1997-12-09 1999-07-02 Fujitsu Ltd Traffic monitoring device and traffic monitoring method
JP2005020713A (en) * 2003-06-02 2005-01-20 Ricoh Co Ltd Image forming system, image forming method, image forming program, and recording medium
JP2008009842A (en) * 2006-06-30 2008-01-17 Hitachi Ltd Computer system control method and computer system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Proceedings of the Fourth International Conference on Machine Learning and Cybernetics", 21 August 2005, article JING WU ET AL.: "A Study on Network Fault Knowledge Acquisition based on Support Vector Machine", pages: 3893 - 3898 *
"The Transactions of the Institute of Electronics, Information and Communication Engineers", vol. J87-B, 1 April 2004, article MIYAMOTO T. ET AL.: "SVM o Mochiita Network Traffic kara no Ijo Kenshutsu", pages: 593 - 598 *
ASADA T. ET AL.: "Kando o Koryo shita SVM ni yoru Chikujiteki Tsuika Gakushu to Bokyaku", IEICE TECHNICAL REPORT, vol. 104, no. 760, 23 March 2005 (2005-03-23), pages 172 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016208159A1 (en) * 2015-06-26 2016-12-29 日本電気株式会社 Information processing device, information processing system, information processing method, and storage medium
US11057399B2 (en) 2015-06-26 2021-07-06 Nec Corporation Information processing device, information processing system, information processing method, and storage medium for intrusion detection by applying machine learning to dissimilarity calculations for intrusion alerts

Also Published As

Publication number Publication date
JP2009217381A (en) 2009-09-24

Similar Documents

Publication Publication Date Title
US10402249B2 (en) Method and apparatus for failure classification
US8448025B2 (en) Fault analysis apparatus, fault analysis method, and recording medium
AU2019275633B2 (en) System and method of automated fault correction in a network environment
CN107870832B (en) Multi-path storage device based on multi-dimensional health diagnosis method
GB2557253A (en) Machine learning based malware detection system
US11698962B2 (en) Method for detecting intrusions in an audit log
KR20150018642A (en) Generalized pattern recognition for fault diagnosis in machine condition monitoring
KR20190120958A (en) Methods and system for vision-based automatic fault notification and classification of system lighting
CN111459692B (en) Method, apparatus and computer program product for predicting drive failure
US20020144181A1 (en) Method for managing an uncorrectable, unrecoverable data error (UE) as the UE passes through a plurality of devices in a central electronics complex
CN109062723A (en) The treating method and apparatus of server failure
CN110245077A (en) A kind of response method and equipment of program exception
US20170149800A1 (en) System and method for information security management based on application level log analysis
WO2009110326A1 (en) Error analysis device, error analysis method, and recording medium
CN117251114A (en) Model training method, disk life prediction method, related device and equipment
CN114168374B (en) Abnormal data positioning method based on combination of integrated abnormal detector and root cause analysis
JP2014048782A (en) Information processor and failure processing method for information processor
Tsai et al. A study of soft error consequences in hard disk drives
CN119026032B (en) Noise tag identification method and system based on priori confidence and heterogeneous consensus strategy
JP4559974B2 (en) Management apparatus, management method, and program
JP2020038525A (en) Anomaly detection device
JP5130968B2 (en) Fault location identification method
CN115913895B (en) A method, device, equipment and medium for server fault diagnosis and alarm
CN109767544A (en) Image analysis method and image analysis system of securities
KR20230024159A (en) System and method for detecting incorrect contour

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09717515

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09717515

Country of ref document: EP

Kind code of ref document: A1