WO2015145676A1 - Ordinateur de surveillance et procédé de surveillance - Google Patents
Ordinateur de surveillance et procédé de surveillance Download PDFInfo
- Publication number
- WO2015145676A1 WO2015145676A1 PCT/JP2014/058918 JP2014058918W WO2015145676A1 WO 2015145676 A1 WO2015145676 A1 WO 2015145676A1 JP 2014058918 W JP2014058918 W JP 2014058918W WO 2015145676 A1 WO2015145676 A1 WO 2015145676A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- path
- monitoring
- probe
- paths
- resource
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/64—Hybrid switching systems
- H04L12/6418—Hybrid transport
Definitions
- the present invention relates to a technique for measuring the performance of an IT system.
- a method based on baseline analysis is often used as a method for detecting a sign of a performance failure in an IT (Information Technology) system.
- a program for measuring the performance of the IT system is installed in the IT system, and the result measured by the probe is compared with the measurement result (baseline) at normal time.
- the measurement result baseline
- a predetermined threshold value it can be determined whether or not the measurement result greatly deviates from the baseline, or whether or not the measurement result deviates from the baseline.
- Performance failure when the measurement result of the probe greatly deviates from the baseline, or when the frequency of occurrence of the probe measurement result temporarily showing an outlier that is out of the baseline (hereinafter referred to as spike) increases. It is determined that there is a sign of
- Patent Document 1 describes that a DPI device that performs measurement is arranged at a port of a switch through which a path with a high probability of failure or a path with concentrated flows passes.
- the time interval of measurement by the probe tends to be narrow so that the spike can be detected without missing it. This is because, for example, if spikes that can be detected by finely measuring at intervals of seconds are measured at intervals of minutes, the measurement results are statistically averaged and the spikes cannot be seen.
- the amount of resources such as CPU (Central Processing Unit) and memory to be used for the measurement or the place where the measurement result is stored increases. Means.
- the cause of the spike may not be identified (isolated). For example, even if a response time to an I / O (Input / Output) request to the storage is measured and a spike whose response time temporarily increases can be detected, the cause of the spike is due to a single phenomenon. It may not be possible to isolate which resource in the storage is due to tightness.
- I / O Input / Output
- This method can identify the resource that caused the spike with high accuracy.
- it is necessary to detect more spikes. This requires placing a larger number of probes.
- probes that measure at fine time intervals that can detect fine spikes at most parts of the server connected to the storage, on the order of hundreds to thousands. It is none other than placement.
- An object of the present invention is to provide a technique for determining a decrease in performance of an IT system at a low cost.
- a management computer is a management computer that manages an information processing system that executes information processing using a path in which a plurality of resource elements are connected, and passes through a monitoring resource element that is a resource element to be monitored.
- the number of paths that pass is a monitoring resource element that has not reached the minimum number of paths.
- the path that selects the path that passes the most uncovered monitoring resource elements is selected from the paths that can be monitored, and the selected path is set as a monitoring path that is a monitoring target path.
- a probe management means for setting a probe for monitoring as a monitoring probe to be monitored, and the monitoring probe And a statistical processing means for determining the monitoring resource element that caused the performance degradation by the separation based on the co-occurrence pattern based on the monitoring result of the monitoring probe.
- the present invention makes it possible to isolate resources that cause performance problems at the lowest possible cost.
- FIG. 1 is a system configuration diagram of an information processing system according to a first embodiment. It is a figure for demonstrating a path
- 3 is a diagram illustrating an example of a resource configuration table 30.
- FIG. 4 is a diagram showing an example of a probe configuration management table 40.
- FIG. It is a figure which shows an example of the monitoring request
- FIG. 15 is a supplementary explanatory diagram of the flowcharts of FIGS. It is a flowchart of the probe reselection process with respect to the change of the path configuration.
- FIG. 17 is a supplementary explanatory diagram of the flowchart of FIG. 16. It is a flowchart of the probe selection process at the time of monitoring object addition. It is a figure which shows a resource selection screen. It is a figure which shows the presentation screen of a probe selection plan. It is a figure which shows the monitoring result screen.
- FIG. 1 is a system configuration diagram of an information processing system according to the first embodiment.
- the management computer 1, server 19, and storage 12 are connected via a LAN 11.
- the management computer 1 collects information from the server 19 and the storage 12 via the LAN 11 and transmits operation instructions to the server 19 and the storage 12 via the LAN 11.
- the server 19 and the storage 12 are also connected by a SAN (Storage Area Network) 18.
- the server 19 transmits an I / O processing request to the storage 12 via the SAN 18, and the storage 12 processes the I / O request and returns a response to the server 19.
- SAN Storage Area Network
- the management computer 1 includes a CPU 2, a memory 3, a computer storage 4, a display I / F 8, and an NW I / F 9.
- the display I / F 8 is connected to the display device 10, and the NW I / F 9 is connected to the LAN 11.
- the computer storage 4 stores a probe management program 5, a collection program 6, and a statistical processing program 7. These programs are read into the memory 3 at the time of startup and executed by the CPU 2.
- the memory 3 stores a resource configuration table 30, a probe configuration table 40, a monitoring request table 50, a co-occurrence condition table 60, a spike history table 70, and a resource performance history table 80.
- the resource configuration table 30 stores configuration information of infrastructure resources that are managed by the management computer 1.
- the server 19 includes a CPU 20, a memory 21, a display I / F 24, an in-computer storage 25, an HBA (Host Bus Adapter) 26, and an NW I / F 27.
- the NW I / F 27 is connected to the LAN 11 and the HBA 26 is connected to the SAN 18.
- the application 21 and the probe 23 are stored in the memory 21. These are programs, which are read into the CPU 20 and executed.
- the VOL 28 is a logical device (not shown) created by the storage 12. From the server 19, this logical device is recognized as a disk area.
- the application 22 requests the VOL 28 to read / write data. At this time, an I / O request is issued to the storage 12 having the data entity of the VOL 28. An I / O request issued from the application 22 is delivered from the HBA 26 to the storage 12 through a path 18. The storage 12 processes the I / O request, and returns the result to the application 22 following the previous path in reverse.
- Probe 23 measures the processing of this I / O request. For example, the probe measures the response time from when an I / O request is issued to the storage 12 until the result returns. The probe 23 also measures the number of I / O requests (IOPS) processed per second. The probe 23 analyzes the data measured in this way by the above-described baseline monitoring technique, and detects spikes (measured values that deviate significantly from the average measured values). The detected spikes are collected by the collection program 6 and stored in the spike history table 70.
- IOPS I / O requests
- the probe 23 has a function of changing the measurement interval. That is, the probe 23 can perform the measurement described above at a relatively short time interval of the order of seconds, or can be performed at a relatively long time interval of the order of minutes. It is also possible to simultaneously measure the second order and the minute order, stop the second order measurement, and perform only the minute order measurement. Such a change in the measurement interval can be performed by the determination of the probe 23 itself, or can be changed by the determination of an external program, for example, the probe management program 5 of the management computer 1.
- the probe 23 can be mounted as, for example, an OS (not shown) driver that directly measures individual I / O requests, or the OS measures I / O requests. It can also be implemented as a program that periodically collects statistical information on results.
- the storage 12 includes a SW (switch) 13, a Port 14, a Processor 15, a Pool 16, and a Cache 17. Port14, Processor15, Pool16, and Cache17 are mutually connected via SW13. Port 14 is connected to SAN 18.
- Pool 16 is a storage area composed of a plurality of disks and stores data.
- One Pool 16 may be composed of a plurality of types of media (SSD, SAS HDD).
- An I / O request issued from the application 22 to the VOL 27 is processed by the processor 15 through the port 14. For example, when the I / O request is data reading, the processor 15 checks whether there is requested data on the cache 17, and if not, acquires the data from the Pool 16 storing the data and returns the data to the application 22.
- FIG. 2 is a diagram for explaining the path configuration.
- FIG. 2A is an explanatory diagram of the path.
- the path is obtained by artificially connecting resources (Port 14, Processor 15, Pool 16, and Cache 17) in the storage 12 used by the VOL 28. Each of these resources is referred to as a resource element.
- I / O requests to the VOL 28 are processed using resource elements on this path.
- the probe 23 monitors an I / O request. This is described as monitoring the path.
- the response time measurement described above is an example of path monitoring.
- a path monitored by the probe is referred to as a monitoring path.
- FIG. 2B is an explanatory diagram of a path group.
- a path group is a group composed of paths having a common part of resource elements.
- FIG. 2B illustrates a path from VOL1 (path 1) and a path from VOL2 (path 2).
- Path 1 and path 2 are common except for Port, and belong to one path group.
- a common part of this path is referred to as a common part or a common path.
- a portion that is not a common portion is described as a single portion (Port corresponds to a single portion in FIG. 2B).
- FIG. 2C is an explanatory diagram of a single intersection.
- FIG. 2B there are two VOLs 28, and a path 1 and a path 2 are output from each. These paths intersect at the processor.
- the processor is described as a single intersection.
- pass 1 and pass 2 have a single intersection.
- a path passing through the resource element is referred to as a passing path, and the number (the number of monitoring paths) is referred to as a passing path number.
- the number of passing paths of the processor is 2, and the other resource elements are 1. Also, if the number of passing paths exceeds the specified value, the resource is covered.
- FIG. 3 is a diagram for explaining a schematic operation of the information processing system according to the present embodiment.
- the collection program 6 collects the configuration information of the resources inside the storage from the storage 12 and stores them in the resource configuration table 30.
- the collection program 6 collects the configuration information of the server 19 and the configuration information of the probe 23 and the VOL 28 from the probe 23 of the server 19 and stores them in the probe configuration table 40.
- the probe management program 5 displays a screen for accepting a monitoring request from the administrator on the display device 10 via the display I / F 8 and stores the input from the administrator in the monitoring request table 50.
- the probe management program 5 refers to the contents stored in the resource configuration table 30 and the probe configuration table 40, determines which probe 23 is to be measured in seconds, and probes. 23 is instructed to measure. At the same time, the probe management program 5 creates a rule (co-occurrence condition) for identifying which resource in the storage 12 is the cause of the spike from the combination of spikes measured by the probe 23 and stores it in the co-occurrence condition table 60. .
- the probe 23 instructed to monitor finely measures the I / O request and detects a spike.
- the collection program 6 collects spike records detected from the probe 23 and stores them in the spike history table 70. Further, the collection program 6 collects the performance information of each resource (measured in the minute order) from the storage 12 and stores it in the resource performance history table 80.
- the statistical processing program 7 analyzes the spike information stored in the spike history table 70 in accordance with the co-occurrence conditions stored in the co-occurrence condition table 60, identifies the resource causing the spike, and determines the result. Record in the spike history table 70.
- the statistical processing program 7 displays the result on the display device 10 and presents it to the administrator.
- FIG. 4 is a diagram illustrating an example of the resource configuration table 30.
- the resource configuration table 30 stores resource configuration information in the storage 12. That is, for each storage 12 (uniquely identified by the storage ID 31), internal resources (uniquely identified by the resource ID 33) are grouped and stored for each resource type 32.
- the resource type 32 includes, for example, Port, Processor, Cache, Pool, and Port. 4, the resource ID 33 of Port 14 is PT1, PT2..., The resource ID 33 of Processor 15 is PR1, PR2..., The resource ID 33 of Cache 17 is CA1, CA2..., And the resource ID 33 of Pool 16 is PO1, PO2. It is written as ... This notation is used in the following description.
- FIG. 5 is a diagram showing an example of the probe configuration management table 40.
- the probe configuration table 40 stores the monitoring contents of each probe 23. In other words, which probe 23 (identified by probe ID 41) operates on which server 19 (identified by server ID 43) and which VOL 28 (identified by VOL ID) is monitored (monitored in the order of seconds)
- the flag 42 stores “Y”).
- the probe configuration table 40 stores information on the path 46 (identified by the path ID 44), that is, resources in the storage 12 used by each VOL 28.
- the resource ID 48 of the resource configuring each path 46 (the same identifier as the resource ID of the resource configuration table 30) and the resource type 47 are stored.
- a path group ID 49 is stored as attached information of each path 46.
- the path group ID 49 indicates information for distinguishing the single part / common part of the path 46 and the ID of the path group to which the common part belongs.
- one path 46 includes both null and a value other than null (0 in the figure) as shown in the rows (401a, 401b) in FIG. 5, a single resource is null.
- a non-null resource belongs to the common part, and the same path belongs to the 0th path group.
- FIG. 6 is a diagram illustrating an example of the monitoring request table 50.
- the monitoring request table 50 includes an administrator request content (identified by a request ID 51), a monitoring designated device (identified by a device ID 52. Server ID 43 or storage ID 31), a monitoring designated resource (identified by a resource ID 53), and The minimum number of paths 54 is stored.
- the minimum number of paths 54 is the minimum number of monitoring paths that pass through the same resource. is there.
- the minimum number of paths 54 determines the certainty of monitoring. If a large number of paths are set, the certainty of identifying the causative resource increases, and if it is set small, the certainty decreases. That is, as the number of paths for monitoring one resource increases, the certainty of the cause resource separation increases.
- FIG. 7 is a diagram illustrating an example of the co-occurrence condition table 60.
- the co-occurrence condition table 60 stores path combination conditions for specifying the cause resource of the measured spike. That is, for each condition (identified by condition ID), a path co-occurrence condition 63, a resource (identified by resource ID 62) that is presumed to be the cause of the spike when the co-occurrence condition is satisfied, and a condition creation time 64 Is stored. For example, “P1 NOT (P2 & P3 & P4 & P5)” is stored in the co-occurrence condition 63 of the row 65a. Here, this means that a spike has occurred in the path P1, but no spike has occurred in the paths P2 to P5. When the manner of occurrence of the spike satisfies this condition, it is estimated that the cause is the tightness of the resource PT1 stored in the resource ID 62.
- FIG. 8 is a diagram illustrating an example of the spike history table 70.
- the spike history table 70 stores spikes measured by the probe 23 and resources estimated from the occurrence of the spikes, that is, analysis results. That is, for each spike (identified by spike ID 71), the generation time 72, the generated VOL (identified by VOL ID), and the response time 74 indicating the magnitude of the spike are stored. Also, the table shows the result of analysis of these spikes by the statistical processing program 7, that is, the resource (identified by resource ID 75) estimated to be the cause of the spike, and the co-occurrence condition table used for the estimation. 60 conditions (identified by condition ID 76) are stored.
- the spike identified by the spike ID 0 indicates that the cause resource of the spike is identified as PT1 because the condition ID 76 matches the condition “1”.
- the resource ID 75 and the condition ID 76 being “null” indicate that the spike has not been analyzed yet.
- “unknown cause” may be recorded in the resource ID 75 and the condition ID 76.
- FIG. 9 is a diagram showing an example of a resource performance history table.
- the resource performance history table 80 each resource in the storage 12, the access performance to the VOL 28 measured by the probe 23 (measurement record in minute order), and the like are recorded.
- a resource whose performance is measured (identified by resource ID 81), a measurement time 82, a measured metric 83, and its value 84 are recorded.
- FIG. 10 is a flowchart of the probe selection process.
- FIG. 11 is a supplementary explanatory diagram of the flowchart of FIG.
- a probe selection process that covers several resource elements (for example, the monitoring request shown in 55b of FIG. 6) specified by the administrator will be described.
- Processors 1 and 2 hatchched resource elements; hereinafter referred to as PR1 and PR2) are designated as monitoring targets. Note that the following processing entities are all the probe management program 5.
- the overlap number is the number of probe groups that include the probe. For example, in FIG. 11, the probe 2 (multiple number 2) included in both PG1 and PG2 is selected.
- S7 Among the paths acquired in S6, only the selected monitoring path and the path that is a single intersection in the specified resource element (PR1 or PR2) are left. Note that, in particular, a user or the like has previously selected a path to be monitored, and there is usually a monitoring path already selected at the start of processing.
- the path remaining in S7 is set as a monitoring path.
- the monitoring flag 42 in the row of the same path stored in the probe configuration table 40 is changed to Y.
- the number of passing paths of the specified resource element is updated.
- the probe management program 5 instructs the probe selected to monitor the path to monitor at a fine time interval (second order).
- the probe management program 5 updates the co-occurrence condition table 60 with the selected path. For example, in FIG. 11, when the monitoring path of Processor 2 is set to path 2 and path 3, a line is added to the co-occurrence condition table 60, the resource ID of the same line is Processor 2, and the co-occurrence condition 63 is “path 2 & path 3”. And That is, PR2 is added to the co-occurrence condition table 60 of FIG. 7, and the co-occurrence condition is P2 & P3. Thereafter, the processes from S3 to S8 are repeated until the condition of S3 is satisfied.
- probe selection processing when the administrator designates monitoring of all resources in the storage 12 will be described.
- FIG. 12 is an overall flowchart of the probe selection process.
- FIG. 13 is a detailed flowchart of step S13.
- FIG. 14 is a detailed flowchart of step S23.
- FIG. 15 is a supplementary explanatory diagram of the flowcharts of FIGS.
- the number of resources in the storage 12 is aggregated for each resource type (Port, Processor, etc.), and the monitoring path and its probe 23 are ordered in the order from the resource type with the largest number of elements to the resource type with the smallest number. And the probe configuration table 40 and the co-occurrence condition table 60 are updated.
- resource type Port, Processor, etc.
- FIG. 12 will be described.
- the monitoring path selected in S13 is recorded in the probe management table 40. That is, the monitoring flag 42 of the entry selected for the monitoring path is updated to Y.
- the probes 23 having the monitoring path selected in S13 are identified with reference to the probe management table 40, and monitoring at a fine monitoring interval is instructed to those probes.
- FIG. 13 will be described with reference to the supplementary explanatory diagram of FIG.
- (S20) Referring to the resource configuration table 30, the internal resource of the storage 12 designated by the administrator is acquired. Next, the probe management table 40 is referred to and a path passing through those resources is acquired. Next, a common path of these paths is specified, and a path group is created. At this time, a path group is created so that the resource type selected in S12 is a single part and the other resource types are a common part.
- Port is the resource type selected in S12.
- the path 1 is composed of resource elements whose numbers are 1, 1, 1, 1 in the order of Port, Processor, Cache, Pool.
- the path 2 is composed of resource elements having numbers 2, 1, 1, 1.
- Path 1 and path 2 pass through the same resource element with a resource type other than Port (common path 1). Therefore, path 1 and path 2 belong to the same path group.
- the path 3 and the path 4 belong to the same path group different from the path groups of the path 1 and the path 2.
- the common paths created in S20 are excluded if the number of passing paths is less than the threshold.
- the threshold value is a numerical value designated by the minimum number of paths 54 in the monitoring request table 50.
- the threshold value is a numerical value designated by the minimum number of paths 54 in the monitoring request table 50.
- the number of passing paths is 2, respectively. This value is compared with the numerical value designated by the minimum number of paths 54.
- common paths with a small number of passing paths are excluded.
- the number of passing paths increases, the number of data used for determining the co-occurrence condition increases, so the accuracy of the co-occurrence determination increases. Also, it is strong against a decrease in the number of passing paths when the path is changed by the configuration change (this will be described later).
- the numerical value specified by the minimum number of paths 54 may not be used as it is for the threshold value.
- a value obtained by multiplying the minimum number of paths 54 by a constant coefficient, for example, 2 may be used as the threshold value.
- the processing from S21 to S23 may be looped while the coefficient is gradually decreased.
- FIG. 14 shows processing for selecting a monitoring path from paths belonging to a path group.
- a path belonging to a path group is added according to a condition, and a path with many points is selected so that the number of paths constituting the path group is equal to or greater than a threshold value.
- the monitoring path can be configured with a favorable path.
- a complete duplicate path is added. That is, a path having a plurality of paths in which passing resource elements completely overlap is added.
- the path 1 of the probe 1 passes through the resource elements numbered 1, 1, 1, 1 in the order of Port, Processor, Cache, and Pool.
- the path 2 of the probe 2 the resource elements that pass through completely overlap with the path 1.
- the path 1 and the path 2 are completely overlapping paths, and the overlapping number is 2.
- the number of points to be added may be a fixed value or may be proportional to the overlap number. The reason for adding points to the completely duplicated path is to make it possible to easily prepare an alternative path when the path configuration is changed.
- a probe having a large number of paths belonging to any of the common paths is identified, and points are added to those paths.
- the probe 3 has a path 3 belonging to the common path 1 and a path 4 belonging to the common path 2.
- path 3 and path 4 are added.
- the number of points to be added may be proportional to the number of paths that one probe has.
- the number of points may be proportional to two of the path 3 and the path 4. This makes it possible to select a path so that the number of probes is reduced as much as possible.
- (S33) Select the target path group for determining the monitoring path.
- the path group having the smallest (number of path candidates ⁇ number of passing paths) is selected.
- the number of path candidates refers to the number of paths that are not selected as monitoring paths of other path groups among the paths constituting the path group.
- the number of passing paths refers to the number of paths selected as monitoring paths among the paths constituting the path group.
- the common path 2 is composed of a path 4, a path 5, and a path 6 having a processor, cache, and pool as common parts. Paths 4 to 6 pass through Ports 3, 4, and 5, respectively. Further, the path 7 belonging to the common path 3 passes through Port 5 in the same manner as the path 6.
- the number of path candidates for the common path 2 is 2, which is obtained by subtracting one of the paths 7 from three of the paths 4 to 6.
- the path 4 has already been selected as the monitoring path for the common path 2
- (the number of path candidates ⁇ the number of passing paths) of the common path 2 will be 1 at 2-1.
- (S35) It is determined whether the number of passing paths of all path groups is greater than or equal to the threshold.
- the number of passing paths is the number of paths selected as monitoring paths among the paths constituting the path group.
- the threshold is the minimum number of paths 54 in the monitoring request table 50. If all the path groups satisfy this condition, the process proceeds to S36, and if not, the process returns to S33.
- S36 This step is performed when the conditions shown in S35 are satisfied, but all the resource elements belonging to a single part have not yet been covered. For example, in FIG. 15C, it is assumed that the path 4 and the path 6 are selected as the monitoring paths among the paths 4, 5, and 6 belonging to the common path 2, and the path 5 is not selected. At this time, Port 4 remains uncovered. In this step, first, in this way, uncovered resource elements belonging to a single part are specified. Next, the path having the highest score is selected from the paths passing through the resource element, and is set as the monitoring path of the path group to which the path belongs.
- the resource to be monitored is not determined by allocating the path to the path group by selecting the monitoring path so that the number of monitoring paths passing through the monitored resource element is equal to or greater than the threshold.
- the path configuration may change due to the operation of the information processing system. For example, in order to reduce the load on a Pool whose performance is tight, a VOL that places data on the Pool may be moved to another Pool at the administrator's discretion. At this time, the Pool through which this VOL path passes changes. That is, the path configuration changes.
- FIG. 16 is a flowchart of probe reselection processing for a path configuration change.
- the processing subject of this flow is the probe management program 5.
- FIG. 17 is a supplementary explanatory diagram of the flowchart of FIG. 16.
- the monitoring path configuration change is received.
- the collection program 6 periodically collects configuration information and extracts the difference.
- the probe management program 5 receives this difference, refers to the monitoring flag 42 in the probe configuration table 40, and determines whether the changed path is a monitoring path.
- the path 1 passing through the resource element r1 before the configuration change is changed to the path 2 passing through the resource element r2 belonging to the same resource type as the resource element r2.
- FIG. 17A shows a configuration change when the unit is a single unit
- FIG. 17B shows a configuration change when the unit is a common unit.
- the co-occurrence condition 63 in the co-occurrence condition table 60 is updated. Specifically, path 1 is excluded from the conditions based on the path group to which path 1 belongs. For example, it is assumed that the path group is composed of five paths 1 to 5. At this time, if the co-occurrence condition 63 includes a line including the condition based on this path group (path 1 & path 2 & path 3 & path 4 & path 5), path 1 is excluded from this line (path 2 & path 3 & path 4 & Update to pass 5).
- the path is transferred from a path group with a sufficient number of passing paths, and the number of passing paths is recovered to a threshold value or more. This will be described with reference to FIG. It is assumed that the path 1 that has passed through the common part Processor 1 is changed to the path 2. At this time, it is assumed that the number of passing paths of the common path 1 decreases by 1 and falls below the threshold. Therefore, a single part (Port-n) through which the path (path m) belonging to the common path 1 passes is specified, and the path group (common path n) to which the path (path n) covering the single part belongs is specified. Is identified. Among such path groups, a path (path n) belonging to a path group having the largest number of passing paths, that is, having the largest difference from the threshold value is set as a new member of the common path 1.
- the administrator may add resources to be monitored. Next, a process of additionally selecting a monitoring path and the probe 23 when this monitoring target resource is newly added will be described.
- FIG. 18 is a flowchart of probe selection processing when a monitoring target is added.
- the subject of this processing is the probe management program 5.
- the probe management program 5 receives information on the newly designated resource element, checks whether there is a co-occurrence condition corresponding to the resource element, and if there is a co-occurrence condition, monitors the path constituting the co-occurrence condition To the probe 23, if not, a path and a probe for monitoring the resource element are newly selected.
- the monitoring path and probe 23 of the resource element for which monitoring is specified are selected.
- This selection method may be a method of selecting individual resource elements as starting points (shown in FIG. 9), or monitoring paths and probes 23 for all resource elements in the storage 12 including the specified resource elements. The method of selection (shown in FIG. 12) may be used.
- FIG. 19 is a diagram showing a resource selection screen. On this screen, the administrator designates a resource that requires monitoring at a fine time interval. In FIG. 19, it is assumed that the storage 1 has already been selected from the plurality of storages 12 managed by the management computer 1.
- the resource selection screen includes a server list 190 and a resource list 191.
- server information 192 related to the storage 1 is displayed.
- the administrator designates a server for which detailed monitoring is required with a check box at the left end.
- the administrator can also select all servers by checking the all selection check box 193. Also, the administrator changes the value of the minimum number of paths 194 as appropriate.
- each line of the resource list 191 resources in the storage 1 are displayed in each line. Similar to the server list 190, the administrator can select a resource for which detailed monitoring is required in this list. Further, the minimum number of paths 197 can be set for each resource element, and the administrator can set and change the value of the minimum number of paths 197 for which detailed monitoring is required.
- the input content is sent to the probe management program 5.
- the probe management program 5 stores the input content in the monitoring request table 50. Thereafter, the probe management program 5 starts probe selection calculation.
- FIG. 20 is a diagram showing a probe selection proposal presentation screen.
- This screen is a screen for presenting the probe selection result calculated by the probe management program 5 so as to satisfy the monitoring request of the administrator.
- This screen includes a probe summary, a cover resource summary 202, and a resource-specific monitoring path configuration 203.
- the number of probes 200 required for monitoring and the number 201 of monitoring paths are displayed.
- the number 201 of monitoring paths required for monitoring is obtained by referring to the probe management table 40 and totaling the monitoring paths that pass through the resources of the storage 1.
- the number of probes 200 required for monitoring is obtained by counting the number of probes having those monitoring paths.
- cover resource summary 202 the number of resources for each resource type of the storage 1 and the number of resources covered by the current probe selection (the number of cover resources) are presented.
- the monitoring path configuration corresponding to each monitoring target resource is displayed in each row.
- Each row displays a co-occurrence condition corresponding to the monitoring target resource, a monitoring path or a monitoring path group constituting the co-occurrence condition, and the number of paths (the number of passing paths).
- Such information can be acquired from the co-occurrence condition table 60. Further, as supplementary information, IOPS measurement data representing the flow rate of these paths is displayed together.
- the administrator presses the OK button 204 and approves the selection plan. If there is a problem, the Cancel button 205 is pressed to return to the resource selection screen of FIG.
- FIG. 21 is a diagram showing a monitoring result screen.
- the management computer 1 totals and statistically processes the spikes measured by the probe 23, and the result of extracting spikes caused by the resources designated for monitoring is displayed.
- the administrator reads a spike increase tendency of a resource from the displayed spike history, and determines that a sign of performance failure appears in the resource.
- Port 1 of the storage 1 has already been selected.
- the monitoring result screen is composed of spike statistics 210 and spike history 211.
- the spike statistics 210 displays spike statistical information for one week related to the currently selected resource (Port 1 in FIG. 21). In the spike statistics 210, (a) the number of spikes measured, (b) the number of spikes attributed to other resources, (c) the number of spikes attributed to Port1, and the previous week ratio of (c) are displayed.
- the value of (a) is obtained as follows. That is, the statistical processing program 7 totals the number of rows corresponding to the VOL related to pass 1, that is, the number of spikes measured in pass 1 from the spike history table 70. This value is displayed in (a).
- the value of (b) is obtained by aggregating the number of rows whose cause resource ID 75 is other than Port1, that is, other resources that are not Port1, among the rows obtained by the calculation of (a).
- the value of (c) is obtained by calculation formulas (a)-(b). From these values, in particular, (c) the number of spikes caused by Port1 and the ratio of the previous week, the administrator can read the increase in spikes caused by Port1, and determine that the performance of Port1 is tight. it can.
- the spike history 211 is a graph showing the numerical values shown in the spike statistics.
- the I / O response time measured in pass 1 is recorded.
- a total of 6 spikes were measured (spike 212 due to Port 1).
- spikes indicated by dotted lines indicate spikes 213a and 213b caused by other resources. As a result, the administrator can easily read from the graph that these spikes are not caused by Port1.
- the resource and the resource element have been described as one-to-one, but this is not necessarily required.
- a plurality of resources may be handled as one resource element, or one resource may be divided into a plurality of resource elements. Below, such a modification is described.
- FIG. 22 is a diagram showing a resource configuration table 30 in the modification.
- FIG. 22 is different from the resource configuration table 30 shown in FIG. 4 in that information about attributes of resources (attribute 34 and attribute value 35) is added.
- a row 36a in FIG. 22 is information on PT1 which is a Port resource.
- the row 36a records that PT1 belongs to the trunk TR1, which is a group of Ports, as attribute information of PT1. For Ports belonging to the same trunk, traffic is automatically distributed according to the load. In this case, a trunk that is a set of a plurality of resources (Ports) can be handled as one resource element.
- the row 30b is information on PR1, which is a processor resource, and indicates that PR1 belongs to a processor group called PRG1.
- PR1 is a processor resource
- PRG1 a processor group called PRG1.
- processing is automatically distributed among the processors belonging to the processor group in accordance with the processor load, similarly to the previous trunk. At this time, as in the case of the previous trunk, the processor group can be handled as one resource element.
- the row 36c stores information on PO1, which is a Pool resource.
- PO1 is composed of a plurality of media (SSD and SAS) having different processing speeds.
- performance characteristics such as response time vary greatly depending on the storage medium of data requested by the I / O to be processed.
- Such a Pool can be regarded as having a plurality of resource elements having different performance characteristics for each medium. Therefore, PO1 composed of SSD and SAS may be divided into SSD resource elements and SAS resource elements.
- FIG. 23 is a flowchart of resource set / division processing.
- FIG. 23 shows a process in which the probe management program 5 refers to the resource configuration table 30 to combine a plurality of resources into one resource element or divide one resource into a plurality of resource elements.
- the resource belonging to the same group is set as one resource element.
- the load-distributed group corresponds to a Port trunk or a processor group.
- resource attribute information indicates that the resource is composed of several resources having different performance characteristics
- the resource is divided into resource elements for each performance characteristic.
- the number of paths that pass through the resource element (monitoring resource element) to be monitored is the minimum number of paths that pass until the number of paths that pass through the predetermined minimum number of paths is exceeded. Since the probe that monitors the path that passes the largest number of monitor resource elements (uncovered monitor resource elements) that has not reached the number is selected from the paths that can be monitored, select as many monitor resource elements as possible. While monitoring with the probe, the monitoring path is selected until the minimum number of paths for isolating the monitoring resource element that caused the performance degradation is secured, and the measurement of the system performance is realized at a low cost.
- a predetermined rule for path selection such as selecting a path according to a predetermined rule from paths that can be monitored by a probe that monitors the path that passes through the most uncovered monitoring resource elements. Therefore, it is possible to set a more preferable monitoring path.
- a path used for processing with a large amount of processing is prioritized, such as adding a flow rate per path, for example, a path with a large I / O amount per second (IOPS), performance degradation such as spikes is reduced.
- IOPS I / O amount per second
- the path is selected so that the total number of the monitoring resource elements through which the selected path passes for each monitoring resource element is equal to or greater than a predetermined value, the monitoring resource elements included in the monitored path are increased. By selecting the path, it becomes easy to separate the managed resource elements.
- the paths are selected so that the total number of monitor resource elements that the selected path passes through for each monitor resource element is equal, the ease of carving for each managed resource element varies. Disappear.
- a path to be monitored is easily prepared instead of that path. can do.
- the monitoring resource element that actually passes through the monitoring path is changed, if there is a completely duplicated path, it is used as a monitoring path instead of the monitoring path. And can be switched easily.
- the monitoring resource element that the monitoring path passes is changed, if there is no complete duplicate path of the monitoring path, other monitoring resources are selected from the paths that pass through the monitoring resource element included in the monitoring path.
- the paths that pass the same monitoring path as the group of monitoring paths that pass the path that passes the same monitoring path as the group with the smallest number of monitoring paths included in the group is used as the monitoring path. At the same time, it is possible to improve the distinguishability of other monitoring resource elements.
- the management computer 1 in the first embodiment is intended to obtain a minimum probe necessary for specifying the cause when the access performance of the storage system deteriorates.
- the management computer 1 in the second embodiment is obtained by replacing the target from a storage system with an application.
- the application is also composed of several program processes. These program processes correspond to resources in the first embodiment.
- the program process is, for example, a process in a program module or a database table (or access process to the database table).
- Application provides some service to application users.
- the service is a service that returns a Web page that matches a specific keyword to the user.
- the user designates a service and sends a processing request (service request) to the application.
- the application executes the request and returns the result to the user.
- the probe 23 measures the response time of the I / O request from the server to the storage system, whereas the probe 23 in the second embodiment is used until the application returns a service request to the user. Measure the response time.
- the service in the second embodiment corresponds to the path in the first embodiment.
- a path is composed of a series of resources that process I / O requests sent to the storage system through the path.
- the service corresponding to the path in the second embodiment is configured by a series of application program processing for processing a service request from the user.
- the second embodiment is similar to the first embodiment, and the monitoring target is changed from a storage system to an application.
- a resource corresponds to a program process
- a path corresponds to a service
- the response time monitored by the probe 23 is a service response time.
- FIG. 24 is a system configuration diagram according to the second embodiment. As can be seen at a glance, most of the parts in FIG. 24 overlap with those of the first embodiment shown in FIG. Therefore, here, a description will be given mainly of the difference.
- the application to be monitored in this embodiment includes the IP switch 102, the Web server 103, and the database server 106. These are connected to the management computer 1 via the LAN 11. These are connected by a business LAN 101 of a different system from the LAN 11 and can communicate with each other.
- the Web server 103 and the database server 106 are ordinary computers provided with permanent storage devices such as a CPU, memory, and HDD.
- a Web program 104 that is a part of an application operates on the Web server 103.
- the Web server 103 includes a large number of program modules 105 that constitute the Web program 104.
- the database server 106 operates a database program 107 that is also a part of the application.
- the database server 106 also has a database table 108 in which application data is stored.
- the service request sent from the application user first enters the IP switch 102 through the business LAN 101.
- the IP switch 102 sends it to the Web server 103.
- the Web program 104 receives the service request, reads the program module 104 related to the service request, and executes predetermined processing. If the data possessed by the application is necessary for the predetermined processing, the Web program 104 further transmits a service request to the database server 105.
- the database program 107 receives this, executes predetermined data processing on the database table 108 related to the service request, and returns the result to the requesting Web program 104.
- the Web program 104 further executes predetermined processing and returns the result to the user via the IP switch 102.
- the service monitoring server 100 is a computer that includes a storage device such as a CPU, a memory, and an HDD and executes a program.
- a probe 23 which is a kind of program operates.
- the service monitoring server 100 is connected to the IP switch 102.
- the IP switch 102 copies a service request packet passing through the business LAN 101 and an application response packet to the service request packet, and transmits the duplicated packet to the service monitoring server 100.
- the probe 23 calculates and records the response time for each service from the time difference between these service request / response packets.
- the probe 23 calculates the response time of the service and monitors the value. When a spike is detected in the response time, the service request in which the spike occurred, the detected time, and the response time are recorded. The recorded contents are periodically collected by the collection program 6 operating on the management computer 1 and stored in the spike history table 70.
- the process of calculating the response time by collating the duplicated packet is a high-load process that consumes a large amount of CPU and memory. Therefore, the probe 23 has a function of limiting the service for which the response time is calculated. Thereby, the consumption of CPU and memory can be reduced.
- the management computer 1 instructs the probe 23 to select a target service.
- FIG. 25 is a diagram showing the contents of the resource configuration table 30 in the second embodiment.
- the resource configuration table 30 in the first embodiment stores the resource configuration inside the storage.
- program processing corresponding to resources operating on the servers is stored in the resource configuration table 30.
- the resource stores a unique identifier (resource ID 33) for each resource type 32.
- FIG. 25 shows that there are resources PM1, PM2, PM3...
- Whose resource type is a program module on a server called Web-Sv1 that constitutes an application.
- FIG. 26 is a diagram showing a probe configuration table 40 in the second embodiment.
- the probe configuration table 40 stores the following configuration information of the probe 23, configuration information of the probe 23, and monitoring information of the probe 23.
- the configuration information of the probe 23 includes the identifier of the probe 23 (probe ID 41) and the service monitoring server 100 (server ID 43) on which the probe 23 is operating.
- the service configuration information includes a service identifier 410, a resource such as a program module used by the service (service 46, resource type 47, resource ID, path group ID 49), service URL 411 (when the application is a Web application), and the like. Is included.
- the monitoring information of the probe 23 includes the presence / absence of service monitoring by the probe 23 (monitoring flag 42).
- the service configuration information may be input manually by the administrator based on the application design information, or the collection program 6 may collect / analyze and input traces and logs output by the application during application execution. Good.
- the probe management program 5 selects the minimum services necessary to identify the performance degradation of the resources (program modules and database tables) and limits the response time to those services.
- the probe 23 is instructed to monitor.
- the method for selecting the minimum service is the same as the method for selecting the minimum probe in the first embodiment. Therefore, the processing flow described in the first embodiment can be applied as it is.
- a term such as “path” may be replaced with the corresponding term in the second embodiment.
- SYMBOLS 1 Management computer, 10 ... Display apparatus, 100 ... Service monitoring server, 101 ... Business LAN, 102 ... IP switch, 103 ... Web server, 104 ... Web program, 104 ... Program module, 105 ... Database server, program module, 106 ... Database server, 107 ... Database program, 108 ... Database table, 11 ... LAN, 12 ... Storage, 13 ... SW, 14 ... Port, 15 ... Processor, 16 ... Pool, 17 ... Cache, 18 ... SAN, 19 ... Server, 2 ... CPU, 20 ... CPU, 21 ... memory, 25 ... computer storage, 26 ... HBA, 30 ... resource configuration table, 4 ... computer storage
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
Selon l'invention, un ordinateur de gestion pour gérer un système de traitement d'informations, qui exécute un traitement d'informations par un chemin composé d'une rangée de multiples étages d'éléments de ressources, comprend : un moyen de gestion de sonde pour sélectionner un chemin parmi des chemins qui peuvent être surveillés par une sonde, parmi des sondes surveillant un chemin passant par un élément de ressource de surveillance, qui est un élément de ressource à surveiller, de telle sorte que le nombre de chemins passant par l'élément de ressource de surveillance est supérieur ou égal à un nombre minimal prescrit de chemins, qui surveille un chemin passant à travers le plus grand nombre d'éléments de ressources de surveillance non couverts qui sont des éléments de ressource de surveillance dans lesquels le nombre de chemins qui passent à travers n'atteint pas le nombre minimum de chemins, ce qui fait du chemin sélectionné un chemin de surveillance qui est un chemin à surveiller, et paramètre une sonde surveillant le chemin de surveillance en tant que sonde de surveillance qui est une sonde à surveiller ; un moyen de collecte pour collecter les résultats de la surveillance par la sonde de surveillance ; et un moyen de traitement statistique pour déterminer un élément de ressource de surveillance qui a provoqué une dégradation de performance grâce à un processus de découpage et de division en fonction d'un modèle de co-occurrence basé sur le résultat de surveillance de la sonde de surveillance.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2014/058918 WO2015145676A1 (fr) | 2014-03-27 | 2014-03-27 | Ordinateur de surveillance et procédé de surveillance |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2014/058918 WO2015145676A1 (fr) | 2014-03-27 | 2014-03-27 | Ordinateur de surveillance et procédé de surveillance |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2015145676A1 true WO2015145676A1 (fr) | 2015-10-01 |
Family
ID=54194267
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2014/058918 Ceased WO2015145676A1 (fr) | 2014-03-27 | 2014-03-27 | Ordinateur de surveillance et procédé de surveillance |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2015145676A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106844736A (zh) * | 2017-02-13 | 2017-06-13 | 北方工业大学 | 基于时空网络的时空同现模式挖掘方法 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000088925A (ja) * | 1998-09-14 | 2000-03-31 | Toshiba Corp | 半導体デバイスの故障箇所特定方法及びその装置 |
| WO2006137373A1 (fr) * | 2005-06-24 | 2006-12-28 | Nec Corporation | Système de déduction de partie de dégradation de qualité et méthode de déduction de partie de dégradation de qualité |
| JP2008113186A (ja) * | 2006-10-30 | 2008-05-15 | Nec Corp | QoSルーティング方法およびQoSルーティング装置 |
| JP2008158666A (ja) * | 2006-12-21 | 2008-07-10 | Nec Corp | ストレージデバイスのマルチパスシステム、その障害箇所特定方法及びプログラム |
| WO2010122604A1 (fr) * | 2009-04-23 | 2010-10-28 | 株式会社日立製作所 | Ordinateur pour spécifier des origines de génération d'évènement dans un système informatique comprenant une pluralité de dispositifs de noeud |
-
2014
- 2014-03-27 WO PCT/JP2014/058918 patent/WO2015145676A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000088925A (ja) * | 1998-09-14 | 2000-03-31 | Toshiba Corp | 半導体デバイスの故障箇所特定方法及びその装置 |
| WO2006137373A1 (fr) * | 2005-06-24 | 2006-12-28 | Nec Corporation | Système de déduction de partie de dégradation de qualité et méthode de déduction de partie de dégradation de qualité |
| JP2008113186A (ja) * | 2006-10-30 | 2008-05-15 | Nec Corp | QoSルーティング方法およびQoSルーティング装置 |
| JP2008158666A (ja) * | 2006-12-21 | 2008-07-10 | Nec Corp | ストレージデバイスのマルチパスシステム、その障害箇所特定方法及びプログラム |
| WO2010122604A1 (fr) * | 2009-04-23 | 2010-10-28 | 株式会社日立製作所 | Ordinateur pour spécifier des origines de génération d'évènement dans un système informatique comprenant une pluralité de dispositifs de noeud |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106844736A (zh) * | 2017-02-13 | 2017-06-13 | 北方工业大学 | 基于时空网络的时空同现模式挖掘方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10877792B2 (en) | Systems and methods of application-aware improvement of storage network traffic | |
| Karumuri et al. | Towards observability data management at scale | |
| CN102713861B (zh) | 操作管理装置、操作管理方法以及程序存储介质 | |
| US8655623B2 (en) | Diagnostic system and method | |
| US8667334B2 (en) | Problem isolation in a virtual environment | |
| US9459942B2 (en) | Correlation of metrics monitored from a virtual environment | |
| US9690645B2 (en) | Determining suspected root causes of anomalous network behavior | |
| JP5546686B2 (ja) | 監視システム、及び監視方法 | |
| Zheng et al. | Co-analysis of RAS log and job log on Blue Gene/P | |
| US9882841B2 (en) | Validating workload distribution in a storage area network | |
| US20160378583A1 (en) | Management computer and method for evaluating performance threshold value | |
| US12141047B1 (en) | Configuring automated workflows for application performance monitoring | |
| US10177984B2 (en) | Isolation of problems in a virtual environment | |
| US8656012B2 (en) | Management computer, storage system management method, and storage system | |
| JP5222876B2 (ja) | 計算機システムにおけるシステム管理方法、及び管理システム | |
| US11853330B1 (en) | Data structure navigator | |
| US9348685B2 (en) | Intermediate database management layer | |
| US20150370619A1 (en) | Management system for managing computer system and management method thereof | |
| KR20150118963A (ko) | 큐 모니터링 및 시각화 | |
| CN120821633B (zh) | 一种批量服务器健康状态监测方法和装置 | |
| WO2021242466A1 (fr) | Analyse de performance de calcul pour des portées dans une architecture à base de microservices | |
| WO2015145676A1 (fr) | Ordinateur de surveillance et procédé de surveillance | |
| JP6622808B2 (ja) | 管理計算機および計算機システムの管理方法 | |
| Borisov et al. | Why Did My Query Slow Down? | |
| JP2018063518A (ja) | 管理サーバ、管理方法及びそのプログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14886968 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 14886968 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: JP |