CN106294136A

CN106294136A - The online test method of concurrent program run duration performance change and system

Info

Publication number: CN106294136A
Application number: CN201610620895.8A
Authority: CN
Inventors: 汤雄超; 翟季冬; 陈文光
Original assignee: Innovation Center Of Yin Zhou Qinghua Changsanjiao Research Inst Zhejiang
Current assignee: Yangtze Delta Region Institute of Tsinghua University Zhejiang
Priority date: 2016-07-29
Filing date: 2016-07-29
Publication date: 2017-01-04
Anticipated expiration: 2036-07-29
Also published as: CN106294136B

Abstract

A method and system for online detection of performance changes during parallel program running. The online detection method includes: obtaining the program structure diagram; based on the program structure diagram, inserting the runtime performance change detection code at the node of the program, the performance change detection code includes the in-process performance change detection code, and the in-process performance change detection code can be used in the program During execution, for multiple iterations of a loop node in the program, indicate which iteration or iterations have changed in performance relative to the performance of other iterations in history. The online detection method and system of the embodiment of the present invention can detect the performance failure of the operating environment when the program is running; automatically detect the performance change, and report it to the developer in time after detection; and can use the nodes in the program structure diagram as The basic unit of analysis is to automatically or custom-select the code to be analyzed, without analyzing all parts of the entire program, thereby reducing the runtime performance overhead.

Description

Online detection method and system for performance change during parallel program running

技术领域technical field

本发明总体地涉及软件开发，更具体地涉及并行程序的运行期间性能变化的在线检测技术。The present invention relates generally to software development, and more particularly to an online detection technique for performance changes during parallel program execution.

背景技术Background technique

基于高性能计算机的大规模并行计算被广泛用于天气预报、石油勘探、药物研发等社会生产生活中。Large-scale parallel computing based on high-performance computers is widely used in social production and life such as weather forecasting, oil exploration, and drug research and development.

现代高性能计算机含有许多的处理器(CPU)，而每个CPU内部又有许多核心(core)，为了充分利用计算能力，并行程序通过某种方式让这些处理器/计算核心能够一起协同工作。在高性能计算领域，消息传递模型(Message Passing Interface，MPI)是事实上的并行程序标准模型；C/C++、Fortran是常用的编程语言。在这样的并行程序中，并行程序有若干个进程，每个进程内部各自执行自己的计算任务，而进程之间会通过网络通信进行交互。Modern high-performance computers contain many processors (CPUs), and each CPU has many cores (cores). In order to make full use of computing power, parallel programs allow these processors/computing cores to work together in a certain way. In the field of high-performance computing, the Message Passing Interface (MPI) is the de facto standard model for parallel programs; C/C++ and Fortran are commonly used programming languages. In such a parallel program, the parallel program has several processes, and each process performs its own computing tasks internally, and the processes interact through network communication.

随着高性能计算机和并行程序的规模日益增大，性能变化(即程序运行时快时慢)成为了一个越来越严重的问题。With the increasing size of high-performance computers and parallel programs, performance variation (that is, programs running fast and slow) has become a more and more serious problem.

传统地，已存在如下技术(后文中，简称之为现有方案1)：使用预先构造的标准测试程序，对高性能计算机的各个结点进行性能评测，找出存在性能故障的计算结点，避免在该结点上运行并行程序。有关此，可参考文献：Problem Diagnosis in Large-ScaleComputing Environments，Mirgorodskiy等人，SC2006。Traditionally, the following technology (hereinafter referred to as existing scheme 1) has existed: use pre-constructed standard test programs to perform performance evaluation on each node of the high-performance computer, find out the computing nodes with performance failures, Avoid running parallel programs on this node. For this, refer to: Problem Diagnosis in Large-Scale Computing Environments, Mirgorodskiy et al., SC2006.

该传统技术方案的一个不足是，只能在程序运行之前进行预先的筛选，无法应对在程序运行中出现的性能故障，同时只关注系统的性能故障，忽视了对程序的行为的分析。A disadvantage of this traditional technical solution is that it can only be pre-screened before the program is running, and cannot deal with performance failures that occur during program running. At the same time, it only focuses on system performance failures and ignores the analysis of program behavior.

传统地，还存在如下技术(后文中，简称之为现有方案2)：以函数为粒度，将各函数每次调用所消耗的时间记录到文件中，在程序结束后用可视化工具分析性能的变化。有关此，可参考文献：Vampir,ScoreP,SCALASCA,文献：《Tools for High PerformanceComputing》，Wolf等人，2008。Traditionally, there is also the following technology (hereinafter referred to as the existing solution 2): taking functions as the granularity, recording the time consumed by each function call into a file, and using a visualization tool to analyze the performance after the program ends. Variety. For this, please refer to the literature: Vampir, ScoreP, SCALASCA, literature: "Tools for High Performance Computing", Wolf et al., 2008.

此类方法的不足是，对函数时间的记录会产生大量的输出文件，这带来几个问题，一是大量的输出文件对系统造成很大的压力，会影响程序的性能；二是程序员的分析需要在程序结束之后才能进行，而程序运行可能需要很长一段时间；三是从海量的输出文件中寻找有用的消息费时费力。The disadvantage of this type of method is that the recording of function time will generate a large number of output files, which brings several problems. First, a large number of output files will cause great pressure on the system and affect the performance of the program; second, the programmer The analysis needs to be carried out after the program ends, and the program may take a long time to run; the third is that it takes time and effort to find useful information from a large number of output files.

如何能够在程序的运行过程中(即“在线”)探测程序的性能变化，同时不引入太大的性能开销(即“轻量级”)是一个迫切需要解决同时具有挑战性的问题。How to detect the performance changes of the program while the program is running (ie "online") without introducing too much performance overhead (ie "lightweight") is an urgent and challenging problem to be solved.

发明内容Contents of the invention

并行程序的性能变化可能来源于很多方面，例如多个并行程序运行在同一个高性能计算机上时，会对共享的资源产生竞争；又比如操作系统自身运行引起的干扰；抑或是一些硬件或软件的故障对程序带来的影响。The performance change of parallel programs may come from many aspects. For example, when multiple parallel programs run on the same high-performance computer, there will be competition for shared resources; another example is the interference caused by the operation of the operating system itself; or some hardware or software The impact of the fault on the program.

并行程序的性能变化带来了一系列的问题。首先是程序的性能受到了损害，比预期的性能要低。更为严重的是，由于程序性能变化本身的不确定性，使得同一程序在同一计算机系统上多次运行会具有不同的表现，这使得开发者很难去了解程序的真实行为。例如在典型的高性能计算机上，多次运行同一个程序可能有10％的性能变化，而对程序的优化预期带来5％的性能提升，在这种情况下，开发者无法判断程序性能的改善是来源于程序优化，还是仅仅因为性能变化引起。The performance variation of parallel programs brings a series of problems. The first is that the performance of the program is compromised, lower than expected. What's more serious is that due to the uncertainty of program performance changes, the same program will have different performances when it is run multiple times on the same computer system, which makes it difficult for developers to understand the real behavior of the program. For example, on a typical high-performance computer, running the same program multiple times may have a 10% performance change, and the optimization of the program is expected to bring about a 5% performance improvement. In this case, the developer cannot judge the performance of the program. Does the improvement come from program optimization, or simply due to performance changes.

希望能够在程序的运行过程中(即“在线”)探测程序的性能变化，同时不引入太大的性能开销(即“轻量级”)，而且能够给出性能变化的来源的提示。It is hoped that the performance change of the program can be detected during the running of the program (ie "online") without introducing too much performance overhead (ie "lightweight"), and it can give a hint of the source of the performance change.

鉴于此情况，做出了本发明。In view of the circumstances, the present invention has been made.

根据本发明的一个方面，提供了一种并行程序运行期间性能变化的在线检测方法，可以包括：获得程序结构图；基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生变化。According to one aspect of the present invention, there is provided an online detection method for performance changes during the running of a parallel program, which may include: obtaining a program structure diagram; based on the program structure diagram, automatically or customizedly inserting runtime performance at the nodes of the program Change detection codes, the performance change detection codes include in-process performance change detection codes, the in-process performance change detection codes can point out relative to the historical The performance of other iterations of , which iteration or iterations the performance of which changes.

进一步地，在线检测方法中，哪个或哪些迭代的性能发生变化可以包括哪个或哪些迭代的性能差于历史上的其它迭代。Further, in the online detection method, which iteration(s) have changed performance may include which iteration(s) have worse performance than other iterations in history.

进一步地，在线检测方法中，进程内性能变化检测代码可以配置为在程序运行过程中将当前迭代与本次程序运行过程中的先前迭代的性能进行比较，记录进程内性能变化信息，并向用户给出进程内性能变化报告。Further, in the online detection method, the in-process performance change detection code can be configured to compare the performance of the current iteration with the performance of the previous iteration in the program running process during the running of the program, record the performance change information in the process, and report to the user Gives an in-process performance change report.

进一步地，在线检测方法中，性能变化检测代码还可以包括进程间性能变化检测代码，包括由运行所述并行程序中的相同任务的并行进程中的一个收集各个运行相同任务的并行进程的进程内性能变化信息，进行比较，并向用户给出关于进程间性能变化的比较报告。Further, in the online detection method, the performance change detection code may also include an inter-process performance change detection code, including collecting the in-process data of each parallel process running the same task by one of the parallel processes running the same task in the parallel program Performance change information, compare, and give the user a comparative report on the performance change between processes.

进一步地，在线检测方法中，程序结构图可以由若干棵树组成，每棵树对于代码中的一个函数定义体，结构图中有三种结点：循环LOOP、分支BR和调用CALL，每个结点至少由三种属性限定：一个全局的唯一的ID；该结点的类型，是循环，是分支，或是调用；该结点在源代码中的位置，所述进程内性能变化检测代码向用户给出报告中指示循环结点在源代码中的位置和性能变化满足预定标准的迭代。Furthermore, in the online detection method, the program structure diagram can be composed of several trees, and each tree has three nodes in the structure diagram for a function definition body in the code: loop LOOP, branch BR and call CALL, each node A point is defined by at least three attributes: a globally unique ID; the type of the node, whether it is a loop, a branch, or a call; the position of the node in the source code, the performance change detection code in the process to The user reports iterations indicating the location of loop nodes in the source code and performance changes meeting predetermined criteria.

进一步地，在线检测方法中，还可以包括，在插装运行时性能变化检测代码之前，裁剪程序结构图。Furthermore, in the online detection method, it may also include, before the runtime performance change detection code is inserted, clipping the program structure diagram.

进一步地，在线检测方法中，裁剪可以包括从程序结构图中自动去掉耗时占比小于预定阈值的结点，从而不会对这样的结点插装运行时性能变化检测代码。Further, in the online detection method, pruning may include automatically removing nodes whose time-consuming proportion is less than a predetermined threshold from the program structure graph, so that no runtime performance change detection code will be inserted into such nodes.

进一步地，在线检测方法中，进程内性能变化信息可以包括：历史上该结点执行时间的总和；历史上该结点执行的总次数；历史上该结点执行时间最长的预定次数执行对应的迭代编号和时间。Further, in the online detection method, the in-process performance change information may include: the sum of the execution time of the node in history; the total number of executions of the node in history; the longest execution time of the node in history. The iteration number and time of .

进一步地，在线检测方法中，、进程间性能变化可以包括：判断是否某个或某些进程的平均运行时间大于其它进程的平均运行时间达预定程度。Further, in the online detection method, the inter-process performance change may include: judging whether the average running time of one or some processes is greater than the average running time of other processes by a predetermined degree.

进一步地，在线检测方法中，进程间性能变化可以包括：判断是否循环结点的某个或某些迭代的性能变化在各个进程都发生了。Further, in the online detection method, the inter-process performance change may include: judging whether the performance change of one or some iterations of the loop node occurs in each process.

根据本发明的另一方面，提供了一种并行程序运行期间性能变化的在线检测系统，可以包括：程序结构图获得装置，配置为获得程序结构图；插装装置，配置为基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生改变。According to another aspect of the present invention, there is provided an online detection system for performance changes during the running of a parallel program, which may include: a device for obtaining a program structure diagram, configured to obtain a program structure diagram; an insertion device configured to, based on the program structure diagram, Automatically or custom-instrumented runtime performance change detection codes at the nodes of the program, the performance change detection codes include in-process performance change detection codes, and the in-process performance change detection codes can be used for the program during the running process of the program Multiple iterations of the loop node in , indicating which iteration(s) have changed performance relative to the performance of other iterations in history.

进一步地，在线检测系统中，哪个或哪些迭代的性能发生改变可以包括哪个或哪些迭代的性能差于历史上的其它迭代。Further, in the online detection system, which iteration(s) have changed performance may include which iteration(s) have worse performance than other iterations in history.

进一步地，在线检测系统中，进程内性能变化检测代码可以配置为在程序运行过程中将当前迭代与本次程序运行过程中的先前迭代的性能进行比较，记录进程内性能变化信息，并向用户给出进程内性能变化报告。Furthermore, in the online detection system, the in-process performance change detection code can be configured to compare the performance of the current iteration with the performance of previous iterations during the program running, record the in-process performance change information, and report to the user Gives an in-process performance change report.

进一步地，在线检测系统中，性能变化检测代码还可以包括进程间性能变化检测代码，进程间性能变化检测代码能够由运行所述并行程序中的相同任务的并行进程中的一个收集各个运行相同任务的并行进程的进程内性能变化信息，进行比较，并向用户给出关于进程间性能变化的比较报告。Further, in the online detection system, the performance change detection code may also include inter-process performance change detection code, and the inter-process performance change detection code can be collected by one of the parallel processes running the same task in the parallel program. Intra-process performance change information of parallel processes is compared, and a comparison report on inter-process performance changes is given to the user.

进一步地，在线检测系统中，程序结构图可以由若干棵树组成，每棵树对于代码中的一个函数定义体，结构图中有三种结点：循环LOOP、分支BR和调用CALL，每个结点至少由三种属性限定：一个全局的唯一的ID；该结点的类型，是循环，是分支，或是调用；该结点在源代码中的位置，所述进程内性能变化检测代码向用户给出报告中指示循环结点在源代码中的位置和性能变化满足预定标准的迭代。Furthermore, in the online detection system, the program structure diagram can be composed of several trees, and each tree has three kinds of nodes in the structure diagram for a function definition body in the code: loop LOOP, branch BR and call CALL, each node A point is defined by at least three attributes: a globally unique ID; the type of the node, whether it is a loop, a branch, or a call; the position of the node in the source code, the performance change detection code in the process to The user reports iterations indicating the location of loop nodes in the source code and performance changes meeting predetermined criteria.

进一步地，在线检测系统中，还可以包括，在插装运行时性能变化检测代码之前，裁剪程序结构图。Furthermore, the online detection system may also include, before the performance change detection code is inserted during runtime, clipping the program structure diagram.

进一步地，在线检测系统中，裁剪可以包括从程序结构图中自动去掉耗时占比小于预定阈值的结点，从而不会对这样的结点插装运行时性能变化检测代码。Furthermore, in the online detection system, pruning may include automatically removing nodes whose time-consuming ratio is less than a predetermined threshold from the program structure graph, so that no runtime performance change detection code will be inserted into such nodes.

进一步地，在线检测系统中，进程内性能变化信息可以包括：历史上该结点执行时间的总和；历史上该结点执行的总次数；历史上该结点执行时间最长的预定次数执行对应的迭代编号和时间。Further, in the online detection system, the in-process performance change information may include: the sum of the execution time of the node in history; the total number of execution times of the node in history; the longest execution time of the node in history. The iteration number and time of .

进一步地，在线检测系统中，进程间性能变化可以包括：判断是否某个或某些进程的平均运行时间大于其它进程的平均运行时间达预定程度。Further, in the online detection system, the inter-process performance change may include: judging whether the average running time of one or some processes is greater than the average running time of other processes by a predetermined degree.

进一步地，在线检测系统中，进程间性能变化可以包括：判断是否循环结点的某个或某些迭代的性能变化在各个进程都发生了。Further, in the online detection system, the inter-process performance change may include: judging whether the performance change of one or some iterations of the loop node occurs in each process.

根据本发明的另一方面，提供了一种计算装置，可以包括中央处理单元和存储器，存储器中存有计算机可执行代码，所述代码当被中央处理单元执行时执行下述方法：获得程序结构图；以及基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生变化。According to another aspect of the present invention, there is provided a computing device, which may include a central processing unit and a memory storing computer-executable code which, when executed by the central processing unit, performs the following method: obtaining a program structure Figure; and based on the program structure diagram, automatically or customizedly insert runtime performance change detection codes at the nodes of the program, the performance change detection codes include in-process performance change detection codes, and the in-process performance change detection codes can During the running of the program, for multiple iterations of the loop node in the program, indicate which iteration or iterations have changed in performance relative to the performance of other iterations in history.

本发明实施例的并行程序性能变化在线检测方案与现有的方案相比，主要的优势有三点：Compared with the existing solutions, the parallel program performance change online detection solution in the embodiment of the present invention has three main advantages:

(1)可以在程序运行时检测运行环境的性能故障，而不像现有方案1那样只能在程序运行之前检测；(1) The performance failure of the operating environment can be detected when the program is running, unlike the existing scheme 1, which can only be detected before the program runs;

(2)可以自动检测性能变化，并在检测到之后及时地报告给开发者，而不需要像现有方案2那样等待程序完成，这对于长时间运行的并行程序是很有必要的(现实中的程序一次可能运行几个小时或几天)；(2) It can automatically detect performance changes and report to developers in a timely manner after detection, without waiting for the program to complete like the existing solution 2, which is necessary for long-running parallel programs (in reality programs may run for hours or days at a time);

(3)基于程序结构图，以程序结构图中的结点为分析的基本单元，可以自动或自定义地选取待分析的代码，不需要对整个程序的所有部分都进行分析，从而减少了运行时的性能开销。(3) Based on the program structure diagram, with the nodes in the program structure diagram as the basic unit of analysis, the code to be analyzed can be automatically or customized, and it is not necessary to analyze all parts of the entire program, thus reducing the running time performance overhead.

附图说明Description of drawings

从下面结合附图对本发明实施例的详细描述中，本发明的这些和/或其它方面和优点将变得更加清楚并更容易理解，其中：These and/or other aspects and advantages of the present invention will become clearer and easier to understand from the following detailed description of the embodiments of the present invention in conjunction with the accompanying drawings, wherein:

图1示出了根据本发明实施例的并行程序运行期间性能变化的在线检测方法100的总体流程图。FIG. 1 shows a general flowchart of an online detection method 100 for performance changes during parallel program running according to an embodiment of the present invention.

图2示意性地示出了一个示例性的源程序代码。Fig. 2 schematically shows an exemplary source program code.

图3示出了以图2的示例代码为例，通过编译器的自动分析，构建出的程序结构图。FIG. 3 shows a program structure diagram constructed by taking the sample code in FIG. 2 as an example through the automatic analysis of the compiler.

图4示出了根据本发明一个实施例从源程序获得程序结构图的方法110的流程图。FIG. 4 shows a flowchart of a method 110 for obtaining a program structure diagram from a source program according to an embodiment of the present invention.

图5示出了对于图2所示的示例源程序中的compute这一函数调用进行插装、以及对整体循环外进行插装后的程序。FIG. 5 shows the program after instrumenting the function call compute in the sample source program shown in FIG. 2 and outside the overall loop.

图6(a)示例性地示出了针对图5的程序中的结点1:CALL(5)的、基于历史的进程内部性能变化检测方法示意图。Fig. 6(a) exemplarily shows a schematic diagram of a history-based internal process performance change detection method for node 1: CALL(5) in the program of Fig. 5 .

图6(b)示例性地示出了各个进程间的性能变化分析方法的示意图。Fig. 6(b) exemplarily shows a schematic diagram of a method for analyzing performance changes among various processes.

图7(a)-(c)示出了根据本发明实施例的并行程序运行期间的进程内性能变化分析的示意图。7( a )-( c ) are schematic diagrams showing the analysis of in-process performance changes during parallel program running according to an embodiment of the present invention.

图8中的(a)和(b)示出了进程间的性能变化分析方法的示意图。(a) and (b) in FIG. 8 show a schematic diagram of a performance change analysis method among processes.

图9示出了根据本发明另一实施例的并行程序运行期间性能变化的在线检测系统200的逻辑框图。FIG. 9 shows a logic block diagram of a system 200 for online detection of performance changes during parallel program execution according to another embodiment of the present invention.

具体实施方式detailed description

为了使本领域技术人员更好地理解本发明，下面结合附图和具体实施方式对本发明作进一步详细说明。In order to enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

首先，对本文中涉及的一些术语进行解释。First, some terms involved in this article are explained.

“性能发生改变”、“性能变化”、“差于”：指性能“改变”、“变化”、“差于”达预定程度或者为在确定指标上的前预定名次。"Changed in performance", "changed in performance", "worse than": means that the performance "changes", "changes" or "worse than" reaches a predetermined level or is the top predetermined ranking on the determined index.

程序结构：程序源代码中，除了顺序执行的计算代码外，通常还有循环、分支、函数调用等控制结构。本文中的“程序结构”指的是代码控制结构在源码中的位置和层次关系。Program structure: In the program source code, in addition to the sequentially executed calculation code, there are usually control structures such as loops, branches, and function calls. The "program structure" in this article refers to the position and hierarchical relationship of the code control structure in the source code.

基于编译器的源代码分析：高级语言编写的程序源代码必须经过编译器翻译为机器语言后才能在计算机上执行，而在编译器翻译源码的过程中，程序员可以通过某种接口，从编译器中获得源代码的程序结构信息。Compiler-based source code analysis: The source code of a program written in a high-level language must be translated into machine language by a compiler before it can be executed on a computer. The program structure information of the source code is obtained in the browser.

自动源代码插装：利用程序按照一定的规则，在源代码中加入一些额外的代码，使得修改后的程序能够满足一些特定的目标(例如在运行中输出某些信息)。Automatic source code insertion: Use the program to add some extra code to the source code according to certain rules, so that the modified program can meet some specific goals (such as outputting certain information during operation).

进程：进程可以理解为“运行着的程序实例”。例如现在有一个可执行文件a.exe，每运行一次这个程序，就会产生一个对应的进程；如果有很多计算机或者CPU都运行a.exe，就有很多个a.exe的进程。而并行程序的意思是同时有很多进程协同工作。关于任务划分，一般是单程序多数据流的方式。例如，3个进程都要完成10次迭代，但是这些进程做的事情是不一样的，就如同三个人都要做10张试卷，一个人做10张英语试卷，一个人做10张数学试卷，剩下的做10张语文试卷。换句话说，各个进程执行的代码是一样的，但是处理的数据不一样。Process: A process can be understood as a "running program instance". For example, now there is an executable file a.exe, every time this program is run, a corresponding process will be generated; if many computers or CPUs run a.exe, there will be many a.exe processes. Parallel programming means that many processes work together at the same time. Regarding task division, it is generally the way of single program and multiple data streams. For example, the three processes have to complete 10 iterations, but these processes do different things, just like three people have to do 10 test papers, one person does 10 English test papers, and one person does 10 math test papers, The rest do 10 Chinese test papers. In other words, the code executed by each process is the same, but the data processed is different.

性能变化(performance variance)：进程内的循环结点的多次迭代，某次迭代和历史上的迭代性能不一样；或者，对于多个进程，某个或某些进程和其它进程的性能不一样。Performance variance: multiple iterations of a loop node within a process, the performance of a certain iteration is different from the historical iteration; or, for multiple processes, the performance of one or some processes is different from other processes .

在结合实施例进行描述之前，为便于本领域技术人员把握本发明，首先概述一下本发明的发明思想。本发明采用动静态结合的方式，在源代码编译期间，利用编译器分析改程序的代码结构，构建程序结构图(Program Structure Graph,PSG)，并基于此结构图进行自动的源代码插装；在程序运行期间，所插装的源代码被执行来对程序的性能数据进行采集和分析，从而检测并行程序的性能变化。Before describing in conjunction with the embodiments, in order to make it easier for those skilled in the art to grasp the present invention, the inventive concept of the present invention is firstly outlined. The present invention adopts a combination of dynamic and static methods. During source code compilation, a compiler is used to analyze the code structure of the modified program, construct a program structure graph (Program Structure Graph, PSG), and perform automatic source code insertion based on the structure graph; During the running of the program, the instrumented source code is executed to collect and analyze the performance data of the program, so as to detect the performance change of the parallel program.

在步骤S110中，获得程序结构图。In step S110, a program structure diagram is obtained.

作为示例，图2示意性地示出了一个示例性的源程序代码。图3示出了以图2的示例代码为例，通过编译器的自动分析，构建出的程序结构图。As an example, Fig. 2 schematically shows an exemplary source program code. FIG. 3 shows a program structure diagram constructed by taking the sample code in FIG. 2 as an example through the automatic analysis of the compiler.

一个程序结构图由若干棵树组成，每棵树对应于代码中的一个函数定义体。结构图中有三种结点：循环LOOP、分支BR和调用CALL。箭头表示了各结点的包含关系。每个结点可以有三个基本属性：A program structure graph consists of several trees, and each tree corresponds to a function definition body in the code. There are three kinds of nodes in the structure diagram: loop LOOP, branch BR and call CALL. Arrows indicate the containment relationship of each node. Each node can have three basic properties:

1、一个全局的唯一的ID，是对程序结构图中的树进行遍历得到的，和下面第三点中说的源代码位置没有直接关系；1. A globally unique ID is obtained by traversing the tree in the program structure diagram, and has no direct relationship with the source code location mentioned in the third point below;

2、记录该结点的类型，是LOOP(循环)，是BR(分支)，或是CALL(函数调用)；2. Record the type of the node, whether it is LOOP (loop), BR (branch), or CALL (function call);

3、记录该结点在源代码中的位置，在图3所示的例子中展示的是开始的行号。在一个示例中，还可以记录程序的文件名，开始和结束的行号与列号等更丰富的位置信息。3. Record the position of the node in the source code. In the example shown in FIG. 3 , it shows the starting line number. In one example, richer location information such as the program's file name, start and end line and column numbers, etc. can also be recorded.

例如，图3所示的示例中，结点0：LOOP(3)，表示全局的结点ID为0，结点类型是LOOP(循环)，开始行号为代码中的第3行；结点1:CALL(5)，表示全局的结点ID为1，结点类型是CALL(函数调用)，开始行号为代码中的第5行；2:BR(6)，表示全局的结点ID为2，结点类型是BR(分支)，开始行号为代码中的第6行。For example, in the example shown in Figure 3, node 0: LOOP (3), means that the global node ID is 0, the node type is LOOP (loop), and the starting line number is the third line in the code; node 1:CALL(5), indicating that the global node ID is 1, the node type is CALL (function call), and the starting line number is the fifth line in the code; 2:BR(6), indicating the global node ID It is 2, the node type is BR (branch), and the starting line number is line 6 in the code.

在步骤S111中，对程序的源代码，利用编译器自动分析，得到一个原始的程序结构图PSG。In step S111, the source code of the program is automatically analyzed by a compiler to obtain an original program structure graph PSG.

与“现有方案2”不同的是，根据本发明一个实施例的技术方案不会盲目地记录所有函数每次调用的时间信息，而是依据某种规则对程序结构图进行裁剪，仅对关键的结点记录信息。Different from "Existing Solution 2", the technical solution according to an embodiment of the present invention does not blindly record the time information of each call of all functions, but cuts the program structure diagram according to a certain rule, only the key The node record information.

裁剪的规则可以是自动生成的，也可以是用户提供的，也可以是两者的结合。The pruning rules can be generated automatically, provided by users, or a combination of both.

在用户提供裁剪规则的情况下，可以由用户在一个配置文件中列出需要保留的或需要去掉的结点的ID列表。从而，根据这样的结点ID列表，来对原始程序结构图PSG进行裁剪。In the case that the user provides the pruning rules, the user may list the IDs of the nodes to be kept or removed in a configuration file. Therefore, the original program structure graph PSG is clipped according to such a node ID list.

在自动化裁剪的情况下，可以先运行一次程序，然后根据此次运行的时间，去掉耗时占比较小的结点(如耗时占整个程序时间小于1％的结点)。这里的先运行一次程序，可以是一次小规模的运行，比如，实际要求解的问题需要启动10000个进程计算24小时，但是为了进行PSG的裁剪，先用100个进程计算10分钟，就可以得到PSG中大部分结点的性能信息，进行裁剪。In the case of automatic cropping, the program can be run once first, and then according to the running time, nodes with a small proportion of time consumption (such as nodes whose time consumption is less than 1% of the entire program time) are removed. Here, run the program once first, which can be a small-scale operation. For example, the actual problem to be solved needs to start 10,000 processes to calculate for 24 hours, but in order to cut PSG, first use 100 processes to calculate for 10 minutes, and you can get The performance information of most nodes in the PSG is clipped.

在对程序结构图进行裁剪后，接下来进行程序结构图的重构，主要过程是结构图中的各结构树的合并以及结点之间的关系调整，以满足后面的运行时分析的需要。After clipping the program structure diagram, the next step is to reconstruct the program structure diagram. The main process is the merging of each structure tree in the structure diagram and the adjustment of the relationship between nodes to meet the needs of later runtime analysis.

对程序结构图通过一定的规则进行裁剪和重构，能够减小最后运行时对程序带来的性能影响，并提高结果的可读性。Cutting and reconstructing the program structure diagram according to certain rules can reduce the impact on the performance of the program when it is finally run, and improve the readability of the results.

回到图1，在步骤S110完成后，前进到步骤S120。Returning to Fig. 1, after step S110 is completed, proceed to step S120.

在步骤S120，基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生变化。In step S120, based on the program structure diagram, the runtime performance change detection code is automatically or customized at the node of the program, the performance change detection code includes an in-process performance change detection code, and the in-process performance change detection code During the running of the program, for multiple iterations of the loop node in the program, it can be pointed out which iteration or iterations have changed in performance compared to the performance of other iterations in history.

在获得程序结构图后，依照此结构图对程序源代码进行插装。插装的目的是为了在运行时提供结构图中结点的性能信息，而不会改变程序原有的算法。After obtaining the program structure diagram, the program source code is instrumented according to the structure diagram. The purpose of instrumentation is to provide performance information of the nodes in the structure graph at runtime without changing the original algorithm of the program.

在一个示例中，性能变化检测代码还包括进程间性能变化检测代码，包括由运行所述并行程序中的相同任务的并行进程中的一个收集各个运行相同任务的并行进程的进程内性能变化信息，进行比较，并向用户给出关于进程间性能变化的比较报告。In an example, the performance change detection code further includes an inter-process performance change detection code, comprising collecting in-process performance change information of each parallel process running the same task by one of the parallel processes running the same task in the parallel program, A comparison is made and a comparison report is given to the user about performance changes between processes.

vtx_in(1,CALL)：前置插装点，记录开始执行compute这个结点时的信息。Vtx_in的参数包括结点ID和结点类型。vtx_in(1,CALL): Pre-insert point, record the information when the compute node starts to be executed. The parameters of Vtx_in include node ID and node type.

vtx_out(1,CALL)：后置插装点，记录结束执行compute时的信息。进程内基于历史的分析工作正是在vtx_out函数中完成的。vtx_out(1,CALL): After the insertion point, record the information when the compute is finished. The in-process history-based analysis is done in the vtx_out function.

GLOBAL_ANALYSIS()：循环后插装点，在循环的所有迭代完成后进行进程间分析，包括：1、某一个进程将其他进程的数据收集起来；2、该进程对全局的性能变化进行分析。GLOBAL_ANALYSIS(): The post-loop insertion point, which performs inter-process analysis after all iterations of the loop are completed, including: 1. A certain process collects data from other processes; 2. The process analyzes global performance changes.

如图5所示，在compute函数前后加入了插装函数vtx_in和vtx_out，它们负责记录compute函数调用(也即图3所示的结构图中的(1:CALL(5))结点)的开始结束信息。前文所说的基于历史的进程内分析即在vtx_out中完成As shown in Figure 5, the instrumentation functions vtx_in and vtx_out are added before and after the compute function, which are responsible for recording the start of the compute function call (that is, the (1:CALL(5)) node in the structure diagram shown in Figure 3) end message. The history-based in-process analysis mentioned above is done in vtx_out

如图5所示，循环外(第10行)也加入了插装代码GLOBAL_ANALYSIS()，用于进程间分析。进程间分析需要将各个进程的本地采集的数据收集起来。在一个示例中，这件事由进程号最小的一个进程完成，并不需要新增额外的进程，这能够保证插装后的程序可以在原有的环境上运行。As shown in Figure 5, the instrumentation code GLOBAL_ANALYSIS() is also added outside the loop (line 10) for inter-process analysis. Inter-process analysis needs to collect the locally collected data of each process. In an example, this is done by the process with the smallest process number, and no additional process needs to be added, which can ensure that the instrumented program can run in the original environment.

插装后的代码可以由用户可以使用既有的编译器进行编译和运行，并在运行时检测程序的性能变化，并可以将这种性能变化分析以适当的方式提供给用户，例如可以显示在显示屏上，或者以语音方式提示给用户。The instrumented code can be compiled and run by the user using an existing compiler, and the performance change of the program can be detected during runtime, and this performance change analysis can be provided to the user in an appropriate way, for example, it can be displayed in on the display screen, or give the user a voice prompt.

哪个或哪些迭代的性能发生变化可以包括哪个或哪些迭代的性能差于历史上的其它迭代。The change in performance of the iteration(s) may include the iteration(s) performing worse than other iterations historically.

进程内性能变化检测代码可以配置为在程序运行过程中将当前迭代与本次程序运行过程中的先前迭代的性能进行比较，记录进程内性能变化信息，并向用户给出进程内性能变化报告。The in-process performance change detection code can be configured to compare the performance of the current iteration with the performance of previous iterations during the program running, record the in-process performance change information, and give the user a report of the in-process performance change.

进程内性能变化信息可以包括：历史上该结点执行时间的总和；历史上该结点执行的总次数；历史上该结点执行时间最长的预定次数执行对应的迭代编号和时间。The in-process performance change information may include: the sum of the execution time of the node in history; the total number of executions of the node in history; the corresponding iteration number and time of the predetermined number of executions with the longest execution time of the node in history.

下面结合图6(a)和6(b)示例性地示意性说明基于程序结构图的进程内性能变化检测和进程间性能变化检测。In the following, the intra-process performance change detection and the inter-process performance change detection based on the program structure diagram will be exemplarily and schematically described with reference to FIGS. 6( a ) and 6 ( b ).

如果一个PSG结点在一个循环中，它在各次迭代中执行的信息应当是类似的。根据本发明一个实施例的在线检测方案记录循环内部结构图结点各次迭代的信息，在每次执行之后，都与历史上的信息进行对比，从而发现一些异常的性能变化。发现性能变化后，工具可以立刻报告该性能变化事例，而不需要像“现有方案2”一样等到整个程序运行结束。If a PSG node is in a loop, the information it performs in each iteration should be similar. The online detection solution according to an embodiment of the present invention records the information of each iteration of the loop internal structure graph node, and compares it with the historical information after each execution, so as to find some abnormal performance changes. After a performance change is discovered, the tool can immediately report the performance change instance, instead of waiting until the entire program runs to the end like the "existing solution 2".

具体地，对于图6(a)所示的例子，对于图5中所示的结点1:CALL(5)，其处于循环结点0：LOOP(3)内，插装的代码在执行时监视结点1:CALL(5)的各次迭代的信息，在每次执行之后，都与历史上的信息进行对比，从而可以发现异常的性能变化，例如发现运行明显变慢的迭代的编号并记录下来，此外还可以记录下总的迭代次数，各次迭代的平均运行时间等等。如图6(a)中，以带阴影的矩形指示性能变化异常的迭代，如果各次迭代编号为0、1、2,…，则如图所示，编号为2的迭代性能明显变差，其编号和运行时间等被记录下来，并在程序运行中即向用户汇报，使得用户可以及时了解正在运行的程序何时出现了异常，从而便于分析原因、找出症结。Specifically, for the example shown in Fig. 6(a), for the node 1: CALL(5) shown in Fig. 5, which is in the loop node 0: LOOP(3), when the inserted code is executed Monitor node 1: The information of each iteration of CALL(5) is compared with the historical information after each execution, so that abnormal performance changes can be found, for example, the number of iterations that run significantly slower is found and Record it, and you can also record the total number of iterations, the average running time of each iteration, and so on. As shown in Figure 6(a), shaded rectangles indicate iterations with abnormal performance changes. If each iteration number is 0, 1, 2,..., as shown in the figure, the performance of the iteration numbered 2 is significantly worse. Its number and running time are recorded and reported to the user during the running of the program, so that the user can know in time when the running program is abnormal, so as to analyze the cause and find out the crux.

前面描述的进程内性能变化检测只关注于某个进程的性能在某个时刻出现了变化，但对于有几个进程发生了性能变化，变化的幅度有多大等等全局信息都不了解。全局的性能变化分析可以通过对比各个进程内部查出的性能变化来实现。这样的全局性能变化分析能够给出一些关于性能变化的根源的提示。例如，如果仅有少量集中的进程出现了性能变化，而其他进程都很正常，则可能是这些进程运行的环境发生了变化，可能出现了一些性能故障，这就检测到了“现有方案1”无法检测的在运行时出现的系统性能故障。而如果全局所有进程都在某个时刻发生了性能变化，可能是出现了一些全局的性能干扰，例如网络突然的拥挤。The in-process performance change detection described above only focuses on the performance change of a certain process at a certain moment, but does not know the global information such as how many processes have changed in performance, how large the change is, and so on. Global performance change analysis can be achieved by comparing the performance changes detected within each process. Such a global performance change analysis can give some hints about the source of the performance change. For example, if only a small number of concentrated processes are experiencing performance changes, while others are normal, then the environment in which these processes are running may have changed, and there may be some performance failure, which detects "existing scenario 1" An undetectable system performance failure that occurs at runtime. And if the performance of all global processes changes at a certain moment, there may be some global performance interference, such as sudden network congestion.

如图6(b)所示，对于进程0、进程1、…进程i…、进程n，将每个进程内的代表性性能变化参数记录下来，然后比较各个进程内部查出的性能变化，来分析并估计全局的性能变化分布。例如，记录下来进程0的性能变差的迭代编号为2、编号2迭代的运行时间、进程0的总体迭代次数、迭代平均运行时间；以及类似地记录下进程i的性能变差的迭代编号为1、编号1迭代的运行时间、进程i的总体迭代次数、迭代平均运行时间；…；记录下进程n的性能变差的迭代编号为3、编号3迭代的运行时间、进程n0的总体迭代次数、迭代平均运行时间。后面将对如何进行进程间性能变化分析进行更具体地址的示例性描述。As shown in Figure 6(b), for process 0, process 1, ... process i ..., process n, record the representative performance change parameters in each process, and then compare the performance changes detected in each process to obtain Analyze and estimate the global distribution of performance changes. For example, the iteration number for which the performance of process 0 deteriorates is recorded as 2, the running time of iteration number 2, the overall number of iterations for process 0, and the average running time of iterations; and similarly, the iteration number for which the performance of process i is deteriorated is recorded as 1. The running time of iteration No. 1, the overall number of iterations of process i, and the average running time of iterations; ...; record the iteration number 3, the running time of iteration No. 3, and the overall number of iterations of process n0. , Iteration average running time. A more specific exemplary description of how to analyze performance changes between processes will be given later.

根据本发明实施例的并行程序运行期间性能变化的在线检测方法，在并行程序运行期间，实时地检测进程内性能变化，并就进程内性能变化情况向用户提示，由此能够及时地向用户给出关键线索和信息，避免了传统地用户在浩瀚的记载报告中花费大量时间和精力来寻找线索；而且基于程序结构图，以程序结构图中的结点为分析的基本单元，可以自动或自定义地选取待分析的代码，不需要对整个程序的所有部分都进行分析，从而减少了运行时的性能开销。According to the online detection method of performance change during parallel program running in the embodiment of the present invention, during parallel program running, the performance change in the process is detected in real time, and the user is prompted about the performance change in the process, so that the user can be informed in time key clues and information, avoiding traditional users to spend a lot of time and effort to find clues in the vast records and reports; and based on the program structure diagram, with the nodes in the program structure diagram as the basic unit of analysis, it can automatically or automatically The code to be analyzed is selected in a defined manner, and all parts of the entire program do not need to be analyzed, thereby reducing the performance overhead at runtime.

下面结合附图7(a)-(c)描述根据本发明实施例的并行程序运行期间的进程内性能变化分析的示例。An example of an analysis of in-process performance changes during the running of a parallel program according to an embodiment of the present invention will be described below with reference to FIGS. 7(a)-(c).

如前所述，运行时分析基于程序结构图进行。在一个示例中，在源代码插装过程中，在程序结构图中的每个结点前后都插入了插装代码，因此可以获得程序运行时各个PSG结点的性能信息，例如包括：执行时间、执行的次数等等。As mentioned earlier, runtime analysis is based on a diagram of the program structure. In one example, during the source code instrumentation process, instrumentation codes are inserted before and after each node in the program structure graph, so that the performance information of each PSG node when the program is running can be obtained, for example, including: execution time , the number of executions, and so on.

进程内性能变化分析是基于历史信息的分析，实际程序中循环次数可能非常多，因此优选地并不直接把所有迭代的信息都保存下来，而是选取历史中有代表性的一些数据进行记录。优选地，记录的信息可以包括：In-process performance change analysis is based on historical information, and the number of loops in the actual program may be very large. Therefore, it is preferable not to directly save all the iteration information, but to select some representative data in the history for recording. Preferably, the recorded information may include:

(1)、历史上该结点执行时间的总和；(1) The sum of the execution time of the node in history;

(2)、历史上该结点执行的总次数；(2), the total number of times the node has been executed in history;

(3)、历史上该结点执行时间最长(或者说执行最慢)的几次执行对应的迭代次数和时间。(3) The number of iterations and time corresponding to several executions of the node with the longest execution time (or the slowest execution) in history.

具体地，图7(a)示出了示例性的源代码，将对该循环中的bar函数的性能变化进行监视；图7(c)左侧示出了该源代码对应的程序结构图，将监测结点1：CALL(3)的进程内性能变化，如前所述“1：CALL(3)”中的“1”表示全局的结点ID为1，结点类型是CALL(函数调用)，开始行号为代码中的第3行。图7(b)示出了结点1：CALL(3)的进程内执行历史，显示bar函数执行了10次，各次迭代的编号为0、1、2、3、4、5、6、7、8、9，各次迭代的对应执行时间分别为1、1、1、3、5、1、1、2、3、1，可见，其中最慢的三次迭代为迭代3、4、8，对应的迭代时间为3、5、3，但并非10次执行的时间都记录下来，而是记录了总次数(10)，总执行时间(19)和最慢的三次执行，其中最慢的三次执行按照(迭代编号：时间)的格式记录，例如(4：5)表示，迭代4的执行时间为5，如图7(c)中的右侧所示。Specifically, Fig. 7 (a) shows exemplary source code, will monitor the performance change of the bar function in this cycle; Fig. 7 (c) left side shows the corresponding program structure chart of this source code, The in-process performance change of node 1: CALL (3) will be monitored. As mentioned earlier, the "1" in "1: CALL (3)" indicates that the global node ID is 1, and the node type is CALL (function call ), the starting line number is line 3 in the code. Figure 7(b) shows the in-process execution history of node 1: CALL(3), which shows that the bar function has been executed 10 times, and the numbers of each iteration are 0, 1, 2, 3, 4, 5, 6, 7 . The corresponding iteration times are 3, 5, and 3, but not all 10 execution times are recorded, but the total number of times (10), total execution time (19) and the slowest three executions are recorded, among which the slowest three Execution is recorded in the format of (iteration number: time), for example (4:5) means that the execution time of iteration 4 is 5, as shown on the right side in Figure 7(c).

下面结合图8中的(a)和(b)描述进程间的性能变化分析方法的示例。An example of a method for analyzing performance changes between processes is described below in conjunction with (a) and (b) in FIG. 8 .

在图8(a)的左侧(或者图8(b)的上侧)示出了各个进程内部的各个迭代，其中纵坐标表示进程的标号，横坐标表示迭代的编号，例如最下面一行，表示进程0的迭代0、1、2、3、4、5的运行时间分别为1、2、1、2、1、1，其中以阴影方格示出了其中运行最慢的两个迭代，即进程0最慢的两次迭代为迭代1和迭代3；类似地，最上面一行，表示进程4的迭代0、1、2、3、4、5的运行时间分别为1、4、1、3、5、2，其中以阴影方格示出了其中运行最慢的两个迭代，即进程4最慢的两次迭代为迭代1和迭代4；其余各行含义类似，这里不再赘述。所示的示例中，在循环结束之后，各个进程选出自己最慢的几次迭代作为性能变化事例报告出来，并且，不同的进程之间用这些“最慢迭代”进行对比分析，推测全局的性能变化情况。On the left side of Figure 8(a) (or on the upper side of Figure 8(b)), each iteration within each process is shown, where the ordinate indicates the label of the process, and the abscissa indicates the number of the iteration, for example, the bottom line, Indicates that the running times of iterations 0, 1, 2, 3, 4, and 5 of process 0 are 1, 2, 1, 2, 1, and 1, respectively, where the two slowest iterations are shown in shaded squares, That is, the two slowest iterations of process 0 are iteration 1 and iteration 3; similarly, the top row indicates that the running times of iterations 0, 1, 2, 3, 4, and 5 of process 4 are 1, 4, 1, 3, 5, and 2, where the two slowest iterations are shown in shaded squares, that is, the slowest two iterations of process 4 are iteration 1 and iteration 4; the meanings of the rest of the lines are similar and will not be repeated here. In the example shown, after the loop ends, each process selects its slowest several iterations as performance change cases to report, and uses these "slowest iterations" for comparative analysis between different processes, and infers the global Performance changes.

例如，进程间性能变化可以分为进程间性能变化在空间上的分布和在时间上的分布。For example, the inter-process performance change can be divided into the spatial distribution and the temporal distribution of the inter-process performance change.

图8(a)展示了检测性能变化在空间上的分布的例子。各个进程发现的性能变化的程度可能是不一样的，有的进程变化大，有的进程变化小。在图8(a)中，5个进程都记录了在某个循环中某个PSG结点的最慢两次执行作为进程内的性能变化事例，进行分析(分析1)，得到了各个进程的性能平均值(vtx_avg)，即一个进程中的各次迭代的总时间除以总的迭代次数，例如进程4的性能平均值为(1+4+1+3+5+2)/6＝2.67，其中4号进程在性能平均值为2.67，3号进程的性能平均值为1.50，2号进程的性能平均值为1.17，1号进程的性能平均值为1.33，0号进程的性能平均值为1.33，可见4号进程的性能平均值远大于其他几个进程。因此从全局来看，性能变化有空间的局部性，4号进程运行的环境可能存在性能故障。另外，图8(a)中还示出了所有进程的性能平均值proc_avg，其可通过将各个进程的性能平均值的总和除以进程的总数来求得，在该示例中为1.60。Figure 8(a) shows an example of the spatial distribution of detection performance variation. The degree of performance change found by each process may be different, some processes have large changes, and some processes have small changes. In Figure 8(a), the five processes all recorded the slowest two executions of a certain PSG node in a certain cycle as the performance change cases within the process, analyzed (analysis 1), and obtained the performance of each process Performance average (vtx_avg), that is, the total time of each iteration in a process divided by the total number of iterations, for example, the performance average of process 4 is (1+4+1+3+5+2)/6=2.67 , where the average performance of No. 4 process is 2.67, the average performance of No. 3 process is 1.50, the average performance of No. 2 process is 1.17, the average performance of No. 1 process is 1.33, and the average performance of No. 0 process is 1.33, it can be seen that the average performance of No. 4 process is much higher than that of other processes. Therefore, from a global perspective, performance changes have spatial locality, and there may be performance failures in the environment where process No. 4 runs. In addition, FIG. 8( a ) also shows the performance average proc_avg of all processes, which can be obtained by dividing the sum of the performance averages of each process by the total number of processes, which is 1.60 in this example.

图8(b)示出了检测进程间性能变化时间分布的示意图。在图8(b)的例子中，统计每个编号的迭代有多少进程出现性能变化。在图8的例子中，所有进程在迭代1(即编号为1的迭代)的时候都出现了性能变化，而其他迭代时只是个别进程会出现性能变化，例如编号2的迭代在进程1中出现性能变化(在本示例中，即属于最慢的2个迭代之一)，编号3的迭代在进程0中出现性能变化，编号4的迭代在进程3和4中出现性能变化，编号5的迭代在任何一个进程中都没有出现性能变化。因此在这个例子中，全局的性能变化在时间上具有局部性，在编号1的迭代时，可能出现了影响整个程序性能的干扰，但在编号2的迭代时这个干扰又消失了。Fig. 8(b) shows a schematic diagram of the time distribution of performance changes among detection processes. In the example of Fig. 8(b), it is counted how many processes have performance changes in each numbered iteration. In the example in Figure 8, all processes have performance changes during iteration 1 (iteration numbered 1), and only individual processes have performance changes during other iterations, for example, iteration number 2 appears in process 1 Performance changes (in this example, one of the slowest 2 iterations), iteration number 3 shows performance changes in process 0, iteration number 4 shows performance changes in processes 3 and 4, iteration number 5 There was no performance change in either process. Therefore, in this example, the global performance change is localized in time. During the iteration number 1, there may be interference affecting the performance of the entire program, but this interference disappears again during the iteration number 2.

需要说明的是，前面的性能间性能变化分析方法仅为示例，可以根据需要变化、删减或增加分析的项目或方法，例如还可以分析性能变化的方差，性能发生变化的迭代的运行时间均值等等。It should be noted that the previous performance change analysis method between performances is only an example, and the items or methods of analysis can be changed, deleted or added according to needs, for example, the variance of performance changes and the running time average of iterations with performance changes can also be analyzed etc.

根据本发明另一实施例，还提供了一种并行程序运行期间性能变化的在线检测系统。According to another embodiment of the present invention, an online detection system for performance changes during parallel program running is also provided.

图9示出了根据本发明另一实施例的并行程序运行期间性能变化的在线检测系统200的逻辑框图。在线检测系统200可以包括程序结构图获得装置210和插装装置220。FIG. 9 shows a logic block diagram of a system 200 for online detection of performance changes during parallel program execution according to another embodiment of the present invention. The online detection system 200 may include a program structure diagram obtaining device 210 and an inserting device 220 .

程序结构图获得装置210配置为获得并行程序的程序结构图。The program structure graph obtaining means 210 is configured to obtain a program structure graph of the parallel program.

插装装置220配置为基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生改变。The instrumentation device 220 is configured to automatically or customizedly insert runtime performance change detection codes at the nodes of the program based on the program structure diagram, the performance change detection codes include in-process performance change detection codes, and the in-process performance change detection codes are The instrumentation code can point out which iteration or iterations have changed performance relative to the performance of other iterations in the history, for multiple iterations of loop nodes in the program during the running of the program.

所述插装了运行时性能变化检测代码的程序在被编译运行时将在运行过程中即及时就性能变化进行报告。The program inserted with the runtime performance change detection code will report the performance change in time when it is compiled and run.

有关程序结构图获得装置210和插装装置220的功能和工作过程可以参考前面结合图1-8(b)进行的描述，这里不再赘述。For the functions and working process of the program structure diagram obtaining device 210 and the plug-in device 220, reference may be made to the previous description in conjunction with Figs. 1-8(b), and details will not be repeated here.

根据本发明另一方面，提供了一种计算装置，包括中央处理单元和存储器，存储器中存有计算机可执行代码，所述代码当被中央处理单元执行时执行下述方法：获得程序结构图；以及基于程序结构图，自动或自定义地在程序的结点插装运行时性能变化检测代码，所述性能变化检测代码包括进程内性能变化检测代码，所述进程内性能变化检测代码能够在程序运行过程中，对于程序中的循环结点的多次迭代，指出相对于历史上的其它迭代的性能，哪个或哪些迭代的性能发生变化。According to another aspect of the present invention, a computing device is provided, including a central processing unit and a memory, wherein computer executable code is stored in the memory, and the code executes the following method when executed by the central processing unit: obtaining a program structure diagram; And based on the program structure diagram, automatically or customizedly insert runtime performance change detection codes at the nodes of the program, the performance change detection codes include in-process performance change detection codes, and the in-process performance change detection codes can be used in the program During execution, for multiple iterations of a loop node in the program, indicate which iteration or iterations have changed in performance relative to the performance of other iterations in history.

本发明实施例的并行程序性能变化在线检测方法和系统与现有的方案相比，主要的优势有三点：Compared with the existing solutions, the online detection method and system for parallel program performance change in the embodiment of the present invention have three main advantages:

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此，本发明的保护范围应该以权利要求的保护范围为准。Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. An online detection method of performance variation during parallel program operation, comprising:

Obtain the program structure diagram;

Based on the program structure diagram, the runtime performance change detection code is automatically or customized at the node of the program, the performance change detection code includes the in-process performance change detection code, and the in-process performance change detection code can be run in the program In a procedure, for multiple iterations of a loop node in a program, indicate which iteration or iterations' performance has changed relative to the performance of other iterations in history.

2. The online detection method according to claim 1, said which iteration or iterations have changed performance includes which iteration or iterations have worse performance than other iterations in history.

3. The online detection method according to claim 1, wherein the in-process performance change detection code is configured to compare the performance of the current iteration with the performance of previous iterations in the program running process during the running of the program, and record the performance change information in the process , and report the in-process performance changes to the user.

4. according to the online detection method of claim 1,

The performance change detection code also includes an inter-process performance change detection code, including collecting and comparing the in-process performance change information of each parallel process running the same task by one of the parallel processes running the same task in the parallel program, And give the user a comparative report on performance changes between processes.

5. according to the online detection method of claim 1, the program structure diagram is made up of several trees, and each tree has three kinds of nodes in the structure diagram for a function definition body in the code: loop LOOP, branch BR and calling CALL, each A node is limited by at least three attributes: a globally unique ID; the type of the node, whether it is a loop, a branch, or a call; the position of the node in the source code, and the performance change detection in the process The code reports to the user the iterations indicating the location of the loop nodes in the source code and the performance change meeting predetermined criteria.

6. The online detection method according to claim 1, further comprising, before inserting the runtime performance change detection code, cutting out the program structure diagram.

7. The online detection method according to claim 6, wherein the clipping includes automatically removing nodes whose time-consuming ratio is less than a predetermined threshold from the program structure graph, so that such nodes will not be inserted with runtime performance change detection codes.

8. The online detection method according to claim 4, said in-process performance change information comprising:

The sum of the execution time of this node in history;

The total number of times the node has been executed in history;

The corresponding iteration number and time of the predetermined number of executions with the longest execution time of this node in history.

9. An online detection system for performance changes during parallel program operation, comprising:

A device for obtaining a program structure diagram, configured to obtain a program structure diagram;

The instrumentation device is configured to automatically or customizedly insert runtime performance change detection codes at the nodes of the program based on the program structure diagram, the performance change detection codes include in-process performance change detection codes, and the in-process performance change The detection code can point out which iteration(s) have changed performance relative to the performance of other iterations in history for multiple iterations of a loop node in the program during the running of the program.

10. A computing device comprising a central processing unit and a memory storing computer executable code which, when executed by the central processing unit, performs the following method:

Obtain a program structure diagram; and