[go: up one dir, main page]

CN102486727A - Parallel kraut decomposition method for ultra-large-scale matrix multi-core based on thread building blocks - Google Patents

Parallel kraut decomposition method for ultra-large-scale matrix multi-core based on thread building blocks Download PDF

Info

Publication number
CN102486727A
CN102486727A CN2010105718566A CN201010571856A CN102486727A CN 102486727 A CN102486727 A CN 102486727A CN 2010105718566 A CN2010105718566 A CN 2010105718566A CN 201010571856 A CN201010571856 A CN 201010571856A CN 102486727 A CN102486727 A CN 102486727A
Authority
CN
China
Prior art keywords
parallel
matrix
crout
tbb
decomposition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105718566A
Other languages
Chinese (zh)
Other versions
CN102486727B (en
Inventor
马健
张丽岩
李克平
孙剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201010571856.6A priority Critical patent/CN102486727B/en
Publication of CN102486727A publication Critical patent/CN102486727A/en
Application granted granted Critical
Publication of CN102486727B publication Critical patent/CN102486727B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a multinuclear parallel crout decomposition method for an ultra-large scale matrix based on a TBB (Treading Building Block), which comprises the following steps: 1) rewriting a parallel computing part according to a traditional Crout method into a standard part meeting TBB requirement; 2) setting a question initial value; 3) entering into line circulation of the Crout method; 4) calling a parallel_reduce parallel module, storing the maximum pivot element in each line into a temporary vector, and meanwhile, calculating and storing a scale factor; 5) entering into row circulation of the Crout method; 6) confirming a new pivot element and the scale factor, modifying TINY, dividing each line by the pivot element; and 7) finishing the decomposition of the ultra-large scale matrix, thereby obtaining a row and line arranging sequence changed due to a local pivot element method. Compared with the prior art, the multinuclear parallel crout decomposition method provided by the invention has the advantages that the running efficiency of matrix LU decomposition is greatly increased, the method is performed on different platforms, the method is extendable, the application is wide, and the like.

Description

基于线程构造块的超大规模矩阵多核并行克劳特分解方法Parallel kraut decomposition method for ultra-large-scale matrix multi-core based on thread building blocks

技术领域 technical field

本发明涉及一种超大规模矩阵分解方法,尤其是涉及基于线程构造块的超大规模矩阵多核并行克劳特(Crout)分解方法。The invention relates to an ultra-large-scale matrix decomposition method, in particular to a multi-core parallel Crout decomposition method of an ultra-large-scale matrix based on a thread building block.

背景技术 Background technique

Intel刚刚推广的线程构建模块(Threading Building Blocks,TBB)是基于C++的多线程并行编程模型,用于支持多核处理器的并行计算,具有成熟的数据结构,支持可扩展的线程嵌套并行,支持可扩展内存分配、支持多种系统平台以及多种编译器支持。TBB的编程模式是使用模板作为并行迭代模型。这使得程序员很方便的处理同步、负载平衡、缓存优化等问题,而能轻松地实现自动调度的并行程序,充分利用CPU的多核运算能力及其他系统资源。TBB提升了程序的伸缩性及可扩展性,并且完全支持嵌套的并行编程。程序员可以创建自己的并行组件,以构建大型的并行程序。同时由于TBB是芯片制造商Intel开发的,对于多核平台有着更好的支持。The Threading Building Blocks (TBB) just promoted by Intel is a multi-threaded parallel programming model based on C++, which is used to support parallel computing of multi-core processors. It has a mature data structure, supports scalable thread nesting parallelism, and supports Scalable memory allocation, support for multiple system platforms, and support for multiple compilers. TBB's programming model is to use templates as a parallel iteration model. This makes it easy for programmers to deal with issues such as synchronization, load balancing, and cache optimization, and to easily implement parallel programs that are automatically scheduled, making full use of the CPU's multi-core computing capabilities and other system resources. TBB improves the scalability and scalability of the program, and fully supports nested parallel programming. Programmers can create their own parallel components to build large parallel programs. At the same time, because TBB is developed by chip manufacturer Intel, it has better support for multi-core platforms.

TBB是一个开源的C++模板库,它可将线程抽象成任务,然后创建可靠、可移植且可扩展的并行应用程序。使用TBB编写基于任务的并行应用程序,有助于提高跨多核平台上运行的可扩展软件的开发效率。通本地线程和线程封装器等其它线程化方法相比,TBB是执行并行应用程序最高效的方式,它能够充分利用多核平台的并行性能。TBB有以下三个显著的优点:1)TBB对线程进行抽象,将其提高到任务的高度,简化了并行应用程序的开发工作;2)基于TBB开发的应用程序的性能可以随着处理器内核数量的增加而自动提升;3)TBB能够减少死锁和资源竞争等常见线程错误,提供了一个跨平台、可扩展的并行解决方案。TBB is an open source C++ templating library that abstracts threads into tasks and then creates reliable, portable, and scalable parallel applications. Writing task-based parallel applications using TBB helps improve the development efficiency of scalable software that runs across multi-core platforms. Compared with other threading methods such as native threads and thread wrappers, TBB is the most efficient way to execute parallel applications, and it can fully utilize the parallel performance of multi-core platforms. TBB has the following three significant advantages: 1) TBB abstracts threads and raises them to the level of tasks, which simplifies the development of parallel applications; 2) The performance of applications developed based on TBB can increase 3) TBB can reduce common thread errors such as deadlock and resource competition, and provides a cross-platform, scalable parallel solution.

目前,在理论研究和实际应用中,矩阵求逆都有广泛的应用,如求解线性方程组、动态规划、计算机图形处理、控制工程等等。目前,矩阵求逆的方法主要有以下几种:(1)伴随矩阵法;(2)初等变换法;(3)高斯-约当消去法;(4)三角分解法。下面分别加以介绍。At present, in theoretical research and practical application, matrix inversion has a wide range of applications, such as solving linear equations, dynamic programming, computer graphics processing, control engineering and so on. At present, there are mainly the following methods for matrix inversion: (1) adjoint matrix method; (2) elementary transformation method; (3) Gauss-Jordan elimination method; (4) triangular decomposition method. To be introduced below.

(1)伴随矩阵法。利用矩阵的伴随矩阵及矩阵的行列式来求得矩阵的逆矩阵。此方法一般适用于维数较小的矩阵(1) Adjoint matrix method. Use the adjoint matrix of the matrix and the determinant of the matrix to find the inverse matrix of the matrix. This method is generally suitable for matrices with small dimensions

(2)初等变换法。原矩阵和单位矩阵同时进行初等行(或列)变换,当原矩阵变成单位矩阵的时候,单位矩阵就变成了原矩阵的逆矩阵。(2) Elementary transformation method. The original matrix and the identity matrix undergo elementary row (or column) transformation at the same time. When the original matrix becomes the identity matrix, the identity matrix becomes the inverse matrix of the original matrix.

(3)高斯-约当消去法:又称高斯消去法、高斯消元法,实际上就是我们俗称的加减消元法。数学上,高斯-约当消去法,由高斯和约当得名,它是线性代数中的一个算法,用于求解线性方程组的解,求解矩阵的秩,以及求解可逆方矩阵的逆。当用于一个矩阵时,高斯消去产生“行消去梯形形式”。(3) Gaussian-Jordan elimination method: also known as Gaussian elimination method and Gaussian elimination method, it is actually what we commonly call addition and subtraction method. Mathematically, the Gauss-Jordan elimination method, named after Gauss and Jordan, is an algorithm in linear algebra for solving the solution of a system of linear equations, solving the rank of a matrix, and solving the inverse of a reversible square matrix. When applied to a matrix, Gaussian elimination produces "row-eliminated trapezoidal form".

在消元的过程中,可能出现主元为零的情况,这时消元法无法进行;即使主元不为零但很小时,若用其作除数,会导致其它元素的数量级严重增长和舍入误差的扩散,最后也使得计算解不可靠。为使高斯-约当消去法具有较好的数值稳定性,可以使用完全主元素消去法,但完全主元素消去法在选主元时要花费较多机器时间,为此又考虑使用列主元消去法,但由于仍须选主元,计算量也较大。对于良态问题,高斯-约当消去法也可能给出很坏的结果,这说明高斯-约当消去法的算法很不稳定。事实上,一般的矩阵都是病态矩阵,采用高斯-约当消去法不能得到满意的结果。In the process of eliminating elements, it may happen that the pivot is zero, and the elimination method cannot be carried out at this time; even if the pivot is not zero but very small, if it is used as a divisor, the order of magnitude of other elements will increase seriously and be discarded. The diffusion of the input error also makes the calculation solution unreliable at last. In order to make the Gauss-Jordan elimination method have better numerical stability, the complete principal element elimination method can be used, but the complete principal element elimination method takes more machine time when selecting the pivot, so consider using the column pivot Elimination method, but because the pivot still needs to be selected, the amount of calculation is also relatively large. For well-conditioned problems, the Gauss-Jordan elimination method may also give very bad results, which shows that the algorithm of the Gauss-Jordan elimination method is very unstable. In fact, general matrices are ill-conditioned matrices, and the Gauss-Jordan elimination method cannot obtain satisfactory results.

(4)三角分解法(4) Triangular decomposition method

三角分解法是将方阵分解成一个上三角矩阵和一个下三角矩阵,该方法又被称为LU分解法。该分解方法的用途主要在简化大矩阵的行列式值的计算,矩阵求逆运算和求解联立方程组。需要注意的是,这种分解方法所得到的上下三角形矩阵不是唯一的,还可找到若干对不同的上下三角矩阵对,它们的乘积也会得到原矩阵。矩阵可以进行LU分解是有条件的,它要求其为方阵且它的所有顺序主子式均不等于零。LU分解的主要优点在于它能用公式表示费时的消去步骤,只对系数矩阵进行操作。LU分解的一个动机是它为矩阵求逆提供了一个有效的方式,也为评估方程组的状态提供了一个方法。The triangular decomposition method is to decompose the square matrix into an upper triangular matrix and a lower triangular matrix, which is also called LU decomposition method. The purpose of this decomposition method is mainly to simplify the calculation of the determinant value of a large matrix, matrix inversion operation and solution of simultaneous equations. It should be noted that the upper and lower triangular matrix obtained by this decomposition method is not unique, and several pairs of different upper and lower triangular matrix pairs can be found, and their product will also obtain the original matrix. It is conditional that a matrix can be decomposed by LU, which requires it to be a square matrix and all its sequential principal subforms are not equal to zero. The main advantage of LU decomposition is that it can formulate the time-consuming elimination step, operating only on the coefficient matrix. One motivation for LU factorization is that it provides an efficient way to invert matrices and also provides a way to evaluate the state of a system of equations.

在上述的矩阵求逆方法中,伴随矩阵法及初等变换法,易于理解,适合手动求解,而对于大规模矩阵的求解,这两种方法不太适合;高斯-约当消去法及LU分解法便于程序的实现,在大多数条件下,能够适应大规模矩阵的求解。但是,对于大规模或超大规模矩阵,高斯-约当消去法及传统的串行LU分解法的计算量往往过大,对计算机的性能要求较高,算法的实时性不高,对于某些时间有严格要求的应用不太适合。Among the above matrix inversion methods, the adjoint matrix method and the elementary transformation method are easy to understand and suitable for manual solution, but for the solution of large-scale matrices, these two methods are not suitable; Gauss-Jordan elimination method and LU decomposition method It is convenient for the implementation of the program, and under most conditions, it can adapt to the solution of large-scale matrices. However, for large-scale or ultra-large-scale matrices, the Gauss-Jordan elimination method and the traditional serial LU decomposition method often require too much calculation, which requires high performance of the computer, and the real-time performance of the algorithm is not high. Applications with strict requirements are not suitable.

发明内容 Contents of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种能够大幅度提高矩阵LU分解的运行效率、且能跨平台、可扩展、应用广泛的基于线程构造块的超大规模矩阵多核并行克劳特分解方法。The purpose of the present invention is to provide a super-large-scale matrix multi-core parallelism based on thread building blocks that can greatly improve the operating efficiency of matrix LU decomposition in order to overcome the defects in the above-mentioned prior art, and can be cross-platform, scalable, and widely used. Kraut decomposition method.

本发明的目的可以通过以下技术方案来实现:基于线程构造块的超大规模矩阵多核并行克劳特分解方法,其特征在于,该分解方法包括以下步骤:1)TBB并行计算平台的安装与环境设置;2)分析并抽取传统Crout方法中可以并行计算的部分,利用TBB并行模板类改写为符合TBB需要的规范类;3)设置问题初始值;4)进入Crout方法的行循环;5)调用parallel_reduce并行模块类,将每一行的最大主元保存在临时向量中,同时计算并保存比例因子;6)进入Crout方法的列循环;7)确定新的主元及比例因子,修改TINY,每一行除以主元,判断列循环是否结束,如果判断为否则转到步骤6),判断行循环是否结束,如果判断为否则转到步骤4);8)运行结束,完成超大规模矩阵分解,得到因部分主元法而改变了的行列排列次序。The object of the present invention can be realized through the following technical solutions: the ultra-large-scale matrix multi-core parallel kraut decomposition method based on thread building block, it is characterized in that, this decomposition method comprises the following steps: 1) installation and environment setting of TBB parallel computing platform ;2) Analyze and extract the part that can be calculated in parallel in the traditional Crout method, and use the TBB parallel template class to rewrite it as a specification class that meets the needs of TBB; 3) Set the initial value of the problem; 4) Enter the row loop of the Crout method; 5) Call parallel_reduce Parallel module class, save the maximum pivot of each row in a temporary vector, and calculate and save the scale factor at the same time; 6) Enter the column loop of the Crout method; 7) Determine the new pivot and scale factor, modify TINY, divide each row Use the pivot to judge whether the column cycle is over, if it is judged otherwise, go to step 6), judge whether the row cycle is over, if it is judged otherwise, go to step 4); The order of rows and columns changed by the pivot method.

所述的步骤3)中的问题初始值包括矩阵的规模、矩阵的行数、列数,输入矩阵。The initial value of the problem in the step 3) includes the scale of the matrix, the number of rows and columns of the matrix, and the input matrix.

与现有技术相比,本发明具有效率高、跨平台、可扩展、应用广泛等优点,本发明的并行Crout方法能够大幅度提高矩阵LU分解的运行效率;在此基础上,通过本发明能够大幅度提高矩阵求逆的效率,进而验证了本发明的有效性与高效性。并行Crout方法的适用条件为:如果矩阵的所有子式都是非零的,即可使用本发明来进行LU分解,进而可以求得该矩阵的逆或线性方程组的解等。当矩阵规模达到百万级别的时候,本发明比串行Crout方法约提高了13.7%,并比高斯-约当消去法约提高了77%。故本发明具有非常大的实用价值和广泛的应用前景。Compared with the prior art, the present invention has the advantages of high efficiency, cross-platform, scalability, and wide application, and the parallel Crout method of the present invention can greatly improve the operating efficiency of matrix LU decomposition; on this basis, the present invention can The efficiency of matrix inversion is greatly improved, and the effectiveness and high efficiency of the present invention are further verified. The applicable conditions of the parallel Crout method are: if all the subforms of the matrix are non-zero, the invention can be used for LU decomposition, and then the inverse of the matrix or the solution of the linear equation system can be obtained. When the scale of the matrix reaches one million levels, the invention improves about 13.7% than the serial Crout method and about 77% than the Gauss-Jordan elimination method. Therefore, the present invention has very large practical value and wide application prospect.

附图说明 Description of drawings

图1为本发明的满足TBB的求最大主元类结构图;Fig. 1 satisfies the TBB of the present invention and seeks the maximum pivot class structural diagram;

图2为本发明的并行Crout方法流程图;Fig. 2 is the parallel Crout method flowchart of the present invention;

图3为利用本发明的并行Crout方法求矩阵逆阵的流程图;Fig. 3 is to utilize parallel Crout method of the present invention to ask the flow chart of matrix inverse;

图4为利用本发明求解矩阵逆阵的结果示意图。Fig. 4 is a schematic diagram of the result of solving the matrix inverse by using the present invention.

具体实施方式 Detailed ways

下面结合附图和具体实施例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

实施例Example

TBB定义了任务的概念,在初始化TBB任务调度时,由任务调度器对象task_scheduler_init实现多任务的分配和并行计算,支持多线程的划分。在调用并行计算的模板类时,由模板类参数指定循环处理的数值范围以及任务粒度参数。任务粒度参数决定了任务划分的粒度,如果粒度太大,不能充分提高运行效率;如果粒度太小,过度的并行化任务分配造成的开销反而降低了运行效率。在无法获得合适任务粒度情况下可以使用TBB提供的自动分配函数auto_partitioner()帮助用户设置合适的任务粒度参数。TBB defines the concept of tasks. When initializing TBB task scheduling, the task scheduler object task_scheduler_init implements multi-task allocation and parallel computing, and supports multi-thread division. When calling the template class for parallel computing, the value range of the loop processing and the task granularity parameters are specified by the template class parameters. The task granularity parameter determines the granularity of task division. If the granularity is too large, the operating efficiency cannot be fully improved; if the granularity is too small, the overhead caused by excessive parallel task allocation will reduce the operating efficiency. If the appropriate task granularity cannot be obtained, the automatic allocation function auto_partitioner() provided by TBB can be used to help users set appropriate task granularity parameters.

以parallel_reduce设计的求解最大主元素模板类为例,详细说明TBB设计并行最大主元素算法的实现过程,以供Crout方法调用。首先,基于parallel_reduce模板将此过程封装成类MaxElement,其作用是求解最大主元素。类MaxElement必须包括的接口为operator和构造函数,具体形式如下:Taking the maximum principal element template class designed by parallel_reduce as an example, the implementation process of the parallel maximum principal element algorithm designed by TBB is described in detail for the Crout method to call. First, based on the parallel_reduce template, this process is encapsulated into a class MaxElement, whose function is to solve the largest principal element. The interfaces that must be included in class MaxElement are operator and constructor, the specific form is as follows:

class MaxElement{class MaxElement{

……...

void operator()(const blocked_range<size_t>&r){void operator()(const blocked_range<size_t>&r){

   void operator()(const blocked_range<size_t>&r){void operator()(const blocked_range<size_t>&r){

      for(size_t i=r.begin();i!=r.end();++i){for(size_t i=r.begin(); i!=r.end();++i){

……...

   }}}}}}

//分出一个支线,如果要访问x,应该保证原子操作//Create a branch, if you want to access x, you should guarantee atomic operation

   MaxElement(MaxElement&x,tbb::split):MaxElement(MaxElement&x, tbb::split):

   my_a(x.my_a),value_of_max(FLT_MIN),index_of_max(-1){……}my_a(x.my_a), value_of_max(FLT_MIN), index_of_max(-1){...}

//合并支线//merge branches

   void join(const MaxElement&y){void join(const MaxElement&y){

……...

}}}}

//构造函数//Constructor

   MaxElement(DOUBLE*a):MaxElement(DOUBLE*a):

   my_a(a),value_of_max(FLT_MIN),index_of_max(-1){}my_a(a), value_of_max(FLT_MIN), index_of_max(-1){}

};};

其中operator接口是并行处理的主要部分,主要功能是进行并行循环优化,将循环的参数修改成TBB定义的blocked_range模板类,能够支持循环体内任务的并行划分。在operator接口中,并行地执行各小块中的比较大小操作,并返回小块中的最大值。当TBB决定分出一个blocked_range时,会调用MaxElement(MaxElement&x,tbb::split),这个tbb::split只是一个点位符,用于和拷贝构造函数相区别。当TBB合并blocked_range时,会调用voidjoin(MaxElement&y),本例中合并就意味着把多个结果进行运算,将最终的结果保存并返回。构造函数主要实现参数的初始化以及从整个任务空间中分离并构建子任务。在完成并行模板类的编写之后,初始化TBB任务调度器,调用上述编写的并行模板,返回计算结果,最后结束TBB任务调度。Among them, the operator interface is the main part of parallel processing. Its main function is to optimize the parallel loop, modify the parameters of the loop to the blocked_range template class defined by TBB, and support the parallel division of tasks in the loop body. In the operator interface, the comparison size operation in each small block is performed in parallel, and the maximum value in the small block is returned. When TBB decides to split a blocked_range, it will call MaxElement(MaxElement&x, tbb::split). This tbb::split is just a point symbol, which is used to distinguish it from the copy constructor. When TBB merges blocked_range, it will call voidjoin(MaxElement&y). In this example, merging means operating multiple results, saving and returning the final result. The constructor mainly implements parameter initialization and separates and constructs subtasks from the entire task space. After completing the writing of the parallel template class, initialize the TBB task scheduler, call the parallel template written above, return the calculation result, and finally end the TBB task scheduling.

利用TBB设计并行Crout方法的基本步骤如下:The basic steps of using TBB to design a parallel Crout method are as follows:

步骤一:TBB并行计算平台的安装与环境设置;Step 1: Installation and environment setting of TBB parallel computing platform;

步骤二:分析并抽取传统Crout方法中可以并行计算的部分,利用TBB并行模板类改写为符合TBB需要的规范类;Step 2: Analyze and extract the part that can be calculated in parallel in the traditional Crout method, and use the TBB parallel template class to rewrite it into a standard class that meets the needs of TBB;

步骤三:设置问题初始值,即矩阵的规模、矩阵的行数、列数,输入矩阵等;Step 3: Set the initial value of the problem, that is, the size of the matrix, the number of rows and columns of the matrix, the input matrix, etc.;

步骤四:进入Crout方法的行循环;Step 4: Enter the line loop of the Crout method;

步骤五:调用parallel_reduce并行模块类,将每一行的最大主元保存在临时向量中,同时计算并保存比例因子,因此采用TBB进行并行设计对提高方法效率至关重要;Step 5: Call the parallel_reduce parallel module class, save the maximum pivot of each row in a temporary vector, and calculate and save the scale factor at the same time, so using TBB for parallel design is very important to improve the efficiency of the method;

步骤六:进入Crout方法的列循环;Step 6: Enter the column loop of the Crout method;

步骤七:确定新的主元及比例因子,修改TINY,每一行除以主元。若列循环未结束则转到步骤六,判断行循环是否结束,如果判断为否则转到步骤4);否则转到步骤四;Step 7: Determine the new pivot and scale factor, modify TINY, and divide each row by the pivot. Go to step 6 if the column cycle is not over, judge whether the row cycle is over, if judged otherwise go to step 4); otherwise go to step 4;

步骤八:结束,得到因部分主元法而改变了的行列排列次序。Step 8: At the end, get the order of ranks and columns changed by the partial pivot method.

为了验证并行Crout方法的正确性、有效性及评价本发明的实用性,下面通过本发明求解矩阵的逆来验证与评价,其基本步骤如下:In order to verify the correctness, effectiveness and evaluation of the present invention's practicability of the parallel Crout method, the inverse of the solution matrix of the present invention is verified and evaluated below, and its basic steps are as follows:

步骤一:TBB并行计算平台的安装与环境设置;Step 1: Installation and environment setting of TBB parallel computing platform;

步骤二:读入文件,其中包括输入矩阵及相关参数设置;Step 2: Read in the file, including the input matrix and related parameter settings;

步骤三:设置问题初始值,即矩阵的行数、列数,输入矩阵;Step 3: Set the initial value of the problem, that is, the number of rows and columns of the matrix, and input the matrix;

步骤四:利用本发明的并行Crout方法对输入矩阵进行LU分解;Step 4: Utilize the parallel Crout method of the present invention to carry out LU decomposition to the input matrix;

步骤五:进入矩阵的行循环;Step 5: Enter the row loop of the matrix;

步骤六:进入矩阵的列循环;Step 6: Enter the column loop of the matrix;

步骤七:通过回代函数将矩阵按列求逆;Step 7: Invert the matrix column by column through the back substitution function;

步骤八:增加列计数,若列循环未结束则转到步骤六;增加行计数,若行循环未结束则转到步骤五;Step 8: Increase the column count, if the column cycle is not over, go to step 6; increase the row count, if the row cycle is not over, go to step 5;

步骤九:结束,得到输入矩阵的逆矩阵,写入到文件。Step 9: End, get the inverse matrix of the input matrix and write it to the file.

如图1、2所示,本发明对传统Crout分解方法进行分析,把其中的关键步骤进行并行化处理,并利用Intel公司的线程构建块来实现并行化。本发明充分利用了线程构建块的灵活性、跨平台性及稳定性,实现基于线程构建块的并行Crout分解方法,并将该方法应用于矩阵求逆,经过大量仿真实验显示,本发明能够大幅度提高传统Crout分解方法的效率,进而大幅度提高使用了本发明的矩阵求逆方法的效率。As shown in Figures 1 and 2, the present invention analyzes the traditional Crout decomposition method, parallelizes the key steps, and utilizes the thread building blocks of Intel Corporation to realize parallelization. The present invention makes full use of the flexibility, cross-platform and stability of the thread building block, realizes the parallel Crout decomposition method based on the thread building block, and applies the method to matrix inversion. A large number of simulation experiments show that the present invention can greatly The efficiency of the traditional Crout decomposition method is greatly improved, and then the efficiency of the matrix inversion method using the present invention is greatly improved.

本发明由一个符合线程构建块规范的求最大主元类、最大主元调用类和利用了最大主元调用类的Crout分解方法组成,符合线程构建块规范的求最大主元类提供了具体的并行处理的部分、任务的线程分支划分及合并等功能;最大主元类调用类提供了对求最大主元类的调用封装,且提供了对其进行线程划分的粒度控制参数;并行化的Crout分解方法利用了线程并行化的优点,充分利用了计算机的资源,提高了传统Crout分解方法的效率。本发明可以运行于多个系统平台,如Windows,Linux,Unix等系统。The present invention is composed of a maximum pivot class, a maximum pivot call class and a Crout decomposition method using the maximum pivot call class that conform to the thread building block specification, and the maximum pivot class that meets the thread construction block specification provides specific Part of parallel processing, task thread branch division and merging and other functions; the maximum main element class call class provides the call encapsulation of the largest main element class, and provides the granularity control parameters for thread division; parallelized Crout The decomposition method takes advantage of thread parallelization, makes full use of computer resources, and improves the efficiency of the traditional Crout decomposition method. The present invention can run on multiple system platforms, such as Windows, Linux, Unix and other systems.

图1为本发明求最大主元类的结构图,以下对图中的各部分的详细描述:Fig. 1 seeks the structural diagram of maximum pivot class for the present invention, the detailed description of each part in the figure below:

在部分101中,实现operator接口,其是并行处理的主要部分,主要功能是进行并行循环优化,将循环的参数修改成TBB定义的blocked_range模板类,能够支持循环体内任务的并行划分;In part 101, the operator interface is implemented, which is the main part of parallel processing, and its main function is to perform parallel loop optimization, and modify the parameters of the loop to the blocked_range template class defined by TBB, which can support the parallel division of tasks in the loop body;

在部分102中,实现带tbb::split的分支创建函数接口,当TBB决定分出一个blocked_range时,会调用MaxElement(MaxElement&x,tbb::split),这个tbb::split只是一个点位符,用于和拷贝构造函数相区别;In part 102, the branch creation function interface with tbb::split is implemented. When TBB decides to split a blocked_range, it will call MaxElement(MaxElement&x, tbb::split). This tbb::split is just a point symbol. Use It is different from the copy constructor;

在部分103中,实现join合并函数接口,当TBB合并blocked_range时,会调用void_join(MaxElement&y),本例中合并就意味着把多个结果进行运算,将最终的结果保存并返回;In part 103, implement the join merge function interface. When TBB merges blocked_range, it will call void_join(MaxElement&y). In this example, merge means to operate multiple results, save and return the final result;

在部分104中,实现构造函数接口,构造函数主要实现参数的初始化以及从整个任务空间中分离并构建子任务。In part 104, the constructor interface is implemented, and the constructor mainly implements parameter initialization and separates and constructs subtasks from the entire task space.

图2为本发明的并行Crout方法流程图,下面对图中的各步骤进行详细描述:Fig. 2 is the parallel Crout method flowchart of the present invention, each step in the figure is described in detail below:

在步骤201中,读入需要分解的矩阵,初始化行值i,列值j,内含比例因子向量vv[i],极小数TINY,临时变量等。然后执行步骤202;In step 201, read in the matrix to be decomposed, initialize the row value i, the column value j, the vector vv[i] containing the scale factor, the very small number TINY, temporary variables and so on. Then execute step 202;

在步骤202中,按行循环,递增行,求内含比例因子。然后执行步骤203;In step 202, loop by row, increment row, and find the contained scale factor. Then execute step 203;

在步骤203中,调用并行的求最大主元类算法:求解各行的主元素和比例因子。然后执行步骤204;In step 203, a parallel algorithm for finding the maximum principal element is invoked: the principal element and the scale factor of each row are calculated. Then execute step 204;

在步骤204中,按列循环,递增列。然后执行步骤205;In step 204, loop by column, incrementing column. Then execute step 205;

在步骤205中,计算新的候选主元素及交换比例因子,替换原有的候选主元素及交换比例因子。然后执行步骤206;In step 205, new candidate principal elements and exchange scale factors are calculated to replace the original candidate principal elements and exchange scale factors. Then execute step 206;

在步骤206中,判断矩阵是否为奇异矩阵,如果是则赋TINY为零,否则TINY保持不变。然后执行步骤207;In step 206, it is judged whether the matrix is a singular matrix, if yes, TINY is assigned zero, otherwise TINY remains unchanged. Then execute step 207;

在步骤207中,矩阵中的每一行除以主元素。如果按列循环未结束,则执行步骤204;如果按列循环结束但按行循环未结束,则执行步骤202;如果按列循环与按行循环都已结束,则执行步骤208;In step 207, each row in the matrix is divided by the principal element. If the circulation by column is not finished, then perform step 204; if the circulation by column is finished but the circulation by row is not finished, then perform step 202; if the circulation by column and the circulation by row have all ended, then perform step 208;

在步骤208中,结束函数:输出因部分主元法而改变了的行排列次序,可以根据此排列次序得到矩阵的LU分解矩阵。In step 208, the end function: output the arrangement order of the rows changed by the partial pivot method, and the LU decomposition matrix of the matrix can be obtained according to the arrangement order.

图3为利用并行Crout方法求矩阵的逆阵的流程图,下面对图中的各步骤进行详细描述:Fig. 3 is the flow chart of using the parallel Crout method to find the inverse matrix of the matrix, and each step in the figure is described in detail below:

在步骤301中,通过对话框选择需要的输入文件,选择的矩阵文件为TXT格式,数据结构格式为:第一行为矩阵的行数(row);第二行为矩阵的列数(col),在本应用中行数等于列数(row=col=n),即这里的矩阵为n维方阵;第三行开始为矩阵的值。然后执行步骤302;In step 301, the required input file is selected through the dialog box, the matrix file selected is in TXT format, and the data structure format is: the row number (row) of the first row matrix; the column number (col) of the second row matrix, in In this application, the number of rows is equal to the number of columns (row=col=n), that is, the matrix here is an n-dimensional square matrix; the third row starts with the value of the matrix. Then execute step 302;

在步骤302中,根据读入的矩阵维数读取需要的数值并初始化相关变量及数据结构。然后执行步骤303;In step 302, the required values are read according to the dimensions of the read-in matrix, and related variables and data structures are initialized. Then execute step 303;

在步骤303中,利用本发明并行Crout方法对读入的矩阵进行分解。然后执行步骤304;In step 303, the read-in matrix is decomposed by using the parallel Crout method of the present invention. Then execute step 304;

在步骤304中,控制矩阵的行循环i,然后执行步骤305;In step 304, the row loop i of the control matrix is executed in step 305;

在步骤305中,控制矩阵的列循环j,然后执行步骤306;In step 305, the column cycle j of the control matrix, and then perform step 306;

在步骤306中,通过回代函数对计算的矩阵进行处理。若i<n且j<n,则返回步骤305;若i<n且j>=n,则返回步骤304;若i>=n且j>=n,执行步骤307;In step 306, the calculated matrix is processed through a back substitution function. If i<n and j<n, then return to step 305; if i<n and j>=n, then return to step 304; if i>=n and j>=n, execute step 307;

在步骤307中,结束矩阵求逆计算,输出矩阵的逆及相关的计算参数:方法的名称、运算时间、方法线程数,方法子矩阵大小等。执行步骤308;In step 307, the matrix inversion calculation is ended, and the inverse of the matrix and related calculation parameters are output: the name of the method, the operation time, the number of method threads, the size of the method sub-matrix, and the like. Execute step 308;

在步骤308中,将结果写入到TXT文件中,供使用者进一步分析。In step 308, the result is written into a TXT file for further analysis by the user.

图4中详细比较了并行crout方法、串行crout方法、高斯-约当消去法在求解矩阵时的效率。由参考线可以看出,当矩阵维数大于等于256时,并行crout方法的运行时间优于串行crout方法,更明显优于高斯-约当消去法的运行时间;并且随着矩阵维数的增加,这种趋势越明显。当矩阵规模达到百万级别的时候,本发明比串行Crout方法约提高了13.7%,并比高斯-约当消去法约提高了77%。即对于大规模或超大规模矩阵而言,并行crout方法明显优于高斯-约当消去法,也优于串行crout方法。因此,本发明具有很高的实用价值与广泛的应用前景。Figure 4 compares the efficiency of the parallel crout method, serial crout method, and Gauss-Jordan elimination method in solving the matrix in detail. It can be seen from the reference line that when the matrix dimension is greater than or equal to 256, the running time of the parallel crout method is better than that of the serial crout method, and is significantly better than that of the Gauss-Jordan elimination method; and as the matrix dimension increases increases, this trend becomes more pronounced. When the scale of the matrix reaches one million levels, the invention improves about 13.7% than the serial Crout method and about 77% than the Gauss-Jordan elimination method. That is, for large-scale or very large-scale matrices, the parallel crout method is obviously better than the Gauss-Jordan elimination method, and it is also better than the serial crout method. Therefore, the present invention has high practical value and wide application prospect.

Claims (2)

1. based on the ultra-large matrix multi-core parallel concurrent Ke Laote decomposition method of thread constructing piece, it is characterized in that this decomposition method may further comprise the steps:
1) installation and the environment setting of thread constructing piece (TBB) parallel computing platform;
2) analyze and extract in traditional Ke Laote (Crout) method can parallel computation part, utilize TBB parallel templates class to be rewritten as the standard class that meets the TBB needs;
3) the problem initial value is set;
4) getting into going of Crout method circulates;
5) call the parallel module class of parallel_reduce, each maximum pivot of going is kept in the interim vector, calculate and preserve scale factor simultaneously;
6) the row circulation of entering Crout method;
7) confirm new pivot and scale factor, revise TINY, each row judges divided by pivot whether the row circulation finishes, if be judged as otherwise forward step 6) to, judges whether the row circulation finishes, if be judged as otherwise forward step 4) to;
8) end of run is accomplished ultra-large matrix decomposition, obtains because of the altered ranks ordering of partial pivot method.
2. the ultra-large matrix multi-core parallel concurrent Ke Laote decomposition method based on the thread constructing piece according to claim 1 is characterized in that the problem initial value in the described step 3) comprises scale, the line number of matrix, the columns of matrix, input matrix.
CN201010571856.6A 2010-12-03 2010-12-03 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block) Expired - Fee Related CN102486727B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010571856.6A CN102486727B (en) 2010-12-03 2010-12-03 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010571856.6A CN102486727B (en) 2010-12-03 2010-12-03 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)

Publications (2)

Publication Number Publication Date
CN102486727A true CN102486727A (en) 2012-06-06
CN102486727B CN102486727B (en) 2014-10-22

Family

ID=46152226

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010571856.6A Expired - Fee Related CN102486727B (en) 2010-12-03 2010-12-03 Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)

Country Status (1)

Country Link
CN (1) CN102486727B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867090A (en) * 2012-09-13 2013-01-09 冶金自动化研究设计院 Parallel genetic algorithm steam pipe system model auto-calibration system based on TBB (threading building block)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030182518A1 (en) * 2002-03-22 2003-09-25 Fujitsu Limited Parallel processing method for inverse matrix for shared memory type scalar parallel computer
CN101604306A (en) * 2009-06-03 2009-12-16 中国人民解放军国防科学技术大学 Method of column pivoting LU decomposition based on FPGA
CN101639788A (en) * 2009-09-10 2010-02-03 北京航空航天大学 Multi-core parallel method for continuous system simulation based on TBB threading building blocks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030182518A1 (en) * 2002-03-22 2003-09-25 Fujitsu Limited Parallel processing method for inverse matrix for shared memory type scalar parallel computer
CN101604306A (en) * 2009-06-03 2009-12-16 中国人民解放军国防科学技术大学 Method of column pivoting LU decomposition based on FPGA
CN101639788A (en) * 2009-09-10 2010-02-03 北京航空航天大学 Multi-core parallel method for continuous system simulation based on TBB threading building blocks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙济洲 等: "改进的并行高斯全主元消去法", 《天津大学学报》 *
牛新 等: "可选主元LU分解流水线算法设计与FPGA实现", 《高技术通讯》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867090A (en) * 2012-09-13 2013-01-09 冶金自动化研究设计院 Parallel genetic algorithm steam pipe system model auto-calibration system based on TBB (threading building block)

Also Published As

Publication number Publication date
CN102486727B (en) 2014-10-22

Similar Documents

Publication Publication Date Title
CN104375805A (en) Method for simulating parallel computation process of reconfigurable processor through multi-core processor
US10803087B2 (en) Language interoperable runtime adaptable data collections
Huang et al. Cpp-taskflow: A general-purpose parallel task programming system at scale
CN112783503B (en) Arm architecture-based NumPy operation acceleration optimization method
CN105242929B (en) A kind of design method of binary program automatically parallelizing for multi-core platform
Vinas et al. Improving OpenCL programmability with the heterogeneous programming library
CN106445666B (en) A Parallel Optimization Method for DOACROSS Loop
Huang et al. Concurrent CPU-GPU Task Programming using Modern C++
Jiang et al. Optimizing scientific workflows in the cloud: A montage example
CN116089050B (en) Heterogeneous adaptive task scheduling method
Brown Porting incompressible flow matrix assembly to FPGAs for accelerating HPC engineering simulations
CN110673877B (en) A Parallel Computing Method Based on Manual Vectorization
Michailidis et al. Scientific computations on multi-core systems using different programming frameworks
CN102486727B (en) Multinuclear parallel crout decomposition method for ultra-large scale matrix based on TBB (Treading Building Block)
Luszczek et al. Search space generation and pruning system for autotuners
Wu et al. Task Mapping and Scheduling on RISC-V MIMD Processor With Vector Accelerator Using Model-Based Parallelization
Tarakji et al. The development of a scheduling system GPUSched for graphics processing units
Gorlatch et al. USING THE SPIN MODEL CHECKER FOR AUTO-TUNING HIGH-PERFORMANCE PROGRAMS
Dios et al. High-level template for the task-based parallel wavefront pattern
Tao et al. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture.
George et al. An empirical evaluation of design abstraction and performance of thrust framework
CN1316359C (en) User guided program semi-automatic parallelizing method
CN105302577B (en) Drive the machine code generation method and device of execution unit
Hardasmal et al. A parallelisation tale of two languages
Basicevic et al. An approach to parallelization of legacy software

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141022

Termination date: 20171203

CF01 Termination of patent right due to non-payment of annual fee