US20140298351A1 - Parallel operation method and information processing apparatus - Google Patents
Parallel operation method and information processing apparatus Download PDFInfo
- Publication number
- US20140298351A1 US20140298351A1 US14/190,623 US201414190623A US2014298351A1 US 20140298351 A1 US20140298351 A1 US 20140298351A1 US 201414190623 A US201414190623 A US 201414190623A US 2014298351 A1 US2014298351 A1 US 2014298351A1
- Authority
- US
- United States
- Prior art keywords
- matrix
- submatrix
- thread
- threads
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Definitions
- the embodiments discussed herein relate to a parallel operation method and an information processing apparatus.
- Large-scale numerical calculations such as scientific calculations may be carried out using supercomputers or other high performance computers. Such large-scale numerical calculations often involve high-dimensional matrix operations. For example, in the flow analysis field, structural analysis field, and other fields, large-scale simultaneous equations may be solved using a coefficient matrix that represents the coefficients of the simultaneous equations. Another example is that, in the circuit analysis field, vibration analysis field, and other fields, the eigenvalues of a large-scale matrix may be computed. In a matrix operation using a computer, an approximate solution may be computed by iterations of matrix-vector multiplication. For example, an approximate solution of a differential equation, which is analytically difficult to solve, may be computed by iterations of multiplying a coefficient matrix by a vector with the finite element method.
- a matrix to be used in a large-scale numerical calculation may be a sparse matrix with a high percentage of zero-valued elements (zero elements) and a small percentage of non-zero-valued elements (non-zero elements).
- Sparse matrix representation including zero elements based on the matrix structure produces a large amount of data and therefore is inefficient.
- compressed storage formats for representing a sparse matrix as compressed data without zero elements.
- Such compressed storage formats include Compressed Column Storage (CCS) format and Compressed Row Storage (CRS) format.
- the elements included in an N ⁇ M matrix are searched in column-major order (i.e., in the order of row 1 and column 1, row 2 and column 1, . . . , row N and column 1, row 1 and column 2, row 2 and column 2, . . . , row 1 and column M, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements in the above order, a second array that stores the row numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new columns in the first array are generated.
- the elements included in an N ⁇ M matrix are searched in row-major order (i.e., in the order of row 1 and column 1, row 1 and column 2, . . . , row 1 and column M, row 2 and column 1, row 2 and column 2, . . . , row N and column 1, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements, a second array that stores the column numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new rows in the first array are generated.
- a matrix may be divided so as to be assigned to a plurality of threads which are then executed in parallel by a plurality of processors.
- This technique achieves high-speed operation.
- a plurality of threads may perform operations for the same element of a final result, depending on the way of dividing the matrix. Therefore, it is considered that a storage area for storing an intermediate result may be allocated to each thread.
- a parallel processing method for multiplying a sparse matrix represented in the compressed column storage format by a column vector with a plurality of processors.
- the columns of the matrix are equally divided and assigned to a plurality of threads, and a storage area having the same size as a column vector that is the final result is allocated to each thread. Then, the column vectors that are the intermediate results obtained by the respective threads are added to thereby obtain the final result.
- Another method is that one shared storage area is prepared for a plurality of threads, and the threads sequentially add values to the elements of a resulting vector. However, if a shared storage area is used simply, some threads may possibly add values to the same element (for example, in the same row of the resulting column vector) at the same time, which is an access conflict. If exclusive control is exercised between the threads so as not to cause access conflict, the overhead for memory access may increase and the efficiency of the parallel processing may decrease.
- a non-transitory computer-readable storage medium storing therein a memory allocating program.
- the memory allocating program causes a computer capable of executing a plurality of threads in parallel to perform a process that includes: assigning calculation of a first submatrix included in a matrix to a first thread and calculation of a second submatrix included in the matrix to a second thread, the matrix including zero elements and non-zero elements; comparing a distribution of non-zero elements in rows or columns of the first submatrix with a distribution of non-zero elements in rows or columns of the second submatrix; and determining allocation of storage areas for storing vectors to be respectively used in calculations by the first and second threads, according to a result of the comparing.
- FIG. 1 illustrates an information processing apparatus according to a first embodiment
- FIG. 2 is a block diagram illustrating an example of the hardware configuration of an information processing apparatus
- FIG. 3 illustrates an example of multiplying a sparse matrix by a vector
- FIG. 4 illustrates an example of a compressed storage format for a sparse matrix
- FIG. 5 illustrates an example of division and assignment of a sparse matrix to threads
- FIG. 6 illustrates an example of parallel calculation for multiplying a sparse matrix by a vector
- FIG. 7 illustrates an example of a non-zero element map
- FIG. 8 illustrates another example of the non-zero element map
- FIG. 9 illustrates an example of allocation of work vectors
- FIG. 10 is a first diagram illustrating another example of allocation of work vectors
- FIG. 11 is a second diagram illustrating another example of allocation of work vectors
- FIG. 12 is a third diagram illustrating another example of allocation of work vectors
- FIG. 13 is a fourth diagram illustrating another example of allocation of work vectors
- FIG. 14 is a block diagram illustrating an example of the functions of the information processing apparatus
- FIG. 15 is a flowchart illustrating how to control matrix operation
- FIG. 16 is a flowchart illustrating how to conduct non-zero element check
- FIG. 17 is a flowchart illustrating how to allocate work vectors.
- FIG. 18 is a flowchart illustrating how to perform parallel matrix operation.
- FIG. 1 illustrates an information processing apparatus according to a first embodiment.
- An information processing apparatus 10 of the first embodiment executes a plurality of threads in parallel in order to perform a matrix operation (for example, a matrix-vector multiplication).
- the information processing apparatus 10 includes a plurality of processors, including processors 11 to 13 , and a memory 14 .
- the processors 11 to 13 are computing devices that are capable of executing threads physically in parallel to one another.
- Each of the processors 11 to 13 may be a single processor package such as a Central Processing Unit (CPU) package, or a processor core (may simply be referred to as a core) included in a processor package.
- the processors 11 to 13 execute a program stored in the memory 14 .
- the processor 12 executes a thread 21
- the processor 13 executes a thread 22 .
- the processor 11 executes a thread or process for controlling parallelization of the matrix operation.
- the processor 12 or 13 may be designed to control the parallelization.
- the memory 14 is a shared memory that is accessed from the processors 11 to 13 , and is a Random Access Memory (RAM), for example.
- the memory 14 stores one or more vectors to be used in the operations of the threads 21 and 22 . In these vectors, values obtained in the middle of the operations of the threads 21 and 22 are stored. In addition, in the vectors, the values of intermediate results (values before a final result is obtained) at the time of completion of the operations of the threads 21 and 22 are stored, for example. In the case where a matrix operation is to multiply a matrix by a column vector, the above vectors are column vectors. In the case where a matrix operation is to multiply a row vector by a matrix, the above vectors are row vectors.
- allocation of storage areas of the memory 14 for storing vectors to the threads 21 and 22 is determined as described below.
- the information processing apparatus 10 prepares a shared storage area (storage area 26 ) or individual storage areas (storage areas 26 and 27 ) for the threads 21 and 22 .
- the processor 11 analyzes a matrix 23 including zero elements and non-zero elements, and assigns the calculations of the submatrices included in the matrix 23 to the threads 21 and 22 .
- the matrix 23 is, for example, a sparse matrix with a high percentage of zero elements and a small percentage of non-zero elements.
- the matrix 23 may be represented in a compressed storage format such as compressed column storage format or compressed row storage format.
- the processor 11 assigns the calculation of a submatrix 24 to the thread 21 , and the calculation of a submatrix 25 to the thread 22 . For example, assume that the submatrices 24 and 25 do not overlap in any columns but overlap in some rows. By contrast, for example, it may be assumed that the submatrices 24 and 25 do not overlap in any rows but overlap in some columns.
- the processor 11 also analyzes the matrix 23 , and allocates the threads 21 and 22 storage areas for storing vectors to be used in their operations. To this end, the processor 11 compares the distribution of non-zero elements of the submatrix 24 with the distribution of non-zero elements of the submatrix 25 . For example, in the case where the operation result is a column vector, the processor 11 compares the distributions of non-zero elements in column direction (in a plurality of rows). On the other hand, in the case where the operation result is a row vector, the processor 11 compares the distributions of non-zero elements in row direction (in a plurality of columns).
- the processor 11 determines allocation of storage areas to the threads 21 and 22 according to a result of comparing the distributions of non-zero elements. For example, in the case where the submatrices 24 and 25 do not have non-zero elements in the same row (or in the same column), the processor 11 determines that it is possible to allocate a shared storage area to the threads 21 and 22 . This is because, if there are no non-zero elements in the same row, the threads 21 and 22 do not access the element in a row of a column vector used in their operations and therefore no access conflict occurs. Likewise, if there are no non-zero elements in the same column, the threads 21 and 22 do not access the element in a column of a row vector used in their operations and therefore no access conflict occurs.
- the thread 21 stores the values (intermediate result) obtained in the middle of or at the end of the calculation of the submatrix 24 in the storage area 26 .
- the thread 22 stores the values (intermediate result) obtained in the middle of or at the end of the calculation of the submatrix 25 in the storage area 26 .
- the thread 21 stores the values obtained by the calculation of the submatrix 24 in the storage area 26
- the thread 22 stores the values obtained by the calculation of the submatrix 25 in the storage area 27 . Then, the vectors stored in the storage areas 26 and 27 are added to thereby obtain a final result.
- Access conflict between the threads 21 and 22 occurs as a problem, especially, in the following matrix operations.
- a matrix operation to multiply the matrix 23 represented in the compressed column storage format by a column vector.
- dividing the columns of the matrix makes the operation more efficient. Therefore, in this matrix operation, there is a problem of access conflict to the same row between the threads 21 and 22 .
- Another matrix operation is, for example, to multiply a row vector by the matrix 23 represented in the compressed row storage format. In the case where the matrix is represented in the compressed row storage format, dividing the rows of the matrix makes the operation more efficient. Therefore, there is a problem of access conflict to the same column between the threads 21 and 22 .
- the matrix 23 is a symmetric matrix
- the calculation of a submatrix at a symmetrical position to the submatrix 24 is also assigned to the thread 21
- the calculation of a submatrix at a symmetrical position to the submatrix 25 is also assigned to the thread 22 .
- this matrix operation there is a problem of conflict between the threads 21 and 22 with respect to either one of the lower triangle part and upper triangle part of the matrix 23 .
- the problem is access conflict to the same row of the lower triangle submatrix between the threads 21 and 22 .
- the problem is access conflict to the same column of the upper triangle submatrix between the threads 21 and 22 .
- the information processing apparatus 10 compares the distribution of non-zero elements of the submatrix 24 with the distribution of non-zero elements of the submartix 25 . Then, the information processing apparatus 10 determines allocation of storage areas to the threads 21 and 22 according to the result of the comparison. This avoids access conflict to the same element in the same storage area (for example, this prevents access conflict), and removes the load of exclusive control between the threads 21 and 22 . In addition, while preventing an increase in the load due to exclusive control (for example, while eliminating the need of exclusive control), it is possible to reduce the size of the storage area to be prepared in the memory 14 . Accordingly, it is possible to efficiently use the storage area of the memory 14 in the matrix operation.
- FIG. 2 is a block diagram illustrating an example of the hardware configuration of an information processing apparatus.
- An information processing apparatus 100 of the second embodiment is a computer that is capable of performing large-scale matrix operation, and is, for example, a server computer that performs matrix operation in response to a user request.
- the information processing apparatus 100 includes a plurality of CPUs, including CPUs 110 , 110 a , 110 b , and 110 c , and a RAM 120 .
- the plurality of CPUs and the RAM 120 are connected to a system bus 136 .
- the information processing apparatus 100 includes a Hard Disk Drive (HDD) 131 , a video signal processing unit 132 , an input signal processing unit 133 , a media reader 134 , and a communication interface 135 .
- the HDD 131 , video signal processing unit 132 , input signal processing unit 133 , media reader 134 , and communication interface 135 are connected to an input/output bus 137 .
- the system bus 136 and input/output bus 137 are connected with a bridge, for example.
- the CPUs 110 , 110 a , 110 b , and 110 c are processor packages that execute programs.
- the CPUs 110 , 110 a , 110 b , and 110 c load at least some of the instructions of a program or data from the HDD 131 to the RAM 120 , and execute the program.
- Each CPU includes a plurality of cores and a cache memory.
- the CPU 110 includes a plurality of cores, including cores 111 to 114 , and a cache memory 115 .
- the cores 111 to 114 execute threads physically in parallel to one another.
- the cache memory 115 is a volatile memory for temporarily storing the instructions of a program and data read from the RAM 120 , and is, for example, a Static Random Access Memory (SRAM).
- SRAM Static Random Access Memory
- a cache memory may be provided for each core.
- the RAM 120 is a shared memory that is accessed from the CPUs 110 , 110 a , 110 b , and 110 c via the high-speed system bus 136 .
- the RAM 120 temporarily stores the instructions of a program and data.
- the information processing apparatus 100 may be provided with another type of volatile memory than RAM, or a plurality of memories.
- the HDD 131 is a non-volatile storage device for storing software programs, such as an Operating System (OS), application software programs, or other programs, and data.
- software programs such as an Operating System (OS), application software programs, or other programs, and data.
- the information processing apparatus 100 may be provided with another type of storage device, such as a flash memory, Solid State Drive (SSD), or another, or a plurality of non-volatile storage devices.
- SSD Solid State Drive
- the video signal processing unit 132 outputs images to a display 141 connected to the information processing apparatus 100 , according to instructions from any of the CPUs.
- a display 141 a Cathode Ray Tube (CRT) display, a Liquid Crystal Display (LCD) display, a Plasma Display Panel (PDP), an Organic Electro-Luminescence (OEL) display, or another display may be used.
- CTR Cathode Ray Tube
- LCD Liquid Crystal Display
- PDP Plasma Display Panel
- OEL Organic Electro-Luminescence
- the input signal processing unit 133 receives an input signal from an input device 142 connected to the information processing apparatus 100 , and outputs the input signal to any one of the CPUs.
- an input device 142 a pointing device, such as a mouse, touch panel, touchpad, trackball, or another, a keyboard, a remote controller, a button switch, or another device may be used.
- plural types of input devises may be connected to the information processing apparatus 100 .
- the media reader 134 is a driving device that reads programs and data from a recording medium 143 .
- a magnetic disk such as a flexible disk (FD) or HDD
- an optical disc such as a Compact Disc (CD) or Digital Versatile Disc (DVD), a Magneto-Optical disk (MO), a semiconductor memory, or another may be used.
- the media reader 134 stores, for example, instructions of a program and data read from the recording medium 143 , in the RAM 120 or HDD 131 .
- the communication interface 135 is connected to a network 144 , and enables communication with another information processing apparatus via the network 144 .
- the communication interface 135 may be a wired communication interface that is connected to a communication device such as a switch or a router with a cable, or a wireless communication interface that is connected to a wireless base station.
- the information processing apparatus 100 may be configured without the media reader 134 . Further, if the user is able to control the information processing apparatus 100 over the network 144 from a terminal device, the information processing apparatus 100 may be configured without the video signal processing unit 132 or input signal processing unit 133 .
- the display 141 and input device 142 may be accommodated as one unit in the casing of the information processing apparatus 100 .
- the cores 111 to 114 are an example of the above-described processors 11 to 13
- the RAM 120 is an example of the above-described memory 14 .
- FIG. 3 illustrates an example of multiplying a sparse matrix by a vector.
- the information processing apparatus 100 performs iterations of multiplying a matrix by an input vector.
- the matrix is, for example, a coefficient matrix that represents the coefficients of simultaneous equations.
- the information processing apparatus 100 multiplies the matrix by an initial input vector.
- the information processing apparatus 100 processes the resulting vector of the first iteration according to a specified algorithm, and takes the resultant as the next input vector.
- the information processing apparatus 100 multiplies the same matrix as used in the first iteration by the input vector obtained by processing the resulting vector of the first iteration. This matrix operation is repeated until a specific completion condition (for example, the number of iterations, the accuracy of values in the resulting vector, etc.) is satisfied.
- a specific completion condition for example, the number of iterations, the accuracy of values in the resulting vector, etc.
- the second embodiment uses a matrix that is symmetric and sparse (symmetric sparse matrix), and a column vector as an input vector. Therefore, the resulting vector, which is the product of the symmetric sparse matrix and the column vector, is a column vector.
- a 6 ⁇ 6 symmetric sparse matrix includes the first row of (4.0, 9.0, 0, 3.0, 0, 0), and an input vector is (0.1, 0.2, 0.3, 0.4, 0.5, 0.6)
- a value in the first row of the resulting vector is calculated as 3.4.
- a 6 ⁇ 6 matrix is used by way of example for simple description.
- FIG. 4 illustrates an example of a compressed storage format for a sparse matrix.
- the lower triangle part of a symmetric sparse matrix is represented in the compressed column storage format.
- the upper triangle part (excluding diagonal elements) of the symmetric sparse matrix may not be included in the matrix data and may be omitted because it is reproducible from the lower triangle part.
- the matrix data includes an element array 121 (Val), a row number array 122 (Row), and a column pointer array 123 (Cp).
- the element array 121 contains the values of the non-zero elements included in the lower triangle part of the symmetric sparse matrix, in column-major order.
- the row number array 122 stores the row numbers of the non-zero elements arranged in the element array 121 .
- the k-th value (Row(k)) of the row number array 122 represents the row number of the k-th non-zero element.
- the length of the element array 121 and row number array 122 is equal to the number of non-zero elements included in the symmetric sparse matrix.
- the column pointer array 123 indicates the locations of the elements that start new columns in the element array 121 .
- the k-th value (Cp(k)) of the column pointer array 123 represents the element number of the first non-zero element in the k-th column.
- the length of the column pointer array 123 is longer by one than the number of columns of the symmetric sparse matrix.
- the end of the column pointer array 123 has a value greater by one than the maximum element number.
- the lower triangle part of a 6 ⁇ 6 symmetric sparse matrix contains the following elements: the first column of (4.0, 9.0, 0, 3.0, 0, 0), the second column of (11.0, 5.0, 0, 0, 0), the third column of (6.0, 0, 0, 0), the fourth column of (1.0, 8.0, 12.0), the fifth column of (2.0, 10.0), and the sixth column of (7.0).
- the element array 121 is (4.0, 9.0, 3.0, 11.0, 5.0, 6.0, 1.0, 8.0, 12.0, 2.0, 10.0, 7.0).
- the row number array 122 is (1, 2, 4, 2, 3, 3, 4, 5, 6, 5, 6, 6).
- the column pointer array 123 is (1, 4, 6, 7, 10, 12, 13).
- the upper triangle part (including diagonal elements) of the symmetric sparse matrix is reproduced by reading the element array 121 , row number array 122 and column pointer array 123 as the arrays for the upper triangle part without any changes to the values. More specifically, the element array 121 is read as an array that contains non-zero elements in row-major order, the row number array 122 is read as a column number array, and the column pointer array 123 is read as a row pointer array.
- the element array, column number array, and row pointer array are a representation of the upper triangle part (including diagonal elements) in the compressed row storage format.
- FIG. 5 illustrates an example of division and assignment of a sparse matrix to threads.
- the information processing apparatus 100 divides a symmetric sparse matrix into a plurality of submatrices that are then assigned to a plurality threads, and executes the threads in parallel with a plurality of cores.
- the information processing apparatus 100 assigns each thread one or two or more successive columns of the lower triangle part of the symmetric sparse matrix.
- the information processing apparatus 100 assigns each thread one or two or more rows of the upper triangle part (excluding diagonal elements) of the symmetric sparse matrix. In the assignment, the j-th column of the lower triangle part and the j-th row of the upper triangle part (excluding diagonal elements), which are at symmetrical positions, are assigned to the same thread.
- the thread pointer array 124 contains the column numbers identifying the first columns of the column groups assigned to the respective threads. Sequential thread numbers are given to the plurality of threads.
- the k-th value (Bp(k)) of the thread pointer array 124 represents the first column of the column group assigned to the thread #k. In this connection, the length of the thread pointer array 124 is longer by one than the number of threads.
- the end of the thread pointer array 124 contains a value greater by one than the number of columns of the symmetric sparse matrix.
- FIG. 6 illustrates an example of parallel calculation for multiplying a sparse matrix by a vector.
- the element array 121 , row number array 122 , column pointer array 123 , and thread pointer array 124 are stored in the RAM 120 .
- a work area 127 (Work) an input vector 128 (X), and a resulting vector 129 (Y) are stored in the RAM 120 .
- the work area 127 includes one or two or more work vectors that are column vectors, for storing intermediate results. Values obtained by each thread with respect to a submatrix of the lower triangle part are stored in the work area 127 . On the other hand, values obtained by each thread with respect to the upper triangle part (excluding diagonal elements) are stored directly in the resulting vector 129 . After the execution of the plurality of threads is complete, the one or two or more work vectors are added to the resulting vector 129 . The following describes the case where one work vector is prepared for each thread in the work area 127 .
- the thread #1 adds the product of the non-zero element in row 1 and column 1 and the first row of the input vector 128 to the element in the first row of the work vector allocated to the thread #1 (i.e., the element in row 1 and column 1) in the work area 127 .
- the thread #1 adds the product of the non-zero element in row 2 and column 1 and the first row of the input vector 128 to the element in row 2 and column 1 of the work area 127 , and adds the product of the non-zero element in row 4 and column 1 and the first row of the input vector 128 to the element in row 4 and column 1 of the work area 127 .
- the thread #1 adds the product of the non-zero element in row 1 and column 2 and the second row of the input vector 128 to the element in the first row of the resulting vector 129 , and adds the product of the non-zero element in row 1 and column 4 and the fourth row of the input vector 128 to the element in the first row of the resulting vector 129 .
- the thread #2 adds the product of the non-zero element in row 2 and column 2 and the second row of the input vector 128 to the element in row 2 and column 2 of the work area 127 .
- the thread #2 also adds the product of the non-zero element in row 3 and column 2 and the second row of the input vector 128 to the element in row 3 and column 2 of the work area 127 , and adds the product of the non-zero element in row 3 and column 3 and the third row of the input vector 128 to the element in row 3 and column 2 of the work area 127 .
- the thread #2 adds the product of the non-zero element in row 2 and column 3 and the third row of the input vector 128 to the element in the second row of the resulting vector 129 .
- the thread #3 adds values to the elements in row 4 and column 3, row 5 and column 3, and row 6 and column 3 of the work area 127 , and adds a value to the element in the fourth row of the resulting vector 129 .
- the thread #4 adds values to the elements in row 5 and column 4 and row 6 and column 4 of the work area 127 , and adds a value to the element in the fifth row of the resulting vector 129 .
- the information processing apparatus 100 adds the four work vectors included in the work area 127 to the resulting vector 129 .
- obtained resulting vector 129 represents the product of the symmetric sparse matrix and the input vector 128 .
- FIG. 7 illustrates an example of a non-zero element map.
- the information processing apparatus 100 analyzes a symmetric sparse matrix to determine how many work vectors to prepare and how the work vectors are allocated to threads.
- the information processing apparatus 100 divides the rows of the symmetric sparse matrix into a plurality of row groups. Preferably, the groups have as equal a number of rows as possible.
- the information processing apparatus 100 previously determines the number of groups (the number of divisions), or changes the number of divisions according to the dimensions of the symmetric sparse matrix. For example, the number of divisions may be set to 100 with respect to a symmetric sparse matrix having the dimensions of about several tens of thousands.
- the information processing apparatus 100 confirms whether a non-zero element exists or not in each block obtained by partitioning the lower triangle part in row direction and column direction, and generates a non-zero element map 125 (Map) representing the distribution of non-zero elements.
- the rows of the non-zero element map 125 correspond to the above row groups, and the columns of the non-zero element map 125 correspond to threads (that is, the column groups of the symmetric sparse matrix assigned to the threads).
- the state of each block is represented by one-bit flag. A flag of “1” indicates that there is at least one non-zero element in the block. A flag of “0” indicates that there is no non-zero element in the block.
- a 6 ⁇ 4 non-zero element map 125 is generated from the lower triangle part of a 6 ⁇ 6 symmetric sparse matrix that is partitioned as illustrated in FIG. 5 .
- one row of the non-zero element map 125 corresponds to one row of the symmetric sparse matrix.
- a flag of “1” is contained in locations in row 1 and column 1, row 2 and column 1, row 4 and column 1, row 2 and column 2, row 3 and column 2, row 4 and column 3, row 5 and column 3, row 6 and column 3, row 5 and column 4, and row 6 and column 4 of the non-zero element map 125 .
- a flag of “0” is contained in the other locations.
- FIG. 8 illustrates another example of the non-zero element map.
- the information processing apparatus 100 generates an 8 ⁇ 8 non-zero element map 125 , and confirms whether at least one non-zero element exists in each block or not.
- non-zero elements tend to concentrate on the vicinity of a diagonal line.
- the diagonal elements and some elements adjacent to the diagonal elements in the non-zero element map 125 have a flag of “1”, and many other elements in the non-zero element map 125 have a flag of “0”.
- FIG. 9 illustrates an example of allocation of work vectors.
- the information processing apparatus 100 searches for a combination of threads to which a shared work vector is allocable, with reference to the non-zero element map 125 . More specifically, the information processing apparatus 100 searches the non-zero element map 125 for a combination of columns that does not have a flag of “1” in the same row. Referring to the example of the non-zero element map 125 of FIG. 7 , each of combinations of the first and fourth columns and the second and third columns does not have a conflict regarding a flag of “1”.
- the information processing apparatus 100 allocates a work vector 1 (the first column of the work area 127 ) to the threads #1 and #4, and allocates a work vector 2 (the second column of the work area 127 ) to the threads #2 and #3. That is to say, only two work vectors need to be prepared for the four threads (threads #1 to #4) in the work area 127 .
- the information processing apparatus 100 After the allocation of the work vectors to the respective threads, the information processing apparatus 100 generates a work pointer array 126 (Up) that contains the vector numbers of the work vectors to be used by the threads.
- the k-th value (Up(k)) of the work pointer array 126 represents the work vector to be used by the thread #k.
- FIG. 10 is a first diagram illustrating another example of allocation of work vectors.
- Non-zero elements exist in the groups 1 and 2 of a thread #1, the groups 2 and 3 of a thread #2, the groups 3 and 4 of a thread #3, the groups 4 and 5 of a thread #4, the groups 5 and 6 of a thread #5, the groups 6 and 7 of a thread #6, the groups 7 and 8 of a thread #7, and the group 8 of a thread #8.
- Non-zero elements do not exist in the other groups.
- the information processing apparatus 100 allocates a work vector 1 to the threads #1, #3, #5, and #7, and allocates a work vector 2 to the threads #2, #4, #6, and #8. That is to say, only two work vectors need to be prepared for these eight threads (threads #1 to #8).
- a shared work vector may be allocated to three or more threads.
- determining appropriate combinations of threads makes it possible to reduce the number of work vectors prepared in the work area 127 to two or three.
- FIG. 11 is a second diagram illustrating another example of allocation of work vectors.
- Non-zero elements exist in the groups 1, 2, and 4 of a thread #1, the groups 2, 3, and 6 of a thread #2, the groups 3, 4, and 8 of a thread #3, the groups 4 and 5 of a thread #4, the groups 5 and 6 of a thread #5, the groups 6 and 7 of a thread #6, the groups 7 and 8 of a thread #7, and the group 8 of a thread #8.
- Non-zero elements do not exist in the other groups.
- the information processing apparatus 100 allocates a work vector 1 to the threads #1, #5, and #7, allocates a work vector 2 to the threads #2, #4, and #8, and allocates a work vector 3 to the threads #3 and #6. That is to say, only three work vectors need to be prepared in the work area 127 for the eight threads (threads #1 to #8).
- FIG. 12 is a third diagram illustrating another example of allocation of work vectors.
- Non-zero elements exist in the groups 1, 3, and 7 of a thread #1, the groups 2 and 6 of a thread #2, the groups 3 and 5 of a thread #3, the groups 4 and 8 of a thread #4, the groups 5 and 7 of a thread #5, the group 6 of a thread #6, the group 7 of a thread #7, and the group 8 of a thread #8.
- Non-zero elements do not exist in the other groups.
- the information processing apparatus 100 allocates a work vector 1 to the threads #1, #2, and #4, allocates a work vector 2 to the threads #3, #6, #7, and #8, and allocates a work vector 3 to the thread #5. That is to say, only three work vectors need to be prepared in the work area 127 for the eight threads (threads #1 to #8).
- FIGS. 9 to 12 are detected by sequentially comparing already prepared work vectors with the columns of the non-zero element map 125 in order from the left column to find a work vector where conflict regarding the flag of “1” does not occur, and preparing a new work vector if there is no such a work vector.
- the information processing apparatus 100 prepares the work vector 1 and allocates the work vector 1 to the thread #1. Then, since it is possible to add the thread #2 to the work vector 1, the information processing apparatus 100 allocates the work vector 1 to the thread #2. Then, since it is not possible to add the thread #3 to the work vector 1, the information processing apparatus 100 prepares the work vector 2 and allocates the work vector 2 to the thread #3.
- the information processing apparatus 100 allocates the work vector 1 to the thread #4. Then, since it is not possible to add the thread #5 to either of the work vectors 1 and 2 , the information processing apparatus 100 prepares the work vector 3 and allocates the work vector 3 to the thread #5. Then, since it is not possible to add the thread #6 to the work vectors 1 but it is possible to add the thread #6 to the work vector 2, the information processing apparatus 100 allocates the work vector 2 to the thread #6. Subsequently, the information processing apparatus 100 allocates the work vector 2 to the threads #7 and #8 in the same way.
- the number of work vectors prepared in the work area 127 may be different depending on an employed search algorithm.
- FIG. 13 is the fourth diagram illustrating another example of allocation of work vectors.
- This example describes the case of executing six threads in order to process a symmetric sparse matrix and dividing the rows of the symmetric sparse matrix into eight row groups.
- Non-zero elements exist in the groups 1, 3, and 8 of a thread #1, the groups 2 and 5 of a thread #2, the groups 3, 4, 7, and 8 of a thread #3, the groups 4, 5, and 7 of a thread #4, and the group 6 of a thread #5, and the groups 6 of a thread #6.
- the information processing apparatus 100 allocates a work vector 1 to the threads #1, #2, and #5, a work vector 2 to the threads #3 and #6, and a work vector 3 to the thread #4. That is, three work vectors are prepared in the work area 127 .
- the information processing apparatus 100 allocates a work vector 1 to the threads #1, #4, and #5, and a work vector 2 to the threads #2, #3, and #6. That is, two work vectors, which are fewer than those of the former search algorithm, are prepared in the work area 127 .
- a search algorithm for minimizing the number of work vectors (an algorithm that provides an optimal solution) needs larger amount of calculation than the other search algorithms (algorithms that provide suboptimal solutions).
- the information processing apparatus 100 may be designed to select a search algorithm to be used, taking into account a balance between the accuracy of solution and the amount of calculation. In the following description, it is assumed that the information processing apparatus 100 uses a search algorithm that provides a suboptimal solution with a small amount of calculation.
- the following describes the functions and processes of the information processing apparatus 100 .
- FIG. 14 is a block diagram illustrating an example of the functions of an information processing apparatus.
- the information processing apparatus 100 includes a data storage unit 150 , a matrix operation requesting unit 161 , a parallelization control unit 162 , a parallel processing unit 165 , and an OS 168 .
- the data storage unit 150 is implemented as a storage area saved in the RAM 120 .
- the matrix operation requesting unit 161 , parallelization control unit 162 , and parallel processing unit 165 are implemented as software modules.
- the parallelization control unit 162 and parallel processing unit 165 may be implemented as a numerical calculation library.
- the data storage unit 150 includes a matrix storage unit 151 , a control data storage unit 152 , an intermediate data storage unit 153 , and a vector storage unit 154 .
- the matrix storage unit 151 stores matrix data representing a symmetric sparse matrix.
- the matrix data includes an element array 121 , a row number array 122 , and a column pointer array 123 .
- the control data storage unit 152 stores control data for use in controlling parallelization.
- the control data includes a thread pointer array 124 , a non-zero element map 125 , and a work pointer array 126 .
- the intermediate data storage unit 153 includes a work area 127 .
- the vector storage unit 154 stores an input vector 128 and a resulting vector 129 .
- the matrix operation requesting unit 161 stores the element array 121 , row number array 122 , column pointer array 123 , and input vector 128 in the data storage unit 150 , and requests the parallelization control unit 162 to multiply the symmetric sparse matrix by the input vector 128 .
- the matrix operation requesting unit 161 processes the resulting vector 129 for use as the next input vector 128 .
- the matrix operation requesting unit 161 repeatedly requests the parallelization control unit 162 to multiply the symmetric sparse matrix by the input vector 128 until the number of iterations or an operation status such as the accuracy of values included in the resulting vector 129 satisfies a specific completion condition.
- the parallelization control unit 162 controls the parallelization of the matrix operation.
- the parallelization control unit 162 includes a matrix analysis unit 163 and a vector input and output unit 164 .
- the matrix analysis unit 163 analyzes the symmetric sparse matrix and determines a parallelization method.
- the matrix analysis unit 163 divides the symmetric sparse matrix and assigns the divisions to a plurality of threads.
- the number of threads for performing the matrix operation is determined, for example, based on conditions such as the amount of hardware resources of the information processing apparatus 100 , the current workload of the information processing apparatus 100 , user contract, or others.
- the matrix analysis unit 163 determines the number of work vectors to be prepared in the work area 127 , and allocates any of the work vectors to each of the plurality of threads.
- the vector input and output unit 164 initializes the work area 127 and resulting vector 129 each time the input vector 128 is specified by the matrix operation requesting unit 161 (for each iteration of the iterative operation). Then, when the matrix operation for the plurality of threads is completed, the vector input and output unit 164 adds all the work vectors included in the work area 127 to the resulting vector 129 , to thereby obtain a final solution as the product of the symmetric sparse matrix and input vector 128 .
- the parallel processing unit 165 performs the functions of a plurality of threads to be executed in parallel.
- the parallel processing unit 165 includes a non-zero element check unit 166 and a matrix operation unit 167 .
- the non-zero element check unit 166 confirms the distribution of non-zero elements with respect to the submatrix assigned to each thread, and updates the flags for the columns corresponding to the thread in the non-zero element map 125 . That is to say, the generation of the non-zero element map 125 is parallelized using the plurality of threads. In this connection, each thread is able to identify the assigned columns of the symmetric sparse matrix with reference to the thread pointer array 124 .
- the matrix operation unit 167 multiplies a submatrix assigned to each thread by the input vector 128 .
- the matrix operation unit 167 stores operation results in the work vector allocated by the matrix analysis unit 163 .
- the matrix operation unit 167 stores operation results in the resulting vector 129 .
- the OS 168 activates a plurality of threads, and assigns the plurality of threads to a plurality of cores included in the CPUs 110 , 110 a , 110 b , and 110 c . As a rule, different threads are assigned to different cores.
- the OS 168 prepares the work area 127 in the RAM 120 . In this connection, it is preferable that threads or processes corresponding to the parallelization control unit 162 , parallel processing unit 165 , and OS 168 are performed by different cores.
- FIG. 15 is a flowchart illustrating how to control matrix operation.
- the matrix operation requesting unit 161 stores an element array 121 , a row number array 122 , and a column pointer array 123 that represent a symmetric sparse matrix, in the matrix storage unit 151 .
- the matrix analysis unit 163 reads the row number array 122 and column pointer array 123 from the matrix storage unit 151 .
- the matrix analysis unit 163 determines how many threads to use in the matrix operation, and instructs the OS 168 to activate the determined number of threads. At this time, the matrix analysis unit 163 gives sequential thread numbers to the threads. In addition, the matrix analysis unit 163 detects the number of non-zero elements included in the symmetric sparse matrix with reference to the element array 121 , and divides the columns of the symmetric sparse matrix so that the divisions contain as equal a number of non-zero elements as possible. Then, the matrix analysis unit 163 generates and stores a thread pointer array 124 in the control data storage unit 152 .
- the matrix analysis unit 163 generates and stores a non-zero element map 125 in the control data storage unit 152 . Then the matrix analysis unit 163 calls the non-zero element check unit 166 for each thread. For each thread, the non-zero element check unit 166 updates the flags for the columns corresponding to the thread in the non-zero element map 125 . This non-zero element check will be described in detail later.
- the matrix analysis unit 163 determines how many work vectors to prepare in the work area 127 and how to allocate the work vectors to the threads, with reference to the non-zero element map 125 generated at step S 3 . Then, the matrix analysis unit 163 generates and stores a work pointer array 126 in the control data storage unit 152 . This allocation of work vectors will be described in detail later.
- the matrix analysis unit 163 instructs the OS 168 to prepare the work area 127 including as many work vectors as determined at step S 4 , in the intermediate data storage unit 153 . Then, the matrix analysis unit 163 generates and stores an empty resulting vector 129 in the vector storage unit 154 .
- the matrix operation requesting unit 161 stores the input vector 128 in the vector storage unit 154 .
- the initial input vector 128 that is used in the first iteration of the iterative operation may be stored in the vector storage unit 154 at step S 1 .
- the vector input and output unit 164 reads the input vector 128 from the vector storage unit 154 .
- the vector input and output unit 164 calls the matrix operation unit 167 for each thread. For each thread, the matrix operation unit 167 multiplies the submatrix assigned to the thread by the input vector 128 . The vector input and output unit 164 adds all the work vectors included in the work area 127 to the resulting vector 129 . Thus obtained resulting vector 129 represents the product of the symmetric sparse matrix and the input vector 128 . This parallel matrix operation will be described in detail later.
- the matrix operation requesting unit 161 reads the resulting vector 129 from the vector storage unit 154 . Then, the matrix operation requesting unit 161 determines whether an operation status satisfies a specific completion condition or not. If the completion condition is satisfied, the matrix operation requesting unit 161 completes the iterative operation. If not, then the matrix operation requesting unit 161 uses the read resulting vector 129 to generate the next input vector 128 , and the process proceeds back to step S 6 .
- FIG. 16 is a flowchart illustrating how to conduct non-zero element check. This non-zero element check is conducted at step S 3 of the flowchart of FIG. 15 .
- the matrix analysis unit 163 generates a non-zero element map 125 with as many rows as the number of divisions and as many columns as the number of threads, and stores the non-zero element map 125 in the control data storage unit 152 . At this time, the matrix analysis unit 163 initializes all the elements in the non-zero element map 125 to zero. The matrix analysis unit 163 then calls the non-zero element check unit 166 for each thread. Then, the following steps S 32 to S 38 are executed in parallel by the plurality of threads. The following describes the case where the thread (thread #t) with a thread number of t executes steps S 32 to S 38 .
- the non-zero element check unit 166 determines with reference to the column pointer array 123 whether all the non-zero elements in the column selected at step S 32 or S 38 have been selected or not. More specifically, the non-zero element check unit 166 determines whether the element number e identified at step S 33 or S 36 matches Cp(C+1) ⁇ 1 or not. If all the elements have been selected, the process proceeds to step S 37 . If there is any unselected element, then the process proceeds to step S 36 .
- the non-zero element check unit 166 selects the next non-zero element. More specifically, the non-zero element check unit 166 increments the element number e (adds one to the current element number e). Then, the process proceeds to step S 34 .
- the non-zero element check unit 166 determines with reference to the thread pointer array 124 whether all the columns assigned to the thread #t have been selected or not. More specifically, the non-zero element check unit 166 determines whether the column number c matches Bp(t+1) ⁇ 1 or not. If all the columns have been selected, the non-zero element check unit 166 completes the non-zero element check, and notifies the matrix analysis unit 163 of the completion. If there is any unselected column, the process proceeds to step S 38 .
- the non-zero element check unit 166 selects the next column. More specifically, the non-zero element check unit 166 increments the column number c (add one to the current column number c). Then the process proceeds to step S 33 .
- FIG. 17 is a flowchart illustrating how to allocate work vectors. This allocation of work vectors is performed at step S 4 of the flowchart of FIG. 15 .
- the matrix analysis unit 163 selects the thread #2. More specifically, the matrix analysis unit 163 sets the thread number t to two as a variable.
- the matrix analysis unit 163 selects the first one of the currently existing work vectors. More specifically, the matrix analysis unit 163 sets the vector number v to one as a variable.
- the matrix analysis unit 163 determines whether all the currently existing work vectors have been selected or not. More specifically, the matrix analysis unit 163 determines whether the vector number v matches the vector count n or not. If all the work vectors have been selected, the process proceeds to step S 48 . If there is any unselected work vector, then the process proceeds to step S 47 .
- the matrix analysis unit 163 selects the next work vector from the currently existing work vectors. More specifically, the matrix analysis unit 163 increments the vector number v (adds one to the current vector number v). Then, the process proceeds to step S 43 .
- the matrix analysis unit 163 adds one work vector. More specifically, the matrix analysis unit 163 increments the vector count n (adds one to the current vector count n).
- the matrix analysis unit 163 overwrites the flags of the n-th column with the flags of t-th column in the non-zero element map 125 . Thereby, the n-th column vector becomes identical to the t-th column vector.
- the matrix analysis unit 163 determines whether all the threads have been selected or not. More specifically, the matrix analysis unit 163 determines whether the thread number t matches the number of threads. If all the threads have been selected, the matrix analysis unit 163 completes the allocation of work vectors. As a result, the number of work vectors to be prepared in the work area 127 and the contents of the work pointer array 126 are fixed. If there is any unselected thread, then the process proceeds to step S 52 .
- the matrix analysis unit 163 selects the next thread. More specifically, the matrix analysis unit 163 increments the thread number t (adds one to the current thread number t). Then, the process proceeds to step S 42 .
- FIG. 18 is a flowchart illustrating how to perform parallel matrix operation. This parallel matrix operation is performed at step S 7 of the flowchart of FIG. 15 .
- the vector input and output unit 164 initializes all of the elements of the work vectors in the work area 127 and the elements of the resulting vector 129 to zero. Then, the vector input and output unit 164 calls the matrix operation unit 167 for each thread. Then, the following steps S 71 to S 79 are executed in parallel by the plurality of threads. The following describes the case where the thread (thread #t) with a thread number of t executes the steps S 71 to S 79 .
- the matrix operation unit 167 calculates and stores the product of the value of the non-zero element selected at step S 72 or S 77 and the value in the c-th row of the input vector 128 , in the work vector allocated to the thread #t. More specifically, the matrix operation unit 167 adds Val(e)X(c) to Work(Row(e), Up(t)).
- the matrix operation unit 167 determines whether the non-zero element selected at step S 72 or S 77 is a diagonal element in the symmetric sparse matrix or not. More specifically, the matrix operation unit 167 determines whether Row(e) matches the column number c or not. If the non-zero element is a diagonal element, the process proceeds to step S 76 . If not, the process proceeds to step S 75 .
- the matrix operation unit 167 calculates and stores the product of the value of the non-zero element selected at step S 72 or S 77 and the value of the row corresponding to the non-zero element in the input vector 128 , in the resulting vector 129 . More specifically, the matrix operation unit 167 adds Val(e)X(Row(e)) to Y(c).
- the matrix operation unit 167 determines with reference to the column pointer array 123 whether all the non-zero elements included in the column selected at step S 71 or S 79 have been selected or not. More specifically, the matrix operation unit 167 determines whether the element number e identified at step S 72 or S 77 matches Cp(c+1) ⁇ 1 or not. If all the non-zero elements have been selected, the process proceeds to step S 78 . If there is any unselected element, then the process proceeds to step S 77 .
- the matrix operation unit 167 selects the next non-zero element. More specifically, the matrix operation unit 167 increments the element number e. Then, the process proceeds to step S 73 .
- the matrix operation unit 167 determines with reference to the thread pointer array 124 whether all the columns assigned to the thread #t have been selected or not. More specifically, the matrix operation unit 167 determines whether the column number c matches Bp(t+1) ⁇ 1 or not. If all the columns have been selected, the matrix operation unit 167 completes the calculation of multiplying the submatrix by the input vector 128 , and notifies the vector input and output unit 164 of the completion. Then, the process proceeds to step S 80 . If there is any unselected column, the process proceeds to step S 79 .
- the matrix operation unit 167 selects the next column. More specifically, the matrix operation unit 167 increments the column number c. Then, the process proceeds to step S 72 .
- the vector input and output unit 164 adds all the work vectors included in the work area 127 to the resulting vector 129 . More specifically, the vector input and output unit 164 adds the value in row i and column j of the work area 127 to Y(i). Then, the vector input and output unit 164 notifies the matrix operation requesting unit 161 of the completion of the matrix operation.
- the second embodiment describes the case of multiplying a symmetric sparse matrix by a column vector.
- the above method of allocating work vectors may be applied to the case of multiplying a row vector by a symmetric sparse matrix.
- the work vectors may be used to store the products of submatrices in the upper triangle part of the symmetric sparse matrix and an input vector 128 .
- the symmetric sparse matrix is represented in the compressed column storage format.
- the symmetric sparse matrix may be represented in the compressed row storage format or may be represented without being compressed. In the case where submatrices at symmetrical positions are assigned to the same thread, there is a problem of access conflict between threads in either one of the upper triangle part and the lower triangle part.
- a sparse matrix to be multiplied by a vector is a symmetric matrix.
- a sparse matrix that is not a symmetric matrix may be used.
- there is a problem of access conflict between threads in the case of dividing the columns of the sparse matrix represented in the compressed column storage format and then multiplying the sparse matrix by a column vector.
- there is a problem of access conflict between threads in the case of dividing the rows of the sparse matrix represented in the compressed row storage format and then multiplying a row vector by the sparse matrix.
- the information processing apparatus 100 compares the submatrices assigned to a plurality of threads in terms of their distributions of non-zero elements, and allocates a shared work vector to a combination of threads that do not overlap in the distributions of non-zero elements. This prevents access conflict to the same element of the work vector, which eliminates the need of exclusive control for access to the work vector between the plurality of threads.
- it is possible to reduce (for example, minimize) the number of work vectors prepared in the work area 127 without causing the access conflict. Therefore, the size of the work area 127 is suppressed even if the number of threads increases. That is to say, the information processing apparatus 100 is able to use the storage area of the RAM 120 efficiently and to multiply a large-scale sparse matrix by a vector efficiently.
- the information processing of the first embodiment may be implemented by the information processing apparatus 10 executing a program.
- the information processing of the second embodiment may be implemented by the information processing apparatus 100 executing a program.
- Such a program may be recorded on a computer-readable recording medium (for example, recording medium 143 ).
- a computer-readable recording medium for example, recording medium 143 .
- the recording medium a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, or the like may be used.
- Magnetic disks include FDs and HDDs.
- Optical disks include CDs, CD-R (Recordable), CD-RW (Rewritable), DVD, DVD-R, and DVD-RW.
- the program may be recorded on portable recording media and then distributed. In this case, the program may be copied (installed) from a portable recording medium to another recording medium such as an HDD (for example, HDD 131 ) and then executed.
- an HDD for example, HDD 131
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
Description
- This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2013-074443, filed on Mar. 29, 2013, the entire contents of which are incorporated herein by reference.
- The embodiments discussed herein relate to a parallel operation method and an information processing apparatus.
- Large-scale numerical calculations such as scientific calculations may be carried out using supercomputers or other high performance computers. Such large-scale numerical calculations often involve high-dimensional matrix operations. For example, in the flow analysis field, structural analysis field, and other fields, large-scale simultaneous equations may be solved using a coefficient matrix that represents the coefficients of the simultaneous equations. Another example is that, in the circuit analysis field, vibration analysis field, and other fields, the eigenvalues of a large-scale matrix may be computed. In a matrix operation using a computer, an approximate solution may be computed by iterations of matrix-vector multiplication. For example, an approximate solution of a differential equation, which is analytically difficult to solve, may be computed by iterations of multiplying a coefficient matrix by a vector with the finite element method.
- A matrix to be used in a large-scale numerical calculation may be a sparse matrix with a high percentage of zero-valued elements (zero elements) and a small percentage of non-zero-valued elements (non-zero elements). Sparse matrix representation including zero elements based on the matrix structure produces a large amount of data and therefore is inefficient. To deal with this, there are compressed storage formats for representing a sparse matrix as compressed data without zero elements. Such compressed storage formats include Compressed Column Storage (CCS) format and Compressed Row Storage (CRS) format.
- In the compressed column storage format, the elements included in an N×M matrix are searched in column-major order (i.e., in the order of
row 1 andcolumn 1,row 2 andcolumn 1, . . . , row N andcolumn 1,row 1 andcolumn 2,row 2 andcolumn 2, . . . ,row 1 and column M, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements in the above order, a second array that stores the row numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new columns in the first array are generated. With regard to the compressed row storage format, the elements included in an N×M matrix are searched in row-major order (i.e., in the order ofrow 1 andcolumn 1,row 1 andcolumn 2, . . . ,row 1 and column M,row 2 andcolumn 1,row 2 andcolumn 2, . . . , row N andcolumn 1, . . . , and row N and column M), and only the non-zero elements are extracted from the matrix. Then, a first array that holds the values of the non-zero elements, a second array that stores the column numbers of the non-zero elements, and a third array that indicates the locations of the elements that start new rows in the first array are generated. - By the way, in a large-scale matrix operation, a matrix may be divided so as to be assigned to a plurality of threads which are then executed in parallel by a plurality of processors. This technique achieves high-speed operation. In this case, a plurality of threads may perform operations for the same element of a final result, depending on the way of dividing the matrix. Therefore, it is considered that a storage area for storing an intermediate result may be allocated to each thread.
- For example, there has been proposed a parallel processing method for multiplying a sparse matrix represented in the compressed column storage format by a column vector with a plurality of processors. In this parallel processing method, the columns of the matrix are equally divided and assigned to a plurality of threads, and a storage area having the same size as a column vector that is the final result is allocated to each thread. Then, the column vectors that are the intermediate results obtained by the respective threads are added to thereby obtain the final result.
- By the way, in a coefficient matrix that represents the coefficients of simultaneous equations, non-zero elements tend to concentrate on the vicinity of a diagonal line and some square areas. Using these characteristics, there has been proposed a high-speed operation method in which a coefficient matrix is divided into a plurality of blocks and calculations for the blocks are carried out in parallel with a plurality of processors.
- See, for example, Japanese Laid-open Patent Publication No. 2009-199430 and International Publication Pamphlet No. WO2008/026261.
- However, the approach of allocating each of a plurality of threads a storage area for storing a vector that is an intermediate result needs more memory, and this is a problem. For example, consider the case of executing 1000 threads in parallel to multiply a sparse matrix with one million rows and one million columns by a column vector. The data of the sparse matrix may be compressed with a compressed storage format, but for the aforementioned storage areas as a whole, space for storing 1000 column vectors with one million rows needs to be prepared. Therefore, this method needs more memory with an increase in the number of threads.
- Another method is that one shared storage area is prepared for a plurality of threads, and the threads sequentially add values to the elements of a resulting vector. However, if a shared storage area is used simply, some threads may possibly add values to the same element (for example, in the same row of the resulting column vector) at the same time, which is an access conflict. If exclusive control is exercised between the threads so as not to cause access conflict, the overhead for memory access may increase and the efficiency of the parallel processing may decrease.
- According to one embodiment, there is provided a non-transitory computer-readable storage medium storing therein a memory allocating program. The memory allocating program causes a computer capable of executing a plurality of threads in parallel to perform a process that includes: assigning calculation of a first submatrix included in a matrix to a first thread and calculation of a second submatrix included in the matrix to a second thread, the matrix including zero elements and non-zero elements; comparing a distribution of non-zero elements in rows or columns of the first submatrix with a distribution of non-zero elements in rows or columns of the second submatrix; and determining allocation of storage areas for storing vectors to be respectively used in calculations by the first and second threads, according to a result of the comparing.
- The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
- It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
-
FIG. 1 illustrates an information processing apparatus according to a first embodiment; -
FIG. 2 is a block diagram illustrating an example of the hardware configuration of an information processing apparatus; -
FIG. 3 illustrates an example of multiplying a sparse matrix by a vector; -
FIG. 4 illustrates an example of a compressed storage format for a sparse matrix; -
FIG. 5 illustrates an example of division and assignment of a sparse matrix to threads; -
FIG. 6 illustrates an example of parallel calculation for multiplying a sparse matrix by a vector; -
FIG. 7 illustrates an example of a non-zero element map; -
FIG. 8 illustrates another example of the non-zero element map; -
FIG. 9 illustrates an example of allocation of work vectors; -
FIG. 10 is a first diagram illustrating another example of allocation of work vectors; -
FIG. 11 is a second diagram illustrating another example of allocation of work vectors; -
FIG. 12 is a third diagram illustrating another example of allocation of work vectors; -
FIG. 13 is a fourth diagram illustrating another example of allocation of work vectors; -
FIG. 14 is a block diagram illustrating an example of the functions of the information processing apparatus; -
FIG. 15 is a flowchart illustrating how to control matrix operation; -
FIG. 16 is a flowchart illustrating how to conduct non-zero element check; -
FIG. 17 is a flowchart illustrating how to allocate work vectors; and -
FIG. 18 is a flowchart illustrating how to perform parallel matrix operation. - Several embodiments will be described below with reference to the accompanying drawings, wherein like reference numerals refer to like elements throughout.
-
FIG. 1 illustrates an information processing apparatus according to a first embodiment. - An
information processing apparatus 10 of the first embodiment executes a plurality of threads in parallel in order to perform a matrix operation (for example, a matrix-vector multiplication). Theinformation processing apparatus 10 includes a plurality of processors, includingprocessors 11 to 13, and amemory 14. - The
processors 11 to 13 are computing devices that are capable of executing threads physically in parallel to one another. Each of theprocessors 11 to 13 may be a single processor package such as a Central Processing Unit (CPU) package, or a processor core (may simply be referred to as a core) included in a processor package. For example, theprocessors 11 to 13 execute a program stored in thememory 14. In this first embodiment, theprocessor 12 executes athread 21, whereas theprocessor 13 executes athread 22. In addition, theprocessor 11 executes a thread or process for controlling parallelization of the matrix operation. Alternatively, the 12 or 13 may be designed to control the parallelization.processor - The
memory 14 is a shared memory that is accessed from theprocessors 11 to 13, and is a Random Access Memory (RAM), for example. Thememory 14 stores one or more vectors to be used in the operations of the 21 and 22. In these vectors, values obtained in the middle of the operations of thethreads 21 and 22 are stored. In addition, in the vectors, the values of intermediate results (values before a final result is obtained) at the time of completion of the operations of thethreads 21 and 22 are stored, for example. In the case where a matrix operation is to multiply a matrix by a column vector, the above vectors are column vectors. In the case where a matrix operation is to multiply a row vector by a matrix, the above vectors are row vectors. In the first embodiment, allocation of storage areas of thethreads memory 14 for storing vectors to the 21 and 22 is determined as described below. For example, thethreads information processing apparatus 10 prepares a shared storage area (storage area 26) or individual storage areas (storage areas 26 and 27) for the 21 and 22.threads - First, the
processor 11 analyzes amatrix 23 including zero elements and non-zero elements, and assigns the calculations of the submatrices included in thematrix 23 to the 21 and 22. Thethreads matrix 23 is, for example, a sparse matrix with a high percentage of zero elements and a small percentage of non-zero elements. Thematrix 23 may be represented in a compressed storage format such as compressed column storage format or compressed row storage format. In the first embodiment, theprocessor 11 assigns the calculation of asubmatrix 24 to thethread 21, and the calculation of asubmatrix 25 to thethread 22. For example, assume that the 24 and 25 do not overlap in any columns but overlap in some rows. By contrast, for example, it may be assumed that thesubmatrices 24 and 25 do not overlap in any rows but overlap in some columns.submatrices - The
processor 11 also analyzes thematrix 23, and allocates the 21 and 22 storage areas for storing vectors to be used in their operations. To this end, thethreads processor 11 compares the distribution of non-zero elements of thesubmatrix 24 with the distribution of non-zero elements of thesubmatrix 25. For example, in the case where the operation result is a column vector, theprocessor 11 compares the distributions of non-zero elements in column direction (in a plurality of rows). On the other hand, in the case where the operation result is a row vector, theprocessor 11 compares the distributions of non-zero elements in row direction (in a plurality of columns). - Then, the
processor 11 determines allocation of storage areas to the 21 and 22 according to a result of comparing the distributions of non-zero elements. For example, in the case where thethreads 24 and 25 do not have non-zero elements in the same row (or in the same column), thesubmatrices processor 11 determines that it is possible to allocate a shared storage area to the 21 and 22. This is because, if there are no non-zero elements in the same row, thethreads 21 and 22 do not access the element in a row of a column vector used in their operations and therefore no access conflict occurs. Likewise, if there are no non-zero elements in the same column, thethreads 21 and 22 do not access the element in a column of a row vector used in their operations and therefore no access conflict occurs.threads - For example, in the case where the shared
storage area 26 is allocated to the 21 and 22, thethreads thread 21 stores the values (intermediate result) obtained in the middle of or at the end of the calculation of thesubmatrix 24 in thestorage area 26. Thethread 22 stores the values (intermediate result) obtained in the middle of or at the end of the calculation of thesubmatrix 25 in thestorage area 26. At this time, there is no need of performing exclusive control for access to thestorage area 26 between the 21 and 22. By contrast, in the case wherethreads 26 and 27 are allocated to theindividual storage areas 21 and 22, respectively, thethreads thread 21 stores the values obtained by the calculation of thesubmatrix 24 in thestorage area 26, whereas thethread 22 stores the values obtained by the calculation of thesubmatrix 25 in thestorage area 27. Then, the vectors stored in the 26 and 27 are added to thereby obtain a final result.storage areas - Access conflict between the
21 and 22 occurs as a problem, especially, in the following matrix operations. For example, there is a matrix operation to multiply thethreads matrix 23 represented in the compressed column storage format by a column vector. In the case where the matrix is represented in the compressed column storage format, dividing the columns of the matrix makes the operation more efficient. Therefore, in this matrix operation, there is a problem of access conflict to the same row between the 21 and 22. Another matrix operation is, for example, to multiply a row vector by thethreads matrix 23 represented in the compressed row storage format. In the case where the matrix is represented in the compressed row storage format, dividing the rows of the matrix makes the operation more efficient. Therefore, there is a problem of access conflict to the same column between the 21 and 22.threads - Further, for example, there is yet another matrix operation in which the
matrix 23 is a symmetric matrix, the calculation of a submatrix at a symmetrical position to thesubmatrix 24 is also assigned to thethread 21, and the calculation of a submatrix at a symmetrical position to thesubmatrix 25 is also assigned to thethread 22. In this matrix operation, there is a problem of conflict between the 21 and 22 with respect to either one of the lower triangle part and upper triangle part of thethreads matrix 23. In the case of multiplying thematrix 23 by a column vector, the problem is access conflict to the same row of the lower triangle submatrix between the 21 and 22. On the other hand, in the case of multiplying a row vector by thethreads matrix 23, the problem is access conflict to the same column of the upper triangle submatrix between the 21 and 22.threads - According to the first, embodiment, the
information processing apparatus 10 compares the distribution of non-zero elements of thesubmatrix 24 with the distribution of non-zero elements of thesubmartix 25. Then, theinformation processing apparatus 10 determines allocation of storage areas to the 21 and 22 according to the result of the comparison. This avoids access conflict to the same element in the same storage area (for example, this prevents access conflict), and removes the load of exclusive control between thethreads 21 and 22. In addition, while preventing an increase in the load due to exclusive control (for example, while eliminating the need of exclusive control), it is possible to reduce the size of the storage area to be prepared in thethreads memory 14. Accordingly, it is possible to efficiently use the storage area of thememory 14 in the matrix operation. -
FIG. 2 is a block diagram illustrating an example of the hardware configuration of an information processing apparatus. - An
information processing apparatus 100 of the second embodiment is a computer that is capable of performing large-scale matrix operation, and is, for example, a server computer that performs matrix operation in response to a user request. - The
information processing apparatus 100 includes a plurality of CPUs, including 110, 110 a, 110 b, and 110 c, and aCPUs RAM 120. The plurality of CPUs and theRAM 120 are connected to asystem bus 136. In addition, theinformation processing apparatus 100 includes a Hard Disk Drive (HDD) 131, a videosignal processing unit 132, an inputsignal processing unit 133, amedia reader 134, and acommunication interface 135. TheHDD 131, videosignal processing unit 132, inputsignal processing unit 133,media reader 134, andcommunication interface 135 are connected to an input/output bus 137. Thesystem bus 136 and input/output bus 137 are connected with a bridge, for example. - The
110, 110 a, 110 b, and 110 c are processor packages that execute programs. TheCPUs 110, 110 a, 110 b, and 110 c load at least some of the instructions of a program or data from theCPUs HDD 131 to theRAM 120, and execute the program. Each CPU includes a plurality of cores and a cache memory. - As an example, the
CPU 110 includes a plurality of cores, includingcores 111 to 114, and acache memory 115. Thecores 111 to 114 execute threads physically in parallel to one another. Thecache memory 115 is a volatile memory for temporarily storing the instructions of a program and data read from theRAM 120, and is, for example, a Static Random Access Memory (SRAM). A cache memory may be provided for each core. - The
RAM 120 is a shared memory that is accessed from the 110, 110 a, 110 b, and 110 c via the high-CPUs speed system bus 136. TheRAM 120 temporarily stores the instructions of a program and data. In this connection, theinformation processing apparatus 100 may be provided with another type of volatile memory than RAM, or a plurality of memories. - The
HDD 131 is a non-volatile storage device for storing software programs, such as an Operating System (OS), application software programs, or other programs, and data. In this connection, theinformation processing apparatus 100 may be provided with another type of storage device, such as a flash memory, Solid State Drive (SSD), or another, or a plurality of non-volatile storage devices. - The video
signal processing unit 132 outputs images to adisplay 141 connected to theinformation processing apparatus 100, according to instructions from any of the CPUs. As thedisplay 141, a Cathode Ray Tube (CRT) display, a Liquid Crystal Display (LCD) display, a Plasma Display Panel (PDP), an Organic Electro-Luminescence (OEL) display, or another display may be used. - The input
signal processing unit 133 receives an input signal from aninput device 142 connected to theinformation processing apparatus 100, and outputs the input signal to any one of the CPUs. As theinput device 142, a pointing device, such as a mouse, touch panel, touchpad, trackball, or another, a keyboard, a remote controller, a button switch, or another device may be used. In addition, plural types of input devises may be connected to theinformation processing apparatus 100. - The
media reader 134 is a driving device that reads programs and data from arecording medium 143. As therecording medium 143, for example, a magnetic disk, such as a flexible disk (FD) or HDD, an optical disc, such as a Compact Disc (CD) or Digital Versatile Disc (DVD), a Magneto-Optical disk (MO), a semiconductor memory, or another may be used. Themedia reader 134 stores, for example, instructions of a program and data read from therecording medium 143, in theRAM 120 orHDD 131. - The
communication interface 135 is connected to anetwork 144, and enables communication with another information processing apparatus via thenetwork 144. Thecommunication interface 135 may be a wired communication interface that is connected to a communication device such as a switch or a router with a cable, or a wireless communication interface that is connected to a wireless base station. - In this connection, the
information processing apparatus 100 may be configured without themedia reader 134. Further, if the user is able to control theinformation processing apparatus 100 over thenetwork 144 from a terminal device, theinformation processing apparatus 100 may be configured without the videosignal processing unit 132 or inputsignal processing unit 133. In addition, thedisplay 141 andinput device 142 may be accommodated as one unit in the casing of theinformation processing apparatus 100. In this connection, thecores 111 to 114 are an example of the above-describedprocessors 11 to 13, and theRAM 120 is an example of the above-describedmemory 14. - The following describes a matrix operation that is performed in the second embodiment.
-
FIG. 3 illustrates an example of multiplying a sparse matrix by a vector. - The
information processing apparatus 100 performs iterations of multiplying a matrix by an input vector. The matrix is, for example, a coefficient matrix that represents the coefficients of simultaneous equations. In the first iteration, theinformation processing apparatus 100 multiplies the matrix by an initial input vector. Theinformation processing apparatus 100 processes the resulting vector of the first iteration according to a specified algorithm, and takes the resultant as the next input vector. In the second iteration, theinformation processing apparatus 100 multiplies the same matrix as used in the first iteration by the input vector obtained by processing the resulting vector of the first iteration. This matrix operation is repeated until a specific completion condition (for example, the number of iterations, the accuracy of values in the resulting vector, etc.) is satisfied. - The second embodiment uses a matrix that is symmetric and sparse (symmetric sparse matrix), and a column vector as an input vector. Therefore, the resulting vector, which is the product of the symmetric sparse matrix and the column vector, is a column vector. For example, in the case where a 6×6 symmetric sparse matrix includes the first row of (4.0, 9.0, 0, 3.0, 0, 0), and an input vector is (0.1, 0.2, 0.3, 0.4, 0.5, 0.6), a value in the first row of the resulting vector is calculated as 3.4. In the following, a 6×6 matrix is used by way of example for simple description. However, in the second embodiment, it is possible to use a large-scale symmetric sparse matrix like having dimensions (the number of elements in one direction) of several tens of thousands to several tens of millions.
-
FIG. 4 illustrates an example of a compressed storage format for a sparse matrix. - In the second embodiment, the lower triangle part of a symmetric sparse matrix is represented in the compressed column storage format. The upper triangle part (excluding diagonal elements) of the symmetric sparse matrix may not be included in the matrix data and may be omitted because it is reproducible from the lower triangle part.
- The matrix data includes an element array 121 (Val), a row number array 122 (Row), and a column pointer array 123 (Cp). The
element array 121 contains the values of the non-zero elements included in the lower triangle part of the symmetric sparse matrix, in column-major order. Therow number array 122 stores the row numbers of the non-zero elements arranged in theelement array 121. The k-th value (Row(k)) of therow number array 122 represents the row number of the k-th non-zero element. The length of theelement array 121 androw number array 122 is equal to the number of non-zero elements included in the symmetric sparse matrix. Thecolumn pointer array 123 indicates the locations of the elements that start new columns in theelement array 121. The k-th value (Cp(k)) of thecolumn pointer array 123 represents the element number of the first non-zero element in the k-th column. In this connection, the length of thecolumn pointer array 123 is longer by one than the number of columns of the symmetric sparse matrix. The end of thecolumn pointer array 123 has a value greater by one than the maximum element number. - For example, assume that the lower triangle part of a 6×6 symmetric sparse matrix contains the following elements: the first column of (4.0, 9.0, 0, 3.0, 0, 0), the second column of (11.0, 5.0, 0, 0, 0), the third column of (6.0, 0, 0, 0), the fourth column of (1.0, 8.0, 12.0), the fifth column of (2.0, 10.0), and the sixth column of (7.0).
- In this case, the
element array 121 is (4.0, 9.0, 3.0, 11.0, 5.0, 6.0, 1.0, 8.0, 12.0, 2.0, 10.0, 7.0). Therow number array 122 is (1, 2, 4, 2, 3, 3, 4, 5, 6, 5, 6, 6). Thecolumn pointer array 123 is (1, 4, 6, 7, 10, 12, 13). For example, the second non-zero element from the top of the fourth column is specified based on theelement array 121,row number array 122, andcolumn pointer array 123 in the following manner. First, from Cp(4)=7, the element number of the first non-zero element in the fourth column is specified as 7, and therefore the element number of the second non-zero element in the fourth column is specified as 8. Then, from Val(8)=8.0 and Row(8)=5, it is specified that the second non-zero element from the top of the fourth column exists in the fifth row and has a value of 8.0. - In this connection, the upper triangle part (including diagonal elements) of the symmetric sparse matrix is reproduced by reading the
element array 121,row number array 122 andcolumn pointer array 123 as the arrays for the upper triangle part without any changes to the values. More specifically, theelement array 121 is read as an array that contains non-zero elements in row-major order, therow number array 122 is read as a column number array, and thecolumn pointer array 123 is read as a row pointer array. Thus obtained element array, column number array, and row pointer array are a representation of the upper triangle part (including diagonal elements) in the compressed row storage format. -
FIG. 5 illustrates an example of division and assignment of a sparse matrix to threads. - The
information processing apparatus 100 divides a symmetric sparse matrix into a plurality of submatrices that are then assigned to a plurality threads, and executes the threads in parallel with a plurality of cores. In the second embodiment, theinformation processing apparatus 100 assigns each thread one or two or more successive columns of the lower triangle part of the symmetric sparse matrix. In addition, theinformation processing apparatus 100 assigns each thread one or two or more rows of the upper triangle part (excluding diagonal elements) of the symmetric sparse matrix. In the assignment, the j-th column of the lower triangle part and the j-th row of the upper triangle part (excluding diagonal elements), which are at symmetrical positions, are assigned to the same thread. - Since the symmetric sparse matrix is represented in the compressed column storage format, the
information processing apparatus 100 first determines a mapping between the columns of the lower triangle part and threads. A mapping between the rows of the upper triangle part (excluding diagonal elements) and the threads is automatically determined accordingly. At this time, theinformation processing apparatus 100 assigns the threads as equal a number of non-zero elements as possible. For example, assuming that the length of theelement array 121 is 12 and there are four threads that are executable in parallel, theinformation processing apparatus 100 divides theelement array 121 such that each thread handles as many non-zero elements close to 12/4=3 as possible. Referring to the example ofFIG. 5 , the first column of the lower triangle part is assigned to athread # 1, the second and third columns are assigned to athread # 2, the fourth column is assigned to athread # 3, and the fifth and sixth columns are assigned to athread # 4. - After dividing the symmetric sparse matrix, the
information processing apparatus 100 generates a thread pointer array 124 (Bp). Thethread pointer array 124 contains the column numbers identifying the first columns of the column groups assigned to the respective threads. Sequential thread numbers are given to the plurality of threads. The k-th value (Bp(k)) of thethread pointer array 124 represents the first column of the column group assigned to the thread #k. In this connection, the length of thethread pointer array 124 is longer by one than the number of threads. The end of thethread pointer array 124 contains a value greater by one than the number of columns of the symmetric sparse matrix. -
FIG. 6 illustrates an example of parallel calculation for multiplying a sparse matrix by a vector. - The
element array 121,row number array 122,column pointer array 123, andthread pointer array 124 are stored in theRAM 120. In addition, a work area 127 (Work), an input vector 128 (X), and a resulting vector 129 (Y) are stored in theRAM 120. - The
work area 127 includes one or two or more work vectors that are column vectors, for storing intermediate results. Values obtained by each thread with respect to a submatrix of the lower triangle part are stored in thework area 127. On the other hand, values obtained by each thread with respect to the upper triangle part (excluding diagonal elements) are stored directly in the resultingvector 129. After the execution of the plurality of threads is complete, the one or two or more work vectors are added to the resultingvector 129. The following describes the case where one work vector is prepared for each thread in thework area 127. - With respect to the lower triangle part, the
thread # 1 adds the product of the non-zero element inrow 1 andcolumn 1 and the first row of theinput vector 128 to the element in the first row of the work vector allocated to the thread #1 (i.e., the element inrow 1 and column 1) in thework area 127. Similarly, thethread # 1 adds the product of the non-zero element inrow 2 andcolumn 1 and the first row of theinput vector 128 to the element inrow 2 andcolumn 1 of thework area 127, and adds the product of the non-zero element inrow 4 andcolumn 1 and the first row of theinput vector 128 to the element inrow 4 andcolumn 1 of thework area 127. With respect to the upper triangle part (excluding diagonal elements), thethread # 1 adds the product of the non-zero element inrow 1 andcolumn 2 and the second row of theinput vector 128 to the element in the first row of the resultingvector 129, and adds the product of the non-zero element inrow 1 andcolumn 4 and the fourth row of theinput vector 128 to the element in the first row of the resultingvector 129. - With respect to the lower triangle part, the
thread # 2 adds the product of the non-zero element inrow 2 andcolumn 2 and the second row of theinput vector 128 to the element inrow 2 andcolumn 2 of thework area 127. Thethread # 2 also adds the product of the non-zero element inrow 3 andcolumn 2 and the second row of theinput vector 128 to the element inrow 3 andcolumn 2 of thework area 127, and adds the product of the non-zero element inrow 3 andcolumn 3 and the third row of theinput vector 128 to the element inrow 3 andcolumn 2 of thework area 127. With respect to the upper triangle part (excluding diagonal elements), thethread # 2 adds the product of the non-zero element inrow 2 andcolumn 3 and the third row of theinput vector 128 to the element in the second row of the resultingvector 129. - Similarly, the
thread # 3 adds values to the elements inrow 4 andcolumn 3,row 5 andcolumn 3, androw 6 andcolumn 3 of thework area 127, and adds a value to the element in the fourth row of the resultingvector 129. Thethread # 4 adds values to the elements inrow 5 andcolumn 4 androw 6 andcolumn 4 of thework area 127, and adds a value to the element in the fifth row of the resultingvector 129. After the execution of thethreads # 1 to #4 is complete, theinformation processing apparatus 100 adds the four work vectors included in thework area 127 to the resultingvector 129. Thus obtained resultingvector 129 represents the product of the symmetric sparse matrix and theinput vector 128. - In the above example, different work vectors are allocated to different threads. In this method, however, many elements in the
work area 127 remain zero because no value is added thereto. For example, referring toFIG. 6 , the values inrow 4 andcolumn 2,row 5 andcolumn 2, androw 6 andcolumn 2 of thework area 127 remain zero. Even if locations inrow 4 andcolumn 2,row 5 andcolumn 2, androw 6 andcolumn 2 are used by thethread # 3, no access conflict occurs between thethreads # 2 and #3. In addition, each work vector is added to the resultingvector 129. Therefore, although values to be added need to be in a proper row, the final result is not affected by which columns these values exist in. Considering this, theinformation processing apparatus 100 of the second embodiment is designed to allocate a shared work vector of thework area 127 to two or more threads if no access conflict occurs. -
FIG. 7 illustrates an example of a non-zero element map. - The
information processing apparatus 100 analyzes a symmetric sparse matrix to determine how many work vectors to prepare and how the work vectors are allocated to threads. In the analysis of the symmetric sparse matrix, theinformation processing apparatus 100 divides the rows of the symmetric sparse matrix into a plurality of row groups. Preferably, the groups have as equal a number of rows as possible. Theinformation processing apparatus 100 previously determines the number of groups (the number of divisions), or changes the number of divisions according to the dimensions of the symmetric sparse matrix. For example, the number of divisions may be set to 100 with respect to a symmetric sparse matrix having the dimensions of about several tens of thousands. - Then, with respect to the lower triangle part of the symmetric sparse matrix, the
information processing apparatus 100 confirms whether a non-zero element exists or not in each block obtained by partitioning the lower triangle part in row direction and column direction, and generates a non-zero element map 125 (Map) representing the distribution of non-zero elements. The rows of thenon-zero element map 125 correspond to the above row groups, and the columns of thenon-zero element map 125 correspond to threads (that is, the column groups of the symmetric sparse matrix assigned to the threads). In thenon-zero element map 125, the state of each block is represented by one-bit flag. A flag of “1” indicates that there is at least one non-zero element in the block. A flag of “0” indicates that there is no non-zero element in the block. - For example, a 6×4
non-zero element map 125 is generated from the lower triangle part of a 6×6 symmetric sparse matrix that is partitioned as illustrated inFIG. 5 . In this case, one row of thenon-zero element map 125 corresponds to one row of the symmetric sparse matrix. For example, a flag of “1” is contained in locations inrow 1 andcolumn 1,row 2 andcolumn 1,row 4 andcolumn 1,row 2 andcolumn 2,row 3 andcolumn 2,row 4 andcolumn 3,row 5 andcolumn 3,row 6 andcolumn 3,row 5 andcolumn 4, androw 6 andcolumn 4 of thenon-zero element map 125. A flag of “0” is contained in the other locations. By comparing the columns of thenon-zero element map 125 with each other, a combination of threads to which a shared work vector is allocable is detected. -
FIG. 8 illustrates another example of the non-zero element map. - The following describes the case of executing eight threads in parallel in order to process a 2534×2534 symmetric sparse matrix. Assuming that the rows are divided into eight row groups, the number of rows in each group is 317 (obtained by rounding up 2534/8 to the nearest integer). However, the last group contains 315 rows because of the rounding process. The
information processing apparatus 100 generates an 8×8non-zero element map 125, and confirms whether at least one non-zero element exists in each block or not. In a large-scale symmetric sparse matrix, non-zero elements tend to concentrate on the vicinity of a diagonal line. In this case, as illustrated inFIG. 8 , the diagonal elements and some elements adjacent to the diagonal elements in thenon-zero element map 125 have a flag of “1”, and many other elements in thenon-zero element map 125 have a flag of “0”. -
FIG. 9 illustrates an example of allocation of work vectors. - As described earlier, the
information processing apparatus 100 searches for a combination of threads to which a shared work vector is allocable, with reference to thenon-zero element map 125. More specifically, theinformation processing apparatus 100 searches thenon-zero element map 125 for a combination of columns that does not have a flag of “1” in the same row. Referring to the example of thenon-zero element map 125 ofFIG. 7 , each of combinations of the first and fourth columns and the second and third columns does not have a conflict regarding a flag of “1”. - In this case, for example, the
information processing apparatus 100 allocates a work vector 1 (the first column of the work area 127) to thethreads # 1 and #4, and allocates a work vector 2 (the second column of the work area 127) to thethreads # 2 and #3. That is to say, only two work vectors need to be prepared for the four threads (threads # 1 to #4) in thework area 127. After the allocation of the work vectors to the respective threads, theinformation processing apparatus 100 generates a work pointer array 126 (Up) that contains the vector numbers of the work vectors to be used by the threads. The k-th value (Up(k)) of thework pointer array 126 represents the work vector to be used by the thread #k. -
FIG. 10 is a first diagram illustrating another example of allocation of work vectors. - The following describes the case of executing eight threads in parallel in order to process a symmetric sparse matrix and dividing the rows of the symmetric sparse matrix into eight row groups. Non-zero elements exist in the
1 and 2 of agroups thread # 1, the 2 and 3 of agroups thread # 2, the 3 and 4 of agroups thread # 3, the 4 and 5 of agroups thread # 4, the 5 and 6 of agroups thread # 5, the 6 and 7 of agroups thread # 6, the 7 and 8 of agroups thread # 7, and thegroup 8 of athread # 8. Non-zero elements do not exist in the other groups. - In this case, even if a shared work vector is allocated to the
threads # 1, #3, #5, and #7, access conflict to the same element of a work vector among these threads will not occur. In addition, even if a shared work vector is allocated to thethreads # 2, #4, #6, and #8, access conflict to the same element of a work vector among these threads will not occur. Therefore, for example, theinformation processing apparatus 100 allocates awork vector 1 to thethreads # 1, #3, #5, and #7, and allocates awork vector 2 to thethreads # 2, #4, #6, and #8. That is to say, only two work vectors need to be prepared for these eight threads (threads # 1 to #8). - As described above, a shared work vector may be allocated to three or more threads. In many large-scale symmetric sparse matrices, determining appropriate combinations of threads makes it possible to reduce the number of work vectors prepared in the
work area 127 to two or three. -
FIG. 11 is a second diagram illustrating another example of allocation of work vectors. - The following describes the case of executing eight threads in parallel in order to process a symmetric sparse matrix and dividing the rows of the symmetric sparse matrix into eight row groups. Non-zero elements exist in the
1, 2, and 4 of agroups thread # 1, the 2, 3, and 6 of agroups thread # 2, the 3, 4, and 8 of agroups thread # 3, the 4 and 5 of agroups thread # 4, the 5 and 6 of agroups thread # 5, the 6 and 7 of agroups thread # 6, the 7 and 8 of agroups thread # 7, and thegroup 8 of athread # 8. Non-zero elements do not exist in the other groups. - In this case, for example, the
information processing apparatus 100 allocates awork vector 1 to thethreads # 1, #5, and #7, allocates awork vector 2 to thethreads # 2, #4, and #8, and allocates awork vector 3 to thethreads # 3 and #6. That is to say, only three work vectors need to be prepared in thework area 127 for the eight threads (threads # 1 to #8). -
FIG. 12 is a third diagram illustrating another example of allocation of work vectors. - Similarly to
FIG. 10 , the following describes the case of executing eight threads in parallel in order to process a symmetric sparse matrix and dividing the rows of the symmetric sparse matrix into eight row groups. Non-zero elements exist in the 1, 3, and 7 of agroups thread # 1, the 2 and 6 of agroups thread # 2, the 3 and 5 of agroups thread # 3, the 4 and 8 of agroups thread # 4, the 5 and 7 of agroups thread # 5, thegroup 6 of athread # 6, thegroup 7 of athread # 7, and thegroup 8 of athread # 8. Non-zero elements do not exist in the other groups. - In this case, for example, the
information processing apparatus 100 allocates awork vector 1 to thethreads # 1, #2, and #4, allocates awork vector 2 to thethreads # 3, #6, #7, and #8, and allocates awork vector 3 to thethread # 5. That is to say, only three work vectors need to be prepared in thework area 127 for the eight threads (threads # 1 to #8). - By the way, combinations of threads that do not cause an access conflict may be found with various kinds of algorithms. The combination examples of
FIGS. 9 to 12 are detected by sequentially comparing already prepared work vectors with the columns of thenon-zero element map 125 in order from the left column to find a work vector where conflict regarding the flag of “1” does not occur, and preparing a new work vector if there is no such a work vector. - For example, referring to
FIG. 12 , theinformation processing apparatus 100 prepares thework vector 1 and allocates thework vector 1 to thethread # 1. Then, since it is possible to add thethread # 2 to thework vector 1, theinformation processing apparatus 100 allocates thework vector 1 to thethread # 2. Then, since it is not possible to add thethread # 3 to thework vector 1, theinformation processing apparatus 100 prepares thework vector 2 and allocates thework vector 2 to thethread # 3. - Then, since it is possible to add the
thread # 4 to thework vector 1, theinformation processing apparatus 100 allocates thework vector 1 to thethread # 4. Then, since it is not possible to add thethread # 5 to either of the 1 and 2, thework vectors information processing apparatus 100 prepares thework vector 3 and allocates thework vector 3 to thethread # 5. Then, since it is not possible to add thethread # 6 to thework vectors 1 but it is possible to add thethread # 6 to thework vector 2, theinformation processing apparatus 100 allocates thework vector 2 to thethread # 6. Subsequently, theinformation processing apparatus 100 allocates thework vector 2 to thethreads # 7 and #8 in the same way. - It should be noted that, if the distribution of non-zero elements in a symmetric sparse matrix is complex, the number of work vectors prepared in the
work area 127 may be different depending on an employed search algorithm. -
FIG. 13 is the fourth diagram illustrating another example of allocation of work vectors. - This example describes the case of executing six threads in order to process a symmetric sparse matrix and dividing the rows of the symmetric sparse matrix into eight row groups. Non-zero elements exist in the
1, 3, and 8 of agroups thread # 1, the 2 and 5 of agroups thread # 2, the 3, 4, 7, and 8 of agroups thread # 3, the 4, 5, and 7 of agroups thread # 4, and thegroup 6 of athread # 5, and thegroups 6 of athread # 6. - According to a certain search algorithm (for example, the aforementioned algorithm), for example, the
information processing apparatus 100 allocates awork vector 1 to thethreads # 1, #2, and #5, awork vector 2 to thethreads # 3 and #6, and awork vector 3 to thethread # 4. That is, three work vectors are prepared in thework area 127. On the other hand, according to another search algorithm, for example, theinformation processing apparatus 100 allocates awork vector 1 to thethreads # 1, #4, and #5, and awork vector 2 to thethreads # 2, #3, and #6. That is, two work vectors, which are fewer than those of the former search algorithm, are prepared in thework area 127. - In general, a search algorithm for minimizing the number of work vectors (an algorithm that provides an optimal solution) needs larger amount of calculation than the other search algorithms (algorithms that provide suboptimal solutions). The
information processing apparatus 100 may be designed to select a search algorithm to be used, taking into account a balance between the accuracy of solution and the amount of calculation. In the following description, it is assumed that theinformation processing apparatus 100 uses a search algorithm that provides a suboptimal solution with a small amount of calculation. - The following describes the functions and processes of the
information processing apparatus 100. -
FIG. 14 is a block diagram illustrating an example of the functions of an information processing apparatus. - The
information processing apparatus 100 includes adata storage unit 150, a matrixoperation requesting unit 161, aparallelization control unit 162, aparallel processing unit 165, and anOS 168. Thedata storage unit 150 is implemented as a storage area saved in theRAM 120. The matrixoperation requesting unit 161,parallelization control unit 162, andparallel processing unit 165 are implemented as software modules. Theparallelization control unit 162 andparallel processing unit 165 may be implemented as a numerical calculation library. - The
data storage unit 150 includes amatrix storage unit 151, a controldata storage unit 152, an intermediatedata storage unit 153, and avector storage unit 154. Thematrix storage unit 151 stores matrix data representing a symmetric sparse matrix. The matrix data includes anelement array 121, arow number array 122, and acolumn pointer array 123. The controldata storage unit 152 stores control data for use in controlling parallelization. The control data includes athread pointer array 124, anon-zero element map 125, and awork pointer array 126. The intermediatedata storage unit 153 includes awork area 127. Thevector storage unit 154 stores aninput vector 128 and a resultingvector 129. - The matrix
operation requesting unit 161 stores theelement array 121,row number array 122,column pointer array 123, andinput vector 128 in thedata storage unit 150, and requests theparallelization control unit 162 to multiply the symmetric sparse matrix by theinput vector 128. When obtaining the resultingvector 129, the matrixoperation requesting unit 161 processes the resultingvector 129 for use as thenext input vector 128. The matrixoperation requesting unit 161 repeatedly requests theparallelization control unit 162 to multiply the symmetric sparse matrix by theinput vector 128 until the number of iterations or an operation status such as the accuracy of values included in the resultingvector 129 satisfies a specific completion condition. - The
parallelization control unit 162 controls the parallelization of the matrix operation. Theparallelization control unit 162 includes amatrix analysis unit 163 and a vector input andoutput unit 164. - When a symmetric sparse matrix is specified by the matrix
operation requesting unit 161 for the first time (when the first iteration of the iterative operation is performed), thematrix analysis unit 163 analyzes the symmetric sparse matrix and determines a parallelization method. Thematrix analysis unit 163 divides the symmetric sparse matrix and assigns the divisions to a plurality of threads. The number of threads for performing the matrix operation is determined, for example, based on conditions such as the amount of hardware resources of theinformation processing apparatus 100, the current workload of theinformation processing apparatus 100, user contract, or others. In addition, thematrix analysis unit 163 determines the number of work vectors to be prepared in thework area 127, and allocates any of the work vectors to each of the plurality of threads. - The vector input and
output unit 164 initializes thework area 127 and resultingvector 129 each time theinput vector 128 is specified by the matrix operation requesting unit 161 (for each iteration of the iterative operation). Then, when the matrix operation for the plurality of threads is completed, the vector input andoutput unit 164 adds all the work vectors included in thework area 127 to the resultingvector 129, to thereby obtain a final solution as the product of the symmetric sparse matrix andinput vector 128. - The
parallel processing unit 165 performs the functions of a plurality of threads to be executed in parallel. Theparallel processing unit 165 includes a non-zeroelement check unit 166 and amatrix operation unit 167. - In response to a request from the
matrix analysis unit 163, the non-zeroelement check unit 166 confirms the distribution of non-zero elements with respect to the submatrix assigned to each thread, and updates the flags for the columns corresponding to the thread in thenon-zero element map 125. That is to say, the generation of thenon-zero element map 125 is parallelized using the plurality of threads. In this connection, each thread is able to identify the assigned columns of the symmetric sparse matrix with reference to thethread pointer array 124. - In response to a request from the vector input and
output unit 164, thematrix operation unit 167 multiplies a submatrix assigned to each thread by theinput vector 128. With respect to the lower triangle part of the symmetric sparse matrix, thematrix operation unit 167 stores operation results in the work vector allocated by thematrix analysis unit 163. With respect to the upper triangle part (excluding diagonal elements) of the symmetric sparse matrix, on the other hand, thematrix operation unit 167 stores operation results in the resultingvector 129. - In response to a request from the
matrix analysis unit 163, theOS 168 activates a plurality of threads, and assigns the plurality of threads to a plurality of cores included in the 110, 110 a, 110 b, and 110 c. As a rule, different threads are assigned to different cores. In addition, in response to a request from theCPUs matrix analysis unit 163, theOS 168 prepares thework area 127 in theRAM 120. In this connection, it is preferable that threads or processes corresponding to theparallelization control unit 162,parallel processing unit 165, andOS 168 are performed by different cores. -
FIG. 15 is a flowchart illustrating how to control matrix operation. - (S1) The matrix
operation requesting unit 161 stores anelement array 121, arow number array 122, and acolumn pointer array 123 that represent a symmetric sparse matrix, in thematrix storage unit 151. Thematrix analysis unit 163 reads therow number array 122 andcolumn pointer array 123 from thematrix storage unit 151. - (S2) The
matrix analysis unit 163 determines how many threads to use in the matrix operation, and instructs theOS 168 to activate the determined number of threads. At this time, thematrix analysis unit 163 gives sequential thread numbers to the threads. In addition, thematrix analysis unit 163 detects the number of non-zero elements included in the symmetric sparse matrix with reference to theelement array 121, and divides the columns of the symmetric sparse matrix so that the divisions contain as equal a number of non-zero elements as possible. Then, thematrix analysis unit 163 generates and stores athread pointer array 124 in the controldata storage unit 152. - (S3) The
matrix analysis unit 163 generates and stores anon-zero element map 125 in the controldata storage unit 152. Then thematrix analysis unit 163 calls the non-zeroelement check unit 166 for each thread. For each thread, the non-zeroelement check unit 166 updates the flags for the columns corresponding to the thread in thenon-zero element map 125. This non-zero element check will be described in detail later. - (S4) The
matrix analysis unit 163 determines how many work vectors to prepare in thework area 127 and how to allocate the work vectors to the threads, with reference to thenon-zero element map 125 generated at step S3. Then, thematrix analysis unit 163 generates and stores awork pointer array 126 in the controldata storage unit 152. This allocation of work vectors will be described in detail later. - (S5) The
matrix analysis unit 163 instructs theOS 168 to prepare thework area 127 including as many work vectors as determined at step S4, in the intermediatedata storage unit 153. Then, thematrix analysis unit 163 generates and stores anempty resulting vector 129 in thevector storage unit 154. - (S6) The matrix
operation requesting unit 161 stores theinput vector 128 in thevector storage unit 154. In this connection, theinitial input vector 128 that is used in the first iteration of the iterative operation may be stored in thevector storage unit 154 at step S1. The vector input andoutput unit 164 reads theinput vector 128 from thevector storage unit 154. - (S7) The vector input and
output unit 164 calls thematrix operation unit 167 for each thread. For each thread, thematrix operation unit 167 multiplies the submatrix assigned to the thread by theinput vector 128. The vector input andoutput unit 164 adds all the work vectors included in thework area 127 to the resultingvector 129. Thus obtained resultingvector 129 represents the product of the symmetric sparse matrix and theinput vector 128. This parallel matrix operation will be described in detail later. - (S8) The matrix
operation requesting unit 161 reads the resultingvector 129 from thevector storage unit 154. Then, the matrixoperation requesting unit 161 determines whether an operation status satisfies a specific completion condition or not. If the completion condition is satisfied, the matrixoperation requesting unit 161 completes the iterative operation. If not, then the matrixoperation requesting unit 161 uses theread resulting vector 129 to generate thenext input vector 128, and the process proceeds back to step S6. -
FIG. 16 is a flowchart illustrating how to conduct non-zero element check. This non-zero element check is conducted at step S3 of the flowchart ofFIG. 15 . - (S30) The
matrix analysis unit 163 divides the rows of the symmetric sparse matrix into a plurality of row groups. For example, assuming that the number of groups (the number of divisions) is already determined, thematrix analysis unit 163 calculates the width (the number of rows in one group) using an equation: width w=(the number of rows in symmetric sparse matrix+the number of divisions−1)/the number of divisions. In the following description, decimal values are truncated in the division. - (S31) The
matrix analysis unit 163 generates anon-zero element map 125 with as many rows as the number of divisions and as many columns as the number of threads, and stores thenon-zero element map 125 in the controldata storage unit 152. At this time, thematrix analysis unit 163 initializes all the elements in thenon-zero element map 125 to zero. Thematrix analysis unit 163 then calls the non-zeroelement check unit 166 for each thread. Then, the following steps S32 to S38 are executed in parallel by the plurality of threads. The following describes the case where the thread (thread #t) with a thread number of t executes steps S32 to S38. - (S32) The non-zero
element check unit 166 selects the first column of the column group assigned to the thread #t with reference to thethread pointer array 124. More specifically, the non-zeroelement check unit 166 identifies the column number c=Bp(t). - (S33) The non-zero
element check unit 166 selects the first non-zero element in the column selected at step S32 or S38, with reference to thecolumn pointer array 123. More specifically, the non-zeroelement check unit 166 identifies the element number e=Cp(c). - (S34) The non-zero
element check unit 166 sets a flag of “1” for the non-zero element selected at step S33 or S36 in thenon-zero element map 125. More specifically, the non-zeroelement check unit 166 sets Map((Row(e)−1)/w+ 1, t)=1. - (S35) The non-zero
element check unit 166 determines with reference to thecolumn pointer array 123 whether all the non-zero elements in the column selected at step S32 or S38 have been selected or not. More specifically, the non-zeroelement check unit 166 determines whether the element number e identified at step S33 or S36 matches Cp(C+1)−1 or not. If all the elements have been selected, the process proceeds to step S37. If there is any unselected element, then the process proceeds to step S36. - (S36) The non-zero
element check unit 166 selects the next non-zero element. More specifically, the non-zeroelement check unit 166 increments the element number e (adds one to the current element number e). Then, the process proceeds to step S34. - (S37) The non-zero
element check unit 166 determines with reference to thethread pointer array 124 whether all the columns assigned to the thread #t have been selected or not. More specifically, the non-zeroelement check unit 166 determines whether the column number c matches Bp(t+1)−1 or not. If all the columns have been selected, the non-zeroelement check unit 166 completes the non-zero element check, and notifies thematrix analysis unit 163 of the completion. If there is any unselected column, the process proceeds to step S38. - (S38) The non-zero
element check unit 166 selects the next column. More specifically, the non-zeroelement check unit 166 increments the column number c (add one to the current column number c). Then the process proceeds to step S33. -
FIG. 17 is a flowchart illustrating how to allocate work vectors. This allocation of work vectors is performed at step S4 of the flowchart ofFIG. 15 . - (S40) The
matrix analysis unit 163 determines to prepare one work vector, and allocates the prepared work vector to thethread # 1. More specifically, thematrix analysis unit 163 sets the vector count n to one as a variable, and sets Up(1)=1. - (S41) The
matrix analysis unit 163 selects thethread # 2. More specifically, thematrix analysis unit 163 sets the thread number t to two as a variable. - (S42) The
matrix analysis unit 163 selects the first one of the currently existing work vectors. More specifically, thematrix analysis unit 163 sets the vector number v to one as a variable. - (S43) The
matrix analysis unit 163 compares the v-th column with the t-th column in thenon-zero element map 125 to determine whether they overlap in the distribution of flag=1 or not. More specifically, thematrix analysis unit 163 calculates the logical AND of the v-th column vector and the t-th column vector, and determines whether both the v-th column vector and t-th column vector have a flag of “1” in the same row group. If there is an overlap in the distribution of flag=1, the process proceeds to step S46. If not, the process proceeds to step S44. - (S44) The
matrix analysis unit 163 allocates the work vector selected at step S42 or S47 to the thread selected at step S41 or S52. More specifically, thematrix analysis unit 163 sets Up(t)=v. - (S45) The
matrix analysis unit 163 copies the flag=1 of the t-th column to the v-th column in thenon-zero element map 125. Then, the process proceeds to step S51. - (S46) The
matrix analysis unit 163 determines whether all the currently existing work vectors have been selected or not. More specifically, thematrix analysis unit 163 determines whether the vector number v matches the vector count n or not. If all the work vectors have been selected, the process proceeds to step S48. If there is any unselected work vector, then the process proceeds to step S47. - (S47) The
matrix analysis unit 163 selects the next work vector from the currently existing work vectors. More specifically, thematrix analysis unit 163 increments the vector number v (adds one to the current vector number v). Then, the process proceeds to step S43. - (S48) The
matrix analysis unit 163 adds one work vector. More specifically, thematrix analysis unit 163 increments the vector count n (adds one to the current vector count n). - (S49) The
matrix analysis unit 163 allocates the work vector newly added at step S48 to the thread selected at step S41 or S52. More specifically, thematrix analysis unit 163 sets Up(t)=n. - (S50) The
matrix analysis unit 163 overwrites the flags of the n-th column with the flags of t-th column in thenon-zero element map 125. Thereby, the n-th column vector becomes identical to the t-th column vector. - (S51) The
matrix analysis unit 163 determines whether all the threads have been selected or not. More specifically, thematrix analysis unit 163 determines whether the thread number t matches the number of threads. If all the threads have been selected, thematrix analysis unit 163 completes the allocation of work vectors. As a result, the number of work vectors to be prepared in thework area 127 and the contents of thework pointer array 126 are fixed. If there is any unselected thread, then the process proceeds to step S52. - (S52) The
matrix analysis unit 163 selects the next thread. More specifically, thematrix analysis unit 163 increments the thread number t (adds one to the current thread number t). Then, the process proceeds to step S42. -
FIG. 18 is a flowchart illustrating how to perform parallel matrix operation. This parallel matrix operation is performed at step S7 of the flowchart ofFIG. 15 . - (S70) The vector input and
output unit 164 initializes all of the elements of the work vectors in thework area 127 and the elements of the resultingvector 129 to zero. Then, the vector input andoutput unit 164 calls thematrix operation unit 167 for each thread. Then, the following steps S71 to S79 are executed in parallel by the plurality of threads. The following describes the case where the thread (thread #t) with a thread number of t executes the steps S71 to S79. - (S71) The
matrix operation unit 167 selects the first column of the column group assigned to the thread #t with reference to thethread pointer array 124. More specifically, thematrix operation unit 167 identifies the column number c=Bp(t). - (S72) The
matrix operation unit 167 selects the first non-zero element included in the column selected at step S71 or S79, with reference to thecolumn pointer array 123. More specifically, thematrix operation unit 167 identifies the element number e=Cp(c). - (S73) With respect to the lower triangle part, the
matrix operation unit 167 calculates and stores the product of the value of the non-zero element selected at step S72 or S77 and the value in the c-th row of theinput vector 128, in the work vector allocated to the thread #t. More specifically, thematrix operation unit 167 adds Val(e)X(c) to Work(Row(e), Up(t)). - (S74) The
matrix operation unit 167 determines whether the non-zero element selected at step S72 or S77 is a diagonal element in the symmetric sparse matrix or not. More specifically, thematrix operation unit 167 determines whether Row(e) matches the column number c or not. If the non-zero element is a diagonal element, the process proceeds to step S76. If not, the process proceeds to step S75. - (S75) With respect to the upper triangle part (excluding diagonal elements), the
matrix operation unit 167 calculates and stores the product of the value of the non-zero element selected at step S72 or S77 and the value of the row corresponding to the non-zero element in theinput vector 128, in the resultingvector 129. More specifically, thematrix operation unit 167 adds Val(e)X(Row(e)) to Y(c). - (S76) The
matrix operation unit 167 determines with reference to thecolumn pointer array 123 whether all the non-zero elements included in the column selected at step S71 or S79 have been selected or not. More specifically, thematrix operation unit 167 determines whether the element number e identified at step S72 or S77 matches Cp(c+1)−1 or not. If all the non-zero elements have been selected, the process proceeds to step S78. If there is any unselected element, then the process proceeds to step S77. - (S77) The
matrix operation unit 167 selects the next non-zero element. More specifically, thematrix operation unit 167 increments the element number e. Then, the process proceeds to step S73. - (S78) The
matrix operation unit 167 determines with reference to thethread pointer array 124 whether all the columns assigned to the thread #t have been selected or not. More specifically, thematrix operation unit 167 determines whether the column number c matches Bp(t+1)−1 or not. If all the columns have been selected, thematrix operation unit 167 completes the calculation of multiplying the submatrix by theinput vector 128, and notifies the vector input andoutput unit 164 of the completion. Then, the process proceeds to step S80. If there is any unselected column, the process proceeds to step S79. - (S79) The
matrix operation unit 167 selects the next column. More specifically, thematrix operation unit 167 increments the column number c. Then, the process proceeds to step S72. - (S80) The vector input and
output unit 164 adds all the work vectors included in thework area 127 to the resultingvector 129. More specifically, the vector input andoutput unit 164 adds the value in row i and column j of thework area 127 to Y(i). Then, the vector input andoutput unit 164 notifies the matrixoperation requesting unit 161 of the completion of the matrix operation. - As described above, the second embodiment describes the case of multiplying a symmetric sparse matrix by a column vector. However, the above method of allocating work vectors may be applied to the case of multiplying a row vector by a symmetric sparse matrix. In this case, the work vectors may be used to store the products of submatrices in the upper triangle part of the symmetric sparse matrix and an
input vector 128. - Further, in the second embodiment, the symmetric sparse matrix is represented in the compressed column storage format. However, the symmetric sparse matrix may be represented in the compressed row storage format or may be represented without being compressed. In the case where submatrices at symmetrical positions are assigned to the same thread, there is a problem of access conflict between threads in either one of the upper triangle part and the lower triangle part.
- Still further, in the second embodiment, a sparse matrix to be multiplied by a vector is a symmetric matrix. However, a sparse matrix that is not a symmetric matrix may be used. In the case of dividing the columns of the sparse matrix represented in the compressed column storage format and then multiplying the sparse matrix by a column vector, there is a problem of access conflict between threads. Likewise, in the case of dividing the rows of the sparse matrix represented in the compressed row storage format and then multiplying a row vector by the sparse matrix, there is a problem of access conflict between threads.
- According to the second embodiment, the
information processing apparatus 100 compares the submatrices assigned to a plurality of threads in terms of their distributions of non-zero elements, and allocates a shared work vector to a combination of threads that do not overlap in the distributions of non-zero elements. This prevents access conflict to the same element of the work vector, which eliminates the need of exclusive control for access to the work vector between the plurality of threads. In addition, it is possible to reduce (for example, minimize) the number of work vectors prepared in thework area 127 without causing the access conflict. Therefore, the size of thework area 127 is suppressed even if the number of threads increases. That is to say, theinformation processing apparatus 100 is able to use the storage area of theRAM 120 efficiently and to multiply a large-scale sparse matrix by a vector efficiently. - As described earlier, the information processing of the first embodiment may be implemented by the
information processing apparatus 10 executing a program. In addition, the information processing of the second embodiment may be implemented by theinformation processing apparatus 100 executing a program. - Such a program may be recorded on a computer-readable recording medium (for example, recording medium 143). For example, as the recording medium, a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, or the like may be used. Magnetic disks include FDs and HDDs. Optical disks include CDs, CD-R (Recordable), CD-RW (Rewritable), DVD, DVD-R, and DVD-RW. The program may be recorded on portable recording media and then distributed. In this case, the program may be copied (installed) from a portable recording medium to another recording medium such as an HDD (for example, HDD 131) and then executed.
- According to one aspect, it is possible to efficiently use the storage area of a memory in matrix operation.
- All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Claims (8)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2013-074443 | 2013-03-29 | ||
| JP2013074443A JP6083300B2 (en) | 2013-03-29 | 2013-03-29 | Program, parallel operation method, and information processing apparatus |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20140298351A1 true US20140298351A1 (en) | 2014-10-02 |
| US9418048B2 US9418048B2 (en) | 2016-08-16 |
Family
ID=51622180
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US14/190,623 Active 2034-06-22 US9418048B2 (en) | 2013-03-29 | 2014-02-26 | Apparatus and method for allocating shared storage areas to parallel processors for multiplication of sparse matrix and vector |
Country Status (2)
| Country | Link |
|---|---|
| US (1) | US9418048B2 (en) |
| JP (1) | JP6083300B2 (en) |
Cited By (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150067009A1 (en) * | 2013-08-30 | 2015-03-05 | Microsoft Corporation | Sparse matrix data structure |
| US20170192818A1 (en) * | 2016-01-04 | 2017-07-06 | Fujitsu Limited | Matrix division method and parallel processing apparatus |
| US9720851B2 (en) * | 2015-08-20 | 2017-08-01 | National Technologies & Engineering Solutions of Sandia, LLC | Method and apparatus for managing access to a memory |
| US10031806B2 (en) * | 2016-11-01 | 2018-07-24 | Cisco Technology, Inc. | Efficient repair of erasure coded data based on coefficient matrix decomposition |
| US20180341517A1 (en) * | 2017-05-23 | 2018-11-29 | Fujitsu Limited | Information processing apparatus, multithread matrix operation method, and multithread matrix operation program |
| US10146738B2 (en) * | 2016-12-31 | 2018-12-04 | Intel Corporation | Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data |
| US10269360B2 (en) | 2016-02-03 | 2019-04-23 | Dolby International Ab | Efficient format conversion in audio coding |
| CN109983503A (en) * | 2016-11-23 | 2019-07-05 | 罗伯特·博世有限公司 | Corresponding search between matrix element |
| US10373057B2 (en) * | 2015-04-09 | 2019-08-06 | International Business Machines Corporation | Concept analysis operations utilizing accelerators |
| JPWO2018070376A1 (en) * | 2016-10-11 | 2019-08-15 | 日本電気株式会社 | Area securing device, area securing method, and area securing program |
| CN111095304A (en) * | 2017-10-12 | 2020-05-01 | 三星电子株式会社 | Electronic equipment and control method thereof |
| US20210042116A1 (en) * | 2019-08-06 | 2021-02-11 | Facebook, Inc. | Distributed physical processing of matrix sum operation |
| US11010688B2 (en) * | 2017-11-30 | 2021-05-18 | Microsoft Technology Licensing, Llc | Negative sampling |
| US11127167B2 (en) * | 2019-04-29 | 2021-09-21 | Nvidia Corporation | Efficient matrix format suitable for neural networks |
| US11132423B2 (en) * | 2018-10-31 | 2021-09-28 | Hewlett Packard Enterprise Development Lp | Partition matrices into sub-matrices that include nonzero elements |
| US11355598B2 (en) | 2018-07-06 | 2022-06-07 | Analog Devices, Inc. | Field managed group III-V field effect device with epitaxial back-side field plate |
| US11429915B2 (en) | 2017-11-30 | 2022-08-30 | Microsoft Technology Licensing, Llc | Predicting feature values in a matrix |
| US12009207B2 (en) | 2017-05-12 | 2024-06-11 | Analog Devices, Inc. | Gallium nitride device for high frequency and high power applications |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9858144B2 (en) | 2015-08-20 | 2018-01-02 | National Technology & Engineering Solutions Of Sandia, Llc | Processor-in-memory-and-storage architecture |
| US9898441B2 (en) * | 2016-02-05 | 2018-02-20 | Google Llc | Matrix processing apparatus |
| US9805001B2 (en) * | 2016-02-05 | 2017-10-31 | Google Inc. | Matrix processing apparatus |
| US10726096B2 (en) * | 2018-10-12 | 2020-07-28 | Hewlett Packard Enterprise Development Lp | Sparse matrix vector multiplication with a matrix vector multiplication unit |
| US11562248B2 (en) * | 2019-04-29 | 2023-01-24 | Advanced Micro Devices, Inc. | Data sparsity monitoring during neural network training |
| WO2022115783A1 (en) * | 2020-11-30 | 2022-06-02 | Google Llc | Systolic array cells with multiple accumulators |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070198621A1 (en) * | 2006-02-13 | 2007-08-23 | Iu Research & Technology Corporation | Compression system and method for accelerating sparse matrix computations |
| US20110191761A1 (en) * | 2010-02-04 | 2011-08-04 | International Business Machines Corporation | Control Flow Analysis Using Deductive Reaching Definitions |
| US20110307685A1 (en) * | 2010-06-11 | 2011-12-15 | Song William S | Processor for Large Graph Algorithm Computations and Matrix Operations |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP3697992B2 (en) * | 2000-01-25 | 2005-09-21 | 日本電気株式会社 | Matrix vector product operation system, matrix storage system used therefor, and method thereof |
| EP2058740A1 (en) | 2006-08-30 | 2009-05-13 | Fujitsu Limited | High-speed calculation process method of combination equation based on finite element method and boundary element method |
| JP4942095B2 (en) * | 2007-01-25 | 2012-05-30 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Technology that uses multi-core processors to perform operations |
| JP5262177B2 (en) | 2008-02-22 | 2013-08-14 | 富士通株式会社 | Vector product parallel processing method |
| JP5458621B2 (en) | 2009-03-19 | 2014-04-02 | 富士通株式会社 | Method, apparatus, and program for calculating simultaneous linear equations of sparse positive symmetric matrix |
| JP5672902B2 (en) | 2010-09-27 | 2015-02-18 | 富士通株式会社 | ordering generation method, program, and shared memory type scalar parallel computer |
-
2013
- 2013-03-29 JP JP2013074443A patent/JP6083300B2/en active Active
-
2014
- 2014-02-26 US US14/190,623 patent/US9418048B2/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070198621A1 (en) * | 2006-02-13 | 2007-08-23 | Iu Research & Technology Corporation | Compression system and method for accelerating sparse matrix computations |
| US20110191761A1 (en) * | 2010-02-04 | 2011-08-04 | International Business Machines Corporation | Control Flow Analysis Using Deductive Reaching Definitions |
| US20110307685A1 (en) * | 2010-06-11 | 2011-12-15 | Song William S | Processor for Large Graph Algorithm Computations and Matrix Operations |
Cited By (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150067009A1 (en) * | 2013-08-30 | 2015-03-05 | Microsoft Corporation | Sparse matrix data structure |
| US9367519B2 (en) * | 2013-08-30 | 2016-06-14 | Microsoft Technology Licensing, Llc | Sparse matrix data structure |
| US10373057B2 (en) * | 2015-04-09 | 2019-08-06 | International Business Machines Corporation | Concept analysis operations utilizing accelerators |
| US10963794B2 (en) * | 2015-04-09 | 2021-03-30 | International Business Machines Corporation | Concept analysis operations utilizing accelerators |
| US20190354878A1 (en) * | 2015-04-09 | 2019-11-21 | International Business Machines Corporation | Concept Analysis Operations Utilizing Accelerators |
| US9720851B2 (en) * | 2015-08-20 | 2017-08-01 | National Technologies & Engineering Solutions of Sandia, LLC | Method and apparatus for managing access to a memory |
| US20170192818A1 (en) * | 2016-01-04 | 2017-07-06 | Fujitsu Limited | Matrix division method and parallel processing apparatus |
| US10269360B2 (en) | 2016-02-03 | 2019-04-23 | Dolby International Ab | Efficient format conversion in audio coding |
| US12019603B2 (en) * | 2016-10-11 | 2024-06-25 | Nec Corporation | Area allocation device, area allocation method, and non-volatile recording medium |
| JPWO2018070376A1 (en) * | 2016-10-11 | 2019-08-15 | 日本電気株式会社 | Area securing device, area securing method, and area securing program |
| US20210286789A1 (en) * | 2016-10-11 | 2021-09-16 | Nec Corporation | Area allocation device, area allocation method, and non-volatile recording medium |
| US10031806B2 (en) * | 2016-11-01 | 2018-07-24 | Cisco Technology, Inc. | Efficient repair of erasure coded data based on coefficient matrix decomposition |
| CN109983503A (en) * | 2016-11-23 | 2019-07-05 | 罗伯特·博世有限公司 | Corresponding search between matrix element |
| US10146738B2 (en) * | 2016-12-31 | 2018-12-04 | Intel Corporation | Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data |
| US12009207B2 (en) | 2017-05-12 | 2024-06-11 | Analog Devices, Inc. | Gallium nitride device for high frequency and high power applications |
| US12249631B2 (en) | 2017-05-12 | 2025-03-11 | Analog Devices, Inc. | Gallium nitride device for high frequency and high power applications |
| US10528384B2 (en) * | 2017-05-23 | 2020-01-07 | Fujitsu Limited | Information processing apparatus, multithread matrix operation method, and multithread matrix operation program |
| US20180341517A1 (en) * | 2017-05-23 | 2018-11-29 | Fujitsu Limited | Information processing apparatus, multithread matrix operation method, and multithread matrix operation program |
| CN111095304A (en) * | 2017-10-12 | 2020-05-01 | 三星电子株式会社 | Electronic equipment and control method thereof |
| US11010688B2 (en) * | 2017-11-30 | 2021-05-18 | Microsoft Technology Licensing, Llc | Negative sampling |
| US11429915B2 (en) | 2017-11-30 | 2022-08-30 | Microsoft Technology Licensing, Llc | Predicting feature values in a matrix |
| US11355598B2 (en) | 2018-07-06 | 2022-06-07 | Analog Devices, Inc. | Field managed group III-V field effect device with epitaxial back-side field plate |
| US11132423B2 (en) * | 2018-10-31 | 2021-09-28 | Hewlett Packard Enterprise Development Lp | Partition matrices into sub-matrices that include nonzero elements |
| US11127167B2 (en) * | 2019-04-29 | 2021-09-21 | Nvidia Corporation | Efficient matrix format suitable for neural networks |
| US20210042116A1 (en) * | 2019-08-06 | 2021-02-11 | Facebook, Inc. | Distributed physical processing of matrix sum operation |
| US11010202B2 (en) * | 2019-08-06 | 2021-05-18 | Facebook, Inc. | Distributed physical processing of matrix sum operation |
Also Published As
| Publication number | Publication date |
|---|---|
| JP6083300B2 (en) | 2017-02-22 |
| JP2014199545A (en) | 2014-10-23 |
| US9418048B2 (en) | 2016-08-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9418048B2 (en) | Apparatus and method for allocating shared storage areas to parallel processors for multiplication of sparse matrix and vector | |
| US20190266217A1 (en) | Apparatus and method for matrix computation | |
| US9223820B2 (en) | Partitioning data for parallel processing | |
| US8595732B2 (en) | Reducing the response time of flexible highly data parallel task by assigning task sets using dynamic combined longest processing time scheme | |
| US10102267B2 (en) | Method and apparatus for access control | |
| US20170206089A1 (en) | Information processing apparatus and computational method | |
| CN111340201A (en) | Convolutional neural network accelerator and method for performing convolutional operation thereof | |
| US10496659B2 (en) | Database grouping set query | |
| US11120050B2 (en) | Parallel bootstrap aggregating in a data warehouse appliance | |
| US12271716B2 (en) | Computer-readable recording medium storing conversion program and conversion processing method | |
| US20200285510A1 (en) | High precision load distribution among processors | |
| KR20210103393A (en) | System and method for managing conversion of low-locality data into high-locality data | |
| US20160125009A1 (en) | Parallelized execution of window operator | |
| KR20230057246A (en) | Quantum circuit simulation method and optimizations using storage devices | |
| EP3785105B1 (en) | Systems and methods for rebalancing striped information across multiple storage devices | |
| US11514306B1 (en) | Static memory allocation in neural networks | |
| KR102840566B1 (en) | Systems, methods, and devices for data propagation in graph processing | |
| US20250123846A1 (en) | Parallel processing for sparse matrix linear algebra | |
| JP7239827B2 (en) | Information processing device and compiler program | |
| US20240020154A1 (en) | Management device, arithmetic processing device, and load distribution method and computer-readable recording medium storing program of arithmetic processing device | |
| KR102192048B1 (en) | Triangle enumeration method that reduces network traffic on heterogeneous clusters and apparatus thereof | |
| KR20240051715A (en) | distributed matrix computation method and apparatus for matrix fused operation | |
| Mehrez et al. | Understanding the performances of sparse compression formats using data parallel programming model | |
| CN119003171A (en) | Resource scheduling method, electronic equipment and medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: FUJITSU LIMITED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:USUI, TETSUZOU;REEL/FRAME:034107/0533 Effective date: 20140116 |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |