[go: up one dir, main page]

WO2008037975A2 - Matrix multiplication - Google Patents

Matrix multiplication Download PDF

Info

Publication number
WO2008037975A2
WO2008037975A2 PCT/GB2007/003635 GB2007003635W WO2008037975A2 WO 2008037975 A2 WO2008037975 A2 WO 2008037975A2 GB 2007003635 W GB2007003635 W GB 2007003635W WO 2008037975 A2 WO2008037975 A2 WO 2008037975A2
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
processor
memory
elements
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/GB2007/003635
Other languages
French (fr)
Other versions
WO2008037975A3 (en
Inventor
Martin John Thompson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TRW Ltd
Original Assignee
TRW Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TRW Ltd filed Critical TRW Ltd
Priority to EP07823926A priority Critical patent/EP2067100A2/en
Publication of WO2008037975A2 publication Critical patent/WO2008037975A2/en
Anticipated expiration legal-status Critical
Publication of WO2008037975A3 publication Critical patent/WO2008037975A3/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • G06F9/3879Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor for non-native instruction execution, e.g. executing a command; for Java instruction set

Definitions

  • This invention relates to a matrix multiplication processor, processing apparatus using such a matrix multiplication processor, use of a matrix multiplication co-processor, a video processing apparatus using a matrix multiplication processor and a matrix multiplication method.
  • the elements of the product matrix obtained by matrix multiplication are defined by:
  • Matrix multiplication can therefore be seen as taking a row from the first matrix and a column from the second matrix, matching successive elements from each to form pairs (that is, first element with first, second with second and so on), multiplying the pairs of elements thus chosen, adding all the products together to generate one element of the product matrix then repeating for all the elements of the row and column then adding together all the products. This generates an element of the product matrix and is repeated for all combinations of rows and columns to form the product matrix.
  • Matrix multiplication is especially important in image processing, where images, or data derived therefrom, can be represented by large matrices, each element of a matrix representing the intensity of a pixel within the image or some other similar data. Matrix multiplication is not a very efficient process on conventional processors, due to the large amount of data that must be passed over the processor's external data bus as successive multiplications are performed. It is necessary to interleave mathematical operation with memory accesses.
  • a matrix multiplication processor for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising:
  • a memory having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written,
  • a multiplier having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output;
  • control unit arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed to the multiplier and causes the resultant product to be passed to the accumulator to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column- wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise through the first and second memory areas.
  • Such a processor has been found to be more efficient than the prior art architecture, where elements must be retrieved from memory and passed through a bus, operated upon and then passed back to the memory through the bus for each operation.
  • the stepwise operation through contiguous memory locations allows the control unit to merely need step through the elements in the memory; the subsequent use of a dedicated accumulator keeping a running total and passing elements directly from a multiplier to the accumulator has been found to be particularly efficient.
  • control unit would also be arranged so as to, in use, cause the running total to be written to an appropriate location within the third area of memory.
  • each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below.
  • the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.
  • the control unit when moving stepwise through the memory to choose elements within whichever of row or column the elements are stored may therefore step through successive memory locations - the memory address chosen by the control unit will increment or decrement by 1.
  • the control unit may step through memory locations separated by the number of rows or columns as appropriate - the memory address chosen by the control unit will increment by a constant (the number of rows or columns). It can be seen from this description that accessing successive elements across rows or columns (as is necessary in matrix multiplication) is a simple process that does not require much processor effort. It should also be noted that decrementing a memory address can also be seen to include incrementing the memory address by a negative amount, so that incrementing should be read to include decrementing also.
  • control unit may comprise a pointer for each matrix, the pointer indicating, in use, the memory address of the element of each matrix that is presently chosen.
  • control unit may also comprise pointer incrementing means arranged to, in use, successively increment each pointer by a constant.
  • the control unit may also comprise a base address store for first and second matrices, which is arranged so as to store, in use, the address of the first element of the row or column through which the control element is stepping at a given time.
  • the control unit may also comprise base address incrementing means, which is arranged to increment the base address by a constant value when switching from one row or column to the next.
  • the base addresses can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true.
  • the accumulator comprises a dedicated memory for the running total; this dedicated memory is preferably separate from the memory in which the matrices are stored in use. Accordingly, by providing an accumulator with a dedicated memory, any problems with interleaving mathematical operations with memory accesses can be reduced.
  • the processor is implemented in a Field Programmable Gate Array (FPGA) circuit.
  • FPGA Field Programmable Gate Array
  • the processor also comprises a port for communicating with an external processor, whereby the processor is arranged first and second matrices can be input at the port in use so as to be written to the memory and so that the output matrix, in use, may be output at the port.
  • the processor may be arranged such that it can use the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory, in use, for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.
  • the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise.
  • the control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix need only be updated to refer to the first address of the previous output matrix.
  • the control unit may be arranged to operate in discrete, cycles.
  • control unit may be arranged to cause a pair of elements to be read from the memory and output to the multiplier; to cause the multiplier to multiply the two elements output to it in the previous cycle and output the product to the accumulator; to cause the accumulator to add the product calculated in the previous cycle to the running total; and to cause the running total from the previous cycle to be written to the memory.
  • the control unit may more reliably operate, as it is clear when each operation will occur. Furthermore, this "pipelining" of instructions may allow for faster operation.
  • the memory may comprise a dual port RAM (or a multiport RAM, of which at least two ports are used; herein by dual port we include multi- port although this need not be the case), each port typically having an input and an output.
  • the output of both ports may be operatively coupled to the multiplier; the use of a dual port RAM allows for both values to be multiplied to be read from memory at the same time.
  • An input of one of the ports may be operatively coupled to an output of the accumulator, the output of the accumulator being arranged to output the running total when commanded to do so by the control unit.
  • one or both inputs of the ports of the RAM may be connected to the port of the processor so as to enable the first and second matrices to be loaded into the memory in use.
  • the accumulator may be coupled to the same input as the processor port is; in such a case the processor may further comprise a multiplexer coupling both accumulator and processor port to the input of the port of the RAM.
  • a processing apparatus comprising a processor coupled to a matrix multiplication processor according to a first aspect of the invention. This allows a conventional processor to use the matrix processor to perform matrix multiplication, thereby speeding up the processing of matrices, whilst remaining the usual functions of conventional processor for other functions .
  • the processor and the matrix processor may be provided on discrete chips connected by a data bus, or may be implemented on a single chip.
  • one or both of the processor and matrix multiplication processor or the single chip may be implemented in FPGA; this is especially convenient given that standard processors are available in FPGA and so the matrix processor may be designed as an on-chip "peripheral" to the processor.
  • a processing apparatus comprising a first processor and a second processor connected by a data bus, in which the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication.
  • the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication.
  • the second processor is used only for, in use (and save perhaps for ancillary matters) matrix multiplication.
  • the processing apparatus may be according to the second aspect of the invention.
  • a dedicated matrix-multiplication co-processor in calculating matrix multiplication in combination with a general-purpose processor for other operations.
  • a video processing apparatus comprising the processing apparatus of the second or third aspects of the invention.
  • the video processing apparatus may comprise a video camera, and the video processing apparatus may be arranged to, in use, capture images from the video camera, convert such into matrices, which are then processed by the matrix multiplication processor.
  • the video processing apparatus may be arranged to capture images of a road and determine, possibly by means of a Kalman filter, the position of lane boundaries in the captured images .
  • a method of matrix multiplication in a processing system the multiplication being of a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the method comprising: storing the first matrix in a first area of memory and the second matrix in a second area of memory, the elements of each matrix being stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column-wise order; repeatedly choosing an element from the first matrix and an element from the second matrix, multiplying the elements thus chosen to form a product and adding that to a running total; in which the elements are chosen from the first and second matrix by moving stepwise through the first and second memory areas.
  • the method also comprising writing the running total to an appropriate location within a third area of memory.
  • the elements of each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below.
  • the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.
  • a pointer to each of first and second matrix may be incremented in one case by 1 and in the other case by the number of columns in the first matrix (which is, of course, equal to the number of rows in the second matrix) .
  • base addresses indicating the starting position of that row or column can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true.
  • the method may comprise the step of using the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.
  • the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise.
  • the control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix being updated in such a case to refer to the first address of the previous output matrix.
  • the method may be carried out in discrete cycles.
  • the a pair of elements may be read from the memory; the two elements read in the previous cycle may be multiplied and the product outputted to the accumulator; the product calculated in the previous cycle may be added to the running total; and the running total from the previous cycle may be written to the memory.
  • Figure 1 is a diagram showing schematically a matrix multiplication processor according to an embodiment of the invention
  • FIG. 2 is a flow chart showing the operation of the processor of
  • Figure 3 shows an example calculation as performed by the matrix multiplication processor of Figure 1;
  • Figure 4 shows a car fitted with the processor of Figure 1.
  • FIG. 1 of those drawings shows the basic functional blocks of the circuit, where the matrix multiplication processor as a whole is depicted as 1.
  • the processor 1 comprises an area of memory 2 of the form of a dual-port RAM, a multiplier 3, a resettable accumulator 4 and a control state machine 5 arranged to act as a control unit for the other circuits.
  • the matrix multiplication processor 1 is connected by means of a data bus 10 to an external processor 11.
  • the external processor 11 When the external processor 11 requires the product of two matrices, it passes first and second matrices across data bus 10 to the matrix multiplication processor 1. They are stored in the memory 2.
  • the processor 11 and the matrix multiplication processor 1 should not be provided on separate chips, or on different architectures.
  • the matrix multiplication processor 1, the bus 10 and the processor 11 are all typically provided on the same FPGA chip. This is particularly convenient, as common processors are available as FPGA modules, and the matrix multiplication processor can be provided as a peripheral thereto .
  • the matrices are stored in the memory in a row- wise fashion. Taking, as a non-limiting example, the matrices A and B referred to in the introduction above, then the elements can be stored as shown in the right hand three columns of Figure 3 of the accompanying drawings.
  • Matrix A starts at memory location 0 with element a n , the top-leftmost element, followed by the remaining element in that row, a 12 , followed by the elements of the next row and so on.
  • the elements of matrix A therefore define a first memory area.
  • Matrix B starts at memory location 9 and is set out in the same fashion, with element b n first, followed by b 12 , b 21 and b 22 in order; the elements of B therefore define a second memory area.
  • each matrix may hold other data and are of no great importance. It is to be noted that each matrix could be held anywhere in the memory as long as the elements within the matrix as stored in the correct order.
  • matrices A and B are set as follows :
  • the matrices A and B are written into memory by processor 11 as shown in Figure 3 and several variables are initialised. These are:
  • BaseA, BaseB indicate the starting location in memory of the row in matrix A and the column in matrix B that is currently being worked on; initially these values are set to the location of the first element in each matrix and so in the worked example take the values 0 and 9 respectively.
  • AddrA, AddrB indicate the location in memory of the element from each matrix that is currently being used; initially these values are set to BaseA and BaseB respectively, and so in the worked example take the values 0 and 9 respectively.
  • AddrAinc, AddrBinc indicate the number of memory locations to skip moving along one element in a row of A or a column of B respectively.
  • AddrAinc is set to 1
  • AddrBinc is set to the number of columns of B; in the worked example, this is 2.
  • AddrY indicate the location in memory to which the current element in the output matrix Y is to be written. Initialised to the location of the first element in Y, so in the worked example is initialised to 15.
  • BaseAinc indicate the increment in the starting location in memory from row to row in A and column to column in B. BaseAinc is equal to the number of columns in the A matrix
  • BaseBinc is equal to 1.
  • Count 0 a decrementing counter which indicates how many elements in the present row of the output matrix are left to be processed. Zero-based, so initialised to the number of elements in a row of A, less 1. The counter is decremented every time an output element is completed; this counter indicating zero implies the start of a new row in the output matrix.
  • Count 1 a decrementing counter which indicates how many pairs of elements in the input matrices are left to be processed within the same output element. Zero-based, so initialised to the number of columns in A, less 1. This counter is decremented every time a pair of values is loaded and multiplied, and it indicating zero implies that an output element has been calculated.
  • Count 2 A decrementing counter which indicates how many output elements remain to be calculated. Zero-based, so initialised to the number of elements in the output matrix less 1.
  • • ld_acc controls whether to load a new value into the accumulator, and hence reset it, or to add the value to the running total. Initialised to " 1" , that is to load the next value calculated into the accumulator. Uses Countl as a counter - the "Ld/acc" counter - to determine whether it is time to load or accumulate.
  • the accumulator 4 determines whether the ld_acc flag is at "1" or "0"; if it is at "1” then the product from the multiplier 3 is loaded into the accumulator 4, thereby resetting it. In such a case, once the accumulator
  • the ld_acc flag is reset to "0" to enable further values to be added to the running total. If the flag is at "0" then the product is added to the running total stored by the accumulator.
  • step 110 it is determined whether Count 1 is at zero (and hence whether the calculation of the present output element is complete) ; if it is not, then Countl is decremented by 1 at step 112 and the values of AddrA and AddrB are incremented at step 114 by AddrAinc and AddrBinc respectively. If the new value of Countl is zero, then the element will be completed in the next cycle and the write enable flag WE is set to " 1 " .
  • Countl is reset to the number of columns in A, less 1. Given that the running total stored by the accumulator will have been stored at step 108, the write enable flag WE may now be lowered to state "0" . Count 2 is decremented by 1 as there are now one fewer elements to calculate, and AddrY is incremented to point to the next write address in the output matrix Y.
  • Count 0 is also zero (step 118)
  • the present output row is complete.
  • Count 0 is reset to the number of columns in B, less 1; BaseA is incremented by BaseAinc and BaseB is reset to the starting position for matrix B.
  • the matrix multiplication processor is now ready to calculate the next row. If the row is not complete, then at step 122 BaseB is incremented by BaseBinc (that is, 1) and Count 0 is decremented by 1 , so as to be ready to calculate the next element in the row. In either case, at step 124 the read addresses - AddrA and AddrB - are reset to the new values of BaseA and BaseB, ready for calculation of the next element .
  • the method checks whether Count 2 has reached zero (step 130) . If it has not, then there are still elements to calculate and so the method returns to step 102 for the next cycle. If it has reached zero, then the matrix multiplication is complete and all elements of the output matrix Y are stored in the appropriate area of memory 2. In the final step 132, a flag done is raised to indicate to processor 11 that the results are ready.
  • Figure 3 of the accompanying drawings shows the operation of this method on a cycle-by-cycle basis. It shows the starting values of BaseA, BaseB, BaseAinc, BaseBinc , CountO , Countl, Count2 , AddrAinc , AddrB inc, and AddrY as discussed above in the first two rows. It also shows the contents of the memory 2 in the right-hand two columns.
  • Count 1 counts down through the cycles required to generate an output element; in this case from 1 to 0 as two calculations are required.
  • Count2 counts down the total number of elements. In this case, it is counted down from 6, the total required, to 0 indicative of the method finishing and thereby causing the raising of the done flag.
  • BaseA is incremented every time a row of the output matrix is finished to point at the first element of a new row of A.
  • BaseB is updated every time an output element is completed, so as to point to the first element of the column of B that is currently being processed.
  • the pointers AddrA and AddrB start from the current value of BaseA and BaseB and increment by AddrAinc and AddrB inc .
  • the ld_acc flag is raised for the first calculation within an element, whereas the WE flag is raised for the last.
  • This calculation therefore shows the multiplication and summations required to calculate a matrix multiplication; indeed, the values of the output matrix are shown in the memory area as they would be after the processor had finished. It can be seen that the method used has correctly calculated:
  • the outputs of the control unit 5 are the addresses AddrA and AddrB plus the write enable flag WE, passed to the memory 2, the ld_acc flag passed to the accumulator 4 and the done flag passed to the processor 11.
  • FIG. 4 shows a car 200 fitted with the processor 11 and matrix multiplication processor 1 of Figure 1.
  • the car 200 is fitted with a lane detection system such as that disclosed in EPl 649 334, comprising a video camera 204 and a processing unit 203.
  • This processing unit is fitted with both the matrix multiplication processor 1 and the processor 11 of Figure 1.
  • the lane detection system fitted to the car then can be used to accelerate the necessary processing of matrices within the lane detection algorithm. It has been found that an acceleration of up to five times is possible by using such a processing unit 203 for relatively small matrices and it is expected that the performance gains on bigger matrices could be even greater.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

A matrix multiplication processor (1) for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising: a memory (2) having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written, an accumulator (4) having an input and arranged to, in use, store a running total and add values at its input to the running total; a multiplier (3) having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and a control unit (5) arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses (102) an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed (104) to the multiplier (3) and causes (106) the resultant product to be passed to the accumulator (4) to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column- wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise (112-130) through the first and second memory areas. Also, a processing apparatus comprising a first processor (11) and a second processor (1) connected by a data bus (10), in which the first processor (11) is a general-purpose processor and the second processor (1) is more efficient than the first at calculating matrix multiplication. The processors thus disclosed may advantageously be used in video processing situations, such as road lane detection from vehicles.

Description

MATRIX MULTIPLICATION
This invention relates to a matrix multiplication processor, processing apparatus using such a matrix multiplication processor, use of a matrix multiplication co-processor, a video processing apparatus using a matrix multiplication processor and a matrix multiplication method.
Matrix multiplication is a well-known mathematical process whereby two matrices are multiplied together to form a third matrix. This can be represented by Y = AB where Y is the product of matrices A and B. It should be noted that matrix multiplication is non-commutative (that is, AB is not generally equal to BA) and that matrix multiplication is only well defined when the number of columns in the first matrix is equal to the number of rows in the second matrix. Multiplication of a m x n matrix by a n x o matrix results in a m x o matrix.
The elements of the product matrix obtained by matrix multiplication are defined by:
(AB) 'ij =∑AιrBrj r=\ where A and B are the matrices to be multiplied together and subscripts indicate the row and column elements of the relevant matrix.
Taking the example of a 3x2 matrix multiplied by a 2x2 matrix:
Figure imgf000003_0001
where:
Figure imgf000003_0002
y2l = a2lbn + a22b21 y22 = a2φl2 + a22b22 ' y3l = a31bn + a32b21
Figure imgf000003_0003
Matrix multiplication can therefore be seen as taking a row from the first matrix and a column from the second matrix, matching successive elements from each to form pairs (that is, first element with first, second with second and so on), multiplying the pairs of elements thus chosen, adding all the products together to generate one element of the product matrix then repeating for all the elements of the row and column then adding together all the products. This generates an element of the product matrix and is repeated for all combinations of rows and columns to form the product matrix.
Matrix multiplication is especially important in image processing, where images, or data derived therefrom, can be represented by large matrices, each element of a matrix representing the intensity of a pixel within the image or some other similar data. Matrix multiplication is not a very efficient process on conventional processors, due to the large amount of data that must be passed over the processor's external data bus as successive multiplications are performed. It is necessary to interleave mathematical operation with memory accesses. According to a first aspect of the invention, there is provided a matrix multiplication processor for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising:
• a memory having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written,
• an accumulator having an input and arranged to, in use, store a running total and add values at its input to the running total;
• a multiplier having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and
• a control unit arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed to the multiplier and causes the resultant product to be passed to the accumulator to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column- wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise through the first and second memory areas.
Such a processor has been found to be more efficient than the prior art architecture, where elements must be retrieved from memory and passed through a bus, operated upon and then passed back to the memory through the bus for each operation. The stepwise operation through contiguous memory locations allows the control unit to merely need step through the elements in the memory; the subsequent use of a dedicated accumulator keeping a running total and passing elements directly from a multiplier to the accumulator has been found to be particularly efficient.
Typically, the control unit would also be arranged so as to, in use, cause the running total to be written to an appropriate location within the third area of memory.
In use, the elements of each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below. In the preferred embodiment, the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.
Given the sequential nature of the elements stored within the memory, the control unit, when moving stepwise through the memory to choose elements within whichever of row or column the elements are stored may therefore step through successive memory locations - the memory address chosen by the control unit will increment or decrement by 1. When moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, the control unit may step through memory locations separated by the number of rows or columns as appropriate - the memory address chosen by the control unit will increment by a constant (the number of rows or columns). It can be seen from this description that accessing successive elements across rows or columns (as is necessary in matrix multiplication) is a simple process that does not require much processor effort. It should also be noted that decrementing a memory address can also be seen to include incrementing the memory address by a negative amount, so that incrementing should be read to include decrementing also.
Indeed, the control unit may comprise a pointer for each matrix, the pointer indicating, in use, the memory address of the element of each matrix that is presently chosen. The control unit may also comprise pointer incrementing means arranged to, in use, successively increment each pointer by a constant.
The control unit may also comprise a base address store for first and second matrices, which is arranged so as to store, in use, the address of the first element of the row or column through which the control element is stepping at a given time. The control unit may also comprise base address incrementing means, which is arranged to increment the base address by a constant value when switching from one row or column to the next.
It can therefore be seen that when calculating one element of the output matrix, the pointer to each of first and second matrix will be incremented in one case by 1 and in the other case by the number of columns in the first matrix (which is, of course, equal to the number of rows in the second matrix) . Once one element in the output matrix has been calculated, the base addresses can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true. Preferably, the accumulator comprises a dedicated memory for the running total; this dedicated memory is preferably separate from the memory in which the matrices are stored in use. Accordingly, by providing an accumulator with a dedicated memory, any problems with interleaving mathematical operations with memory accesses can be reduced.
Preferably, the processor is implemented in a Field Programmable Gate Array (FPGA) circuit. This is particularly useful, as it is a flexible technology particularly suitable for the generation of new architectures such as that described above. Preferably, the processor also comprises a port for communicating with an external processor, whereby the processor is arranged first and second matrices can be input at the port in use so as to be written to the memory and so that the output matrix, in use, may be output at the port.
The processor may be arranged such that it can use the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory, in use, for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.
To this end, the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise. The control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix need only be updated to refer to the first address of the previous output matrix. The control unit may be arranged to operate in discrete, cycles. In one cycle, the control unit may be arranged to cause a pair of elements to be read from the memory and output to the multiplier; to cause the multiplier to multiply the two elements output to it in the previous cycle and output the product to the accumulator; to cause the accumulator to add the product calculated in the previous cycle to the running total; and to cause the running total from the previous cycle to be written to the memory. By operating in discrete cycles, the control unit may more reliably operate, as it is clear when each operation will occur. Furthermore, this "pipelining" of instructions may allow for faster operation.
The memory may comprise a dual port RAM (or a multiport RAM, of which at least two ports are used; herein by dual port we include multi- port although this need not be the case), each port typically having an input and an output. The output of both ports may be operatively coupled to the multiplier; the use of a dual port RAM allows for both values to be multiplied to be read from memory at the same time. However, it is possible to use a single port RAM although with a speed penalty.
An input of one of the ports may be operatively coupled to an output of the accumulator, the output of the accumulator being arranged to output the running total when commanded to do so by the control unit. Furthermore, one or both inputs of the ports of the RAM may be connected to the port of the processor so as to enable the first and second matrices to be loaded into the memory in use.
The accumulator may be coupled to the same input as the processor port is; in such a case the processor may further comprise a multiplexer coupling both accumulator and processor port to the input of the port of the RAM. According to a second aspect of the invention, there is provided a processing apparatus, comprising a processor coupled to a matrix multiplication processor according to a first aspect of the invention. This allows a conventional processor to use the matrix processor to perform matrix multiplication, thereby speeding up the processing of matrices, whilst remaining the usual functions of conventional processor for other functions .
The processor and the matrix processor may be provided on discrete chips connected by a data bus, or may be implemented on a single chip. Conveniently, one or both of the processor and matrix multiplication processor or the single chip may be implemented in FPGA; this is especially convenient given that standard processors are available in FPGA and so the matrix processor may be designed as an on-chip "peripheral" to the processor.
According to a third aspect of the invention, there is provided a processing apparatus comprising a first processor and a second processor connected by a data bus, in which the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication. Such an apparatus makes use of a special "co-processor" in order to calculate matrix multiplication, an otherwise processor-intensive operation. Preferably, the second processor is used only for, in use (and save perhaps for ancillary matters) matrix multiplication. The processing apparatus may be according to the second aspect of the invention.
According to a fourth aspect of the invention, there is provided use of a dedicated matrix-multiplication co-processor in calculating matrix multiplication in combination with a general-purpose processor for other operations.
According to a fifth aspect of the invention, there is provided a video processing apparatus, comprising the processing apparatus of the second or third aspects of the invention. The video processing apparatus may comprise a video camera, and the video processing apparatus may be arranged to, in use, capture images from the video camera, convert such into matrices, which are then processed by the matrix multiplication processor. The video processing apparatus may be arranged to capture images of a road and determine, possibly by means of a Kalman filter, the position of lane boundaries in the captured images .
According to a sixth aspect of the invention, there is provided a method of matrix multiplication in a processing system, the multiplication being of a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the method comprising: storing the first matrix in a first area of memory and the second matrix in a second area of memory, the elements of each matrix being stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column-wise order; repeatedly choosing an element from the first matrix and an element from the second matrix, multiplying the elements thus chosen to form a product and adding that to a running total; in which the elements are chosen from the first and second matrix by moving stepwise through the first and second memory areas.
Typically, the method also comprising writing the running total to an appropriate location within a third area of memory. In use, the elements of each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below. In the preferred embodiment, the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.
Given the sequential nature of the elements stored within the memory, when moving stepwise through the memory to choose elements within whichever of row or column the elements it is therefore possible to simply step through successive memory locations - the memory address chosen will increment or decrement by 1. When moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, it is therefore possible to step through memory locations separated by the number of rows or columns as appropriate - the memory address chosen will increment by a constant (the number of rows or columns) . It can be seen from this description that accessing successive elements across rows or columns (as is necessary in matrix multiplication) is a simple process that does not require much processor effort. It should also be noted that decrementing a memory address can also be seen to include incrementing the memory address by a negative amount, so that incrementing should be read to include decrementing also.
It can therefore be seen that when calculating one element of the output matrix, a pointer to each of first and second matrix may be incremented in one case by 1 and in the other case by the number of columns in the first matrix (which is, of course, equal to the number of rows in the second matrix) . Once one element in the output matrix has been calculated, base addresses indicating the starting position of that row or column can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true.
The method may comprise the step of using the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.
To this end, the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise. The control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix being updated in such a case to refer to the first address of the previous output matrix.
The method may be carried out in discrete cycles. In one cycle, the a pair of elements may be read from the memory; the two elements read in the previous cycle may be multiplied and the product outputted to the accumulator; the product calculated in the previous cycle may be added to the running total; and the running total from the previous cycle may be written to the memory. By operating in discrete cycles, the method may more reliably operate, as it is clear when each operation will occur. There now follows by way of example only a description of an embodiment of the invention, described with reference to the accompanying drawings in which:
Figure 1 is a diagram showing schematically a matrix multiplication processor according to an embodiment of the invention;
Figure 2 is a flow chart showing the operation of the processor of
Figure 1;
Figure 3 shows an example calculation as performed by the matrix multiplication processor of Figure 1; and
Figure 4 shows a car fitted with the processor of Figure 1.
A matrix multiplication processor according to an embodiment of the invention is shown in the accompanying drawings. Figure 1 of those drawings shows the basic functional blocks of the circuit, where the matrix multiplication processor as a whole is depicted as 1. The processor 1 comprises an area of memory 2 of the form of a dual-port RAM, a multiplier 3, a resettable accumulator 4 and a control state machine 5 arranged to act as a control unit for the other circuits.
The matrix multiplication processor 1 is connected by means of a data bus 10 to an external processor 11. When the external processor 11 requires the product of two matrices, it passes first and second matrices across data bus 10 to the matrix multiplication processor 1. They are stored in the memory 2. However, in other embodiments, there is no reason why the processor 11 and the matrix multiplication processor 1 should not be provided on separate chips, or on different architectures.
The matrix multiplication processor 1, the bus 10 and the processor 11 are all typically provided on the same FPGA chip. This is particularly convenient, as common processors are available as FPGA modules, and the matrix multiplication processor can be provided as a peripheral thereto .
The matrices are stored in the memory in a row- wise fashion. Taking, as a non-limiting example, the matrices A and B referred to in the introduction above, then the elements can be stored as shown in the right hand three columns of Figure 3 of the accompanying drawings. Matrix A starts at memory location 0 with element an, the top-leftmost element, followed by the remaining element in that row, a12, followed by the elements of the next row and so on. The elements of matrix A therefore define a first memory area. Matrix B starts at memory location 9 and is set out in the same fashion, with element bn first, followed by b12, b21 and b22 in order; the elements of B therefore define a second memory area. A third area of memory is reserved for the elements of the product matrix Y = AB, starting with V11 at position 15 and continuing as with the other matrices .
The memory locations between each matrix may hold other data and are of no great importance. It is to be noted that each matrix could be held anywhere in the memory as long as the elements within the matrix as stored in the correct order.
The operation of the matrix multiplication processor according to the present embodiment may be demonstrated with respect to Figures 2 and 3. For the purposes of this example only, matrices A and B are set as follows :
Figure imgf000015_0001
These are example values only; the invention is not restricted to the size or values that can be manipulated thereby.
In the first step 100, the matrices A and B are written into memory by processor 11 as shown in Figure 3 and several variables are initialised. These are:
• BaseA, BaseB: indicate the starting location in memory of the row in matrix A and the column in matrix B that is currently being worked on; initially these values are set to the location of the first element in each matrix and so in the worked example take the values 0 and 9 respectively.
• AddrA, AddrB: indicate the location in memory of the element from each matrix that is currently being used; initially these values are set to BaseA and BaseB respectively, and so in the worked example take the values 0 and 9 respectively.
• AddrAinc, AddrBinc : indicate the number of memory locations to skip moving along one element in a row of A or a column of B respectively. AddrAinc is set to 1, whereas AddrBinc is set to the number of columns of B; in the worked example, this is 2.
• AddrY: indicate the location in memory to which the current element in the output matrix Y is to be written. Initialised to the location of the first element in Y, so in the worked example is initialised to 15.
• BaseAinc, BaseBinc: indicate the increment in the starting location in memory from row to row in A and column to column in B. BaseAinc is equal to the number of columns in the A matrix
(that is, that is the number of elements to skip to move from one row to the next) , whereas BaseBinc is equal to 1.
• Count 0 : a decrementing counter which indicates how many elements in the present row of the output matrix are left to be processed. Zero-based, so initialised to the number of elements in a row of A, less 1. The counter is decremented every time an output element is completed; this counter indicating zero implies the start of a new row in the output matrix.
• Count 1 : a decrementing counter which indicates how many pairs of elements in the input matrices are left to be processed within the same output element. Zero-based, so initialised to the number of columns in A, less 1. This counter is decremented every time a pair of values is loaded and multiplied, and it indicating zero implies that an output element has been calculated. • Count 2 : A decrementing counter which indicates how many output elements remain to be calculated. Zero-based, so initialised to the number of elements in the output matrix less 1.
• ld_acc: controls whether to load a new value into the accumulator, and hence reset it, or to add the value to the running total. Initialised to " 1" , that is to load the next value calculated into the accumulator. Uses Countl as a counter - the "Ld/acc" counter - to determine whether it is time to load or accumulate.
• WE: controls writing of the running total to the memory. Is initially set to "0" to inhibit writing to memory until the first value has been calculated. Once all the variables have been initialised, control passes to step 102.
The values at the memory locations indicated by AddrA and AddrB are passed to the multiplier 3 and multiplied at step 104. At step 106, the accumulator 4 determines whether the ld_acc flag is at "1" or "0"; if it is at "1" then the product from the multiplier 3 is loaded into the accumulator 4, thereby resetting it. In such a case, once the accumulator
4 had been reset, the ld_acc flag is reset to "0" to enable further values to be added to the running total. If the flag is at "0" then the product is added to the running total stored by the accumulator.
If writing is enabled - that is, if WE is set to " 1" , then the running total from the accumulator 4 is stored in the location defined by AddrY at step 108. Otherwise, the running total is kept for the next cycle.
Now that the necessary data have been calculated, it is necessary to update the counters and addresses for the next cycle. At step 110, it is determined whether Count 1 is at zero (and hence whether the calculation of the present output element is complete) ; if it is not, then Countl is decremented by 1 at step 112 and the values of AddrA and AddrB are incremented at step 114 by AddrAinc and AddrBinc respectively. If the new value of Countl is zero, then the element will be completed in the next cycle and the write enable flag WE is set to " 1 " .
If Countl was zero, then the current element is complete. At step 116, Countl is reset to the number of columns in A, less 1. Given that the running total stored by the accumulator will have been stored at step 108, the write enable flag WE may now be lowered to state "0" . Count 2 is decremented by 1 as there are now one fewer elements to calculate, and AddrY is incremented to point to the next write address in the output matrix Y.
If Count 0 is also zero (step 118), the present output row is complete. At step 120, Count 0 is reset to the number of columns in B, less 1; BaseA is incremented by BaseAinc and BaseB is reset to the starting position for matrix B. The matrix multiplication processor is now ready to calculate the next row. If the row is not complete, then at step 122 BaseB is incremented by BaseBinc (that is, 1) and Count 0 is decremented by 1 , so as to be ready to calculate the next element in the row. In either case, at step 124 the read addresses - AddrA and AddrB - are reset to the new values of BaseA and BaseB, ready for calculation of the next element .
Finally, the method checks whether Count 2 has reached zero (step 130) . If it has not, then there are still elements to calculate and so the method returns to step 102 for the next cycle. If it has reached zero, then the matrix multiplication is complete and all elements of the output matrix Y are stored in the appropriate area of memory 2. In the final step 132, a flag done is raised to indicate to processor 11 that the results are ready.
Figure 3 of the accompanying drawings shows the operation of this method on a cycle-by-cycle basis. It shows the starting values of BaseA, BaseB, BaseAinc, BaseBinc , CountO , Countl, Count2 , AddrAinc , AddrB inc, and AddrY as discussed above in the first two rows. It also shows the contents of the memory 2 in the right-hand two columns.
In the main body of Figure 3, the operation of the matrix multiplication processor is shown cycle-by-cycle basis. It can be seen that, in each step DataA and DataB are read from the locations in memory 2 indicated by AddrA and AddrB. These are multiplied together in column "*" , which is indicative of the output of multiplier 3. Depending on whether the ld_acc flag is at "0" or "1", this product will be written to or added to the running total "Accum" , representative of the running total stored in the accumulator 4. If the WE flag is enabled, this running total is written to the area of memory indicated by AddrY, as an element of the output matrix.
The operation of the counters and address variables is also shown in Figure 3. It can be seen that Count 1 counts down through the cycles required to generate an output element; in this case from 1 to 0 as two calculations are required. The column "Cl = = 0" indicates when Count 1 reaches zero. Every time Count 1 is reset, Count 0 decrements to count down the number of elements in a row of the output matrix; this is two elements, so Count 0 counts down from 1 to 0 also. The column "CO = = 0" indicates when both Countl and CountO are zero. Count2 counts down the total number of elements. In this case, it is counted down from 6, the total required, to 0 indicative of the method finishing and thereby causing the raising of the done flag.
BaseA is incremented every time a row of the output matrix is finished to point at the first element of a new row of A. BaseB, however, is updated every time an output element is completed, so as to point to the first element of the column of B that is currently being processed. Within the calculation for each element, the pointers AddrA and AddrB start from the current value of BaseA and BaseB and increment by AddrAinc and AddrB inc . The ld_acc flag is raised for the first calculation within an element, whereas the WE flag is raised for the last. This calculation therefore shows the multiplication and summations required to calculate a matrix multiplication; indeed, the values of the output matrix are shown in the memory area as they would be after the processor had finished. It can be seen that the method used has correctly calculated:
Figure imgf000020_0001
Of course, in reality the matrix multiplication processor would be used on far larger matrices but this example suffices for demonstrating the method used.
The fact that the output matrix Y is stored locally is useful; the processor 11 can now demand further calculations be performed on Y simply by setting the starting address of Y as that of A or B in the next calculation. This avoids having to pass the matrices over the data bus 11 unnecessarily if the intermediate result is not important.
It can therefore be seen that the outputs of the control unit 5 are the addresses AddrA and AddrB plus the write enable flag WE, passed to the memory 2, the ld_acc flag passed to the accumulator 4 and the done flag passed to the processor 11.
Once the done flag has been raised, one of the memory ports 12 normally connected to the multiplier 3 is multiplexed with the data bus 10 to form an output for the matrix multiplication processor, so that the processor 11 can read the output matrix in memory if desired. Figure 4 shows a car 200 fitted with the processor 11 and matrix multiplication processor 1 of Figure 1. The car 200 is fitted with a lane detection system such as that disclosed in EPl 649 334, comprising a video camera 204 and a processing unit 203. This processing unit is fitted with both the matrix multiplication processor 1 and the processor 11 of Figure 1. The lane detection system fitted to the car then can be used to accelerate the necessary processing of matrices within the lane detection algorithm. It has been found that an acceleration of up to five times is possible by using such a processing unit 203 for relatively small matrices and it is expected that the performance gains on bigger matrices could be even greater.

Claims

1. A matrix multiplication processor for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising:
• a memory having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written, • an accumulator having an input and arranged to, in use, store a running total and add values at its input to the running total;
• a multiplier having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and • a control unit arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed to the multiplier and causes the resultant product to be passed to the accumulator to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column-wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise through the first and second memory areas .
2. The matrix multiplication processor of claim 1 in which the control unit is arranged so as to, in use, cause the running total to be written to an appropriate location within the third area of memory.
3. The matrix multiplication processor of claim 1 or claim 2 in which the matrix multiplication processor is arranged such that, in use, the elements of each matrix are stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations.
4. The matrix multiplication processor of claim 3 in which the elements of the first row of each matrix are stored in order starting from the top-left element of each matrix, followed by the second row.
5. The matrix multiplication processor of claim 3 or claim 4 in which the control unit is arranged such that, when moving stepwise through the memory to choose elements within whichever of row or column the elements are stored it steps through successive memory locations.
6. The matrix multiplication processor of any of claims 3 to 5 in which the control unit is arranged such that, when moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, it steps through memory locations separated by the number of rows or columns as appropriate.
7. The matrix multiplication processor of any preceding claim in which the control unit comprises a pointer for each matrix, the pointer indicating, in use, the memory address of the element of each matrix that is presently chosen.
8. The matrix multiplication processor of claim 7 in which the control unit comprises pointer incrementing means arranged to, in use, successively increment each pointer by a constant.
9. The matrix multiplication processor of claim 7 or claim 8 in which the control unit comprises a base address store for first and second matrices, which is arranged so as to store, in use, the address of the first element of the row or column through which the control element is stepping at a given time.
10. The matrix multiplication processor of claim 9 in which the control unit may comprises base address incrementing means, which is arranged to increment each base address by a constant value when switching from one row or column to the next.
11. The matrix multiplication processor of any preceding claim in which the accumulator comprises a dedicated memory for the running total.
12. The matrix multiplication processor of claim 11 in which the dedicated memory is separate from the memory in which the matrices are stored in use.
13. The matrix multiplication processor of any preceding claim, which is implemented in a Field Programmable Gate Array (FPGA) circuit.
14. The matrix multiplication processor of any preceding claim which comprises a port for communicating with an external processor, whereby the processor is arranged such that first and second matrices can be input at the port in use so as to be written to the memory and so that the output matrix, in use, may be output at the port.
15. The matrix multiplication processor of any preceding claim, arranged such that it can use the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory, in use, for this purpose.
16. The matrix multiplication processor of any preceding claim, arranged such that the first and second matrices and output matrices are stored, in use, in the memory in the same order
17. The matrix multiplication processor of any preceding claim in which the control unit comprises a starting address store indicative in use of the first address in each matrix.
18. The matrix multiplication processor of any preceding claim in which the control unit is arranged to operate in discrete cycles.
19. The matrix multiplication processor of claim 18 in which the control unit is arranged such that, in use in one cycle, the control unit causes a pair of elements to be read from the memory and output to the multiplier; causes the multiplier to multiply the two elements output to it in the previous cycle and output the product to the accumulator; causes the accumulator to add the product calculated in the previous cycle to the running total; and causes the running total from the previous cycle to be written to the memory.
20. The matrix multiplication processor of any preceding claim in which the memory comprises a dual port RAM
21. The matrix multiplication processor of claim 20 in which each port has an input and an output; the output of both ports being operatively coupled to the multiplier.
22. The matrix multiplication processor of claim 21 in which an input of one of the ports is operatively coupled to an output of the accumulator, the output of the accumulator being arranged to output the running total when commanded to do so by the control unit.
23. The matrix multiplication processor of claim 21 or claim 22 as dependent from claim 14 in which one or both inputs of the ports of the RAM are connected to the port of the processor so as to enable the first and second matrices to be loaded into the memory in use.
24. The matrix multiplication processor of claim 23 as dependent from claim 22 in which the accumulator is coupled to the same input as the processor port is; the processor further comprising a multiplexer coupling both accumulator and processor port to the input of the port of the RAM.
25. A processing apparatus, comprising a processor coupled to a matrix multiplication processor according to any preceding claim.
26. The processing apparatus of claim 25, in which the processor and the matrix processor are provided on discrete chips connected by a data bus
27. The processing apparatus of claim 25 in which the processor and the matrix processor are implemented on a single chip.
28. The processing apparatus of any of claims 25 to 27 in which one or both of the processor and matrix multiplication processor or the single chip are implemented in FPGA.
29. A processing apparatus comprising a first processor and a second processor connected by a data bus, in which the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication.
30. The processing apparatus of claim 29 in which the processing apparatus is arranged such that the second processor is used only for, in use matrix multiplication.
31. The processing apparatus of claim 29 or claim 30, which is also a processing apparatus according to any of claims 25 to 28.
32. Use of a dedicated matrix-multiplication co-processor in calculating matrix multiplication in combination with a general-purpose processor for other operations.
33. A video processing apparatus, comprising the processing apparatus of any of claims 25 to 31.
34. The video processing apparatus of claim 33, comprising a video camera, the video processing apparatus being arranged to, in use, capture images from the video camera, convert such into matrices, which are then processed by the matrix multiplication processor.
35. The video processing apparatus of claim 34, which is arranged to capture images of a road and determine the position of lane boundaries in the captured images.
36. A method of matrix multiplication in a processing system, the multiplication being of a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the method comprising: storing the first matrix in a first area of memory and the second matrix in a second area of memory, the elements of each matrix being stored, in use, in sequential memory locations of the respective memory area in row- wise and/or column- wise order; repeatedly choosing an element from the first matrix and an element from the second matrix, multiplying the elements thus chosen to form a product and adding that to a running total; in which the elements are chosen from the first and second matrix by moving stepwise through the first and second memory areas.
37. The method of claim 36, further comprising writing the running total to an appropriate location within a third area of memory.
38. The method of claim 36 or claim 37 in which the elements of each matrix are stored such that each row or column is stored in sequential memory locations, the elements within each row or column also being stored in sequential memory locations.
39. The method of any of claims 36 to 38 in which, when moving stepwise through the memory to choose elements within whichever of row or column the elements the method steps through successive memory locations
40. The method of any of claims 36 to 39 in which, when moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, the method steps through memory locations separated by the number of rows or columns as appropriate.
41. The method of any of claims 36 to 40 in which, once one element in the output matrix has been calculated, base addresses indicating the starting position of that row or column are updated.
42. The method of any of claims 36 to 41 in which the method comprises the step of using the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory for this purpose.
43. The method of any of claims 36 to 42 in which the method is carried out in discrete cycles.
44. The method of claim 43 in which, in one cycle, a pair of elements are read from the memory; the two elements read in the previous cycle are multiplied and the product outputted to the accumulator; the product calculated in the previous cycle is added to the running total; and the running total from the previous cycle is written to the memory.
PCT/GB2007/003635 2006-09-26 2007-09-26 Matrix multiplication Ceased WO2008037975A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07823926A EP2067100A2 (en) 2006-09-26 2007-09-26 Matrix multiplication

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0618921A GB0618921D0 (en) 2006-09-26 2006-09-26 Matrix multiplication
GB0618921.1 2006-09-26

Publications (2)

Publication Number Publication Date
WO2008037975A2 true WO2008037975A2 (en) 2008-04-03
WO2008037975A3 WO2008037975A3 (en) 2009-05-22

Family

ID=37434671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003635 Ceased WO2008037975A2 (en) 2006-09-26 2007-09-26 Matrix multiplication

Country Status (3)

Country Link
EP (1) EP2067100A2 (en)
GB (1) GB0618921D0 (en)
WO (1) WO2008037975A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541507A (en) * 2010-12-31 2012-07-04 联芯科技有限公司 Dimension reconfigurable data processing method, system and matrix multiplication processor
US9384168B2 (en) 2013-06-11 2016-07-05 Analog Devices Global Vector matrix product accelerator for microprocessor integration
WO2018174934A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Systems, methods, and apparatus for matrix move
CN112506567A (en) * 2020-11-27 2021-03-16 海光信息技术股份有限公司 Data reading method and data reading circuit
CN113536220A (en) * 2020-04-21 2021-10-22 中科寒武纪科技股份有限公司 Operation method, processor and related product
CN114158284A (en) * 2020-07-07 2022-03-08 尼奥耐克索斯有限私人贸易公司 Apparatus and method for matrix multiplication using in-memory processing
US11275588B2 (en) 2017-07-01 2022-03-15 Intel Corporation Context save with variable save state size
US12536020B2 (en) 2024-02-05 2026-01-27 Intel Corporation Systems, methods, and apparatuses for tile store

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
DE102018110607A1 (en) 2017-05-08 2018-11-08 Nvidia Corporation Generalized acceleration of matrix multiplication and accumulation operations

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5021987A (en) * 1989-08-31 1991-06-04 General Electric Company Chain-serial matrix multipliers
US4956801A (en) * 1989-09-15 1990-09-11 Sun Microsystems, Inc. Matrix arithmetic circuit for processing matrix transformation operations
KR100416250B1 (en) * 2001-02-05 2004-01-24 삼성전자주식회사 Time-devision type matrix calculator
US6944747B2 (en) * 2002-12-09 2005-09-13 Gemtech Systems, Llc Apparatus and method for matrix data processing
GB0317949D0 (en) * 2003-07-31 2003-09-03 Trw Ltd Sensing apparatus for vehicles

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541507A (en) * 2010-12-31 2012-07-04 联芯科技有限公司 Dimension reconfigurable data processing method, system and matrix multiplication processor
US9384168B2 (en) 2013-06-11 2016-07-05 Analog Devices Global Vector matrix product accelerator for microprocessor integration
US12106100B2 (en) 2017-03-20 2024-10-01 Intel Corporation Systems, methods, and apparatuses for matrix operations
US12282773B2 (en) 2017-03-20 2025-04-22 Intel Corporation Systems, methods, and apparatus for tile configuration
US12314717B2 (en) 2017-03-20 2025-05-27 Intel Corporation Systems, methods, and apparatuses for dot production operations
US11080048B2 (en) 2017-03-20 2021-08-03 Intel Corporation Systems, methods, and apparatus for tile configuration
US11086623B2 (en) 2017-03-20 2021-08-10 Intel Corporation Systems, methods, and apparatuses for tile matrix multiplication and accumulation
US11360770B2 (en) 2017-03-20 2022-06-14 Intel Corporation Systems, methods, and apparatuses for zeroing a matrix
US11163565B2 (en) 2017-03-20 2021-11-02 Intel Corporation Systems, methods, and apparatuses for dot production operations
US11200055B2 (en) 2017-03-20 2021-12-14 Intel Corporation Systems, methods, and apparatuses for matrix add, subtract, and multiply
US11263008B2 (en) 2017-03-20 2022-03-01 Intel Corporation Systems, methods, and apparatuses for tile broadcast
US12260213B2 (en) 2017-03-20 2025-03-25 Intel Corporation Systems, methods, and apparatuses for matrix add, subtract, and multiply
US12182571B2 (en) 2017-03-20 2024-12-31 Intel Corporation Systems, methods, and apparatuses for tile load, multiplication and accumulation
US11567765B2 (en) 2017-03-20 2023-01-31 Intel Corporation Systems, methods, and apparatuses for tile load
US10877756B2 (en) 2017-03-20 2020-12-29 Intel Corporation Systems, methods, and apparatuses for tile diagonal
US11288069B2 (en) 2017-03-20 2022-03-29 Intel Corporation Systems, methods, and apparatuses for tile store
US11288068B2 (en) 2017-03-20 2022-03-29 Intel Corporation Systems, methods, and apparatus for matrix move
US11714642B2 (en) 2017-03-20 2023-08-01 Intel Corporation Systems, methods, and apparatuses for tile store
US11847452B2 (en) 2017-03-20 2023-12-19 Intel Corporation Systems, methods, and apparatus for tile configuration
US11977886B2 (en) 2017-03-20 2024-05-07 Intel Corporation Systems, methods, and apparatuses for tile store
US12039332B2 (en) 2017-03-20 2024-07-16 Intel Corporation Systems, methods, and apparatus for matrix move
WO2018174934A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Systems, methods, and apparatus for matrix move
US12124847B2 (en) 2017-03-20 2024-10-22 Intel Corporation Systems, methods, and apparatuses for tile transpose
US12147804B2 (en) 2017-03-20 2024-11-19 Intel Corporation Systems, methods, and apparatuses for tile matrix multiplication and accumulation
US11275588B2 (en) 2017-07-01 2022-03-15 Intel Corporation Context save with variable save state size
CN113536220A (en) * 2020-04-21 2021-10-22 中科寒武纪科技股份有限公司 Operation method, processor and related product
CN114158284A (en) * 2020-07-07 2022-03-08 尼奥耐克索斯有限私人贸易公司 Apparatus and method for matrix multiplication using in-memory processing
CN112506567A (en) * 2020-11-27 2021-03-16 海光信息技术股份有限公司 Data reading method and data reading circuit
US12536020B2 (en) 2024-02-05 2026-01-27 Intel Corporation Systems, methods, and apparatuses for tile store

Also Published As

Publication number Publication date
EP2067100A2 (en) 2009-06-10
WO2008037975A3 (en) 2009-05-22
GB0618921D0 (en) 2006-11-08

Similar Documents

Publication Publication Date Title
WO2008037975A2 (en) Matrix multiplication
EP2017743B1 (en) High speed and efficient matrix multiplication hardware module
CN110705687B (en) Convolution neural network hardware computing device and method
CN107909148B (en) Apparatus for performing convolution operations in convolutional neural networks
US11710032B2 (en) Pooling unit for deep learning acceleration
EP3408798A1 (en) A convolutional neural network
TW202123093A (en) Method and system for performing convolution operation
US10402196B2 (en) Multi-dimensional sliding window operation for a vector processor, including dividing a filter into a plurality of patterns for selecting data elements from a plurality of input registers and performing calculations in parallel using groups of the data elements and coefficients
CN108717571B (en) Acceleration method and device for artificial intelligence
KR20200140282A (en) Efficient convolution engine
CN112712457A (en) Data processing method and artificial intelligence processor
US20070230827A1 (en) Method and Apparatus for Downscaling a Digital Colour Matrix Image
CN1253340A (en) Signal processing of distributed operation architecture
JP3333779B2 (en) Matrix arithmetic unit
CN111783876B (en) Self-adaptive intelligent detection circuit and image intelligent detection method
CN111145075B (en) Data processing system
US10623222B2 (en) Vectorized peak detection for signal processing
WO2022230674A1 (en) Computation processing device
US7092035B1 (en) Block move engine with scaling and/or filtering for video or graphics
JP3860548B2 (en) Image processing apparatus and image processing method
KR102864575B1 (en) A single port RAM based image data packing device and method
JPS63273176A (en) Space filtering device
HK40045882B (en) Matrix operation method, device, equipment and storage medium of image data
WO2023112581A1 (en) Inference device
CN120216859A (en) Method, device, apparatus and program product for determining matrix multiplication result

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07823926

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 2007823926

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE