WO2008037975A2

WO2008037975A2 - Matrix multiplication

Info

Publication number: WO2008037975A2
Application number: PCT/GB2007/003635
Authority: WO
Inventors: Martin John Thompson
Original assignee: TRW Ltd
Current assignee: TRW Ltd
Priority date: 2006-09-26
Filing date: 2007-09-26
Publication date: 2008-04-03
Anticipated expiration: 2009-03-26
Also published as: EP2067100A2; WO2008037975A3; GB0618921D0

Abstract

A matrix multiplication processor (1) for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising: a memory (2) having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written, an accumulator (4) having an input and arranged to, in use, store a running total and add values at its input to the running total; a multiplier (3) having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and a control unit (5) arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses (102) an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed (104) to the multiplier (3) and causes (106) the resultant product to be passed to the accumulator (4) to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column- wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise (112-130) through the first and second memory areas. Also, a processing apparatus comprising a first processor (11) and a second processor (1) connected by a data bus (10), in which the first processor (11) is a general-purpose processor and the second processor (1) is more efficient than the first at calculating matrix multiplication. The processors thus disclosed may advantageously be used in video processing situations, such as road lane detection from vehicles.

Description

MATRIX MULTIPLICATION

This invention relates to a matrix multiplication processor, processing apparatus using such a matrix multiplication processor, use of a matrix multiplication co-processor, a video processing apparatus using a matrix multiplication processor and a matrix multiplication method.

Matrix multiplication is a well-known mathematical process whereby two matrices are multiplied together to form a third matrix. This can be represented by Y = AB where Y is the product of matrices A and B. It should be noted that matrix multiplication is non-commutative (that is, AB is not generally equal to BA) and that matrix multiplication is only well defined when the number of columns in the first matrix is equal to the number of rows in the second matrix. Multiplication of a m x n matrix by a n x o matrix results in a m x o matrix.

The elements of the product matrix obtained by matrix multiplication are defined by:

(AB) 'ij =∑A_ιrB_rj r=\ where A and B are the matrices to be multiplied together and subscripts indicate the row and column elements of the relevant matrix.

Taking the example of a 3x2 matrix multiplied by a 2x2 matrix:

where:

y_2l = a_2lb_n + a₂₂b₂₁ y₂₂ = a₂φ_l2 + a₂₂b₂₂ ^' y_3l = a₃₁b_n + a₃₂b₂₁

Matrix multiplication can therefore be seen as taking a row from the first matrix and a column from the second matrix, matching successive elements from each to form pairs (that is, first element with first, second with second and so on), multiplying the pairs of elements thus chosen, adding all the products together to generate one element of the product matrix then repeating for all the elements of the row and column then adding together all the products. This generates an element of the product matrix and is repeated for all combinations of rows and columns to form the product matrix.

Matrix multiplication is especially important in image processing, where images, or data derived therefrom, can be represented by large matrices, each element of a matrix representing the intensity of a pixel within the image or some other similar data. Matrix multiplication is not a very efficient process on conventional processors, due to the large amount of data that must be passed over the processor's external data bus as successive multiplications are performed. It is necessary to interleave mathematical operation with memory accesses. According to a first aspect of the invention, there is provided a matrix multiplication processor for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising:

• a memory having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written,

• an accumulator having an input and arranged to, in use, store a running total and add values at its input to the running total;

• a multiplier having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and

• a control unit arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed to the multiplier and causes the resultant product to be passed to the accumulator to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column- wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise through the first and second memory areas.

Such a processor has been found to be more efficient than the prior art architecture, where elements must be retrieved from memory and passed through a bus, operated upon and then passed back to the memory through the bus for each operation. The stepwise operation through contiguous memory locations allows the control unit to merely need step through the elements in the memory; the subsequent use of a dedicated accumulator keeping a running total and passing elements directly from a multiplier to the accumulator has been found to be particularly efficient.

Typically, the control unit would also be arranged so as to, in use, cause the running total to be written to an appropriate location within the third area of memory.

In use, the elements of each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below. In the preferred embodiment, the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.

Given the sequential nature of the elements stored within the memory, the control unit, when moving stepwise through the memory to choose elements within whichever of row or column the elements are stored may therefore step through successive memory locations - the memory address chosen by the control unit will increment or decrement by 1. When moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, the control unit may step through memory locations separated by the number of rows or columns as appropriate - the memory address chosen by the control unit will increment by a constant (the number of rows or columns). It can be seen from this description that accessing successive elements across rows or columns (as is necessary in matrix multiplication) is a simple process that does not require much processor effort. It should also be noted that decrementing a memory address can also be seen to include incrementing the memory address by a negative amount, so that incrementing should be read to include decrementing also.

Indeed, the control unit may comprise a pointer for each matrix, the pointer indicating, in use, the memory address of the element of each matrix that is presently chosen. The control unit may also comprise pointer incrementing means arranged to, in use, successively increment each pointer by a constant.

The control unit may also comprise a base address store for first and second matrices, which is arranged so as to store, in use, the address of the first element of the row or column through which the control element is stepping at a given time. The control unit may also comprise base address incrementing means, which is arranged to increment the base address by a constant value when switching from one row or column to the next.

It can therefore be seen that when calculating one element of the output matrix, the pointer to each of first and second matrix will be incremented in one case by 1 and in the other case by the number of columns in the first matrix (which is, of course, equal to the number of rows in the second matrix) . Once one element in the output matrix has been calculated, the base addresses can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true. Preferably, the accumulator comprises a dedicated memory for the running total; this dedicated memory is preferably separate from the memory in which the matrices are stored in use. Accordingly, by providing an accumulator with a dedicated memory, any problems with interleaving mathematical operations with memory accesses can be reduced.

Preferably, the processor is implemented in a Field Programmable Gate Array (FPGA) circuit. This is particularly useful, as it is a flexible technology particularly suitable for the generation of new architectures such as that described above. Preferably, the processor also comprises a port for communicating with an external processor, whereby the processor is arranged first and second matrices can be input at the port in use so as to be written to the memory and so that the output matrix, in use, may be output at the port.

The processor may be arranged such that it can use the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory, in use, for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.

To this end, the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise. The control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix need only be updated to refer to the first address of the previous output matrix. The control unit may be arranged to operate in discrete, cycles. In one cycle, the control unit may be arranged to cause a pair of elements to be read from the memory and output to the multiplier; to cause the multiplier to multiply the two elements output to it in the previous cycle and output the product to the accumulator; to cause the accumulator to add the product calculated in the previous cycle to the running total; and to cause the running total from the previous cycle to be written to the memory. By operating in discrete cycles, the control unit may more reliably operate, as it is clear when each operation will occur. Furthermore, this "pipelining" of instructions may allow for faster operation.

The memory may comprise a dual port RAM (or a multiport RAM, of which at least two ports are used; herein by dual port we include multi- port although this need not be the case), each port typically having an input and an output. The output of both ports may be operatively coupled to the multiplier; the use of a dual port RAM allows for both values to be multiplied to be read from memory at the same time. However, it is possible to use a single port RAM although with a speed penalty.

An input of one of the ports may be operatively coupled to an output of the accumulator, the output of the accumulator being arranged to output the running total when commanded to do so by the control unit. Furthermore, one or both inputs of the ports of the RAM may be connected to the port of the processor so as to enable the first and second matrices to be loaded into the memory in use.

The accumulator may be coupled to the same input as the processor port is; in such a case the processor may further comprise a multiplexer coupling both accumulator and processor port to the input of the port of the RAM. According to a second aspect of the invention, there is provided a processing apparatus, comprising a processor coupled to a matrix multiplication processor according to a first aspect of the invention. This allows a conventional processor to use the matrix processor to perform matrix multiplication, thereby speeding up the processing of matrices, whilst remaining the usual functions of conventional processor for other functions .

The processor and the matrix processor may be provided on discrete chips connected by a data bus, or may be implemented on a single chip. Conveniently, one or both of the processor and matrix multiplication processor or the single chip may be implemented in FPGA; this is especially convenient given that standard processors are available in FPGA and so the matrix processor may be designed as an on-chip "peripheral" to the processor.

According to a third aspect of the invention, there is provided a processing apparatus comprising a first processor and a second processor connected by a data bus, in which the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication. Such an apparatus makes use of a special "co-processor" in order to calculate matrix multiplication, an otherwise processor-intensive operation. Preferably, the second processor is used only for, in use (and save perhaps for ancillary matters) matrix multiplication. The processing apparatus may be according to the second aspect of the invention.

According to a fourth aspect of the invention, there is provided use of a dedicated matrix-multiplication co-processor in calculating matrix multiplication in combination with a general-purpose processor for other operations.

According to a fifth aspect of the invention, there is provided a video processing apparatus, comprising the processing apparatus of the second or third aspects of the invention. The video processing apparatus may comprise a video camera, and the video processing apparatus may be arranged to, in use, capture images from the video camera, convert such into matrices, which are then processed by the matrix multiplication processor. The video processing apparatus may be arranged to capture images of a road and determine, possibly by means of a Kalman filter, the position of lane boundaries in the captured images .

According to a sixth aspect of the invention, there is provided a method of matrix multiplication in a processing system, the multiplication being of a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the method comprising: storing the first matrix in a first area of memory and the second matrix in a second area of memory, the elements of each matrix being stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column-wise order; repeatedly choosing an element from the first matrix and an element from the second matrix, multiplying the elements thus chosen to form a product and adding that to a running total; in which the elements are chosen from the first and second matrix by moving stepwise through the first and second memory areas.

Typically, the method also comprising writing the running total to an appropriate location within a third area of memory. In use, the elements of each matrix may be stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations. This can improve the efficiency of access of the elements as will be seen below. In the preferred embodiment, the elements of the first row are stored in order starting from the top-left element of each matrix, followed by the second row, but any other arrangement preserving the order of elements and rows is possible.

Given the sequential nature of the elements stored within the memory, when moving stepwise through the memory to choose elements within whichever of row or column the elements it is therefore possible to simply step through successive memory locations - the memory address chosen will increment or decrement by 1. When moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, it is therefore possible to step through memory locations separated by the number of rows or columns as appropriate - the memory address chosen will increment by a constant (the number of rows or columns) . It can be seen from this description that accessing successive elements across rows or columns (as is necessary in matrix multiplication) is a simple process that does not require much processor effort. It should also be noted that decrementing a memory address can also be seen to include incrementing the memory address by a negative amount, so that incrementing should be read to include decrementing also.

It can therefore be seen that when calculating one element of the output matrix, a pointer to each of first and second matrix may be incremented in one case by 1 and in the other case by the number of columns in the first matrix (which is, of course, equal to the number of rows in the second matrix) . Once one element in the output matrix has been calculated, base addresses indicating the starting position of that row or column can be updated; when working through a row of output elements, the base address for the second matrix will be incremented by the appropriate increment for moving along a column; when the time comes to start a new row, the base address for the first matrix will be incremented by the equivalent of one row whilst the base address for the second matrix will be reset to the start of the first column; this assumes that a row-wise calculation of the output matrix is performed, but for a column- wise calculation the reverse is true.

The method may comprise the step of using the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory for this purpose. This allows further calculations to be performed on the matrix without the need for passing the output matrix back out of the processor.

To this end, the first and second matrices and output matrices are preferably stored in the memory in the same order - that is, they are both stored row-wise or column- wise. The control unit may be provided with a starting address store indicative in use of the first address in each matrix; to perform a calculation on a previous output matrix, the starting address store for the first or second matrix being updated in such a case to refer to the first address of the previous output matrix.

The method may be carried out in discrete cycles. In one cycle, the a pair of elements may be read from the memory; the two elements read in the previous cycle may be multiplied and the product outputted to the accumulator; the product calculated in the previous cycle may be added to the running total; and the running total from the previous cycle may be written to the memory. By operating in discrete cycles, the method may more reliably operate, as it is clear when each operation will occur. There now follows by way of example only a description of an embodiment of the invention, described with reference to the accompanying drawings in which:

Figure 1 is a diagram showing schematically a matrix multiplication processor according to an embodiment of the invention;

Figure 2 is a flow chart showing the operation of the processor of

Figure 1;

Figure 3 shows an example calculation as performed by the matrix multiplication processor of Figure 1; and

Figure 4 shows a car fitted with the processor of Figure 1.

A matrix multiplication processor according to an embodiment of the invention is shown in the accompanying drawings. Figure 1 of those drawings shows the basic functional blocks of the circuit, where the matrix multiplication processor as a whole is depicted as 1. The processor 1 comprises an area of memory 2 of the form of a dual-port RAM, a multiplier 3, a resettable accumulator 4 and a control state machine 5 arranged to act as a control unit for the other circuits.

The matrix multiplication processor 1 is connected by means of a data bus 10 to an external processor 11. When the external processor 11 requires the product of two matrices, it passes first and second matrices across data bus 10 to the matrix multiplication processor 1. They are stored in the memory 2. However, in other embodiments, there is no reason why the processor 11 and the matrix multiplication processor 1 should not be provided on separate chips, or on different architectures.

The matrix multiplication processor 1, the bus 10 and the processor 11 are all typically provided on the same FPGA chip. This is particularly convenient, as common processors are available as FPGA modules, and the matrix multiplication processor can be provided as a peripheral thereto .

The matrices are stored in the memory in a row- wise fashion. Taking, as a non-limiting example, the matrices A and B referred to in the introduction above, then the elements can be stored as shown in the right hand three columns of Figure 3 of the accompanying drawings. Matrix A starts at memory location 0 with element a_n, the top-leftmost element, followed by the remaining element in that row, a₁₂, followed by the elements of the next row and so on. The elements of matrix A therefore define a first memory area. Matrix B starts at memory location 9 and is set out in the same fashion, with element b_n first, followed by b₁₂, b₂₁ and b₂₂ in order; the elements of B therefore define a second memory area. A third area of memory is reserved for the elements of the product matrix Y = AB, starting with V₁₁ at position 15 and continuing as with the other matrices .

The memory locations between each matrix may hold other data and are of no great importance. It is to be noted that each matrix could be held anywhere in the memory as long as the elements within the matrix as stored in the correct order.

The operation of the matrix multiplication processor according to the present embodiment may be demonstrated with respect to Figures 2 and 3. For the purposes of this example only, matrices A and B are set as follows :

These are example values only; the invention is not restricted to the size or values that can be manipulated thereby.

In the first step 100, the matrices A and B are written into memory by processor 11 as shown in Figure 3 and several variables are initialised. These are:

• BaseA, BaseB: indicate the starting location in memory of the row in matrix A and the column in matrix B that is currently being worked on; initially these values are set to the location of the first element in each matrix and so in the worked example take the values 0 and 9 respectively.

• AddrA, AddrB: indicate the location in memory of the element from each matrix that is currently being used; initially these values are set to BaseA and BaseB respectively, and so in the worked example take the values 0 and 9 respectively.

• AddrAinc, AddrBinc : indicate the number of memory locations to skip moving along one element in a row of A or a column of B respectively. AddrAinc is set to 1, whereas AddrBinc is set to the number of columns of B; in the worked example, this is 2.

• AddrY: indicate the location in memory to which the current element in the output matrix Y is to be written. Initialised to the location of the first element in Y, so in the worked example is initialised to 15.

• BaseAinc, BaseBinc: indicate the increment in the starting location in memory from row to row in A and column to column in B. BaseAinc is equal to the number of columns in the A matrix

(that is, that is the number of elements to skip to move from one row to the next) , whereas BaseBinc is equal to 1.

• Count 0 : a decrementing counter which indicates how many elements in the present row of the output matrix are left to be processed. Zero-based, so initialised to the number of elements in a row of A, less 1. The counter is decremented every time an output element is completed; this counter indicating zero implies the start of a new row in the output matrix.

• Count 1 : a decrementing counter which indicates how many pairs of elements in the input matrices are left to be processed within the same output element. Zero-based, so initialised to the number of columns in A, less 1. This counter is decremented every time a pair of values is loaded and multiplied, and it indicating zero implies that an output element has been calculated. • Count 2 : A decrementing counter which indicates how many output elements remain to be calculated. Zero-based, so initialised to the number of elements in the output matrix less 1.

• ld_acc: controls whether to load a new value into the accumulator, and hence reset it, or to add the value to the running total. Initialised to " 1" , that is to load the next value calculated into the accumulator. Uses Countl as a counter - the "Ld/acc" counter - to determine whether it is time to load or accumulate.

• WE: controls writing of the running total to the memory. Is initially set to "0" to inhibit writing to memory until the first value has been calculated. Once all the variables have been initialised, control passes to step 102.

The values at the memory locations indicated by AddrA and AddrB are passed to the multiplier 3 and multiplied at step 104. At step 106, the accumulator 4 determines whether the ld_acc flag is at "1" or "0"; if it is at "1" then the product from the multiplier 3 is loaded into the accumulator 4, thereby resetting it. In such a case, once the accumulator

4 had been reset, the ld_acc flag is reset to "0" to enable further values to be added to the running total. If the flag is at "0" then the product is added to the running total stored by the accumulator.

If writing is enabled - that is, if WE is set to " 1" , then the running total from the accumulator 4 is stored in the location defined by AddrY at step 108. Otherwise, the running total is kept for the next cycle.

Now that the necessary data have been calculated, it is necessary to update the counters and addresses for the next cycle. At step 110, it is determined whether Count 1 is at zero (and hence whether the calculation of the present output element is complete) ; if it is not, then Countl is decremented by 1 at step 112 and the values of AddrA and AddrB are incremented at step 114 by AddrAinc and AddrBinc respectively. If the new value of Countl is zero, then the element will be completed in the next cycle and the write enable flag WE is set to " 1 " .

If Countl was zero, then the current element is complete. At step 116, Countl is reset to the number of columns in A, less 1. Given that the running total stored by the accumulator will have been stored at step 108, the write enable flag WE may now be lowered to state "0" . Count 2 is decremented by 1 as there are now one fewer elements to calculate, and AddrY is incremented to point to the next write address in the output matrix Y.

If Count 0 is also zero (step 118), the present output row is complete. At step 120, Count 0 is reset to the number of columns in B, less 1; BaseA is incremented by BaseAinc and BaseB is reset to the starting position for matrix B. The matrix multiplication processor is now ready to calculate the next row. If the row is not complete, then at step 122 BaseB is incremented by BaseBinc (that is, 1) and Count 0 is decremented by 1 , so as to be ready to calculate the next element in the row. In either case, at step 124 the read addresses - AddrA and AddrB - are reset to the new values of BaseA and BaseB, ready for calculation of the next element .

Finally, the method checks whether Count 2 has reached zero (step 130) . If it has not, then there are still elements to calculate and so the method returns to step 102 for the next cycle. If it has reached zero, then the matrix multiplication is complete and all elements of the output matrix Y are stored in the appropriate area of memory 2. In the final step 132, a flag done is raised to indicate to processor 11 that the results are ready.

Figure 3 of the accompanying drawings shows the operation of this method on a cycle-by-cycle basis. It shows the starting values of BaseA, BaseB, BaseAinc, BaseBinc , CountO , Countl, Count2 , AddrAinc , AddrB inc, and AddrY as discussed above in the first two rows. It also shows the contents of the memory 2 in the right-hand two columns.

In the main body of Figure 3, the operation of the matrix multiplication processor is shown cycle-by-cycle basis. It can be seen that, in each step DataA and DataB are read from the locations in memory 2 indicated by AddrA and AddrB. These are multiplied together in column "*" , which is indicative of the output of multiplier 3. Depending on whether the ld_acc flag is at "0" or "1", this product will be written to or added to the running total "Accum" , representative of the running total stored in the accumulator 4. If the WE flag is enabled, this running total is written to the area of memory indicated by AddrY, as an element of the output matrix.

The operation of the counters and address variables is also shown in Figure 3. It can be seen that Count 1 counts down through the cycles required to generate an output element; in this case from 1 to 0 as two calculations are required. The column "Cl = = 0" indicates when Count 1 reaches zero. Every time Count 1 is reset, Count 0 decrements to count down the number of elements in a row of the output matrix; this is two elements, so Count 0 counts down from 1 to 0 also. The column "CO = = 0" indicates when both Countl and CountO are zero. Count2 counts down the total number of elements. In this case, it is counted down from 6, the total required, to 0 indicative of the method finishing and thereby causing the raising of the done flag.

BaseA is incremented every time a row of the output matrix is finished to point at the first element of a new row of A. BaseB, however, is updated every time an output element is completed, so as to point to the first element of the column of B that is currently being processed. Within the calculation for each element, the pointers AddrA and AddrB start from the current value of BaseA and BaseB and increment by AddrAinc and AddrB inc . The ld_acc flag is raised for the first calculation within an element, whereas the WE flag is raised for the last. This calculation therefore shows the multiplication and summations required to calculate a matrix multiplication; indeed, the values of the output matrix are shown in the memory area as they would be after the processor had finished. It can be seen that the method used has correctly calculated:

Of course, in reality the matrix multiplication processor would be used on far larger matrices but this example suffices for demonstrating the method used.

The fact that the output matrix Y is stored locally is useful; the processor 11 can now demand further calculations be performed on Y simply by setting the starting address of Y as that of A or B in the next calculation. This avoids having to pass the matrices over the data bus 11 unnecessarily if the intermediate result is not important.

It can therefore be seen that the outputs of the control unit 5 are the addresses AddrA and AddrB plus the write enable flag WE, passed to the memory 2, the ld_acc flag passed to the accumulator 4 and the done flag passed to the processor 11.

Once the done flag has been raised, one of the memory ports 12 normally connected to the multiplier 3 is multiplexed with the data bus 10 to form an output for the matrix multiplication processor, so that the processor 11 can read the output matrix in memory if desired. Figure 4 shows a car 200 fitted with the processor 11 and matrix multiplication processor 1 of Figure 1. The car 200 is fitted with a lane detection system such as that disclosed in EPl 649 334, comprising a video camera 204 and a processing unit 203. This processing unit is fitted with both the matrix multiplication processor 1 and the processor 11 of Figure 1. The lane detection system fitted to the car then can be used to accelerate the necessary processing of matrices within the lane detection algorithm. It has been found that an acceleration of up to five times is possible by using such a processing unit 203 for relatively small matrices and it is expected that the performance gains on bigger matrices could be even greater.

Claims

1. A matrix multiplication processor for multiplying a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the matrix processor comprising:

• a memory having a first area in which, in use, the first matrix is stored, a second area in which, in use, the second matrix is stored and a third area to which, in use, the output matrix is written, • an accumulator having an input and arranged to, in use, store a running total and add values at its input to the running total;

• a multiplier having an input for two values and an output and arranged to, in use, multiply two values at its input and output the product of the two values at its output; and • a control unit arranged to control the operation of the matrix multiplication processor; in which the control unit is arranged, such that in use, it repeatedly chooses an element from the first matrix and an element from the second matrix, causes the elements thus chosen to be passed to the multiplier and causes the resultant product to be passed to the accumulator to be added to the running total; wherein the elements of each matrix are stored, in use, in sequential memory locations of the respective memory area in row-wise and/or column-wise order; and in which the control unit selects the elements from the first and second matrix by moving stepwise through the first and second memory areas .

2. The matrix multiplication processor of claim 1 in which the control unit is arranged so as to, in use, cause the running total to be written to an appropriate location within the third area of memory.

3. The matrix multiplication processor of claim 1 or claim 2 in which the matrix multiplication processor is arranged such that, in use, the elements of each matrix are stored such that each row or column is stored in sequential memory locations, the elements within each row also being stored in sequential memory locations.

4. The matrix multiplication processor of claim 3 in which the elements of the first row of each matrix are stored in order starting from the top-left element of each matrix, followed by the second row.

5. The matrix multiplication processor of claim 3 or claim 4 in which the control unit is arranged such that, when moving stepwise through the memory to choose elements within whichever of row or column the elements are stored it steps through successive memory locations.

6. The matrix multiplication processor of any of claims 3 to 5 in which the control unit is arranged such that, when moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, it steps through memory locations separated by the number of rows or columns as appropriate.

7. The matrix multiplication processor of any preceding claim in which the control unit comprises a pointer for each matrix, the pointer indicating, in use, the memory address of the element of each matrix that is presently chosen.

8. The matrix multiplication processor of claim 7 in which the control unit comprises pointer incrementing means arranged to, in use, successively increment each pointer by a constant.

9. The matrix multiplication processor of claim 7 or claim 8 in which the control unit comprises a base address store for first and second matrices, which is arranged so as to store, in use, the address of the first element of the row or column through which the control element is stepping at a given time.

10. The matrix multiplication processor of claim 9 in which the control unit may comprises base address incrementing means, which is arranged to increment each base address by a constant value when switching from one row or column to the next.

11. The matrix multiplication processor of any preceding claim in which the accumulator comprises a dedicated memory for the running total.

12. The matrix multiplication processor of claim 11 in which the dedicated memory is separate from the memory in which the matrices are stored in use.

13. The matrix multiplication processor of any preceding claim, which is implemented in a Field Programmable Gate Array (FPGA) circuit.

14. The matrix multiplication processor of any preceding claim which comprises a port for communicating with an external processor, whereby the processor is arranged such that first and second matrices can be input at the port in use so as to be written to the memory and so that the output matrix, in use, may be output at the port.

15. The matrix multiplication processor of any preceding claim, arranged such that it can use the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory, in use, for this purpose.

16. The matrix multiplication processor of any preceding claim, arranged such that the first and second matrices and output matrices are stored, in use, in the memory in the same order

17. The matrix multiplication processor of any preceding claim in which the control unit comprises a starting address store indicative in use of the first address in each matrix.

18. The matrix multiplication processor of any preceding claim in which the control unit is arranged to operate in discrete cycles.

19. The matrix multiplication processor of claim 18 in which the control unit is arranged such that, in use in one cycle, the control unit causes a pair of elements to be read from the memory and output to the multiplier; causes the multiplier to multiply the two elements output to it in the previous cycle and output the product to the accumulator; causes the accumulator to add the product calculated in the previous cycle to the running total; and causes the running total from the previous cycle to be written to the memory.

20. The matrix multiplication processor of any preceding claim in which the memory comprises a dual port RAM

21. The matrix multiplication processor of claim 20 in which each port has an input and an output; the output of both ports being operatively coupled to the multiplier.

22. The matrix multiplication processor of claim 21 in which an input of one of the ports is operatively coupled to an output of the accumulator, the output of the accumulator being arranged to output the running total when commanded to do so by the control unit.

23. The matrix multiplication processor of claim 21 or claim 22 as dependent from claim 14 in which one or both inputs of the ports of the RAM are connected to the port of the processor so as to enable the first and second matrices to be loaded into the memory in use.

24. The matrix multiplication processor of claim 23 as dependent from claim 22 in which the accumulator is coupled to the same input as the processor port is; the processor further comprising a multiplexer coupling both accumulator and processor port to the input of the port of the RAM.

25. A processing apparatus, comprising a processor coupled to a matrix multiplication processor according to any preceding claim.

26. The processing apparatus of claim 25, in which the processor and the matrix processor are provided on discrete chips connected by a data bus

27. The processing apparatus of claim 25 in which the processor and the matrix processor are implemented on a single chip.

28. The processing apparatus of any of claims 25 to 27 in which one or both of the processor and matrix multiplication processor or the single chip are implemented in FPGA.

29. A processing apparatus comprising a first processor and a second processor connected by a data bus, in which the first processor is a general-purpose processor and the second processor is more efficient than the first at calculating matrix multiplication.

30. The processing apparatus of claim 29 in which the processing apparatus is arranged such that the second processor is used only for, in use matrix multiplication.

31. The processing apparatus of claim 29 or claim 30, which is also a processing apparatus according to any of claims 25 to 28.

32. Use of a dedicated matrix-multiplication co-processor in calculating matrix multiplication in combination with a general-purpose processor for other operations.

33. A video processing apparatus, comprising the processing apparatus of any of claims 25 to 31.

34. The video processing apparatus of claim 33, comprising a video camera, the video processing apparatus being arranged to, in use, capture images from the video camera, convert such into matrices, which are then processed by the matrix multiplication processor.

35. The video processing apparatus of claim 34, which is arranged to capture images of a road and determine the position of lane boundaries in the captured images.

36. A method of matrix multiplication in a processing system, the multiplication being of a first matrix with a second matrix to form an output matrix, each matrix comprising a plurality of elements in rows and columns, the method comprising: storing the first matrix in a first area of memory and the second matrix in a second area of memory, the elements of each matrix being stored, in use, in sequential memory locations of the respective memory area in row- wise and/or column- wise order; repeatedly choosing an element from the first matrix and an element from the second matrix, multiplying the elements thus chosen to form a product and adding that to a running total; in which the elements are chosen from the first and second matrix by moving stepwise through the first and second memory areas.

37. The method of claim 36, further comprising writing the running total to an appropriate location within a third area of memory.

38. The method of claim 36 or claim 37 in which the elements of each matrix are stored such that each row or column is stored in sequential memory locations, the elements within each row or column also being stored in sequential memory locations.

39. The method of any of claims 36 to 38 in which, when moving stepwise through the memory to choose elements within whichever of row or column the elements the method steps through successive memory locations

40. The method of any of claims 36 to 39 in which, when moving stepwise through the memory to choose elements from successive of the rows or columns that the elements are stored in, the method steps through memory locations separated by the number of rows or columns as appropriate.

41. The method of any of claims 36 to 40 in which, once one element in the output matrix has been calculated, base addresses indicating the starting position of that row or column are updated.

42. The method of any of claims 36 to 41 in which the method comprises the step of using the output matrix generated in one operation as a first or second matrix in a subsequent operation, with the output matrix being retained in memory for this purpose.

43. The method of any of claims 36 to 42 in which the method is carried out in discrete cycles.

44. The method of claim 43 in which, in one cycle, a pair of elements are read from the memory; the two elements read in the previous cycle are multiplied and the product outputted to the accumulator; the product calculated in the previous cycle is added to the running total; and the running total from the previous cycle is written to the memory.