HK1077383A

HK1077383A - Computationally efficient mathematical engine

Info

Publication number: HK1077383A
Application number: HK05112006.2A
Authority: HK
Inventors: 里安．S．布杰特; 夏伊．S．庭尔曼; 史蒂芬．S．苏普利
Original assignee: 美商内数位科技公司
Priority date: 2002-09-24
Filing date: 2003-09-24
Publication date: 2006-02-10

Description

Computationally efficient mathematical engine

no marking

Technical Field

The present invention relates to the use of a mathematical engine to compute the output of a complex multiplier array. More particularly, the present invention is a computationally efficient mathematical engine that is accessible to perform a plurality of mathematical calculations.

Background

Recent wireless communication systems typically require a large number of mathematical calculations to perform signal processing. The calculations are typically performed by processors and Application Specific Integrated Circuits (ASICs).

Standard asic designs for receivers require many algorithms to be executed and calculated, which typically require many parallel multiplications to complete the calculation during a given period. These algorithms typically include many matrix-to-matrix and matrix-to-vector multiplications, and many Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) computations. Because multipliers take up a lot of space on an application specific integrated circuit, it is contemplated that a design may apply a solution where the same multiplier spans several algorithms.

The ad hoc shared computing may be used to support various current wireless technologies such as WCDMA, WTT, CDMA2000, 802.1X, TDSCDMA, FDD, TDD, and other future system architectures not presented. One such type of computation that is commonly performed is point product multiplication. The dot product calculation is performed as a standard function of the operation between the two matrices. For example, performing channel estimation and data estimation requires dot product calculations. In a wideband time division duplex system, the calculations may include prime factor fft calculations, multiplication of one matrix by another, multiplication of one matrix by its complex conjugate transpose, and multiplication of one matrix by a vector.

Typically, several dot product calculations must be performed by a single communication device, and therefore the communication device must have sufficient processing power to support the required calculations. Currently, each algorithm uses dedicated hardware to perform its own mathematical function. It would be advantageous to develop systems that reuse hardware to maximize operational efficiency. Operational efficiencies include, but are not limited to, processing time, silicon area on which processing is performed, and power required to process silicon therebetween.

Disclosure of Invention

According to the present invention, a mathematical engine is provided for performing a plurality of types of mathematical calculations so that hardware can be efficiently used. The present invention includes a memory having a parallel output used to store one or more values selected to be output in the parallel output of the logical approximation. In the case where the length of the stored value of the vector exceeds the capacity of the computational section, the memory is addressed to provide a vector component called a fold (fold) in a logical sequence that facilitates the completion of the mathematical execution of the full vector.

The different algorithmic results are generated by selectively using enable signals that actuate data transfer and correct mathematical calculations to control the operation of the mathematical engine. This has the advantage of increasing the flexibility of the mathematical engine to perform different types of calculations when designing the signal processor, and provides the advantage of saving processor circuitry to reduce the amount of semiconductor real estate required.

Drawings

FIG. 1 is a schematic block diagram of a mathematical engine according to the present invention.

FIGS. 2A-C show a display with its complexityThe conjugate transpose (A) is used to calculate the autocorrelation matrix (A)^HA) The system response matrix (AH) of (1).

FIGS. 3A-G show performing A^HThe mathematical engine of fig. 1 for which the a function requires computation.

FIGS. 4A-D show A^HMultiplication of the matrix with the received signal vector (r).

FIGS. 5A-I show execution A^HThe mathematical engine of figure 1 for which the r function requires computation.

Fig. 6A-D show the mathematical engine of fig. 1 performing the calculations required for discrete fourier transforms.

Fig. 7A-C show the selective use of input sources.

Detailed Description

The present invention may be described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout.

The present invention is a single mathematical engine that handles multiple independent and distinct algorithms. The mathematical engine may use the same hardware for all algorithms. Because the multiplier requires significant space on the asic, the present invention reduces the amount of space required by the asic. The mathematical engine of the present invention also performs the required calculations very efficiently by using hardware for a relatively large amount of time. The efficiency of the mathematical engine depends on the input matrix size and the number of processing components.

Typically, a mathematical engine has at least two inputs and an output. The inputs include a serial input and a parallel input, wherein the parallel input is as wide as the number of processing elements. The number of processing elements may be optimized as an entire vector, a portion of a vector, or a vector of a matrix. Both parallel and series inputs may be loaded into a shift register or another type of serial access register for different types of operations. A parallel output shift register is a memory with a parallel output that can quickly output stored data. The parallel outputs of the shift register are multiplexed such that a logical approximation of the width of the parallel output shift register function, determined by the number of processing elements, can be performed, the function of a parallel output shift register with selectable outputs can be performed or access to the secondary parallel inputs can be provided. The primary parallel input and the multiplexed parallel output and secondary parallel input of the shift register can be used as inputs to complex multipliers and adder trees, which increases the computational efficiency performed by the mathematical engine. This allows data to be shifted into registers as quickly as possible for each executed operation, and also allows data to be reorganized for the steps within the operation that are being executed efficiently.

In a preferred embodiment, the parallel output shift register outputs data from the logical approximation data value such that the output is used to store values that are selectively output in parallel to the calculation block. In the case where the length of the stored vector exceeds the capacity of the computational section, the parallel output shift register is addressed to provide vector components in the sequence that facilitate the completion of the mathematical execution of the full vector. As a result, a calculation is made almost every clock, rather than multiple steps to prepare the data for an operation. When coupled with an output circuit of a given length, the parallel output shift register is a parallel output unfolded shift register, meaning that its memory storage is n times the data width (i.e., the number of processing elements) of the calculation section.

The adder tree feeds an accumulator that facilitates many different matrix-to-matrix and matrix-to-vector multiplications, and that facilitates, for example, A^HA and A^HAnd (5) effectively calculating r.

According to the present invention, the dot product calculations performed by the mathematical engine include, but are not limited to, a plurality of different types of multiplications required for channel estimation and data estimation, such as prime factor fft; multiplication of one matrix by another, multiplication of one matrix by its complex conjugate transpose, and multiplication of one matrix by a vector.

Referring to FIG. 1, a simplified block diagram of a mathematical engine 30 according to the present invention is shown. The math engine 30 includes a Primary Parallel Data Input Source (PPDIS)31, a demultiplexer 32, an n-fold shift register 33, a Serial Data Input Source (SDIS)34, a multiplexer 35 and a Secondary Parallel Data Input Source (SPDIS) 36. Also included is a complex multiplier Processing Element (PE) array 40, comprising a plurality of array elements (41-44 shown), a complex adder tree 55 and a complex accumulator 56. The complex accumulator 56 includes a start condition input multiplexer 58, a summer 57 and a summer output register 59. The summer output register 59 provides an accumulated output. The complex multiplier processing element array 40 provides a parallel complex multiplication function including dot product outputs. In operation, the output of the primary parallel data input source 31 contains the Q position. The serial data input source 34 provides a complex sequence of receive chips to the shift register 33. The output from the serial data input source 34 is used to load the shift register 33 in parallel. The shift register 33 has n folds, where each fold has a Q position, each position containing a complex value. Furthermore, it may carry any fold-over with values from the Q position of the primary parallel data input source 31, or X position shifted by zero every clock. Multiplexer 35 receives the output from shift register 33 and the complex Q position value from secondary parallel data input source 36.

The output from the multiplexer 35 is provided to a complex multiplier array 40 which is used to provide corresponding inputs to a complex adder tree 55. The complex adder tree 55 is a Q-input complex adder tree.

The complex adder tree 55 provides its output to an accumulator 56. The accumulator 56 is provided with a start condition at the summer 57, which is provided via the multiplexer 58. The output result of the summer 57 is stored in an output register 59.

The mathematical engine 30 of the present invention shown in fig. 1 can perform several different types of calculations, thereby reducing the need for separate processors and thus greatly reducing the amount of application specific integrated circuit real estate required for the mathematical calculations. Three different applications of the preferred embodiment in a wideband tdd communication system are described below, but not limited to: 1) autocorrelation matrix (A)^HA) Generation of (1); 2) white matched filter (A)^Hr); and 3) performing channel estimation via Steiner algorithm. However, those skilled in the art will appreciate that other algorithms may be used without departing from the spirit and scope of the inventionThe method may also be practiced.

Execution A^HAn example of the invention for the A function will be described with reference to FIGS. 2A-C and FIGS. 3A-G. This example will illustrate the use of the mathematical engine of FIG. 1 to perform A in accordance with the present invention^HAnd (4) processing the A function. A. the^HThe A function is one of the most complex functions computed by the mathematical engine because it is computed from a partial memory matrix and the same partial memory matrix is accessed in a different way with only partial computation results. The order of the operations is not based on the logical progression of the matrix, but how the matrix is adapted by the existing functions of the mathematical engine.

FIG. 2A is A^H，A，A^HSimple field matrix representation of the a matrix. Only A^HThe line block is stored in memory, so the whole A^HAnd the A matrix represents an addressing function on a stored row block that can be transformed to access data using an imaginary portion of the data as in the A matrix. Final A^HThe A matrix is compressed as if it were constructed from A^HAnd an A matrix.

Because A is^HThe subblocks are only stored in a single access memory (the primary parallel data input source 32 in FIG. 1), so the first step, replication A^HThe contents of the memory enter the shift register 33, where they can also be manipulated to present the second and third row blocks as they are perceived only by the window of data used in the computation. Because A is^HThe matrix is a value of the Spreading Factor (SF) + the channel response length (W) -1 wide, so that there are only 20 processing elements, each A^HThe text is 20 values. The data shift into a 20-valued block is called folding (e.g., SF + W-1 ═ 44 in this example, so there is 3 folding). FIGS. 3A, B and C show the presentation of first A^HThe first three clock cycles of the shift register are loaded.

In FIG. 3A, A^HThe first fold of the first row of blocks is to load the first fold store in the shift register. In FIG. 3B, A^HThe second fold of the first row of blocks is to load the second fold store in the shift register. In FIG. 3C, A^HThe third fold of the first row of blocks is to load the third fold store in the shift register.

The matrix component to be calculated is emphasized in fig. 2B. The entire row is emphasized, but only the matrix elements within the line need be used in this calculation. FIG. 2B shows A^HThe first row of the matrix is multiplied by the first column of the A matrix to generate A^HThe first value of the a matrix.

Because A is^HAnd the a matrix row and column blocks are larger than the number of processing elements (44 row and column sizes in this example), so are calculated one fold at a time. The folded segments are multiplied and accumulated in an accumulation register 59 (shown in FIG. 1) until the entire row is reached, at which point the result is stored in the external mathematical engine. A. the^HThe matrix is represented by values in a shift register, while the A matrix values are presented in A by an imaginary inverse function performed by taking their conjugates in the complex multiplier array 40^HIn a memory.

FIG. 3D shows A^HThe first fold in the first row of the matrix is multiplied by the first fold in the first column of the a matrix. The multiplication result is stored in the accumulation register 59. FIG. 3E shows A^HThe second fold in the first row of the matrix is multiplied by the second fold in the first column of the a matrix. The multiply accumulated result is added to the previous accumulated value and stored in the accumulation register 59. FIG. 3F shows A^HThe third fold of the first row of the matrix is multiplied by the third fold of the first column of the a matrix. The result of this multiplication is added to the previous accumulation value and stored in the accumulation register 59. The value stored in the accumulation register 59 represents a^HThe first position of the a matrix and is stored externally for further processing. This multiplication employs 3 clocks, one for each fold, but only the first clock is presented in FIG. 3G, which shows A^HThe first fold of the first row of the matrix is multiplied by the first fold of the second active column of the a matrix. The multiplication result is stored in the accumulation register 59. This process is performed for 3 clocks until the accumulated result is stored externally for further processing.

In FIG. 2C, it can be seen how the 16 zero point is shifted to the left of the shift register 33 to cause A^HThe second row block of the matrix is visible via the input window used for the computation window. The process only uses one clock and can be compared with the restThe computation is processed in parallel, thereby saving significant time. For A in the values used for the calculation window^HShifted versions of sub-blocks, A^HThis shifted version of the first row of the matrix sub-block must be multiplied by all the valid columns of the a matrix sub-block.

First effective A^HAfter each fold of a row is multiplied by each valid column of the a-matrix sub-block contained in the computable input window, all the same processing is done for every other valid row. This out-of-order processing is performed such that the AH row does not have to be addressed to A^HThe shifted version of the sub-block is reloaded.

As a second example, the AHr function performed by the mathematical engine of FIG. 1 will be described with reference to FIGS. 4A-D and 5A-I. A. the^HThe function r being the whole of A^HThe function is multiplied by the received vector (r). A. the^HThe matrix is not stored in its entirety, only a small portion of which is called A^HThe row blocks are actually stored; a. the^HAll other values in the matrix are actually repeated by A^HRow blocks or zeros. Thus, A^HCan be stored as A^HComplex addressing schemes for blocks of rows.

For example, wideband time division duplex third generation (3G) data cluster types 1 or 3 collectively have 61A^HRow blocks, with a total of 69A in cluster type 2 time slots^HAnd (5) line blocks. In fact, there are two r vectors, and this function will be described with respect to one r vector; this process is then repeated for the second data set vector.

FIG. 4A is a drawing A^Hr icon of matrix multiplication. A. the^HLine block in A^HThe matrix is repeated throughout all remaining zeros. A. the^HOf the row block rows, two of the four rows are invalidated by previous processing.

Calculation of A^HThe first step of the r function preloads the entire shift register 33 of the mathematical engine with r vector values. The shift register 33 (shown in fig. 1) is loaded in series by a series data input source 34. A. the^HThe matrix is provided at the primary parallel data input source 31.

Calculate A to begin a step-by-step process^Hr, calculating A as r vector^HThe first row of the matrix is shown in fig. 4B. Because only A^HThe first SF + W-1 value of the vector contains a non-0 value, so only the first part of the column product is calculated. Since there are only 20 processing elements, it is necessary to multiply (e.g., fold) in multiple steps, in this case W is 29 and SF + W-1 is 44. Divide the number of processing elements (20) by SF + W-1 and carry into integers, for a total of 3 folds, or multiply by 3 clocks and accumulate the significant portion of the r vector to compute A^HAll dot products of the active part of the row. The first three clocks shown in FIGS. 5A, B and C show the calculation of the first value of the dot product of the matrix column and vector in a one step process.

FIG. 5A shows first A^HThe first fold of the column is multiplied by the first fold of the r vector and the result is stored in the accumulation register 59. FIG. 5B shows the first A^HThe second fold of the line is multiplied by the second fold of the r vector and the result is added to the result currently located in the accumulation register 59. FIG. 5C shows the first A^HThe third fold of rows is multiplied by the third fold of the r vector and the result is added to the result currently located in the accumulation register 59. The final accumulation is now stored as the first A^Hr calculating points.

The next step is to calculate A from the currently calculated symbol^HThe next active row of the row block. This involves multiplying the same portion of the r vector by the new A shown in FIG. 4C^HAnd (6) rows. The r vector may remain the same while A^HDifferent rows are accessed in A^HIn a memory. Again, three clocks are shown in FIGS. 5D, E and F, which show all three folded multiply accumulations. FIG. 5D shows the third item A^HThe first fold of the column is multiplied by the first fold of the r vector and the result is stored in the accumulation register 59. Figure 5E shows the second fold of the third AH row multiplied by the second fold of the r vector and the result is added to the result currently located in the accumulation register 59. FIG. 5F shows the third embodiment A^HThe third fold of rows is multiplied by the third fold of the r vector and the result is added to the result currently located in the accumulation register 59. The final accumulation is now stored as the second A^Hr calculating points.

The next step starts with the calculation of a second a with the r vector^HAnd (5) line blocks. This involves the reaction of each effective A^HThe second row of sub-block rows is multiplied by the r vector. Since the next row is shifted by 16 values from the last, it is multiplied by A^HThe r-vector portions of the sub-blocks are different. In fig. 4D, the part involved in the calculation is emphasized.

Because of the stored A^HThe row blocks do not move, so the r vector must be shifted to realign multiplied by new A^HThe r vector portion of a row in a row block. This is done by shifting 16 new values for 16 clocks using the shift register 33 and dropping the first value of the r vector. The first two clocks of this process are shown in the fifth G and H values, and the remainder are inferred. Thereafter, the same as the first A^HThe processing of the row block is active so that the first clock of this repeated operation is displayed with the first A^HThe first clock of the column block is compared.

FIG. 5G shows a value of the r vector being shifted into the right side of the shift register when the first value put into the shift register is lost. FIG. 5H shows the second value of the r vector being transferred into the shift register when the second value put into the register is lost. Figure 5I shows the first AH row multiplied by the first fold of the most recently shifted r vector and the result is stored in the accumulation register 59 as the first AHr row block first calculation. This process continues with each folding of each valid AH row block row for each AH row block.

As a third example, the Steiner algorithm executed by the mathematical engine of FIG. 1 will be described with reference to FIGS. 6A-D. The icon shows the 3pt discrete Fourier transform using the math engine to operate 456pt Steiner. FIGS. 6A-C show the first DFT calculation, with each diagram showing a different clock. As shown, the first three addresses of the dft have been loaded into memory in series, so they all can be accessed in parallel at one time via the primary parallel data input source 31 (in fig. 1). The dither (Twiddle) factor used for discrete fourier transforms is input via a secondary parallel data input source 36 (in fig. 1).

In each of these three clocks, the first DFT input at addresses 0, 152, and 304 is multiplied by one of three sets of dithering factors unique to that DFT. The results of these three calculations are external to the mathematical engine of figure 1.

In fig. 6A, three points of dither set 1 are multiplied by three points of discrete fourier transform 1 and the result is stored externally. The first point of the second dft is loaded into the next line of the memory. In fig. 6B, three points of dither set 2 are multiplied by three points of discrete fourier transform 1 and the result is stored externally. The second point of the second dft loads the next line of the memory. In fig. 6C, three points of dither set 3 are multiplied by three points of discrete fourier transform 1 and the result is stored externally. The third point of the second dft loads the next line of the memory.

FIG. 6D shows how the first point of the next DFT is assigned the first twiddle factor set, while the other two sets follow the following clock. The 64-point discrete Fourier transforms are achieved in a slightly different manner, where each set of discrete Fourier transforms multiplied by a dither set uses four consecutive clocks that are accumulated together before storage.

In fig. 6D, three points of dither set 1 are multiplied by three points of discrete fourier transform 2, and the result is stored externally. The first point of the third dft is loaded into the next line of the memory. This multiplication is continued for all three dither factor sets for each 3pt discrete fourier transform. Other discrete fourier transforms are similarly achieved with 8 dither sets to 8pts, with 19 dither sets to 19pts, and with 64 dither sets to 64 pts.

Referring to FIGS. 7A-C, it can be seen that selective input to the mathematical engine of the present invention allows the mathematical engine to perform A^HA，A^Hr and Steiner function. As shown in fig. 7A for a^HFunction A, with only the main parallel data input source input to A^HIs used during operation A to provide a system response matrix (A)^H) (ii) a The system response matrix is multiplied by its complex conjugate transpose value. Because of the parallel load capability of using shift registers and because complex multiplication arrays contain the ability to conjugate one of their inputs, each point is computed in x clock cycles depending on how many folds are needed (where 1 ═ x ═ n, and n is the fold maximum #).

Referring to FIG. 7B for A^HR function, the primary parallel data input source providing the system response matrix (A)^H). Then, input in series to A^HThe a operation is used during the data field to provide the received vector (r). System response matrix a^HThe data field of the received vector (r) is multiplied for a slot. Due to the use of the serial loading capability of the shift register, the data field is loaded into the shift register when the system response matrix is provided to the primary parallel input. The elements of the final vector are computed in x clock cycles depending on how many folds are needed (where 1 ═ x ═ n, and n is the fold maximum #).

As for the Steiner operation (FFT) shown in fig. 7C, the primary parallel data input source provides the FFT input data set, while the secondary parallel data input source provides the FFT dithering factor. During the fft operation, in m-point fft, the appropriate m-points of the data set are provided to the complex multiplication array by the secondary parallel data input sources, and the appropriate fft dithering factor is provided by the secondary parallel data input sources.

Claims

1. A mathematical engine comprising:

a parallel output shift register receiving data to be processed;

a processor including an adder tree using a plurality of Arithmetic Logic Unit (ALU) circuits for receiving the output of the shift register and providing data outputs; whereby the shift register includes a selectable start position for selective output based on the capacity of the processor.

2. The mathematical engine of claim 1 further comprising a multiplexer for receiving the output from the parallel output shift register and providing data from the parallel output shift register to the processor.

3. The mathematical engine of claim 2 further comprising at least one actuation circuit that selectively actuates the output from the shift register.

4. The mathematical engine of claim 1 wherein the data comprises real and imaginary components.

5. A computing unit for performing a plurality of different types of computations, the computing unit comprising:

a parallel output shift register;

a multiplexer for receiving the output from the shift register and providing an output to the adder tree;

the adder tree comprises a plurality of arithmetic logic units; and

a selection circuit selectively actuates the shift register and the multiplexer to apply specific portions of the input data to the adder tree to perform different computations.

6. The computing unit of claim 5, further comprising at least one selectable memory having at least one data width of the adder tree.

7. A mathematical engine for performing calculations, the mathematical engine comprising:

at least one input memory for storing input data;

a selectable memory receiving input data from the at least one input memory and providing selectable outputs via a plurality of folds, wherein each fold includes at least one location within the selectable memory; and

a processor array having a plurality of processors for receiving the output from the selectable memory and selectively providing an output.

8. The mathematical engine of claim 7 further comprising an activation circuit for selectively controlling the at least one input memory and the selectable memory depending on the desired mathematical computation.

9. The mathematical engine of claim 8, further comprising an adder tree having a plurality of arithmetic logic unit circuits for receiving outputs from the processor array and processing the outputs; and an accumulation circuit for receiving and accumulating outputs from the adder tree; whereby the enable circuit further controls at least a portion of the adder tree to support the desired mathematical computation.

10. A computational circuit that can solve complex functions; the calculation circuit includes:

a memory receiving input data for a complex solution;

a storage body for storing the operation factor of the complex function;

a multiplexer for receiving an input from the memory and the operation factor;

a processing array circuit for processing according to a number of bit positions stored by the memory, the processing array circuit including an output from the multiplexer and at least some of the input data;

a complex adder tree receiving the outputs from the processing array and providing an added output; and

an accumulator circuit receives the output from the adder tree and provides an accumulated complex output.

11. The computing circuit of claim 10 wherein said memory comprises a parallel output shift register having a selectable start position.

12. The computing circuit of claim 10, further comprising a dither factor provided as an operator for performing discrete fourier transforms, wherein said multiplexer receives its output from said memory and said operator.

13. A computational circuit capable of solving complex functions, comprising:

a memory for receiving complex input data for solution;

a storage for storing a dithering factor;

a multiplexer receiving complex input data from the memory and the dithering factor from the storage volume;

a processing circuit for processing data according to a plurality of bit positions stored by the memory, the processing circuit including an output from the multiplexer and at least some of the input data;

an accumulator counts the added output.

14. The computing circuit of claim 13 wherein said memory includes a parallel output shift register having a selectable start position.

15. A method of electronically resolving complex functions, the method comprising:

providing input data from a first memory for a complex solution;

providing an operator for the complex function from the second memory and multiplying the operator by the data from the first memory to provide the multiplexed data; and

a selected portion of the multiplexed data is supplied to the processing array as a parallel output, the selected portion being dependent upon the complex function.

16. The method of claim 15 wherein the processing array comprises a plurality of adders.

17. The method of claim 15, further comprising:

providing a dithering factor as the operation factor for performing discrete Fourier transform; and

selectively activating at least a portion of the processing array, thereby controlling the size of data processed by the processing array.

18. The method of claim 15 further comprising accumulating data output from the processing array.