METH0D OF PERFORMING A FFT TRANSFORM ON PARALLEL PROCESSORS
The present invention relates to techniques for organising and performing Fast Fourier Transforms (FFTs) on parallel programmable processors.
FFTs are a family of commonly used algorithms that provide an efficient way to convert sampled and digitized time domain signals to the frequency domain and vice versa . Usually, FFTs act on an array of 2 complex values (where M is a natural number) to produce 2M complex spectral components, representing amplitude and phase information. FFTs divide the computation into a series of stages, in each of which a number of operations (such as a complex multiplication and an addition/subtraction) are performed on the data input to that stage. The outputs of one stage form the inputs for the next stage, until the final stage is reached.
Some processor architectures provide two or more execution units for processing different instructions (or different parts of an instruction) in parallel. Such processors are referred to as parallel processors. For example, very long instruction word (VLIW) processors are designed such that different instructions from an instruction packet may be processed in parallel on different execution units.
FFTs are frequently a major part of the computational load in signal processing applications, and thus it is desirable to arrange FFTs to be performed as efficiently as possible.
According to a first aspect of the invention there is
provided a method of performing a Fast Fourier Transform on a parallel processor comprising a plurality of execution units, the method comprising: performing the Fast Fourier Transform in a series' of stages, wherein different parts of each stage are performed concurrently in different execution units, an output of an execution unit in one stage being used as at least part of the input to that execution unit in a subsequent stage.
The present invention takes advantage of the fact that in some stages the input data required by each execution unit may be the same data that was output by that execution unit in the previous stage. Thus the present invention may allow parts of each stage to be performed in parallel, with no requirement to move data from one execution unit to another in at least one of the stages. This may allow the FFT to be performed more efficiently than would otherwise be the case.
Preferably, the data input to an execution unit in each stage is stored in a local memory of that execution unit. This can allow each execution unit to have direct access to the data that it needs in each stage without accessing a central memory. Data which is output by an execution unit in each stage may also be stored in a local memory of that execution unit. In that case, the output data may overwrite at least part of the data input to the execution unit in that stage. This may allow a more efficient usage of the memory.
Preferably the data on which the Fast Fourier Transform is being performed is distributed between the local memories of the execution units. This may allow a more efficient usage of memory than if each cluster stored
its own version of the entire input data array. Thus the method may further comprise the steps of dividing an input data array into a plurality of blocks of data, and storing each block of data in a local memory of a different one of the execution units.
In some stages of the Fast Fourier Transform, data which is required by one execution unit may be stored in a local memory of another execution unit. Therefore the method may further comprise the step of moving data from a local memory of one execution unit to a local memory of another execution unit. This may be achieved,- for example, through use of registers which are visible to two or more execution units.
In each stage, each execution unit may perform at least one butterfly calculation. A butterfly calculation typically involves a multiplication and an addition and/or a subtraction. The butterfly calculation is preferably performed on complex data, although real data may be used.
A copy of a twiddle table may be stored in a local memory of each of the execution units. A twiddle table comprises coefficients which are needed for performing the Fast Fourier Transform. By arranging each cluster to have its own version of the twiddle table, the clusters may obtain the coefficients without the need to access a central memory.
Preferably, in at least one stage, all of the data output from an execution unit in that stage is used as input data to that execution unit in the next stage. Furthermore, in at least one stage, all of the data input to an execution unit in that stage may be output by that execution unit in the previous stage. In such
stages it may not be necessary to transfer data between execution units, resulting in improved performance.
In some circumstances it may be required to perform the Fast Fourier Transform on successive input data arrays. In that case, data on which a subsequent Fast Fourier Transform is to be performed may be loaded into local memories of the execution units at the same time as the execution units are performing a Fast Fourier Transform on previous data. This can allow the computation of successive Fast Fourier Transforms to be carried out more efficiently. Furthermore, data from a previous Fast Fourier Transform may be read from the local memories at the same time as the execution units are performing a Fast Fourier Transform on subsequent data.
Data may be loaded into and/or read from local memories of the execution units using a memory access device, such as a direct memory access (DMA) engine, which may be external to the processor. The memory access device may also be arranged to perform a bit reversal operation. A bit reversal operation may be required before or after the Fast Fourier Transform is performed, and may be a significant part of the overall computation. It is therefore advantageous to arrange the memory access device to perform the bit reversal operation rather than the processor itself. The memory access device may then perform a bit reversal operation at the same time as the processor is performing a Fast Fourier Transform on another set of data. Preferably the memory access device performs the bit reversal operation as part of a load or read operation, depending on the type of FFT.
The method may be carried out on a clustered VLIW processor.
The invention extends to a corresponding processor, and thus according to a second aspect of the invention there is provided a parallel processor programmed to perform a Fast Fourier Transform, the processor comprising a plurality of execution units, wherein the processor is programmed to perform the Fast Fourier Transform in a series of stages with different parts of each stage being performed concurrently in different execution units, and with an output of an execution unit in one stage being used as at least part of the input to that execution unit in a subsequent stage .
The invention extends to corresponding computer programs, and thus according to a third aspect of the invention there is provided a computer program for performing a Fast Fourier Transform on a parallel processor comprising a plurality of execution units, the program being arranged such that, when it is executed, the Fast Fourier Transform is performed in a series of stages, with different parts of each stage being performed concurrently in different execution units, and with an output of an execution unit in one stage being used as at least part of the input to that execution unit in a subsequent stage.
According to a fourth aspect of the invention there is provided a computer readable storage medium having stored thereon a computer program for performing a Fast Fourier Transform on a parallel processor comprising a plurality of execution units, the program being arranged such that, when it is executed, the Fast Fourier Transform is performed in a series of stages, with different parts of each stage being performed concurrently in different execution units, and with an output of an execution unit in one stage being used as
at least part of the input to that execution unit in a subsequent stage .
According to a fifth aspect of the invention there is provided a method of organizing a Fast Fourier Transform for execution on a parallel processor comprising a plurality of execution units, the method comprising: dividing the Fast Fourier Transform into a series of stages; dividing each stage into a plurality of parts; and arranging for different parts of each stage to be performed by a different execution unit, with an output of an execution unit in one stage being used as at least part of the input to that execution unit in a subsequent stage .
The method according to the fifth aspect may be carried out by a compiling apparatus such as a computer running a compiler program.
According to a sixth aspect of the invention there is provided a method of performing a Fast Fourier Transform on a parallel processor comprising a plurality of execution units, the method comprising: arranging the Fast Fourier Transform into a series of stages; arranging each stage of the Fast Fourier Transform into a plurality of parts; dividing an input data array into a plurality of blocks of data; allocating each block of data to a separate execution unit; performing a first stage of the Fast Fourier Transform, each execution unit performing a part of the first stage using the block of data allocated to that
execution unit to yield an output block of data; and performing subsequent stages of the Fast Fourier Transform, each execution unit performing a part of each stage using as at least part of an input block of data at least part of the output block of data from that execution unit in a previous stage.
Features of one aspect may be applied to any other aspect; method features may be applied to processor aspects and vice versa .
Preferred features of the present invention will now be described, purely by way of example, with reference to the accompanying drawings , in which : -
Figure 1 shows an overview of a parallel processor;
Figure 2 is a block diagram of the master cluster in the processor of Figure 1; Figure 3 is a block diagram of a slave cluster in the processor of Figure 1;
Figure 4 shows the topology of an example radix-2 FFT;
Figure 5 shows a decimation in time radix-2 butterfly;
Figure 6 shows a decimation in frequency radix-2 butterfly;
Figure 7 shows the topology of an example radix-4 FFT; Figure 8 shows how the FFT of Figure 4 may be arranged in accordance with an embodiment of the invention;
Figure 9 shows how the FFT of Figure 7 may be arranged in accordance with an embodiment of the invention;
Figure 10 is a flow chart showing the basic steps
in performing an example FFT in an embodiment of the invention;
Figure 11 is a timing diagram showing how the various steps of Figure 10 are performed on different parts of the processor;
Figure 12 shows how a pair of three-legged butterflies may be formed; and
Figure 13 shows the topology of a FFT in another embodiment of the invention.
Overview of a parallel pipelined processor
Figure 1 shows an overview of a parallel processor. The processor 1 comprises instruction issuing unit 10, schedule storage unit 12, first, second, third and fourth processor clusters 14, 16, 18, 20 and system bus 22. The system bus is connected to random access memory (RAM) 24, input/output devices 26, and direct memory access (DMA) engine 28. As will be explained, each of the clusters 14, 16, 18, 20 contains a number of execution units having a shared register file.
The processor 1 is designed to operate in two distinct modes. In the first mode, referred to as scalar mode, instructions are issued to just the first cluster 14, and the second to fourth clusters 16, 18, 20 do not perform any computational tasks. In the second mode, referred to as VLIW mode, instructions are issued in parallel to all of the clusters 14, 16, 18, 20, and these instructions are processed in parallel . The first cluster 14 is referred to as the master cluster and the other clusters 16, 18, 20 are referred to as slave clusters. In practice, the processor architecture may be configured to include any number of slave clusters.
The master cluster 14 controls the overall operation of
the processor 1. The block structure of the master cluster 14 is shown in Figure 2. The master cluster comprises first execution unit (AGU) 30, second execution unit (EXU) 32, control transfer unit (CTU) 34, instruction register 36, I-cache 38, V-cache partition 40, code decompression unit (CDU) 42, local memory 44, data cache 46, system bus interface 48, control and status registers 50, and predicate registers (P-regs) 52.
In operation, when the processor is in scalar mode, instructions are fetched one at a time from the I-cache (instruction cache) 38 and placed in the instruction register 36. The instructions are then executed by one of the execution units 30, 32 or the control transfer unit 34, depending on the type of instruction. The first execution unit 30 executes all address computations, for which there is a file of address registers (A-regs) 54, and performs all load/store operations. The second execution unit 32 executes all arithmetic and other data manipulation instructions, for which it has a file of data registers (D-regs) 56, and a separate file of extended accumulator registers (X-regs) 58. The control transfer unit 34 executes all control transfer instructions.
When the processor is in VLIW mode, two instructions are fetched in parallel from the V-cache partition 40 and are placed in the instruction register 36. The V- cache partition 40 is part of a cache memory which is used to store VLIW instructions which are to be executed by the processor. The two instructions in the instruction register are issued in parallel to the execution units 30, 32 and are executed simultaneously. The V-cache partitions of all clusters are managed by the code decompression unit 42.
Figure 3 shows the block structure of a slave cluster 16. The slave cluster 16 comprises first and second execution units 60, 62, instruction register 64, V- cache partition 66, local memory 68, system bus interface 70, status registers 72, and predicate registers (P-regs) 74. When the processor is in VLIW mode, instruction execution is controlled by the master cluster 14, which broadcasts an address corresponding to the next instruction packet to be issued. The instructions in the instruction packet are read from the V-cache partition in each cluster, and proceed in parallel through the execution units 60, 62.
The clusters are connected in a bidirectional ring topology. Inter-cluster communication is achieved by permitting each cluster to have read-only access to the data registers (D-regs) of the neighbouring clusters. References to non-local D-regs are explicitly encoded in the instruction set in a similar way to references to local registers.
The local memory in each cluster can be read from and written to by DMA engine 28 which is attached to the system bus 22. The DMA engine 28 is able to operate independently of the processor 1 to perform various memory access operations, such as moving data from one area in memory to another. The DMA engine may be set up in advance by the processor to perform memory access operations while the processor is performing other tasks. The DMA engine contends with the processor for the system bus. The DMA engine has a lower priority than the processor for the system bus, and thus the DMA engine uses the system bus while the processor does not require access to the bus.
The instructions which are to be executed by the
processor are usually produced by a compiler. A compiler converts high level instructions (source code) into low level instructions (object code) which are to be executed by the processor. The compiler may schedule instructions to be executed in a particular execution unit in a particular cycle. The compiler may be provided, for example, on a general purpose computer.
The processor is able to implement zero overhead software pipelined loops under control of the predicate registers 52, 74 shown in Figures 2 and 3. Further details on the use of predicate registers in software pipelined loops may be found in United Kingdom patent application number 0014432.9 in the name of Siroyan Limited, the entire subject matter of which is incorporated herein by reference.
Fast Fourier Transforms Fast Fourier Transforms are efficient algorithms for performing the Discrete Fourier Transform (DFT) . The DFT analyses a sampled signal of finite duration by convoluting the signal with a set of harmonic waveforms. FFTs take advantage of redundancy in the computation of the convolution between the signal and harmonics to reduce the overall computation relative to the full DFT. FFTs usually act on an array of 2 complex input values, producing 2M complex output values, representing amplitude and phase information.
Figure 4 shows the topology of an example FFT . The FFT is divided into a number of stages, in each of which pairs of complex input values are combined to produce pairs of complex output values. Each combination of inputs is known as a butterfly, and involves a complex multiplication, addition and subtraction. Each
butterfly therefore requires a complex coefficient which is known as the twiddle factor. The outputs of one stage form the inputs for the next stage, until the final stage is reached.
The topology shown in Figure 4 may be read either from left to right or from right to left . In the former case, the input data is in a natural sequence while the output data is in a bit reversed sequence. In this case the correct sequence of outputs can be restored by a bit reversal operation, in which each element of the output array is mapped to a new position given by reversing the M-bit pattern formed by the index of its array position. If the topology is read from right to left, the input data is in a bit reversed sequence and the output data is in natural order sequence. In this case the bit reversal operation must be performed on the input data before the transform is started. The bit reversal operation is frequently a significant overhead in completing the transform.
The calculations which are performed in each stage of the FFT depend on whether the FFT uses decimation in time or decimation in frequency. Decimation in time involves dividing a signal in the time domain, while decimation in frequency involves dividing a signal in the frequency domain. Figure 5 shows a decimation in time butterfly, and Figure 6 shows a decimation in frequency butterfly. Each butterfly takes a pair of complex input values xx and x2. In the decimation in time butterfly, x2 is multiplied by a complex twiddle factor T. The complex product is then added to xx to yield a first output y_, and subtracted from x_ to yield a second output y2. In the decimation in frequency butterfly, x_ is added to x2 to yield y_ , and x2 is subtracted from xλ and the result is multiplied by a
twiddle factor T to yield y2.
The FFT shown in Figure 4 , reads from and writes to the same positions of the array in each stage. Such an FFT is known as an in-place FFT. In an in-place FFT, each butterfly over-writes its own inputs without affecting any other butterfly in the same stage. The butterflies in each stage form disjoint groups, as indicated in Figure 4. The twiddle factors in each group are either identical or related by incremental rotations in the complex plane .
The FFT shown in Figure 4 is referred to as a radix-2 FFT, because each butterfly operates on a pair of complex inputs. Figure 7 shows an example radix-4 FFT, in which each butterfly operates on four complex input values to produce four complex output values . The radix-4 algorithm requires fewer complex additions and multiplications than the equivalent sized radix-2 algorithm, and thus may result in a more efficient implementation. Each radix-4 butterfly uses three successive powers of the twiddle factor, obtained from elements i, 2i and 3i of the twiddle table. The radix- 4 butterfly may be organized such that the output of the transform is in bit-reversal sequence in the same way as the radix-2 algorithm.
Embodiments of the invention
In embodiments of the invention, an FFT is partitioned amongst a number of individual clusters of a VLIW processor. This is done by dividing the input data array into blocks of contiguous data elements, and allocating each block to a separate cluster. The data in a specific cluster's partition are stored in' that cluster's local memory. Each cluster also stores a local memory copy of all twiddle factors required for
the transform. In each stage, each cluster operates on its block of input data to produce a block of output data, which is then used as input data for the next stage .
Figure 8 shows how the FFT of Figure 4 might be partitioned amongst four clusters. In Figure 8, partition 0 is formed from the first four values of the input data, and this partition is assigned to cluster 0. Partition 1 is formed from the second four values of the input data, and this partition is assigned to cluster 1, and so forth. For simplicity, the same format as Figure 4 has been used in Figure 8 , so that in the first stage four butterflies appear in each of partition 1 and partition 2. However it will be appreciated that computation is distributed as evenly as possible, so that in this example each cluster operates on two butterflies in each stage.
Figure 9 shows how the FFT of Figure 7 might be partitioned amongst four clusters. In this example, each cluster operates on a one radix-4 butterfly in each stage.
It can be seen that the first two (left-hand) stages of Figure 8, and the first stage of Figure 9, have groups which span two or more partitions. In these stages any butterfly calculated by a cluster requires one or more inputs from outside that cluster's own partition. Data from one cluster which are required by another cluster are transferred using inter-cluster moves via data registers which are accessible by neighbouring clusters. In certain embodiments of the invention, there may be vacant slots in the schedule for each butterfly, and inter-cluster moves may be introduced into these vacant slots. This may make it possible to
introduce the inter-cluster moves required for the initial non-partitioned stages of the transform without incurring any overhead.
The combinations of clusters which share data changes at each stage, up to a point where an entire group falls within a single partition. At this point each cluster needs only the data in its own local memory, along with the twiddle table.
Depending on the type of FFT, the -twiddle factors in either the first stage or the last stage of the FFT are all of powers of X . For example, in an FFT using decimation in time with a natural input sequence and bit reversed output sequence, the first stage has these "trivial" twiddle factors, while in a decimation in frequency FFT with the same input and output sequences the final stage has these trivial twiddle factors. Butterflies with trivial twiddle factors require only the addition and/or subtraction of real and imaginary parts, without any complex multiplication. In some embodiments of the invention the parameters of the algorithm are selected such that the stage in which the most inter-cluster data transfers are required is also the stage in which only trivial twiddle factors are required. This may reduce the amount of code which is needed to implement the FFT.
In embodiments of the invention, the code for performing an FFT is organised as nested loops over stages, groups and butterflies. For optimal performance, each loop should be of maximal length, whilst excluding conditional code within the loop. In most stages, this is achieved by arranging the loop over butterflies to be the innermost loop. However, as the number of butterflies per group is reduced at each
stage, and the number of groups increases by the same factor, a point is reached where it is advantageous to reverse the sequence of the inner two loops of the transform, so that the loop over groups is nested inside the loop over butterflies.
In a decimation in time FFT with a natural input sequence and bit reversed output sequence, the stages of the FFT can be placed into four categories :
1. Initial not fully-partitioned stage, with inter- cluster moves and trivial twiddle factors. Loops over butterflies only.
2. Subsequent not fully-partitioned stages, with inter-cluster moves and non-trivial twiddle factors. Loops over butterflies innermost.
3. Fully-partitioned stages with loops over butterflies innermost.
4. Fully-partitioned stages with loops over groups innermost.
Depending on the particular implementation, one or more of the above categories may be absent and/or other categories may be present. For example, in a corresponding decimation in frequency FFT, the first non-partitioned stage would not have trivial twiddle factors, but the final stage would. For such an arrangement, the categories would be as follows:
1. Initial not fully-partitioned stage, with inter- cluster moves . Loops over butterflies innermost .
2. Fully-partitioned stages with loops over butterflies innermost .
3. Fully-partitioned stages with loops over groups innermost.
4. Fully-partitioned stage with loop over groups only
(there being a single butterfly in each group) and trivial twiddle factors .
Each of the above categories would normally be coded as a separate routine.
Before the FFT is executed, the data must first be distributed (scattered) to the local memories of each cluster. This operation is carried out by DMA engine 28. The DMA engine 28 also gathers the data from the local memories at the end of the FFT. In addition, the DMA engine 28 may be used to perform the bit reversal operation on the input or output data at the same time as scattering or gathering the data.
Usually it is required to perform the FFT on successive input data arrays. In this case, the DMA engine 28 may be set up to scatter the input data to the clusters' local memories, and to gather data from the clusters' local memories, at the same time as FFT computations are being performed by the various clusters. This is achieved by using alternate buffers in the clusters' local memories for alternate input data arrays . The DMA engine 28 may then be writing to one buffer in a cluster while that cluster is performing the FFT computations on the data in the other data buffer.
When the FFT is run on the processor, the FFT is first initialized using a section of scalar code which runs on the master cluster. As part of this initialization, a table of twiddle coefficients is copied to each cluster's local memory, and the DMA engine 28 is set up to perform scattering of data to and gathering of data from the clusters. The DMA engine 28 may also be used to distribute the twiddle factors, instead of the master cluster, if required. The processor then enters
VLIW mode, in which each cluster performs the FFT computations on its own data partition.
Figure 10 is a flow chart showing the basic steps in performing an FFT in an embodiment of the invention. In this example, the FFT uses decimation in time. Referring to Figure 10, in step 100 the parameters of the FFT are set. In this step the number of stages, the size of the partition in each cluster, and the number of twiddle factors are determined in dependence on the size of the transform and the number of clusters available. This step is carried out by the master cluster in scalar mode.
In step 102, the DMA engine is set up to perform scattering of input data to the clusters' local memories, and gathering of data from the clusters' local memories. In step 104, an identical table of twiddle factors is copied from main memory to each cluster's local memory. This step may be carried out by the master cluster, or by the DMA engine.
In step 106 the DMA engine distributes the first input data array to input buffer A in each cluster. In step 108 the DMA engine distributes the next input data array (if present) to input buffer B in each cluster, or to buffer A if buffer B was last used.
In steps 110 to 116 the FFT computations are performed in the various clusters using the data stored in buffer A of each cluster, or buffer B if buffer A was last used. These steps are carried out in VLIW mode, and at the same time as step 108. In step 110 the first stage of the transform is performed, with inter-cluster data sharing and trivial twiddle factors. In this stage there is only a single group, and each cluster performs
a subset of the group. Each cluster therefore loops over the butterflies which are in its partition.
In step 112 any subsequent not fully-partitioned stages are performed. These stages involve inter-cluster moves and non-trivial twiddle factors.
In step 114 further stages of the transform are performed, in which all clusters operate in parallel using exclusively data from their own partition. This step is performed on those stages in which the number of groups per cluster is less than or equal to the number of butterflies per group. In each stage of this step, each cluster processes one or more groups of butterflies, with the inner loop iterating over the butterflies (i.e. all of the butterflies in one of the groups are performed in each iteration) .
In step 116 the final stages of the transform are performed. This step is performed on those stages in which the number of butterflies per group is less than the number of groups per cluster. In each stage of this step, the inner loop in each cluster iterates over the groups (i.e. the ith butterfly of all of the groups are performed in the ith iteration) .
In step 118, the data from buffer A of each cluster (or buffer B if that was used in steps 110 to 116) is gathered by the DMA engine. A bit reversal operation is carried out as part of the gather operation, in order to restore the natural sequence of the data. This step is carried out in parallel with the next iteration of steps 110 to 116 (if performed) .
In step 120 it is determined whether another iteration of the FFT is to be performed on the next input data
array. If so, the processor repeats steps 108 to 118. In step 108, data is distributed to the buffer alternate to that used the previous time step 108 was carried out .
Figure 11 is a timing diagram which shows how the various steps are performed concurrently on different parts of the processor. At time T=0 the FFT algorithm is started, and the initialization steps (steps 100 to 104 in Figure 10) are performed on the master cluster. At T=l the first input data array - (data 1) is scattered to buffer A in each cluster (step 106 in Figure 10) . At T=2, the scatter operation is complete, and the various clusters beginning processing the data (steps 110 to 116 in Figure 10) . At the same time, the DMA engine starts scattering the data from the second input data array (data 2) to buffer B of each cluster (step 108 in Figure 10) . At T=3 , the FFT computations in the various clusters are complete. Each cluster then immediately starts processing the data which is in its buffer B (steps 110 to 116 in Figure 10) . At the same time the data in the various buffers A are gathered by the DMA engine, and the bit reversal operation is performed (step 118 in Figure 10) . At time T=4 the gather operation is completed. The DMA engine then begins scattering the data from the third input data array (data 3) to buffer A of each cluster. This takes place while the various clusters are still processing the data from the second input data array. At time T=5 the computations in the various clusters are complete, and a new cycle begins.
In one implementation of the FFT, the butterflies are arranged so that only the data from the butterfly legs which fall within the same partition are written to that cluster's memory. This is achieved in a radix-2
FFT by providing a pair of butterflies with two input legs and one output leg in those locations where an output of the butterfly would otherwise have to be copied from one cluster to another. This arrangement is shown in Figure 12. Although this results in some calculations being duplicated in different clusters, it reduces the number of inter-cluster moves which are required, and therefore speeds up the overall processing time. Figure 13 shows how the FFT of Figure 8 would be modified in this implementation. Such an implementation might be advantageous where there is a large number of clusters, requiring many inter-cluster moves .
For the very first stage, even the input moves can be eliminated if each execution unit has a copy of the entire transform input array, at the cost of additional memory usage and additional time in distributing all the data to every cluster. The input data could also be ordered in a non-natural input order so that each cluster has all the data it needs for its butterflies, and then writes out some data to the partition in local memory of another cluster.
Any form of FFT may be used when implementing the present invention. For example, radix-2 FFTs, radix-4 FFTs or higher radix FFTs or mixed radix FFTs may be used. The FFT may use either decimation in time or decimation in frequency. The FFT may be have natural order inputs and bit reversed outputs, or bit reversed inputs and natural order outputs. The FFT may be partitioned amongst any desired number of clusters. The FFT may operate on an input data array of any size (not necessarily a power of two) . The FFT may be other than an in place algorithm.
In one preferred embodiment, the FFT is a radix-4 decimation in frequency FFT with natural order inputs and bit reversed outputs, which has been found to give good results. For partitioning amongst two or four clusters the inter-cluster moves in the initial stage have been found to introduce zero overhead.
The FFT may be programmed using a high level language, such as C, and compiled to yield the object code which is to be run by the processor. In some implementations, it may not be possible to assign computation directly to a cluster in C-code, this being the job of the compiler. However, the choice of cluster which is made by the compiler may be influenced by associating the variables in the computation with specific clusters. This can be done by specifying that certain variables be stored in the local memory of a specific cluster. The compiler will then assign computation using that variable to that cluster. The compiler's assignment of computation to a specific cluster uses algorithms which aim to optimize the code schedule for performance.
It will be appreciated from the above that some FFT algorithms are appropriate to the allocation of different parts of each stage to different execution units by virtual of the existence of independent subsets of the data at the input to some stages, thus leading to independence of the calculations performed by each execution unit (i.e. no inter-cluster moves to introduce overhead and thus degrade performance) .
Although the above description relates, by way of example, to a clustered VLIW processor it will be appreciated that the present invention is applicable to any processor having at least two parallel pipelines.
Thus the invention may be applied to parallel processors other than VLIW processors, and to processors not having clustered pipelines. A processor embodying the present invention may be included as a processor "core" in a highly-integrated "system-on-a- chip" (SOC) for use in multimedia applications, network routers, video mobile phones, intelligent automobiles, digital television, voice recognition, 3D games, etc.