Title
Power-efficient Shape Adaptive Discrete Cosine Transformation
Field of the Invention This invention relates to improving data processing in video object-based texture coding. More particularly, the present invention relates to improving power consumption efficiency when processing data for computing the SA-DCT of MPEG4 video data.
Background to the Invention
There is an ongoing global trend to shift multimedia applications from more traditional delivery platforms, such as set-top boxes or desktop PCs, to mobile platforms in the so- called pervasive computing era. This is evidenced by increasingly frequent announcements by device manufacturers of new personal digital assistants (PDAs) and mobile telephone handsets with enhanced video functionality. Mobile platforms present their own unique challenges, especially in the domain of multimedia. The devices are restricted to typically having low computational power, small memory, short battery life and ever-increasing miniaturisation requirements. These limitations are exacerbated when running very demanding real-time multimedia applications, such as video encoding. Short battery life is the biggest technological challenge faced by truly mobile multimedia.
In particular, the short battery life issue caused by excessive power consumption of mobile devices has become the biggest design challenge facing truly mobile multimedia. In general, power consumption in a CMOS circuit has two components - dynamic power and static power. Traditionally the dynamic component was by far the most dominant, but this is changing due to physical phenomena as process technologies shrink below 90nm. It is estimated that when devices are scaled to 45nm (around the year 2007) the static component will be equal in proportion to the dynamic component. Even so, tackling static power primarily involves working at the circuit and physical levels and depends on the implementation technology employed. Dynamic power consumption is caused by circuit node switching activity, and is regularly modelled with Equation 1, which sums over all nodes S in the circuit.
^vWc = ∑Q, ./V2«, Equation 1 i -l
The parameters in this equation are as follows: Cu - Capacitive load of node i f— Circuit operating frequency V- Circuit operating voltage
CCi — Switching probability of node i
A commonly adopted approach to reduce Equation 1 at a system level is to design energy-efficient hardware accelerator peripherals for the most computationally demanding tools. This results in improved throughput since the central processor is relieved of these loads, and the hardware accelerators operate in parallel with the processor. Power dissipation is also improved since the peripherals are dedicated devices and their architectures may be tailored in terms of power consumption to suit the computation. This is not as easy to achieve with general-purpose processor, since by definition they are used to compute many different kinds of operation.
In the context of the above multimedia applications, natural video scenes consist of a stationary background and moving foreground objects that are of arbitrary shape. When encoding texture, a video codec system divides each rectangular video frame into an array of non-overlapping 8x8 texture blocks and processes these sequentially. In previously standardised video coding schemes (e.g. MPEG-I, MPEG-2) a 8x8 pixels Discrete Cosine Transformation processes all blocks, regardless of whether they belong to the background or to a foreground object. MPEG-4 uses a Shape-Adaptive DCT to support object-based texture encoding, which in turn allows object manipulation as well as giving improved compression efficiency. In MPEG-4 the object shape description of a frame is termed the alpha-plane or video object plane (VOP). The alpha-plane of a video object can be provided by (semi-) automatic segmentation of the video sequence. This technique is not covered by the MPEG-4 standardization process and depends on the application. For blocks that are located entirely inside the VOP, the SA-DCT behaves identically to the 8x8 DCT. Blocks that are located entirely outside the VOP are skipped to save needless processing. Blocks that lie on the VOP boundary are encoded depending on their shape and only the opaque pixels within the boundary blocks are actually coded.
The SA-DCT algorithm was originally formulated in response to the MPEG-4 requirement for video object based texture coding, and it builds upon the 8x8 2D DCT computation by including extra processing steps that manipulate the shape information of a video object. The gain is increased compression efficiency at the cost of additional computation. The SA-DCT is one of the most computationally demanding blocks in an MPEG-4 video codec, therefore energy-efficient implementations are important — especially on battery powered wireless platforms. Power consumption issues require resolution for mobile MPEG-4 hardware solutions to become viable. In summary, the SA-DCT gives improved compression efficiency compared with the regular 8x8 DCT since it leverages the video object information. However, it is more computationally demanding since there are additional processing steps necessary, which induce tighter real-time processing constraints. As such, there is a valid case for hardware acceleration of the SA-DCT tool especially given the relatively reduced performance and computational capability of mobile processors. Any hardware acceleration solution proposed for the SA-DCT on a mobile platform must make power efficiency a primary design feature as while matching the other requirements for throughput and low silicon area.
Although there have been numerous approaches to object-based texture coding in the literature, version 2 of the MPEG-4 visual standard has adopted Sikora's SA-DCT algorithm. Two main architectures for implementing the SA-DCT have been proposed.
The first is based-on a time recursive filter structure. The second, which is held in higher favour, is the feed-forward architecture. It trades off scalability, modularity and regularity and claims to avoid numerical inaccuracy or bit-width explosion that can occur in some of the fast algorithms that are more suitable for a software implementation. Both architectures focus on implementations of an JV-point ID DCT
(where N = 0,1,...8), assuming N has been established for the data vector being processed and the VOP pixels in the vector have been identified. This is not a trivial task and this invention concentrates on computing N in an efficient manner, optimising accesses to a transpose memory when storing intermediate coefficients and to an external memory when storing the final SA-DCT coefficients.
Object of the Invention
It is an object of the present invention to reduce processing latency without excessive use of parallelism, whereby a device configured according to the present invention can operate at a lower frequency and voltage to maintain throughput. Power consumption is therefore reduced since by Equation 1 this corresponds to reducing / and V without significant increase in 5 and Cu- To achieve these benefits, this invention uses novel approaches to subsume SA-DCT processing stages 0 and 1 into a single processing stage as well as subsuming stages 2 and 3 into a single processing stage. Stages 4 and 5 are also partly integrated.
After reducing latency to arrive at an optimal trade-off between {S,Cu} and {f,V} in terms of energy efficiency, it is another object of the present invention to minimise the switching probability (% for the optimised number of nodes S to be switched. There is scope for this especially since the computational load involved for each of the SA-DCT processing stages is entirely dependent on the shape information. The image data processing unit of the present invention has therefore been designed such that no unnecessary switching occurs.
Summary of the Invention
According to an aspect of the present invention, an apparatus is provided for encoding image data comprising processing means for processing a shape-adaptive discrete cosine transformation, an input buffer, a transpose memory and an output buffer. The apparatus receives said image data as picture screen elements (pixels), each of which is defined with respective luminance, chrominance and alpha values. The input buffer is configured to store said pixels vertically as said image data is received. The processing means is configured to process a plurality of first coefficients from said vertically-stored pixels and a plurality of second coefficients from said first coefficients. The transpose memory is configured to store said plurality of first coefficient horizontally as they are output. The output buffer is configured to output said second coefficients as they become available from said processing means, whereby said second coefficients are output with minimal latency.
According to another aspect of the present invention, a method of encoding image data is provided, said image data being picture screen elements (pixels), each of which being defined with respective luminance, chrominance and alpha values, and said method comprises the steps of storing said pixels vertically upon receiving said image data in an input buffer; processing a plurality of first coefficients from said vertically-stored pixels by way of performing a shape-adaptive discrete cosine transformation thereon; processing a plurality of second coefficients from said first coefficients by way of performing a shape-adaptive discrete cosine transformation thereon; storing said plurality of first coefficient horizontally in a transpose memory as they are output; and outputting said second coefficients from an output buffer as they become available with minimal latency.
According to yet another aspect of the present invention, a processing system is provided which comprising the apparatus for encoding data detailed thereabove. The processing system includes processing means, memory means, image data input means and networking means, wherein said memory means stores instructions which configure said processing means to capture said image data as picture screen elements (pixels), each of which being defined with respective luminance, chrominance and alpha values, and exchange said captured image data with said apparatus for the processing thereof over an input/output bus.
Preferably, the pixels are video object plane (VOP) pixels, and the second coefficients are VOP coefficients. Advantageously, these VOP coefficients are output according to a number N of VOP coefficients. The transpose memory horizontally stores said first coefficients processed by said processing means as they are output therewith and desirably updates said N value iteratively for each row in the transpose memory. The shape-adaptive discrete cosine transformation preferably includes at least two variable N-point 1-dimensional discrete cosine transformations and the processing means advantageously includes at least two variable TV-point 1-dimensional discrete cosine transformation-processing modules.
In a preferred embodiment of the present invention, the processing means, the input buffer, the transpose memory, the output buffer and the processing means are pipelined
and interleaved, whereby latency and power consumption of the apparatus thereabove and any device embodying the above method are reduced.
Computational load of the SA-DCT processing means varies preferably with the shape image data, whereby only registers appropriate to a particular processing step are clocked and the dynamic power consumption of said apparatus is reduced.
Advantageously, the networking means of the above system are wireless and said system is a mobile phone handset.
Brief Description of the Drawings
The features and advantages of the invention will be presented in conjunction with the following illustrations listed below: Figure 1 shows a preferred embodiment of the present invention in an environment, including at least one audio-visual data processing terminal and at least one remote terminal;
Figure 2 provides an example of the audio-visual data processing terminal, which includes processing means, memory means, communicating means and an image data processing unit;
Figure 3 illustrates a boundary block of 8 by pixels with a varying N count;
Figure 4 shows a conceptual architecture of the image data processing unit of Figure 2, including an input buffer, a transpose memory and an output buffer;
Figure 5 further details the input buffer of Figure 4; Figure 6 illustrates the order in which the pixels of Figure 3 are input to the image data processing unit of Figures 2 to 5;
Figure 7 provides an example timing diagram of the operation of the input buffer of
Figure 5;
Figure 8 shows a preferred embodiment of a fixed 8-point 1-dimensional DCT processing element for outputting horizontally-packed data;
Figure 9 shows a preferred embodiment of a variable iV-point 1-dimensional DCT processing element for outputting horizontally-packed data;
Figure 10 further details the transpose memory of Figure 4;
Figure 11 provides an example timing diagram of the operation of the transpose memory of Figure 10; and
Figure 12 further details the output buffer of Figure 4.
Detailed Description of the Drawings
This invention tackles the problem of accelerating the MPEG-4 object-based shape adaptive discrete cosine transform (SA-DCT) function with power efficient hardware, represented hereinbelow as an image data processing unit. The SA-DCT is less regular compared to the conventional 8x8 block-based DCT, because processing decisions are entirely dependent on the shape information associated with each individual block. The 8x8 DCT requires 16 ID 8-point DCT computations if implemented using the row- column approach. Each ID transformation has a fixed length of 8, with fixed basis functions. This is amenable to hardware implementation since the data path is fixed and all parameters are constant.
The SA-DCT however requires up to 16 ID iV-point DCT computations, where JV €Ξ {2,3, --,8} (N G {0,1} are trivial cases). In general N can vary across the possible 16 computations depending on the shape. With the SA-DCT, the basis functions vary with N, complicating hardware implementation. However, with respect to this invention an efficient implementation of a variable //-point ID DCT is considered to be a solved problem given the large amount of prior art in the area. Instead, this invention focuses on the logic extensions necessary that surround the variable JV-point ID DCT modules to compute the SA-DCT.
In summary, the SA-DCT processing stages according to the known prior art are: Stage 0 - Load input block data from memory Stage 1 - Vertically shift VOP pels Stage 2 - Vertical ID DCT on each column Stage 3 - Horizontally shift intermediate vertical coefficients
Stage 4 - Horizontal ID DCT on each row of intermediate coefficients Stage 5 - Store final coefficient block data to external memory
The block-based 8x8 DCT does not require stages 1 and 3. In addition, stages 0 and 5 are somewhat trivial for an 8x8 DCT since the amount of data being loaded and stored is fixed. With the SA-DCT, however, this amount varies depending on the alpha mask so there is scope for adapting the number of processing steps based on the shape information to achieve minimum processing latency. If the challenge of implementing stages 2 and 4 is considered to be a solved problem, it is clear that this invention targets the remaining processing stages (1,3 and 5). An alternative embodiment for implementing stages 2 and 4 is nonetheless described in this document, which is an extension of a state-of-the-art static N = 8 ID DCT architecture, in order to implement a variable JV-point DCT required for the SA-DCT.
A preferred embodiment of the present invention is shown in an environment in Figure 1, which includes at least one network-connected apparatus under the form of a mobile telephone handset 101. Mobile phone 101 is configured with audio-visual data capturing means, such as a built-in camera 102, and may connect to remote terminals over any of a plurality of wireless networks 103 including a low-bandwidth Global System for Mobile Communication ('GSM') network, or higher-bandwidth General Packet Radio Service ('GPRS') network, or yet higher-bandwidth 'G3' network. Mobile phone 101 receives or emits data encoded as a digital signal over wireless networks 103, wherein said signal is relayed respectively to or from mobile phone 101 by the geographically- closest communication link relay 104 of a plurality thereof. Said plurality of communication link relays allows said digital signal to be routed between mobile phone 101 and its intended recipient or from its remote emitter. Mobile phone 101 may also connect to a remote, but proximate terminal over a local wireless network 105 such as the medium bandwidth 'Bluetooth™' network.
Upon establishing such a network connection, mobile phone 101 may therefore wirelessly broadcast (106) voice or audio-visual data captured with said camera to another network-connected terminal 107 such as another mobile communicating device, e.g. mobile telephone handset 107, having a configuration suitable for receiving and locally processing said broadcast data, over said wireless networks 103 or 105 Mobile phone 101 may also connect to a remote terminal 108, such as a desktop computer or a portable computer, a variation thereof being a personal digital assistant,
over a Wide Area Network ('WAN') 109, such as the Internet, by way of a remote gateway 110. Gateway 110 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network 103 within which the example wireless data transmission 105 takes place, and said wide area network (WAN) 109. Said gateway 110 further provides protocol conversion if required, for instance if mobile phone 101 broadcasts (106) data to said terminal 108, which is itself only connected to the WAN 109.
Thus, the potential exists for data exchange between any of mobile phone 101, mobile communicating device 107 and terminal 108, by way of wireless data transmission 105 and the Internet 109 interfaced by gateway 110. It will however be readily apparent to those skilled in the art that the above environment is provided by way of example only, and that the present invention may be embodied in any network comprising devices connected thereto exchanging data encoded as described hereinbelow.
An example of the arrangement of components of terminal 101 is shown in Figure 2 in further detail. Mobile phone 101 firstly includes processing means in the form of a general-purpose central processing unit (CPU) 201, which is for instance an Intel ARM X-Scale processor manufactured by the Intel Corporation of Santa Clara, California, USA, for acting as the main controller of mobile phone 101 and processing data. Mobile phone 101 next includes memory means 202, which includes non-volatile random- access memory (NVRAM) 203 totalling 512 kilobytes in this embodiment. NVRAM
203 provides non-volatile storage of instructions and initialising data for processing means 201 when not in operation. Memory means 202 preferably also includes volatile random-access memory (RAM) 204 totalling 16 megabytes in this embodiment. RAM
204 provides volatile storage of data for processing means 201 in operation.
CPU 201, NVRAM 203 and RAM 204 are connected by a data input/output bus 205, over which they communicate and to which further components of mobile phone 101 are similarly linked in order to provide wireless communication functionality and receive input data. Network communication functionality is provided by a modem 206, which provides the interface to external communication systems, such as the GSM, GPRS or G3 cellular telephone networks 103 shown in Figure 1. An analogue-to-digital
converter 207 receives analogue voice data from the user of mobile phone 101 through a microphone 208, or from remote devices connected to the GSM network only and processes it into digital data. An aerial 209 is preferably provided to amplify the network communication operation.
Analogue input data may be received from microphone 208 and digital data may be locally input with data input interface 210, which is a keypad. In this embodiment, a third data input interface 102 is provided as a CCD camera configured to capture visual data as a digital video frame defined by a plurality of picture screen elements (pixels), each of which having respective luminance, chrominance and alpha numerical values and wherein said alpha value may be zero or one. In operation, memory means 202 stores instructions which configure CPU 201 to process image data captured by camera 102 as pixels, each having a respective luminance, chrominance and alpha value.
Power may be provided to mobile phone 101 by an electrical converter 212 connected to an internal module battery 213. Output data, in addition to digital output data broadcast over networks 103, includes local visual data output by CPU 201 to a Video Display Unit 214 and audio data output by CPU 201 to a Speaker Unit 215. Said arrangement is described herein by way of a generalised data processing architecture only, in order to not unnecessarily obscure the present description, and it will be readily understood by those skilled in the art that such arrangement may vary to a fairly large extent.
According to the present invention, however, mobile phone 101 also includes an image data processing unit 216 coupled to bus 205 so as to exchange input and output data with CPU 201 and memory means 202. Image processing unit 216 preferably includes an input buffer module 217, a transpose memory module 218, an output buffer module
219 and processing means 220 for processing the shape adaptive discrete cosine transformation computations, shown in the preferred embodiment as two processing modules 220A, 220B.
Digital image data captured by camera 102 comprises pixels having luminance, chrominance and alpha values, wherein respective alpha values of a plurality of co-
located pixels define at least one shape of at least one video object. With reference to the encoding of blocks that lie on the VOP boundary depending on their shape, an example of corresponding opaque pixels within a boundary block is shown in Figure 3 with a varying N count. The block measures 8 pixels by 8 pixels, and the 64 pixels thereof may belong to the shape or not. The first instance 301 of the block illustrates a varying value N as the number of shape-belonging pixels per pixel column 302 to 309 when vertical ID DCT processing is performed, as per stage 2 of the prior art above. The second instance of the block 310 illustrates a varying value N as the number of shape-belonging pixels per pixel row 311 to 318 when horizontal ID DCT processing is performed, as per stage 4 of the above prior art.
A conceptual architecture of the image data processing unit 216 of Figure 2 is shown in Figure 4 in relation to the prior art stages subsuming of the present invention. The input buffer 217 performs (401) processing stages 0 and 1 simultaneously. As the texture and alpha data is loaded from the input data bus 205, it undergoes vertical packing based on the information on the input alpha bus 402. This loading and packing is achieved without needing above and beyond the number of clock cycles necessary for a conventional DCT that does not require packing. The input buffer 217 uses alternating buffers to minimise area requirements. Using a combination of clock gating and only switching the data in registers when necessary eliminates unnecessary power consumption.
The horizontal packing (stage 3) (403) is performed in tandem with the vertical ID N- point DCT processing (404) by module 220A, such that as soon as stage 2 (404) is complete, the data is stored in the transpose memory 403 in the appropriate horizontally packed formation 310 and the value of N for each row has been evaluated. Therefore horizontal processing (stage 4) (405) by module 220B can begin as soon as the vertical processing 403, 404 is complete. By combining the processing stages in this manner the processing latency (number of pipeline stages) is reduced, which saves power. Again by using a combination of clock gating and only switching the data in registers when absolutely necessary, power is also saved.
An efficient output buffer 219 performs stage 5 (406), which is no longer trivial because there may in general be a variable number of coefficients V (<64) 407 sent to the external memory 202, where V is the number of VOP pixels in the block (301) being processed. Ideally the number of clock cycles necessary for coefficient storage should decrease linearly with V, and this invention uses a technique to store the coefficients as soon as they are made available by the horizontal processing element in as few clock cycles as necessary. Again this decreases processing latency, which saves power. The input buffer 217 and logic thereof is used to compute SA-DCT processing stages 0 and 1 simultaneously, namely reading the input data block from the memory means 202 and packing each of the block columns vertically to compute N for each column. A diagram of the architecture is shown in Figure 5
If the data_yalid_r input port 501 is at a logic '0' level the input buffer logic remains in an idle state. Essentially the data_yalid_r signal provides clock-gating control to this logic so no unnecessary switching is occurs in the case that the input data is not valid. If the data_valid_r signal is detected as logic " 1". the data on the input data buses alpha__ιn_r (8 bits wide) 502 and data_in_r (9 bits wide) 503 is signalled as valid input data. On each positive edge of the synchronous clock signal a new pixel and its corresponding alpha value is present on the buses. Block data 301 is assumed to be fed in a vertical raster scan of the columns 302 to 309 as illustrated in Figure 6.
Each consecutive burst of 8 input data pixel values is stored (depending on the alpha values) in an alternative 8x9-bit buffer (buffer A 504 and buffer B 505). The routing of data_in_r to the buffers is dependent on a single bit selection signal col_buff_sel_r 506. The signal col_buff_sel_r is controlled by a modulo-8 counter col memberjdx r 507 and every time col_member_idx_r reaches the value 7 the col_buff_sel_r signal is inverted. This inversion logic, which is controlled by col_member_idx_r, is represented by the logic cloud 508 labelled "Next Buff Select Logic". The col_member_idx_r is encoded using gray-coding to minimise the number of registers switched in the counter in each step. This counter is only incremented if data_valid_r 501 is asserted (in normal operation data_yalid_r remains asserted for an integer multiple of 8 clock cycles when asserted). If data_yalid_r 501 is de-asserted at some clock edge not on an 8 cycle
boundary, col_member_idx_r 507 retains its value to allow to transfer to resume if and when data valid r 501 is re-asserted.
Data is only stored in a buffer register if the corresponding alpha_in_r value is non- zero, indicating a VOP pixel. The logic controlling this storage is represented by the logic clouds 509, 510 "Next N/ State Logic", where / (Ξ {A,B}. If valid VOP pixel data is present on the input data bus, it is stored in location 511 col_buff_i r[N_yert_buff_i_r] in the next clock cycle. The 4-bit register 512 N_vert_buff_i_r is also incremented by 1 in the same cycle, which represents the number of VOP pels in the current buffer being loaded (i.e. the vertical N value). In this way the values in the buffer registers are only switched if required since pixels that do not form part of the VOP are not stored. When the buffer multiplex 513 is switched, the N_yert_buff_i_r register 512 for the buffer about to be written to is reset to 0 so that it may be incremented to 1 on the next clock edge if the first pel happens to be a VOP pel. Simultaneously, the new_col_loaded_r signal 514 is pulsed high for one clock cycle to indicate to the vertical SA-DCT logic that one of the buffers is ready for processing. The transpose memory circuit 218 as further described hereinbelow also detects this pulse. The appropriate buffer and N value are routed to the vertical SA-DCT logic via 2:1 multiplexors 513, 515. It is not necessary to explicitly clock gate the new_col_loaded_r signal since it is only pulsed under control of col_member_idx_r 507, which itself is clock gated.
When one buffer is being processed the other buffer is being loaded and vice versa. The buffer i 504, 505 that is ready for vertical processing has a value in its corresponding N_yert_ buff_i_r register in the range [0,8] that indicates the number of VOP pels in buffer i needing processing. These valid VOP pels are packed into the range 511 col_buff_i_r[0:N_vert_buff_i_r-lJ except for the case when N_vert_buff_i_r equals 0 so there are no VOP pels stored.
A sample timing diagram is shown in Figure 7, which illustrates the efficient, synchronous behaviour of the input buffer circuit 217. The data_yalid_r 501 signal is asserted without interruption implying that new data is on the input data buses on every clock cycle. In clock cycle 0 (701) the final data value (final since col_member_idx_r
507 is equal to 7 and datajyalidjr 501 is asserted - referred to as "condition C" for clarity) in input column j 702 is being read. Since condition C is true in cycle 0, the col_bujf_selj- signal 506 is inverted in clock cycle 1 (703), therefore routing the data from column j+1 712 read in cycle 1 (703) to buffer B 505.
Condition C coupled with the negative transition on col_ buff_sel_r 506 also causes N_ vert_buff_B_r 512 to be reset to 0. Also in cycle 1 (703) new_col_loaded_r 514 is pulsed due to condition C to indicate to the vertical DCT circuit that column j 702 has been loaded and shifted in buffer A 504 and is ready for processing with its N value stored in N_yert_ buff_A_r 512. The vertical DCT circuit can read the data from current_column_s[7:0] 515 and the associated N value from current_N_vert_s 513. In clock cycle 1 since the alpha _in_r 502 equals 255, the data DBO 704 on data_in_r 502 is stored in buffer B 505 in cycle 2 (705) at location 511 col_buffer_B_r[N_yert_buff_B_r] , and the value in N_yertj3uff_ B_r 512 is also incremented by 1.
This process continues in a similar manner until in clock cycle 5 (706) where alpha jn_ r 502 equals 0. Therefore in cycle 6 (707) the corresponding data DB4 708 is not stored (since it does not form part of the VOP) and N_yert_buff_B_r 512 is not incremented. In cycle 6 (707) alphajnj 502 equals 255 indicating that DB5 (709) is part of the VOP so requires storage in cycle 7 (710). However since N_yert_buff_B_r 512 was not incremented in cycle 6 (707), DB5 (709) is stored in the next buffer location after the previous VOP pixel, which is DB3 (711). It is clear that no actual circuit register shifting takes place but the data is stored in the appropriately packed manner. Column j+1 (712) contains 6 VOP pixels and only 6 registers in col_bujf_B_r (511) get switched. Registers col_buff_B_r[7:6] are not unnecessarily switched and hence no unnecessary energy is consumed. Although these registers now contain dead data, the subsequent DCT circuitry will know this based on the value of N_vert_buff_B_r and will not access them. In clock cycles 9-16 (713) the vertical DCT processor pipeline processes column j+1 (712). Also during this interval column j+2 (713) is stored in buffer A (504) and in this case all 8 pixels form part of the VOP. This example illustrates how the input buffer circuitry performs the vertical shifting process as required for an SA-DCT calculation in an efficient manner.
According to the present invention, the input buffer circuit 217 is interleaved with the subsequent vertical SA-DCT processing module 220A, and requires no additional register stages to implement the vertical shifting process since only 8 cycles are required to read and shift the entire column from the input ports. The circuit is also energy efficient since only 2 out of the 8 input block columns require storage in the input buffer at any one time. Interleaving the design reduces the processing latency and allows the design to operate at a lower frequency to save power while maintaining throughput. As soon as a valid column has been loaded it is immediately forwarded for processing and the next column is loaded without any wait-states or delay cycles. In the worse-case load situation where valid data is continuously presented on the input data buses and data_yalid_r is held continuously at logic '1', both buffers are alternately being loaded or processed and switched every 8 clock cycles. The values in the buffer registers are only overwritten by the number of VOP pels in the next column being loaded. The other registers are not switched but this will not cause error since the vertical SA-DCT logic uses the value of N to process only relevant VOP data in the next column.
Said vertical SA-DCT module 220A computes stage 2 of the SA-DCT function; namely an N-point ID DCT of the N-point data vector presented to it, for the value of N specified. In relation to the present invention, an implementation of stage 2 is considered to be a solved problem and any state-of-the-art implementation can be swapped in to the appropriate block 404 in Figure 4. A preferred embodiment for the 8- point ID DCT case is nonetheless shown in Figure 8, which extends a ΝEDA architecture to implement a factorized version of the N=8 ID DCT cosine basis matrix. The factorization method adopted is even-odd decomposition, and this requires less hardware compared to a ΝEDA implementation of the un-factorized matrix. The savings are made in the ΝEDA weight generation logic (26 full adders as opposed to 35 full adders).
The first register 801 stage performs the even-odd decomposition. The second stage 802 forms the 13 binary weights for each coefficient and these are stored in the second bank of registers 803. The third stage 804 forms each of the final 8 DCT coefficients by using a 13 input carry save adder (CSA) tree 805 to combine the weights for each coefficient.
Ignoring the CSA trees, it is clear that 26 adders are required to form the 13x8 weights (8 for even-odd decomposition and 18 in the second stage).
However, an implementation of the SA-DCT requires a variable iV-point ID DCT processing unit capable of computing all possible TV-point ID DCTs (N = 0,1,2,...8) and not just the N = 8 case. It is possible to apply the even-odd decomposition technique described above for the other values of N to construct similar architectures for these JV values. For each individual value of N, the even-odd decomposition adder structure (stage 1) and the adders necessary for the weight generation (stage 2) will be different since the cosine basis functions vary with N. The specific optimised adder requirements are independently derived for each value of N using a recursive iterative matching optimisation algorithm. Instead of implementing 7 discrete modules to compute the coefficients for each N value (N = 2,...8 since N - 0,1 cases are trivial), the present invention proposes an efficient architecture for variable N-point DCT coefficient computation where the data path is multiplexed by N since the value of N may vary on a per-computation basis. Using this approach, the adders required for stages 1 and 2 can be shared reducing the hardware cost. The data path taken by a particular Λf-point data vector is multiplexed through the processing unit based on the N value associated with that vector.
The proposed architecture is shown in Figure 9 and it is clear that the N value 901 is used to multiplex the adders 902 to perform the even-odd decomposition. There is a similar multiplexed adder structure in stage 2 where there is also a certain amount of overlap in the adders required to compute the weights for each N value 901. Only 31 unique adders are required to implement a variable iV-point ID DCT. The outputs 903 of the adders are multiplexed to form each of the 104 weights (13 for each of the 8 coefficients). Based on the design constraints for a particular implementation (speed, area and power consumption), a state-of-the-art synthesis tool can be used to allocate the appropriate number of adders and set the amount of resource sharing to meet the desired requirements.
The function of the transpose memory 218 shown in further detail in Figure 10 is to store the intermediate coefficients produced by the vertical transform module 220A,
prior to horizontal transformation 405. The transpose memory 218 of the present invention is pipelined with both the vertical and horizontal SA-DCT processing units 220A and 220B respectively to reduce processing latency. In conventional 8x8 DCT implementations, the transpose memory structure and access scheme is trivial in the sense that there are a fixed number of coefficients to be stored in a fixed order. However, the vertical SA-DCT unit 220A produces up to 64 coefficients in a pattern dependent on the shape of the input block, which complicates the storage scheme. As outlined earlier, the vertical SA-DCT 220A is followed by a horizontal packing stage 403. This invention uses a scheme to perform horizontal packing as soon as the vertical SA-DCT unit 220A produces the data requiring storage in order to minimise latency.
The transpose memory write access is under the control of a simple finite state machine. When the input buffer logic asserts the new coljtoaded r signal 514, the state machine is aware that the vertical SA-DCT logic (stage 2) will have valid coefficients stored in the v_coeffs_r[current_N_vert_r~l:OJ registers 1001 and the appropriate N value 901 in the current_N _vert_r register 513 after a fixed number of clock cycles (defined by the constant V_COEFFS_RDY). The number of clock cycles depends on the implementation employed for stage 2, and V_COEFFS_RDY can be easily re-defined to suit the implementation. On the V_COEFFS_RDYth clock edge the state machine asserts the vcoeffs_rdy_r pulse 1002 for a single cycle and the N coefficients are stored to the appropriate locations in the transpose memory in the subsequent clock cycle. The vcoeffsj-dy_r 1002 essentially acts as a clock-gating signal to the transpose memories 1003, 1004 so switching only occurs when new data is to be written, thus saving power. Each location in the transpose memories 1003, 1004 is ll.f bits wide where 11 bits represent the integral part and f bits are used to represent the fractional part. The mathematical properties of a ID //-point DCT imply that 11 bits is enough to capture the integral part without any loss incurred. Since the cosine functions are theoretically infinitely precise, rounding the fractional part of the result to f bits introduces loss. For example, the MPEG-4 hardware implementation of the 8x8 DCT uses f = 4 (11.4 = 15 bits in total). In this architecture the size of f can be adjusted easily by re-defining a constant.
IS
There are two 8x8 l l.f-bit transpose memories (transposejbuffer_A_r[7:0] [7:0] 1003 and transposejbufferj3j-[7:0] [7:0] 1004), and the same multiplexing principle applies as in the case of the input buffer 217. When one of the transpose memories 1003, 1004 is being written to, the other is being read by the horizontal SA-DCT logic for horizontal processing and vice versa. The write control de-multiplex signal is transpose_ buff_sel_r 1005, which is indirectly controlled by the write access state machine 1006. Each time this state machine detects the new_col_ loaded_r 514 as asserted, it waits V COEFFS RD Y cycles, vcoeffsjrdyj- 1002 is asserted and a modulo-8 counter coljdxj- 1007 is incremented which represents the horizontal index of the current column about to be stored. In the next cycle (since the transpose memory addressing needs a cycle to detect vcoeffsjrdyj- asserted) column coljdx r is stored in the appropriate location as described subsequently. After every 8 assertions of vcoeffsjrdyj- (since each block has 8 columns), the state machine 1006 switches the transpose jbuff_sel j signal so the next block is written to the other transpose memory. The novel aspect of this circuit is the way in which the data is stored, since horizontal packing is performed implicitly. A favourable consequence of this scheme is that the number of registers switched depends entirely on the length N of the data being stored. No power is wasted switching irrelevant data. Both transpose memories 1003, 1004 have 8 x 4-bit registers N _horz_buff_ij-[7:0] 1008 associated with them that are used to store the value of N for each row (i.e. the number of VOP pels in each row after horizontal packing). The current jN jyertj- signal is used to update these signals when each column produced by the vertical SA-DCT logic is stored. When vcoeffsj-dyj' 1002 is asserted, the current values of N_horzjbujf_ij-[current_N_yertj~-l:0] 1009 where ϊ E {A,B} are used to address rows current jN_vertj--l :0 of transpose jbuffer j j~ 1010 in horizontal memory locations N_horzJbujfjj-[current_N_vert_r-l:0]. In the same cycle, registers N_horz_buff_ijr[current_Nj>ertj--l:0] are incremented by I since another VOP pel has been stored in each of the corresponding rows. After 8 assertions of vcoeffsjrdyjr 1002, the values of N for each row of the vertical coefficients are stored in NJιorz_buff_i_r[7:0] 1009 and the data is packed horizontally in tr anspose jbuffer J jr [7:0] [7:0] 1010. Each time transpose jbuffjselj- 1005 switches, the registers N_horz_buff_i_r[7:0] 1009 for the buffer i about to be written to are reset to zero since a new block is about to be packed.
After every 8 assertions of vcoeffs_rdy_r 1002 one of the transpose memories transpose _buffer_ i_r[7:0] [7:0] 1010 has an entire block of vertical coefficients stored in a packed horizontal manner as required. This data is now ready for horizontal processing and the read access state machine 1011 is used to control routing each of the rows of the newly filled transpose memory transpose _ buffer _i_rp r:0) '[77:0] 1010 to the horizontal processing unit. It is worth noting that it is not possible to begin horizontal processing until all 8 columns of vertical coefficients have been stored in the transpose memory since any row may have pixels in all 8 of the columns and this cannot be known a priori. Hence it is important to have two transpose memories 1003, 1004 to decrease processing latency and avoid the horizontal SA-DCT pipeline stage stalling.
The read control state machine 1011 has two states and stays in the NOJREAD state most of the time except when it detects that a new vertical coefficient block has been fully stored in one of the buffers. When this condition is detected it moves to the TRM READ state for 8 clock cycles (counted by modulo-8 counter transpose_mod8_read_cntr_r 1012) since it takes the horizontal SA-DCT logic this amount of time to clock each of the 8 rows into its pipeline, 1 row per cycle. In the TRM READ state the read_mem_r signal 1013 is asserted (it actually uses the same register as the state variable). The 8 rows 1014 of transpose _ buffer_ i_r[7:0] [7:0] where / G {A,B} and the appropriate N values 1015 are routed sequentially in 8 cycles to the output ports 1016, 1017 of the memory. The data is addressed by transpose jnod8_readj:ntrjr 1012, which is the same counter used to control the read access state machine. This counter actually has another further use in the output buffer module described in further details hereinbelow. The read mem r signal 1013 is used to indicate to the horizontal ID N-point DCT processing unit that valid data exists on currentjrow jy [current _N_horz-l:0] and current_N Jiorz each cycle read_mem_r 1013 is asserted. In the worst case scenario where valid data is continuously presented to the core the transpose_buff_sel_r 1005 will invert (and hence trigger the horizontal SA- DCT logic) every 64 clock cycles, the first 8 of which involves clocking the new transpose memory rows into the horizontal processing pipeline.
A timing diagram showing a typical sequence of operations in the transpose memory 218 is provided in Figure 11. In clock cycle 1 (1101) the input buffer 217 pulses the
newjcol_loaded__r signal 514, which clocks the current column and N value into the vertical DCT processor pipeline. However, it also triggers the transpose memory write control FSM 1006. Using a constant parameter V COEFFSJRDY 1102 this FSM 1006 knows how many clock cycles the vertical DCT processor 404 requires to present valid data on v_coeffs_r[7:0] 1002. In this example V_COEFFS_RDY 1102 is 6 cycles so 6 cycles after new_col_loaded_r 514 is asserted a new column of coefficients is ready for storage in one of the memories. In the example, vcoβffsjrdyjr 1001 is pulsed in cycle 7 (1103) to indicate that a new column is ready, which is written in a horizontally packed fashion to the memory in cycle 8 (1104). Also in cycle 8 the appropriate horizontal N values are incremented depending on the number of VOP coefficients in the column being stored. Each time vcoeffs_rdy_r 1001 is pulsed; col_idx_r 1007 is incremented in a modulo-8 fashion. Therefore each time coljdxj- is incremented to 7, an entire 8x8 block of coefficients has been stored in the memory that are now ready for horizontal SA-DCT processing. When this condition is true the transpose memory read control FSM 1011 is triggered and the filled buffer is routed row-wise to the horizontal DCT circuit in 8 cycles, 1 row per cycle. Also in cycle 15 (1105) the transpose _buff_sel_r signal 1005 is inverted so that when buffer A 1003 storing block ϊ is being read, buffer B 1004 is targeted to store the first column of block i+1. This example illustrates the efficient manner in which data is written to and read from the transpose memories.
In summary, the transpose memory circuit 218 of the present invention stores a vertical coefficient block produced by the vertical SA-DCT unit 404 while at the same time routing the previous block produced to the horizontal SA-DCT unit 405. The storage scheme developed incorporates the necessary horizontal packing step in a power- efficient manner prior to horizontal processing (405). By lowering the number of processing steps and pipelining the design the number of clock cycles necessary to process the SA-DCT function is reduced, which in turn leads to power savings. It is achieved with minimal area overhead and by keeping switching to a minimum.
The horizontal SA-DCT module 220B computes stage 4 of the SA-DCT function; namely an //-point ID DCT on the N-point data vector presented to it according to the value of N presented. The process being performed is identical to stage 2, described above and with particular reference to Figures 8 and 9, except the bit-width of the data
path is wider. The bit-width of the input data to the vertical circuit is 9-bits (pixel difference values). This is wider for the horizontal circuit since the input data is not pixel differences, but intermediate vertical DCT coefficient produced by the vertical module 220A. The actual width is not explicitly specified by any standard, but the data must be precise enough to satisfy standardised accuracy requirements of the final output. For example, the intermediate bit-width used by the MPEG-4 hardware reference 8x8 DCT implementation is 15 bits. In relation to this invention, an implementation of stage 4 is considered to be a solved problem and any state-of-the-art implementation can be swapped in to the appropriate block 405 in Figure 4. In the preferred embodiment of the present invention, this module is identical to the module described in Figure 9, except for the difference in bit-width.
The final stage of the SA-DCT process involves storing the coefficients out to some external memory. This design assumes the coefficients are transmitted serially. The architecture of the output buffer 219 is shown in Figure 12. Two small finite state machines control storage to the output buffer 1201 and transmission of the coefficients to the output port 1202.
The storage control state machine 1203 uses the transpose_τnod8_read_cntr_r signal 1012 to be told that the horizontal SA-DCT processor 405 is reading from the transpose memory 218, and will produce eight rows of final coefficients and the corresponding N . values over 8 specified clock cycles. These 8 cycles occur at some offset from transpose_mod8_read_cntr_r 1012 defined as H_COEFFS_RDY. The number of clock cycles depends on the implementation employed for stage 4, and H_COEFFS_RDY can be easily re-defined to suit the implementation. When the state machine 1203 detects transposejnod8_read_cntrjr 1012 equal to H_COEFFS_RDY-1 it knows that in each of the subsequent 8 clock cycles a new row of coefficients will be present on h_coeffs_r[current_N_horz_r-l:0] 1204 and their corresponding N value will be present on current _/V 'Jτorzjr 1015 on the next clock edge. The state machine uses ob_row_idx_r 1205 to route each row and N value to its appropriate location in the output buffer registers output_buff_r{ob_row_idx__r][current_N_horz_r-l:OJ and ob_N_horz_r[ob_row_idx_r] 1206. Each output buffer location is 12 bits wide to comply with the MPEG-4 standard for coefficient size. The signal ob_row_idx_r 1205
is a modulo-8 counter that is incremented every time a new row is presented by the horizontal SA-DCT processor 404.
To minimise latency as soon as the storage control state machine 1203 has stored the first row, the transmission control state machine 1207 begins to route the N coefficients serially to the output port xf_coeff_out (1208) in the next clock cycle. It does this by detecting the condition transpose modS read cntrj- 1012 equal to H_COEFFS_RDY.
This is the first cycle after the first row has been stored. In a conventional 8x8 DCT this stage is trivial since there are a fixed number of coefficients requiring storage. With the SA-DCT, only the coefficients created from the VOP require storage. Ideally if there are
V coefficients requiring transmission (V < 64) they are transmitted in V clock cycles to minimise latency. This is achieved with the transmission control state machine 1207 by manipulating ob_N_horz_r[7:0] 1209 and two indices used to indicate the current VOP coefficient being transmitted {pb_rowjcmϊt_ idxjr and ob_mem_xmit_idx_r).
The transmission control state machine 1207 asserts xf_new_coeff_rdy 1210 if valid coefficients are presented at the output data port. When the state machine detects that the last coefficient is about to be transmitted, the xf_dct_done signal 1211 is asserted for a single clock cycle and the state machine returns to the OB XMIT IDLE state. If there are V VOP coefficients in a particular block the xf_new_coeff_rdy signal is asserted for
V clock cycles, and xf_dct_done is pulsed in the last of these cycles.
There are some architectural variations possible with this invention depending on the design requirements of a specific implementation. One possibility involves implementing multiple parallel data paths to increase throughput if the operating voltage
V and frequency / are held constant. Such an architecture could be used to process multiple objects in parallel, or indeed multiple frames of the one object in parallel to meet possible tight real-time processing constraints. An alternative parallel data path implementation could reduce V and / to achieve lower dynamic power consumption (Equation 1) for the same throughput as a singular data path implementation. Obviously the parallel data paths and added control circuitry comes at a certain circuit area cost so depending on specific design requirements, power, area and performance can be traded off when implementing this invention. To achieve higher throughput
without incurring an area overhead but at the cost of dynamic power, it is possible to increase the operating frequency of the singular data path architecture. This will decrease the processing latency for a single object and allow the possibility to timeshare the same data path for multiple object processing.
The shape-adaptive nature of the computation means that depending on the shape of the input data certain portions of the circuitry will not be needed on a per-computation basis. It is therefore possible to implement this invention in a 2D systolic array structure with some additional datapath control circuitry, which allocates the available computational resources to a number of computations in parallel. As well as increasing throughput, this ensures that all of the hardware is being used at maximum efficiency for a given power consumption level.
The words "comprises/comprising" and the words "having/including" when used herein with reference to the present invention are used to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof.