WO2006008725A1 - Transformation en cosinus discret adaptive de forme performante - Google Patents
Transformation en cosinus discret adaptive de forme performante Download PDFInfo
- Publication number
- WO2006008725A1 WO2006008725A1 PCT/IE2005/000076 IE2005000076W WO2006008725A1 WO 2006008725 A1 WO2006008725 A1 WO 2006008725A1 IE 2005000076 W IE2005000076 W IE 2005000076W WO 2006008725 A1 WO2006008725 A1 WO 2006008725A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- coefficients
- processing
- pixels
- image data
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/147—Discrete orthonormal transforms, e.g. discrete cosine transform, discrete sine transform, and variations therefrom, e.g. modified discrete cosine transform, integer transforms approximating the discrete cosine transform
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/649—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding the transform being applied to non rectangular image segments
Definitions
- This invention relates to improving data processing in video object-based texture coding. More particularly, the present invention relates to improving power consumption efficiency when processing data for computing the SA-DCT of MPEG4 video data.
- Equation 1 A commonly adopted approach to reduce Equation 1 at a system level is to design energy-efficient hardware accelerator peripherals for the most computationally demanding tools. This results in improved throughput since the central processor is relieved of these loads, and the hardware accelerators operate in parallel with the processor. Power dissipation is also improved since the peripherals are dedicated devices and their architectures may be tailored in terms of power consumption to suit the computation. This is not as easy to achieve with general-purpose processor, since by definition they are used to compute many different kinds of operation.
- the alpha-plane of a video object can be provided by (semi-) automatic segmentation of the video sequence. This technique is not covered by the MPEG-4 standardization process and depends on the application.
- the SA-DCT behaves identically to the 8x8 DCT. Blocks that are located entirely outside the VOP are skipped to save needless processing. Blocks that lie on the VOP boundary are encoded depending on their shape and only the opaque pixels within the boundary blocks are actually coded.
- the SA-DCT algorithm was originally formulated in response to the MPEG-4 requirement for video object based texture coding, and it builds upon the 8x8 2D DCT computation by including extra processing steps that manipulate the shape information of a video object.
- the gain is increased compression efficiency at the cost of additional computation.
- the SA-DCT is one of the most computationally demanding blocks in an MPEG-4 video codec, therefore energy-efficient implementations are important — especially on battery powered wireless platforms. Power consumption issues require resolution for mobile MPEG-4 hardware solutions to become viable.
- the SA-DCT gives improved compression efficiency compared with the regular 8x8 DCT since it leverages the video object information. However, it is more computationally demanding since there are additional processing steps necessary, which induce tighter real-time processing constraints. As such, there is a valid case for hardware acceleration of the SA-DCT tool especially given the relatively reduced performance and computational capability of mobile processors. Any hardware acceleration solution proposed for the SA-DCT on a mobile platform must make power efficiency a primary design feature as while matching the other requirements for throughput and low silicon area.
- the first is based-on a time recursive filter structure.
- the second which is held in higher favour, is the feed-forward architecture. It trades off scalability, modularity and regularity and claims to avoid numerical inaccuracy or bit-width explosion that can occur in some of the fast algorithms that are more suitable for a software implementation. Both architectures focus on implementations of an JV-point ID DCT
- an apparatus for encoding image data comprising processing means for processing a shape-adaptive discrete cosine transformation, an input buffer, a transpose memory and an output buffer.
- the apparatus receives said image data as picture screen elements (pixels), each of which is defined with respective luminance, chrominance and alpha values.
- the input buffer is configured to store said pixels vertically as said image data is received.
- the processing means is configured to process a plurality of first coefficients from said vertically-stored pixels and a plurality of second coefficients from said first coefficients.
- the transpose memory is configured to store said plurality of first coefficient horizontally as they are output.
- the output buffer is configured to output said second coefficients as they become available from said processing means, whereby said second coefficients are output with minimal latency.
- a method of encoding image data is provided, said image data being picture screen elements (pixels), each of which being defined with respective luminance, chrominance and alpha values, and said method comprises the steps of storing said pixels vertically upon receiving said image data in an input buffer; processing a plurality of first coefficients from said vertically-stored pixels by way of performing a shape-adaptive discrete cosine transformation thereon; processing a plurality of second coefficients from said first coefficients by way of performing a shape-adaptive discrete cosine transformation thereon; storing said plurality of first coefficient horizontally in a transpose memory as they are output; and outputting said second coefficients from an output buffer as they become available with minimal latency.
- a processing system comprising the apparatus for encoding data detailed thereabove.
- the processing system includes processing means, memory means, image data input means and networking means, wherein said memory means stores instructions which configure said processing means to capture said image data as picture screen elements (pixels), each of which being defined with respective luminance, chrominance and alpha values, and exchange said captured image data with said apparatus for the processing thereof over an input/output bus.
- pixels picture screen elements
- the pixels are video object plane (VOP) pixels
- the second coefficients are VOP coefficients.
- these VOP coefficients are output according to a number N of VOP coefficients.
- the transpose memory horizontally stores said first coefficients processed by said processing means as they are output therewith and desirably updates said N value iteratively for each row in the transpose memory.
- the shape-adaptive discrete cosine transformation preferably includes at least two variable N-point 1-dimensional discrete cosine transformations and the processing means advantageously includes at least two variable TV-point 1-dimensional discrete cosine transformation-processing modules.
- the processing means, the input buffer, the transpose memory, the output buffer and the processing means are pipelined and interleaved, whereby latency and power consumption of the apparatus thereabove and any device embodying the above method are reduced.
- Computational load of the SA-DCT processing means varies preferably with the shape image data, whereby only registers appropriate to a particular processing step are clocked and the dynamic power consumption of said apparatus is reduced.
- the networking means of the above system are wireless and said system is a mobile phone handset.
- Figure 1 shows a preferred embodiment of the present invention in an environment, including at least one audio-visual data processing terminal and at least one remote terminal;
- Figure 2 provides an example of the audio-visual data processing terminal, which includes processing means, memory means, communicating means and an image data processing unit;
- Figure 3 illustrates a boundary block of 8 by pixels with a varying N count
- Figure 4 shows a conceptual architecture of the image data processing unit of Figure 2, including an input buffer, a transpose memory and an output buffer;
- Figure 5 further details the input buffer of Figure 4;
- Figure 6 illustrates the order in which the pixels of Figure 3 are input to the image data processing unit of Figures 2 to 5;
- Figure 7 provides an example timing diagram of the operation of the input buffer of
- Figure 8 shows a preferred embodiment of a fixed 8-point 1-dimensional DCT processing element for outputting horizontally-packed data
- Figure 9 shows a preferred embodiment of a variable iV-point 1-dimensional DCT processing element for outputting horizontally-packed data
- Figure 10 further details the transpose memory of Figure 4;
- Figure 11 provides an example timing diagram of the operation of the transpose memory of Figure 10;
- FIG 12 further details the output buffer of Figure 4.
- This invention tackles the problem of accelerating the MPEG-4 object-based shape adaptive discrete cosine transform (SA-DCT) function with power efficient hardware, represented hereinbelow as an image data processing unit.
- SA-DCT is less regular compared to the conventional 8x8 block-based DCT, because processing decisions are entirely dependent on the shape information associated with each individual block.
- the 8x8 DCT requires 16 ID 8-point DCT computations if implemented using the row- column approach. Each ID transformation has a fixed length of 8, with fixed basis functions. This is amenable to hardware implementation since the data path is fixed and all parameters are constant.
- the SA-DCT however requires up to 16 ID iV-point DCT computations, where JV € ⁇ ⁇ 2,3, --,8 ⁇ (N G ⁇ 0,1 ⁇ are trivial cases).
- N can vary across the possible 16 computations depending on the shape.
- the basis functions vary with N, complicating hardware implementation.
- an efficient implementation of a variable //-point ID DCT is considered to be a solved problem given the large amount of prior art in the area. Instead, this invention focuses on the logic extensions necessary that surround the variable JV-point ID DCT modules to compute the SA-DCT.
- the SA-DCT processing stages are: Stage 0 - Load input block data from memory Stage 1 - Vertically shift VOP pels Stage 2 - Vertical ID DCT on each column Stage 3 - Horizontally shift intermediate vertical coefficients
- Stage 4 Horizontal ID DCT on each row of intermediate coefficients
- Stage 5 Store final coefficient block data to external memory
- the block-based 8x8 DCT does not require stages 1 and 3.
- stages 0 and 5 are somewhat trivial for an 8x8 DCT since the amount of data being loaded and stored is fixed. With the SA-DCT, however, this amount varies depending on the alpha mask so there is scope for adapting the number of processing steps based on the shape information to achieve minimum processing latency. If the challenge of implementing stages 2 and 4 is considered to be a solved problem, it is clear that this invention targets the remaining processing stages (1,3 and 5).
- FIG. 1 A preferred embodiment of the present invention is shown in an environment in Figure 1, which includes at least one network-connected apparatus under the form of a mobile telephone handset 101.
- Mobile phone 101 is configured with audio-visual data capturing means, such as a built-in camera 102, and may connect to remote terminals over any of a plurality of wireless networks 103 including a low-bandwidth Global System for Mobile Communication ( ' GSM') network, or higher-bandwidth General Packet Radio Service ('GPRS') network, or yet higher-bandwidth 'G3' network.
- Mobile phone 101 receives or emits data encoded as a digital signal over wireless networks 103, wherein said signal is relayed respectively to or from mobile phone 101 by the geographically- closest communication link relay 104 of a plurality thereof.
- Said plurality of communication link relays allows said digital signal to be routed between mobile phone 101 and its intended recipient or from its remote emitter.
- Mobile phone 101 may also connect to a remote, but proximate terminal over a local wireless network 105 such as the medium bandwidth 'BluetoothTM' network.
- mobile phone 101 may therefore wirelessly broadcast (106) voice or audio-visual data captured with said camera to another network-connected terminal 107 such as another mobile communicating device, e.g. mobile telephone handset 107, having a configuration suitable for receiving and locally processing said broadcast data, over said wireless networks 103 or 105
- a remote terminal 108 such as a desktop computer or a portable computer, a variation thereof being a personal digital assistant, over a Wide Area Network ('WAN') 109, such as the Internet, by way of a remote gateway 110.
- a remote terminal 108 such as a desktop computer or a portable computer, a variation thereof being a personal digital assistant
- 'WAN' Wide Area Network
- Gateway 110 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network 103 within which the example wireless data transmission 105 takes place, and said wide area network (WAN) 109. Said gateway 110 further provides protocol conversion if required, for instance if mobile phone 101 broadcasts (106) data to said terminal 108, which is itself only connected to the WAN 109.
- WAN wide area network
- gateway 110 the potential exists for data exchange between any of mobile phone 101, mobile communicating device 107 and terminal 108, by way of wireless data transmission 105 and the Internet 109 interfaced by gateway 110. It will however be readily apparent to those skilled in the art that the above environment is provided by way of example only, and that the present invention may be embodied in any network comprising devices connected thereto exchanging data encoded as described hereinbelow.
- Mobile phone 101 firstly includes processing means in the form of a general-purpose central processing unit (CPU) 201, which is for instance an Intel ARM X-Scale processor manufactured by the Intel Corporation of Santa Clara, California, USA, for acting as the main controller of mobile phone 101 and processing data.
- CPU central processing unit
- Mobile phone 101 next includes memory means 202, which includes non-volatile random- access memory (NVRAM) 203 totalling 512 kilobytes in this embodiment.
- NVRAM non-volatile random- access memory
- Memory means 202 preferably also includes volatile random-access memory (RAM) 204 totalling 16 megabytes in this embodiment.
- RAM random-access memory
- CPU 201, NVRAM 203 and RAM 204 are connected by a data input/output bus 205, over which they communicate and to which further components of mobile phone 101 are similarly linked in order to provide wireless communication functionality and receive input data.
- Network communication functionality is provided by a modem 206, which provides the interface to external communication systems, such as the GSM, GPRS or G3 cellular telephone networks 103 shown in Figure 1.
- An analogue-to-digital converter 207 receives analogue voice data from the user of mobile phone 101 through a microphone 208, or from remote devices connected to the GSM network only and processes it into digital data.
- An aerial 209 is preferably provided to amplify the network communication operation.
- Analogue input data may be received from microphone 208 and digital data may be locally input with data input interface 210, which is a keypad.
- a third data input interface 102 is provided as a CCD camera configured to capture visual data as a digital video frame defined by a plurality of picture screen elements (pixels), each of which having respective luminance, chrominance and alpha numerical values and wherein said alpha value may be zero or one.
- memory means 202 stores instructions which configure CPU 201 to process image data captured by camera 102 as pixels, each having a respective luminance, chrominance and alpha value.
- Output data in addition to digital output data broadcast over networks 103, includes local visual data output by CPU 201 to a Video Display Unit 214 and audio data output by CPU 201 to a Speaker Unit 215. Said arrangement is described herein by way of a generalised data processing architecture only, in order to not unnecessarily obscure the present description, and it will be readily understood by those skilled in the art that such arrangement may vary to a fairly large extent.
- mobile phone 101 also includes an image data processing unit 216 coupled to bus 205 so as to exchange input and output data with CPU 201 and memory means 202.
- Image processing unit 216 preferably includes an input buffer module 217, a transpose memory module 218, an output buffer module
- processing means 220 for processing the shape adaptive discrete cosine transformation computations shown in the preferred embodiment as two processing modules 220A, 220B.
- Digital image data captured by camera 102 comprises pixels having luminance, chrominance and alpha values, wherein respective alpha values of a plurality of co- located pixels define at least one shape of at least one video object.
- FIG 3 an example of corresponding opaque pixels within a boundary block is shown in Figure 3 with a varying N count.
- the block measures 8 pixels by 8 pixels, and the 64 pixels thereof may belong to the shape or not.
- the first instance 301 of the block illustrates a varying value N as the number of shape-belonging pixels per pixel column 302 to 309 when vertical ID DCT processing is performed, as per stage 2 of the prior art above.
- the second instance of the block 310 illustrates a varying value N as the number of shape-belonging pixels per pixel row 311 to 318 when horizontal ID DCT processing is performed, as per stage 4 of the above prior art.
- FIG. 4 A conceptual architecture of the image data processing unit 216 of Figure 2 is shown in Figure 4 in relation to the prior art stages subsuming of the present invention.
- the input buffer 217 performs (401) processing stages 0 and 1 simultaneously. As the texture and alpha data is loaded from the input data bus 205, it undergoes vertical packing based on the information on the input alpha bus 402. This loading and packing is achieved without needing above and beyond the number of clock cycles necessary for a conventional DCT that does not require packing.
- the input buffer 217 uses alternating buffers to minimise area requirements. Using a combination of clock gating and only switching the data in registers when necessary eliminates unnecessary power consumption.
- stage 3 (403) is performed in tandem with the vertical ID N- point DCT processing (404) by module 220A, such that as soon as stage 2 (404) is complete, the data is stored in the transpose memory 403 in the appropriate horizontally packed formation 310 and the value of N for each row has been evaluated. Therefore horizontal processing (stage 4) (405) by module 220B can begin as soon as the vertical processing 403, 404 is complete.
- the processing latency (number of pipeline stages) is reduced, which saves power.
- An efficient output buffer 219 performs stage 5 (406), which is no longer trivial because there may in general be a variable number of coefficients V ( ⁇ 64) 407 sent to the external memory 202, where V is the number of VOP pixels in the block (301) being processed.
- V is the number of VOP pixels in the block (301) being processed.
- the number of clock cycles necessary for coefficient storage should decrease linearly with V, and this invention uses a technique to store the coefficients as soon as they are made available by the horizontal processing element in as few clock cycles as necessary. Again this decreases processing latency, which saves power.
- the input buffer 217 and logic thereof is used to compute SA-DCT processing stages 0 and 1 simultaneously, namely reading the input data block from the memory means 202 and packing each of the block columns vertically to compute N for each column.
- a diagram of the architecture is shown in Figure 5
- the data_yalid_r input port 501 is at a logic '0' level the input buffer logic remains in an idle state. Essentially the data_yalid_r signal provides clock-gating control to this logic so no unnecessary switching is occurs in the case that the input data is not valid. If the data_valid_r signal is detected as logic " 1 " . the data on the input data buses alpha__ ⁇ n_r (8 bits wide) 502 and data_in_r (9 bits wide) 503 is signalled as valid input data. On each positive edge of the synchronous clock signal a new pixel and its corresponding alpha value is present on the buses. Block data 301 is assumed to be fed in a vertical raster scan of the columns 302 to 309 as illustrated in Figure 6.
- Each consecutive burst of 8 input data pixel values is stored (depending on the alpha values) in an alternative 8x9-bit buffer (buffer A 504 and buffer B 505).
- the routing of data_in_r to the buffers is dependent on a single bit selection signal col_buff_sel_r 506.
- the signal col_buff_sel_r is controlled by a modulo-8 counter col memberjdx r 507 and every time col_member_idx_r reaches the value 7 the col_buff_sel_r signal is inverted.
- This inversion logic which is controlled by col_member_idx_r, is represented by the logic cloud 508 labelled "Next Buff Select Logic".
- the col_member_idx_r is encoded using gray-coding to minimise the number of registers switched in the counter in each step. This counter is only incremented if data_valid_r 501 is asserted (in normal operation data_yalid_r remains asserted for an integer multiple of 8 clock cycles when asserted). If data_yalid_r 501 is de-asserted at some clock edge not on an 8 cycle boundary, col_member_idx_r 507 retains its value to allow to transfer to resume if and when data valid r 501 is re-asserted.
- VOP pixel data is only stored in a buffer register if the corresponding alpha_in_r value is non- zero, indicating a VOP pixel.
- the logic controlling this storage is represented by the logic clouds 509, 510 "Next N/ State Logic", where / ( ⁇ ⁇ A,B ⁇ . If valid VOP pixel data is present on the input data bus, it is stored in location 511 col_buff_i r[N_yert_buff_i_r] in the next clock cycle.
- the 4-bit register 512 N_vert_buff_i_r is also incremented by 1 in the same cycle, which represents the number of VOP pels in the current buffer being loaded (i.e. the vertical N value).
- the values in the buffer registers are only switched if required since pixels that do not form part of the VOP are not stored.
- the N_yert_buff_i_r register 512 for the buffer about to be written to is reset to 0 so that it may be incremented to 1 on the next clock edge if the first pel happens to be a VOP pel.
- the new_col_loaded_r signal 514 is pulsed high for one clock cycle to indicate to the vertical SA-DCT logic that one of the buffers is ready for processing.
- the transpose memory circuit 218 as further described hereinbelow also detects this pulse.
- the appropriate buffer and N value are routed to the vertical SA-DCT logic via 2:1 multiplexors 513, 515. It is not necessary to explicitly clock gate the new_col_loaded_r signal since it is only pulsed under control of col_member_idx_r 507, which itself is clock gated.
- the buffer i 504, 505 that is ready for vertical processing has a value in its corresponding N_yert_ buff_i_r register in the range [0,8] that indicates the number of VOP pels in buffer i needing processing.
- These valid VOP pels are packed into the range 511 col_buff_i_r[0:N_vert_buff_i_r-lJ except for the case when N_vert_buff_i_r equals 0 so there are no VOP pels stored.
- FIG. 7 A sample timing diagram is shown in Figure 7, which illustrates the efficient, synchronous behaviour of the input buffer circuit 217.
- the data_yalid_r 501 signal is asserted without interruption implying that new data is on the input data buses on every clock cycle.
- clock cycle 0 the final data value (final since col_member_idx_r 507 is equal to 7 and datajyalidjr 501 is asserted - referred to as "condition C" for clarity) in input column j 702 is being read. Since condition C is true in cycle 0, the col_bujf_selj- signal 506 is inverted in clock cycle 1 (703), therefore routing the data from column j+1 712 read in cycle 1 (703) to buffer B 505.
- Condition C coupled with the negative transition on col_ buff_sel_r 506 also causes N_ vert_buff_B_r 512 to be reset to 0. Also in cycle 1 (703) new_col_loaded_r 514 is pulsed due to condition C to indicate to the vertical DCT circuit that column j 702 has been loaded and shifted in buffer A 504 and is ready for processing with its N value stored in N_yert_ buff_A_r 512.
- the vertical DCT circuit can read the data from current_column_s[7:0] 515 and the associated N value from current_N_vert_s 513.
- This example illustrates how the input buffer circuitry performs the vertical shifting process as required for an SA-DCT calculation in an efficient manner.
- the input buffer circuit 217 is interleaved with the subsequent vertical SA-DCT processing module 220A, and requires no additional register stages to implement the vertical shifting process since only 8 cycles are required to read and shift the entire column from the input ports.
- the circuit is also energy efficient since only 2 out of the 8 input block columns require storage in the input buffer at any one time. Interleaving the design reduces the processing latency and allows the design to operate at a lower frequency to save power while maintaining throughput. As soon as a valid column has been loaded it is immediately forwarded for processing and the next column is loaded without any wait-states or delay cycles.
- both buffers are alternately being loaded or processed and switched every 8 clock cycles.
- the values in the buffer registers are only overwritten by the number of VOP pels in the next column being loaded.
- the other registers are not switched but this will not cause error since the vertical SA-DCT logic uses the value of N to process only relevant VOP data in the next column.
- Said vertical SA-DCT module 220A computes stage 2 of the SA-DCT function; namely an N-point ID DCT of the N-point data vector presented to it, for the value of N specified.
- stage 2 is considered to be a solved problem and any state-of-the-art implementation can be swapped in to the appropriate block 404 in Figure 4.
- the factorization method adopted is even-odd decomposition, and this requires less hardware compared to a ⁇ EDA implementation of the un-factorized matrix. The savings are made in the ⁇ EDA weight generation logic (26 full adders as opposed to 35 full adders).
- the first register 801 stage performs the even-odd decomposition.
- the second stage 802 forms the 13 binary weights for each coefficient and these are stored in the second bank of registers 803.
- the third stage 804 forms each of the final 8 DCT coefficients by using a 13 input carry save adder (CSA) tree 805 to combine the weights for each coefficient. Ignoring the CSA trees, it is clear that 26 adders are required to form the 13x8 weights (8 for even-odd decomposition and 18 in the second stage).
- CSA carry save adder
- the present invention proposes an efficient architecture for variable N-point DCT coefficient computation where the data path is multiplexed by N since the value of N may vary on a per-computation basis. Using this approach, the adders required for stages 1 and 2 can be shared reducing the hardware cost. The data path taken by a particular ⁇ f-point data vector is multiplexed through the processing unit based on the N value associated with that vector.
- the proposed architecture is shown in Figure 9 and it is clear that the N value 901 is used to multiplex the adders 902 to perform the even-odd decomposition.
- N value 901 is used to multiplex the adders 902 to perform the even-odd decomposition.
- stage 2 There is a similar multiplexed adder structure in stage 2 where there is also a certain amount of overlap in the adders required to compute the weights for each N value 901. Only 31 unique adders are required to implement a variable iV-point ID DCT.
- the outputs 903 of the adders are multiplexed to form each of the 104 weights (13 for each of the 8 coefficients).
- a state-of-the-art synthesis tool can be used to allocate the appropriate number of adders and set the amount of resource sharing to meet the desired requirements.
- the function of the transpose memory 218 shown in further detail in Figure 10 is to store the intermediate coefficients produced by the vertical transform module 220A, prior to horizontal transformation 405.
- the transpose memory 218 of the present invention is pipelined with both the vertical and horizontal SA-DCT processing units 220A and 220B respectively to reduce processing latency.
- the transpose memory structure and access scheme is trivial in the sense that there are a fixed number of coefficients to be stored in a fixed order.
- the vertical SA-DCT unit 220A produces up to 64 coefficients in a pattern dependent on the shape of the input block, which complicates the storage scheme.
- the vertical SA-DCT 220A is followed by a horizontal packing stage 403. This invention uses a scheme to perform horizontal packing as soon as the vertical SA-DCT unit 220A produces the data requiring storage in order to minimise latency.
- the transpose memory write access is under the control of a simple finite state machine.
- the state machine is aware that the vertical SA-DCT logic (stage 2) will have valid coefficients stored in the v_coeffs_r[current_N_vert_r ⁇ l:OJ registers 1001 and the appropriate N value 901 in the current_N _vert_r register 513 after a fixed number of clock cycles (defined by the constant V_COEFFS_RDY).
- the number of clock cycles depends on the implementation employed for stage 2, and V_COEFFS_RDY can be easily re-defined to suit the implementation.
- the state machine asserts the vcoeffs_rdy_r pulse 1002 for a single cycle and the N coefficients are stored to the appropriate locations in the transpose memory in the subsequent clock cycle.
- the vcoeffsj-dy_r 1002 essentially acts as a clock-gating signal to the transpose memories 1003, 1004 so switching only occurs when new data is to be written, thus saving power.
- Each location in the transpose memories 1003, 1004 is ll.f bits wide where 11 bits represent the integral part and f bits are used to represent the fractional part.
- the mathematical properties of a ID //-point DCT imply that 11 bits is enough to capture the integral part without any loss incurred.
- transposejbuffer_A_r[7:0] [7:0] 1003 and transposejbufferj3j-[7:0] [7:0] 1004) there are two 8x8 l l.f-bit transpose memories (transposejbuffer_A_r[7:0] [7:0] 1003 and transposejbufferj3j-[7:0] [7:0] 1004), and the same multiplexing principle applies as in the case of the input buffer 217.
- the write control de-multiplex signal is transpose_ buff_sel_r 1005, which is indirectly controlled by the write access state machine 1006.
- this state machine detects the new_col_ loaded_r 514 as asserted, it waits V COEFFS RD Y cycles, vcoeffsjrdyj- 1002 is asserted and a modulo-8 counter coljdxj- 1007 is incremented which represents the horizontal index of the current column about to be stored.
- column coljdx r is stored in the appropriate location as described subsequently.
- the state machine 1006 switches the transpose jbuff_sel j signal so the next block is written to the other transpose memory.
- the novel aspect of this circuit is the way in which the data is stored, since horizontal packing is performed implicitly.
- a favourable consequence of this scheme is that the number of registers switched depends entirely on the length N of the data being stored. No power is wasted switching irrelevant data.
- Both transpose memories 1003, 1004 have 8 x 4-bit registers N _horz_buff_ij-[7:0] 1008 associated with them that are used to store the value of N for each row (i.e.
- the current jN jyertj- signal is used to update these signals when each column produced by the vertical SA-DCT logic is stored.
- vcoeffsj-dyj' 1002 When vcoeffsj-dyj' 1002 is asserted, the current values of N_horzjbujf_ij-[current_N_yertj ⁇ -l:0] 1009 where ⁇ E ⁇ A,B ⁇ are used to address rows current jN_vertj--l :0 of transpose jbuffer j j ⁇ 1010 in horizontal memory locations N_horzJbujfjj-[current_N_vert_r-l:0].
- registers N_horz_buff_ij r [current_Nj>ertj--l:0] are incremented by I since another VOP pel has been stored in each of the corresponding rows.
- vcoeffsjrdyjr 1002 the values of N for each row of the vertical coefficients are stored in NJ ⁇ orz_buff_i_r[7:0] 1009 and the data is packed horizontally in tr anspose j buffer J jr [7:0] [7:0] 1010.
- transpose jbuffjselj- 1005 switches, the registers N_horz_buff_i_r[7:0] 1009 for the buffer i about to be written to are reset to zero since a new block is about to be packed.
- the registers N_horz_buff_i_r[7:0] 1009 for the buffer i about to be written to are reset to zero since a new block is about to be packed.
- transpose _buffer_ i_r[7:0] [7:0] 1010 has an entire block of vertical coefficients stored in a packed horizontal manner as required.
- the read control state machine 1011 has two states and stays in the NOJREAD state most of the time except when it detects that a new vertical coefficient block has been fully stored in one of the buffers. When this condition is detected it moves to the TRM READ state for 8 clock cycles (counted by modulo-8 counter transpose_mod8_read_cntr_r 1012) since it takes the horizontal SA-DCT logic this amount of time to clock each of the 8 rows into its pipeline, 1 row per cycle. In the TRM READ state the read_mem_r signal 1013 is asserted (it actually uses the same register as the state variable).
- transpose _ buffer_ i_r[7:0] [7:0] where / G ⁇ A,B ⁇ and the appropriate N values 1015 are routed sequentially in 8 cycles to the output ports 1016, 1017 of the memory.
- the data is addressed by transpose jnod8_readj:ntrjr 1012, which is the same counter used to control the read access state machine. This counter actually has another further use in the output buffer module described in further details hereinbelow.
- the read mem r signal 1013 is used to indicate to the horizontal ID N-point DCT processing unit that valid data exists on currentjrow j y [current _N_horz-l:0] and current_N Jiorz each cycle read_mem_r 1013 is asserted.
- the transpose_buff_sel_r 1005 will invert (and hence trigger the horizontal SA- DCT logic) every 64 clock cycles, the first 8 of which involves clocking the new transpose memory rows into the horizontal processing pipeline.
- FIG. 11 A timing diagram showing a typical sequence of operations in the transpose memory 218 is provided in Figure 11.
- the input buffer 217 pulses the new j col_loaded__r signal 514, which clocks the current column and N value into the vertical DCT processor pipeline.
- the transpose memory write control FSM 1006 Using a constant parameter V COEFFSJRDY 1102 this FSM 1006 knows how many clock cycles the vertical DCT processor 404 requires to present valid data on v_coeffs_r[7:0] 1002.
- V_COEFFS_RDY 1102 is 6 cycles so 6 cycles after new_col_loaded_r 514 is asserted a new column of coefficients is ready for storage in one of the memories.
- vco ⁇ ffsjrdyjr 1001 is pulsed in cycle 7 (1103) to indicate that a new column is ready, which is written in a horizontally packed fashion to the memory in cycle 8 (1104). Also in cycle 8 the appropriate horizontal N values are incremented depending on the number of VOP coefficients in the column being stored.
- vcoeffs_rdy_r 1001 is pulsed; col_idx_r 1007 is incremented in a modulo-8 fashion. Therefore each time coljdxj- is incremented to 7, an entire 8x8 block of coefficients has been stored in the memory that are now ready for horizontal SA-DCT processing.
- the transpose memory read control FSM 1011 is triggered and the filled buffer is routed row-wise to the horizontal DCT circuit in 8 cycles, 1 row per cycle. Also in cycle 15 (1105) the transpose _buff_sel_r signal 1005 is inverted so that when buffer A 1003 storing block ⁇ is being read, buffer B 1004 is targeted to store the first column of block i+1.
- This example illustrates the efficient manner in which data is written to and read from the transpose memories.
- the transpose memory circuit 218 of the present invention stores a vertical coefficient block produced by the vertical SA-DCT unit 404 while at the same time routing the previous block produced to the horizontal SA-DCT unit 405.
- the storage scheme developed incorporates the necessary horizontal packing step in a power- efficient manner prior to horizontal processing (405). By lowering the number of processing steps and pipelining the design the number of clock cycles necessary to process the SA-DCT function is reduced, which in turn leads to power savings. It is achieved with minimal area overhead and by keeping switching to a minimum.
- the horizontal SA-DCT module 220B computes stage 4 of the SA-DCT function; namely an //-point ID DCT on the N-point data vector presented to it according to the value of N presented.
- stage 4 of the SA-DCT function namely an //-point ID DCT on the N-point data vector presented to it according to the value of N presented.
- the process being performed is identical to stage 2, described above and with particular reference to Figures 8 and 9, except the bit-width of the data path is wider.
- the bit-width of the input data to the vertical circuit is 9-bits (pixel difference values). This is wider for the horizontal circuit since the input data is not pixel differences, but intermediate vertical DCT coefficient produced by the vertical module 220A.
- the actual width is not explicitly specified by any standard, but the data must be precise enough to satisfy standardised accuracy requirements of the final output.
- the intermediate bit-width used by the MPEG-4 hardware reference 8x8 DCT implementation is 15 bits.
- an implementation of stage 4 is considered to be a solved problem and any state-of-the-art implementation can be swapped in to the appropriate block 405 in Figure 4.
- this module is identical to the module described in Figure 9, except for the difference in bit-width.
- the final stage of the SA-DCT process involves storing the coefficients out to some external memory. This design assumes the coefficients are transmitted serially.
- the architecture of the output buffer 219 is shown in Figure 12. Two small finite state machines control storage to the output buffer 1201 and transmission of the coefficients to the output port 1202.
- the storage control state machine 1203 uses the transpose_ ⁇ nod8_read_cntr_r signal 1012 to be told that the horizontal SA-DCT processor 405 is reading from the transpose memory 218, and will produce eight rows of final coefficients and the corresponding N . values over 8 specified clock cycles. These 8 cycles occur at some offset from transpose_mod8_read_cntr_r 1012 defined as H_COEFFS_RDY. The number of clock cycles depends on the implementation employed for stage 4, and H_COEFFS_RDY can be easily re-defined to suit the implementation.
- transposejnod8_read_cntrjr 1012 equal to H_COEFFS_RDY-1 it knows that in each of the subsequent 8 clock cycles a new row of coefficients will be present on h_coeffs_r[current_N_horz_r-l:0] 1204 and their corresponding N value will be present on current _/V 'J ⁇ orzjr 1015 on the next clock edge.
- the state machine uses ob_row_idx_r 1205 to route each row and N value to its appropriate location in the output buffer registers output_buff_r ⁇ ob_row_idx__r][current_N_horz_r-l:OJ and ob_N_horz_r[ob_row_idx_r] 1206.
- Each output buffer location is 12 bits wide to comply with the MPEG-4 standard for coefficient size.
- the signal ob_row_idx_r 1205 is a modulo-8 counter that is incremented every time a new row is presented by the horizontal SA-DCT processor 404.
- the transmission control state machine 1207 begins to route the N coefficients serially to the output port xf_coeff_out (1208) in the next clock cycle. It does this by detecting the condition transpose modS read cntrj- 1012 equal to H_COEFFS_RDY.
- V coefficients requiring transmission V ⁇ 64 they are transmitted in V clock cycles to minimise latency. This is achieved with the transmission control state machine 1207 by manipulating ob_N_horz_r[7:0] 1209 and two indices used to indicate the current VOP coefficient being transmitted ⁇ pb_rowjcm ⁇ t_ idxjr and ob_mem_xmit_idx_r).
- the transmission control state machine 1207 asserts xf_new_coeff_rdy 1210 if valid coefficients are presented at the output data port.
- the state machine detects that the last coefficient is about to be transmitted, the xf_dct_done signal 1211 is asserted for a single clock cycle and the state machine returns to the OB XMIT IDLE state. If there are V VOP coefficients in a particular block the xf_new_coeff_rdy signal is asserted for
- V and frequency / are held constant.
- Such an architecture could be used to process multiple objects in parallel, or indeed multiple frames of the one object in parallel to meet possible tight real-time processing constraints.
- An alternative parallel data path implementation could reduce V and / to achieve lower dynamic power consumption (Equation 1) for the same throughput as a singular data path implementation.
- Equation 1 the parallel data paths and added control circuitry comes at a certain circuit area cost so depending on specific design requirements, power, area and performance can be traded off when implementing this invention.
- the shape-adaptive nature of the computation means that depending on the shape of the input data certain portions of the circuitry will not be needed on a per-computation basis. It is therefore possible to implement this invention in a 2D systolic array structure with some additional datapath control circuitry, which allocates the available computational resources to a number of computations in parallel. As well as increasing throughput, this ensures that all of the hardware is being used at maximum efficiency for a given power consumption level.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Algebra (AREA)
- Discrete Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| IE20040498 | 2004-07-23 | ||
| IE2004/0498 | 2004-07-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2006008725A1 true WO2006008725A1 (fr) | 2006-01-26 |
Family
ID=34972666
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IE2005/000076 Ceased WO2006008725A1 (fr) | 2004-07-23 | 2005-07-23 | Transformation en cosinus discret adaptive de forme performante |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2006008725A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106383859A (zh) * | 2016-08-31 | 2017-02-08 | 北京蓝天航空科技股份有限公司 | 一种试飞数据分析处理方法 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020168114A1 (en) * | 2001-02-06 | 2002-11-14 | Valente Stephane Edouard | Preprocessing method applied to textures of arbitrarily shaped objects |
-
2005
- 2005-07-23 WO PCT/IE2005/000076 patent/WO2006008725A1/fr not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020168114A1 (en) * | 2001-02-06 | 2002-11-14 | Valente Stephane Edouard | Preprocessing method applied to textures of arbitrarily shaped objects |
Non-Patent Citations (4)
| Title |
|---|
| KUN-BIN LEE ET AL: "A cost-effective MPEG-4 shape-adaptive DCT with auto-aligned transpose memory organization", CIRCUITS AND SYSTEMS, 2004. ISCAS '04. PROCEEDINGS OF THE 2004 INTERNATIONAL SYMPOSIUM ON VANCOUVER, BC, CANADA 23-26 MAY 2004, PISCATAWAY, NJ, USA,IEEE, US, vol. 2, 23 May 2004 (2004-05-23), pages 777 - 780, XP010720452, ISBN: 0-7803-8251-X * |
| SIKORA T: "Low complexity shape-adaptive DCT for coding of arbitrarily shaped image segments", SIGNAL PROCESSING. IMAGE COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 7, no. 4, November 1995 (1995-11-01), pages 381 - 395, XP004047090, ISSN: 0923-5965 * |
| WEE KOK NG ET AL: "A new shape-adaptive DCT for coding of arbitrarily shaped image segments", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2000. ICASSP '00. PROCEEDINGS. 2000 IEEE INTERNATIONAL CONFERENCE ON 5-9 JUNE 2000, PISCATAWAY, NJ, USA,IEEE, vol. 6, 5 June 2000 (2000-06-05), pages 2115 - 2118, XP010504716, ISBN: 0-7803-6293-4 * |
| YI J-W ET AL: "A new coding algorithm for arbitrarily shaped image segments", SIGNAL PROCESSING. IMAGE COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 12, no. 3, June 1998 (1998-06-01), pages 231 - 242, XP004122850, ISSN: 0923-5965 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106383859A (zh) * | 2016-08-31 | 2017-02-08 | 北京蓝天航空科技股份有限公司 | 一种试飞数据分析处理方法 |
| CN106383859B (zh) * | 2016-08-31 | 2020-11-03 | 北京蓝天航空科技股份有限公司 | 一种试飞数据分析处理方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Chen et al. | A new block-matching criterion for motion estimation and its implementation | |
| US20060050976A1 (en) | Caching method and apparatus for video motion compensation | |
| JP5245004B2 (ja) | 高性能ビデオプロセッサ用の低電力メモリ階層 | |
| KR20120066681A (ko) | 비디오 코딩을 위한 효율적인 변환 기술들 | |
| Jia et al. | A low-power motion estimation architecture for HEVC based on a new sum of absolute difference computation | |
| Saponara et al. | Motion estimation and CABAC VLSI co-processors for real-time high-quality H. 264/AVC video coding | |
| Sun et al. | VLSI implementation of a configurable IP Core for quantized discrete cosine and integer transforms | |
| Fanucci et al. | A parametric VLSI architecture for video motion estimation | |
| WO2006008725A1 (fr) | Transformation en cosinus discret adaptive de forme performante | |
| Biswas et al. | A high-speed VLSI architecture for motion estimation using modified adaptive rood pattern search algorithm | |
| US20050141776A1 (en) | Low power, high performance transform coprocessor for video compression | |
| KR100742772B1 (ko) | 가변 블록 움직임 추정장치 및 그 방법 | |
| JP2003296724A (ja) | 画像処理システム及びその方式 | |
| Campos et al. | Integer-pixel motion estimation H. 264/AVC accelerator architecture with optimal memory management | |
| US8526503B2 (en) | OCN-based moving picture decoder | |
| Li et al. | A novel configurable motion estimation architecture for high-efficiency MPEG-4/H. 264 encoding | |
| CN101459838B (zh) | 一种帧间预测系统、方法及多媒体处理器 | |
| JP4625903B2 (ja) | 画像処理プロセッサ | |
| EP1056295A1 (fr) | Dispositif arithmetique de filtrage | |
| Li et al. | An efficient video decoder design for MPEG-2 MP@ ML | |
| KR20080092418A (ko) | 이미지 및 비디오 프로세싱을 위한 메모리 구성 방법 및제어기 구조 | |
| Taşdizen et al. | High performance hardware architectures for a hexagon-based motion estimation algorithm | |
| Liang et al. | A full-pipelined 2-D IDCT/IDST VLSI architecture with adaptive block-size for HEVC standard | |
| Stabernack et al. | A system on a chip architecture of an H. 264/AVC coprocessor for DVB-H and DMB applications | |
| Dias et al. | Efficient motion vector refinement architecture for sub-pixel motion estimation systems |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| 122 | Ep: pct application non-entry in european phase |