Background
Digital video has been widely used in people's daily life, and relates to various fields such as digital televisions, personal computers, handheld mobile devices, entertainment, education and the like. For a large number of users, the most basic requirement is high quality real-time playback (decoding) of video content. However, in order to achieve high compression rate and good image quality, the video compression standard needs to adopt a video compression technique with high computational complexity, which directly results in that the decoding process thereof needs to consume a large amount of computational resources.
In most of the common video compression standards, macroblocks with a size of 16 × 16 are used as basic processing units, and referring to fig. 1, to complete the decoding process, the following processing links need to be sequentially completed for each macroblock: variable length decoding, inverse quantization, inverse Discrete Cosine Transform (Inverse Discrete Cosine Transform-IDCT), motion compensation and color space conversion. Variable length decoding completes parsing of the video bitstream, and restores entropy coding information of the video, such as parameters, coefficients, and motion vectors of each macroblock, which is a strictly serial bit operation. Subsequent inverse quantization and IDCT acts on each block of coefficients that make up the macroblock, processing the sparse DCT coefficients to recover the original block of pixels, a process that is computationally complex. Motion compensation is an effective method to reduce temporal redundancy in video sequences, in basic units of macroblocks. The basic principle of the process in the encoding stage is to search a reference frame for an image block that is most similar to a macroblock in the current image, i.e., a prediction block, the search result is represented by a motion vector, calculate the difference between the current macroblock and the prediction block, and then encode the difference and the motion vector. Motion compensation is the process of recovering the coded picture from the difference and motion vectors. Since good prediction tends to result in better coding efficiency, common video coding systems employ techniques such as bi-directional prediction (B-frame) and sub-pixel precision motion vectors to improve the accuracy of motion estimation. Although the prediction accuracy and compression rate are improved, the complexity of the operation is further increased. The final color space conversion process, which is typically computationally intensive, multiplies the color vector (RGB) by the transformed rectangle for each pixel in the image. As can be seen, the decoding process of video is a complex system composed of multiple time-consuming processing elements.
In the face of high-quality and high-resolution video and the complex compression technology introduced by the new generation of compression standard (such as h.264), the software decoder that uses CPU alone in the current computer system cannot even meet the requirement of real-time video decoding. Therefore, other subsystems are needed to share part of the decoding task to relieve the CPU stress. Dedicated video decoding hardware has been introduced into computer systems for decades, either as stand-alone boards or integrated within graphics hardware. The popularization of the microsoft direct Video access (DXVA) specification makes the latter the current mainstream. However, such dedicated decoding hardware is usually only applicable to a specific video compression standard (mostly MPEG-2) and thus has very limited extensibility and programmability, and lacks sufficient flexibility to cope with the current diverse video compression formats. While programmable video processing hardware, such as PureVideo by Nvidia and Avivo by ATI, technologies have begun to be integrated on current graphics cards, they require additional hardware overhead and higher cost, and there is currently a lack of efficient high-level languages and application program interfaces to facilitate control of these underlying hardware resources.
On the other hand, with the development and popularization of three-dimensional graphics applications, graphics hardware has evolved into a graphics Processor, i.e., GPU, with high performance and flexibility, and at present, the main programmable parts include a Vertex Processor (Vertex Processor) and a pixel Processor (Fragment Processor). The two parts of processing units combine the grating and the synthesizing device to form a pipeline processing structure of the GPU. The high performance of graphics processors in massively parallel, the programmability of sophisticated high-level shading languages, and the support of high-precision data types (32-bit floating point number) make GPUs an attractive coprocessor in computer systems other than CPUs, and can be used to solve many general-purpose computing problems (gpgpgpu) outside the graphics domain, such as numerical computation, signal processing, fluid simulation, etc. From an architectural point of view, GPUs are highly parallel stream processors based on vector operations, which have a great resemblance to some successful specialized multimedia and video processors. These all provide powerful support for efficient video decoding on GPUs.
However, the GPU is developed from design to development in order to accelerate graphics computation, and processed data are relatively regular vertices and pixels, and therefore cannot be directly used in a relatively complex and multi-branched video decoding process. Except for the final color space conversion stage, the texture methods commonly used in the GPGPU domain are not applicable to this decoding process. The main reason is that most current video compression standards are based on an organization of macroblocks/coefficient blocks; each macroblock or coefficient block has its own specific parameters and properties, which are different from each other and are not conveniently represented by a regular single texture. Some predecessors that are based on texture representations, such as DCT/IDCT transforms using a GPU, have no advantage over the CPU in performance and also have considerable data transmission overhead. The document "Accelerate Video decoding with genetic GPU" (Shen G. Et al, IEEE Transaction on Circuits and Systems for Video Technology, may 2005) represents macroblocks with small rectangles, thereby completing the motion compensation process in decoding, which, while effective, still has problems of data redundancy, etc. These methods do not fully utilize the computational resources of the GPU resulting in poor performance and are not suitable for practical video decoding systems.
Disclosure of Invention
The invention aims to solve the defects of the existing software and hardware decoding scheme in performance or flexibility, and provides a compression video decoding method based on a GPU. The method has the advantages of high performance of hardware and flexibility of software, is suitable for various video compression standards, can be used for replacing special decoding hardware on a personal computer, a game host, a handheld mobile device and the like which are provided with a GPU, improves the utilization rate of hardware resources, and reduces the cost.
The above object of the present invention is achieved by the following technical solutions:
a digital video decoding method based on a graphics processor comprises the following steps:
1) CPU variable length decoding to obtain macro block and coefficient block, and using basic graphic element point to represent, respectively generating macro block point set corresponding to macro block and DCT coefficient point set corresponding to coefficient block;
2) The CPU sends the macro block point set and the DCT coefficient point set to the GPU in batches in a batch processing mode;
3) And drawing a macro block point set and a DCT coefficient point set, and executing a corresponding vertex and pixel processing program by the GPU to finish the video decoding process.
The invention expresses the basic units- 'macro blocks and coefficient blocks' forming the video by the basic graphic elements- 'points' in the graphic drawing, thereby mapping the traditional video decoding process into the drawing process of a point set, fully exerting the advantages of GPU pipeline processing and large-scale parallel processing and obtaining higher decoding performance. In the process of drawing the point set, a programmable vertex processor and a programmable pixel processor on a GPU are controlled by a vertex program and a pixel program to finish the main links in the decoding process: inverse quantization, IDCT, motion compensation and color space conversion, and further share part of the computational task with the compositing unit (Blending) and the texture filtering unit on the GPU. The technical scheme specifically comprises the following aspects:
1) Video block information is represented by point primitives rather than rectangles. The working principle is to store the type, position, parameters, coefficients, etc. of macroblocks and coefficient blocks in the video using the properties of the points (four-dimensional vectors) such as position, normal and texture coordinates, etc. Wherein the macro block and the coefficient block correspond to two different types of point sets respectively: macroblock point sets and DCT coefficient point sets for motion compensation and IDCT, respectively. The generation process of the DCT coefficient point set utilizes Zigzag scanning to reduce the number of points in the point set. In consideration of the inefficient branch processing capability of the GPU and the different operation processes corresponding to different types of macro blocks or coefficient blocks, during the process of generating the DCT coefficient point set and the macro block point set, the CPU is used to further subdivide the two types of point sets, and the blocks corresponding to the same type of operation are divided into a type of subset, for example, all non-predicted macro blocks (Intra) in the macro block are grouped into one type, and all forward predicted macro blocks (forward) are grouped into another type.
2) The inverse quantization and IDCT processes in the decoding process are performed by once plotting the DCT coefficient point set created in 1). The inverse quantization is completely finished by a vertex processor of the GPU, while the IDCT is mainly finished in a pixel processor, and the inverse quantization and the IDCT form a pipeline structure to improve the execution efficiency. The quantization parameters and DCT coefficients in inverse quantization are fed into the vertex processor by the attributes of the point elements and the quantization rectangles are preset by the uniform parameter into the constant registers of the vertex processor. The IDCT process is performed by linearly combining DCT coefficients and corresponding base images in a pixel processing unit, and the base images are stored in a video memory of the GPU in the form of texture after being preprocessed. For DCT coefficients distributed in a plurality of points and belonging to the same coefficient block, accumulating the results of a plurality of point primitives to an IDCT output buffer (residual image texture) by using a mixing device (blending) in the GPU.
3) The motion compensation process is completed by drawing the macro block point set created in 1), sampling the reference image texture and the IDCT output texture output in step 2) in a pixel processing unit, accumulating the sampling results and performing saturation operation, and completing the motion compensation process. For the motion compensation of sub-pixel precision, the interpolation operation of sub-pixels is realized by utilizing bilinear filtering hardware of a GPU texture unit.
The advantages of the invention can be summarized in the following aspects:
1) The method combines the advantages of the CPU and the GPU, enables the CPU and the GPU to work in parallel to accelerate the video decoding process, has high performance of hardware decoding and flexibility of software decoding, and can process various video compression formats and standards.
2) Compared with dedicated video hardware, the solution can be realized on the basis of an upper-layer graphics API (such as OpenGL) and a high-level rendering language (such as CG and GLSL) without platform and system relations, is independent of specific hardware of a bottom layer, and is suitable for various systems with GPUs (graphics processing units), such as a personal computer, a game host, a mobile phone, a PDA (personal digital assistant) and the like. The GPU has high evolution speed, the performance increase range far exceeds Moore's law, new functions and characteristics are continuously added to bring more flexible programmability, and the GPU has more potential than CPU soft decoding and special hardware in the long run.
3) The method uses points to represent the macro blocks and the coefficient blocks, and has simple realization and flexible control. Comparing texture representations, the point-based method only transmits non-zero coefficients; compared with a rectangular representation method, a large amount of redundant data of four vertexes in the rectangle is eliminated, so that transmission overhead is reduced, and bandwidth requirements are reduced. Meanwhile, the point method is flexible to control, un-coded blocks (non-coded blocks) can be conveniently removed, zero coefficients are automatically removed in the DCT coefficient point diagram element generating process corresponding to the coefficient blocks, and unnecessary calculation is reduced. And based on a point representation mode, a vertex processor and rasterization hardware in a GPU processing pipeline are conveniently utilized, and the calculation resources of the GPU are fully mined. On the other hand, the method of dividing the CPU into different point sets eliminates the bottleneck of GPU branch processing and improves the performance.
Detailed Description
The preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings of the present invention.
Fig. 2 illustrates a block diagram of a hardware system according to the present invention. The invention needs the cooperation of the CPU and the GPU to complete the whole decoding process, and the CPU and the GPU can be executed in parallel, thereby further improving the efficiency. The CPU and the GPU are connected through a system bus, such as PCIE or AGP. Bus bandwidth is a limited resource and data transfer overhead is an important factor affecting overall performance. An important advantage of the present invention over prior methods is that useless or redundant data is avoided, significantly reducing the amount of data transferred. The CPU packages the information needed by the decoding of the macro block and the coefficient block in the video into a point set for drawing, temporarily stores the information in a system memory in a vertex array or other forms, and then transmits the information into the GPU through a system bus. The GPU is the main execution unit of the decoding task of the present invention, completes the main decoding task, and requires a vertex and pixel processor with programmability and a video memory of a certain capacity for storing the calculation data and intermediate results.
The invention provides a method for realizing video decoding by using point elements to represent macro blocks and coefficient blocks in a video and drawing point sets corresponding to the corresponding macro blocks and coefficient blocks through a graphics hardware GPU. The process flow of the present invention is shown in FIG. 4. The following describes the specific steps of the present invention for implementing video decoding in detail with reference to the accompanying drawings:
1) And the CPU performs variable length decoding to generate a point set corresponding to the macro block and the coefficient block in the video.
Firstly, the CPU completes variable length decoding to obtain information of macro blocks and coefficient blocks in the video, then the video information is packaged into attributes of point primitives, the point primitives are classified into different point sets according to different types or processing processes of the macro blocks and the coefficient blocks, after all video blocks are processed, the corresponding point sets are sent to the GPU in batches in a batch processing mode (such as vertex arrays), and therefore the GPU parallel and pipeline execution efficiency is improved.
The point sets are divided into two broad categories: a DCT coefficient point set and a macroblock point set. The main basis for this partitioning is that the current compressed video is based on the block-based structure, as shown in fig. 3, where the macroblock is the basic unit of motion compensation, and the coefficient blocks that make up the macroblock are the basic units of inverse quantization and IDCT. Both large sets of class points can be further divided into new subsets depending on the type and characteristics of the block. For example, the DCT coefficient point set can be further divided into a field DCT coding point set and a frame DCT coding point set according to different DCT coding modes; the macroblock point sets may be subdivided into non-predicted macroblock (Intra) sets, uni-predicted macroblock sets, bi-predicted macroblock sets, etc., according to macroblock type. The video blocks of different types are usually corresponding to different decoding processing processes, are classified into subsets by using the CPU in advance and are respectively sent to the GPU for execution, so that time-consuming branch operation on the GPU can be avoided, and the overall decoding efficiency is improved.
The process of packing information in macroblocks and coefficient blocks into point primitives is different, but the basic idea is to use multiple vector attributes of point primitives such as position, normal, color, texture coordinates, etc. to store useful information such as types, parameters, coefficients, etc. in video blocks.
The main information contained in the macro block is the position, type (intra, inter) and motion vector of the macro block, and can be directly put into the vector attribute of the point primitive, so that the macro block is converted into the point primitive.
The main information of the coefficient block is DCT coefficients. Due to the energy-converging property and quantization process of DCT, only a few non-zero values of 64 DCT coefficients in an 8 x 8 coefficient block are distributed in the low frequency part. Although a small number of coefficients can be directly put into the attributes of the point primitives, the distribution of the coefficients in different coefficient blocks is irregular, which is not favorable for forming a regular point set suitable for GPU processing, so that the coefficients in each coefficient block need to be reorganized to generate a regular structure. We use Zigzag storage of DCT coefficients to generate the corresponding coefficient point primitives as shown in fig. 5. Zigzag scanning converts two dimensions into a one-dimensional form to concentrate the non-zero coefficients as much as possible. Based on the one-dimensional coefficient array of Zigzag, every four coefficients are a four-dimensional attribute in a group of corresponding point primitives. To ensure the regularity of the points, each or a specific plurality of four-dimensional attributes are placed in a point, and the index of the set of coefficients in this one-dimensional array (coefficient index) is loaded, together with the location, type, quantization parameter information of the coefficient block, to form a DCT coefficient point primitive. The direct result of this approach is that multiple point primitives may be generated for each video block. We then use the IDCT process to accumulate the results scattered into different points.
The method for generating the point primitives acts on all macro blocks and coefficient blocks in each frame of image, the generated point set is stored in a system memory in the form of a Vertex Array (Vertex Array), then the point set is drawn by using a graphics API (application programming interface), and data is sent to a GPU (graphics processing unit) in a batch processing mode to complete the subsequent decoding process.
2) The graphics API rendering environment is initialized.
a) The API function is called to set the rasterized size of the point primitive (e.g., glPointSize in OpenGL). The size is set to 8 when drawing the DCT coefficient Point set and the Point Sprite mode texture generation (Point Sprite ARB Extension) is activated, and set to 16 when drawing the macroblock Point set. For variable-size block structures, the size of the block may be stored in a point attribute, changing the PSIZE register in the vertex processor of the GPU to achieve different rasterization sizes.
b) And allocating off-screen buffer space on the GPU, and storing intermediate output results. We allocate one IDCT output buffer and three frame buffers. In order to ensure the accuracy of the IDCT operation, the output of the IDCT is buffered in a single-channel 16-bit floating-point format (fp 16), and the luminance and chrominance components are shown in fig. 7 a. Because the motion compensation process needs to keep the reference frame, the three frame buffers are respectively used for storing the forward reference frame, the backward reference frame and the current frame, the structure of the frame buffer is an unaligned Byte type of an 8-bit RGB three-channel, the structure is shown in FIG. 7B, the brightness component is stored in the R channel, and the two chroma components are respectively stored in the G channel and the B channel after interpolation. With the "render to texture" function of the GPU, such as render to texture extension or FBO of OpenGL, these caches can be directly used as textures for sampling and access after rendering is complete. Setting a texture filtering mode as 'Nearest' for the IDCT output texture; setting texture filtering as 'Biliner' for a frame buffer used for motion prediction so as to automatically activate a texture filtering function for sub-pixel precision motion compensation during texture sampling; while setting the texture addressing mode to "Clamp" for pixel padding (padding) of the image edges required for "unconstrained motion vectors".
c) The DCT base image is processed to synthesize base image texture for GPU sampling. The IDCT transform can be viewed as a linear combination of DCT coefficients and their corresponding base images, as shown by the following equation:
wherein X represents pixel block after IDCT, X (u, v) represents coefficient at (u, v) in DCT coefficient block, T represents DCT transformation matrix, T (u) is u-th row of the matrix, and base image corresponding to coefficient (u, v) passes through column directionQuantity T (u) T And the outer product of the row vector T (v). The calculation process of the formula is the multiplication operation of scalar quantity and matrix and the linear combination operation of matrix. The main advantage of this process is that each coefficient is computed relatively independently, and zero-valued coefficients can be directly culled to reduce the amount of computation.
The base image texture generation process is shown in fig. 7. According to the Zigzag scanning sequence, the base images corresponding to every four coefficients are stored in an RGBA channel of an 8 x 8 texture block, and in order to ensure the accuracy of IDCT operation, the data accuracy of each color channel is 16 bits. This results in a 32 x 32 size, 16 bit floating point precision RGBA format texture.
d) A Vertex handler (Vertex Program) and a pixel handler (Fragment Program) corresponding to the set of points for rendering the DCT coefficients are loaded. The quantization matrix is loaded into the vertex handler by the uniformity parameter for inverse quantization.
3) After the preparation 2) is completed, drawing the DCT coefficient point set generated in step 1) is started, and the GPU completes inverse quantization and IDCT processes in the drawing process, as shown in fig. 9.
a) The vertex processor implements inverse quantization. The inverse quantization process is essentially a multiplication of the quantization step size and the coefficients. The operation process is as follows:
X iq (u,v)=qp×QM(u,v)×X q (u,v)
wherein X is q (u, v) and X iq The (u, v) respectively represents DCT coefficients before and after inverse quantization, the quantization parameter represented by qp is put into the attribute of the point through the generation process of the coefficient point primitive in the step 1), QM (u, v) represents a corresponding item of the quantization matrix, the whole quantization matrix is loaded into a constant register in the step 2) d), and the corresponding item (entry) can be obtained through the coefficient index introduced in the step 1). Because the coefficients are stored in a vector form, one vector multiplication in the vertex processing program can complete the inverse quantization process of four coefficients.
The vertex processing program can also calculate the texture coordinates of the base image corresponding to the coefficients according to the coefficient indexes and transmit the texture coordinates to the subsequent rasterization stage.
b) And in the light deleting stage, the point graphic elements are converted into pixel blocks with the specified sizes at corresponding positions according to the sizes of the points set in the step 2) a) and the positions output by the vertex processor. Meanwhile, each pixel covered by the pixel block inherits the output attribute of the dot diagram element in the vertex processing stage. For a set of coefficient points, after activating step 2) a) pixelwise texture generation, each pixel will generate the corresponding intra-block texture coordinates, range (0, 0) - (1, 1).
c) The pixel processor combines the texture coordinates of the base image output in a) and the texture coordinates in the block formed in b), and can accurately sample the texture value of the base image corresponding to each pixel point. Consider the IDCT calculation formula in step 2) c). The multiplication between scalar and matrix has been converted to a direct form of operation between pixels. Because the coefficients and the texture values of the base image exist in the form of RGBA four-dimensional vectors, the multiplication and accumulation of the four coefficients can be completed by one vector point multiplication operation in the pixel processing program, and then the result is output to a buffer.
d) The Blending function (Blending) of the GPU hardware is activated and set to Add. Because of the plurality of coefficient point primitives which may be generated by each coefficient block in step 1), the operation result output by each point primitive in the step can be finally accumulated in the output buffer, thereby completing the linear accumulation of all coefficients in the IDCT calculation formula in step 2) c).
And finally, finishing drawing the DCT coefficient point set, and storing the result of inverse quantization of the coefficient block in the video and IDCT in an output buffer of the IDCT to be used as residual image texture for a subsequent motion compensation process.
4) The vertex and pixel handlers for motion compensation are loaded, the macroblock point size is set (16), and the macroblock point set is drawn to complete the motion compensation process, as shown in fig. 10.
a) The vertex processing procedure is mainly used to pre-process the motion vectors. And generating a corresponding decimal part according to the pixel precision of the motion vector so as to automatically complete the interpolation of the pixel by utilizing the bilinear filtering hardware of the texture when the texture is sampled. For example, for half-pixel accuracy, the fractional part is 0.5. Fig. 8a and 8b are simplified illustrations of the pixel interpolation and texture bilinear filtering processes.
b) The rasterization generates a block of pixels of macroblock size, each pixel inheriting the motion vector output in a).
c) In the pixel processing program, the position of each pixel is obtained by using the WPOS register, and then the position is shifted by using the motion vector to obtain the texture coordinate of the corresponding reference block. And the pixel processing program samples the texture of the reference frame and the texture of the residual image output by the IDCT, accumulates the sampling values, carries out saturation processing and outputs the result to a frame cache.
5) If the image in the frame buffer needs to be output to a display device, color space conversion needs to be carried out. The implementation process is to draw a rectangle with the size of an image, sample the frame buffer output in the step 4) c) by using a pixel processing program, change the color of each pixel, and output and display the result. And finally, the whole decoding process is finished.
The steps give out the whole process of finishing video decoding by using the GPU, the CPU is only used for generating and organizing point sets for drawing, and other decoding links are finished on the GPU, so that the calculation burden of the CPU is reduced to the maximum extent; by representing macro blocks and coefficient blocks in the video as point primitives and efficiently mapping the whole decoding process into a drawing process of the point primitives, the method fully exerts the computing resources on the GPU and remarkably improves the video decoding efficiency by means of the parallel computing of GPU hardware and the acceleration function of pipeline processing.
Although specific embodiments of, and examples for, the invention are disclosed in the accompanying drawings for illustrative purposes and to aid in the understanding of the invention and the manner in which it may be practiced, those skilled in the art will understand that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.