CN1658673A

CN1658673A - Video compression coding-decoding method

Info

Publication number: CN1658673A
Application number: CN200510038537.8A
Authority: CN
Inventors: 马国强; 徐苏珊; 吴金勇; 徐健键
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2005-03-23
Filing date: 2005-03-23
Publication date: 2005-08-24

Abstract

Video compression encoding and decoding method, including the following procedures to encode and process video compression signals, discrete cosine transform DCT; transformation and quantization; channel buffer needs to be set before the coded bit stream enters the channel; the buffer must have a control mechanism; motion estimation; This position offset is described by a motion vector, a motion vector represents the displacement in the horizontal and vertical directions; when motion estimation, the P frame image uses the most recently decoded I frame or P frame as the reference image, which is called Forward prediction; motion compensation: use the motion vector calculated by motion estimation to move the macroblock in the reference frame image to the corresponding position in the horizontal and vertical directions to generate a prediction of the compressed image; and sub-pixel Carry out search calculation; quantization, storage, and motion search after DCT transformation of the sampled signal are all completed in the frequency domain. The video encoder completes all calculations in the frequency domain. The compression rate is high and the calculation amount is small.

Description

Video compression coding and decoding method

One, the technical field

The invention relates to a video compression coding algorithm and an AVCS video conference system based on the algorithm.

Second, background Art

At present, foreign similar video conference terminals generally adopt encoding technologies such as H.263 and H.264. The H.263 adopted product has lower computational complexity, is easy to realize on hardware with lower cost, has lower production cost, but has very low compression ratio to video data, occupies larger bandwidth and increases network operation cost; products using h.264 have very high compression ratio, occupy less network resources, but bring with it very high computational overhead, making such products rely on rather costly hardware platforms.

In 1995, after the Video Coding Experts Group (VCEG) of the international union completed the work of the h.263 standard, a new low bit rate video communication standard was developed and named h.26l. In 2001, the potential advantages of h.26l were recognized by the ISO Moving Picture Experts Group (MPEG), which in cooperation with VCEG established the Joint Video Team (JVT). The result of this group was Advanced Video Coding (AVC) released in the second quarter of 2003. In the ITU-T family of standards, AVC is referred to as H.264. Since the end of 2003, the appeal of h.264 quickly spared video users who had been suffering from the expense of dedicated bandwidth, since half the bandwidth was available to achieve the original image quality. In the domestic market, video manufacturers including Dingshitong, Zhongtai, Zhongxing, Kodao, TANDBERG and the like have or are about to put out new video products supporting the H.264 standard. H.264 has the greatest advantage over h.263 in that it is a very low code rate coding scheme. Theoretically, under the condition of equivalent restored image quality, H.264 saves half of code rate compared with H.263. In other words, the image quality of the same segment of video coded at 384kbps using H.264 is the same as that of the H.263 coded at 768 kbps. This provides the possibility of obtaining high quality images at low bandwidth for users with tight bandwidth resources. H.264 was designed with hierarchical coding transmission at different network resources. H.264 has stronger fault-tolerant capability, and can obtain better quality than H.263 coded video in a network environment with unstable quality. As video communication applications are gradually migrating from government and enterprise private networks to public networks, the interference rejection characteristics of h.264 will play a key role. Another significant difference between the h.264 and h.261 and h.263 encoding schemes is that it can support finer sub-pel motion vectors when performing motion compensated prediction. In contrast to the 1/2 pixel level prediction of h.263, h.264 can achieve prediction at the 1/4 pixel level, which results in higher quality video encoded by h.264. The benefits of h.264 are not free. The cost of h.264 is that its computational complexity is much higher than h.263. The decoding complexity of h.264 is 2 times that of h.263 in the same case, while the encoding complexity is more than 3 times that of h.263. The increase in computational complexity has limited the implementation of h.264. A simple example is the latest product of a well-known international video terminal manufacturer, which can support 2M bitrate under h.263, but can only support up to 512kbps under h.264.

Of course, as a new coding standard, h.264 has its limitations in application. Since the original design objective was to use h.264 to obtain good image quality at low bandwidth, we can see that h.264 has no significant difference in image quality compared to h.263 at high bit rate in practical tests. Therefore, the network bandwidth used is a factor that must be considered by the user when purchasing h.264 products. Since 1M bandwidth can usually be guaranteed if the videoconference is running on a private network, there is no need to spend more investment on h.264. Since the h.264 standard is released for only one year, most terminal manufacturers advertising h.264 support are mainly basic grades supporting h.264. As the complexity of h.264 codecs increases, it presents challenges to the video processing capabilities of the terminal vendor. The existing platform can not perform H.264 coding and decoding at all or support coding and decoding at high code rate. Moreover, the implementation methods of h.264 are different among several major terminal manufacturers, it is difficult for terminals of different brands to establish connection using h.264, the interconnection capability is difficult to be ensured, and these objective factors all set a great obstacle to the rapid popularization of h.264.

However, h.264 has its technical advancement, and as an emerging codec standard, its high-efficiency coding performance is helpful to improve the utilization efficiency of resources and save huge investment in network bandwidth. In 2003, the popularization degree of broadband in China is higher and higher, the video communication demand under low bandwidth such as DSL is gradually increased, and we have reason to believe that H.264 plays a key role in the popularization of video communication.

The birth of H.264 is that in video communication and storage application, the video coding and decoding standard occupies the core position of the technology. There have been two standardization systems for video coding, one being the MPEG series of standards (e.g., MPEG-1, MPEG-2, MPEG-4) dominated by ISO/IEC; another is the ITU-T dominated H.26x series of standards (e.g., H.261 and H.263). The MPEG series of standards are widely used in the fields of video storage, on demand and forwarding, such as the video format of VCD, which has been developed based on MPEG-1 technology. Also, due to the recommendation of the international telecommunications union, the h.26x series of standards are also widely used in the field of video communication, and are adopted by a wide range of operators and equipment providers.

Patent applications for video coding methods include: the integer transform matrix selection method of CN 200410012857.1 video coding and the related integer transform method relate to the integer transform of image data compression in a video coder-decoder, and the method adopts 8 by 8 integer DCT transform aiming at the first audio/video coding standard (AVS) to be formulated in China at present, provides a transform base selection method of integer transform, comprehensively evaluates two indexes of decorrelation efficiency and energy concentration rate of the transform base, and the transform dynamic range and the calculation complexity of the transform base, provides two groups of 8 by 8 integer transform bases (5, 6, 4, 1) and (4, 5, 3, 1) with excellent performance through the method, and obtains an integer transform fast algorithm based on the two groups of bases.

CN03157077.1 discloses a bidirectional prediction method for video coding, when bi-directionally predicting coding at the coding end, first, for each image block of the current B frame, obtaining a given candidate forward motion vector of the current image block; then, candidate backward motion vectors are obtained through calculation, and candidate bidirectional prediction reference blocks are obtained through a bidirectional prediction method; calculating matches within a given search range and/or within a given match threshold; and finally, selecting the optimal matching block to determine the final forward motion vector, backward motion vector and block residual error of the block. The method is combined with forward and backward predictive coding to realize a new predictive coding type, and can be suitable for the established AVS standard.

CN200310116090.2 provides a method for determining a reference image block in a direct coding mode, which can be implemented in a division-free manner while maintaining an accurate motion vector, thereby improving the calculation accuracy of the motion vector, more truly representing the motion of an object in a video, obtaining more accurate motion vector prediction, and combining with forward predictive coding and backward predictive coding, a new predictive coding type can be implemented, which can ensure the high efficiency of direct mode coding and facilitate the implementation of hardware, and obtain an effect similar to that of traditional B frame coding, and can be used for the formulated AVS standard. 98123036.9A video coding-decoding (CODEC) method of error resilient mode, a computer readable medium containing a program for the video CODEC method, and a video CODEC apparatus. The video CODEC method provides greater resilience against channel errors, making communications less susceptible to errors. Wherein a header data bit region, a motion vector data bit region and a discrete cosine transform data bit region are divided from each macroblock of error recovery mode video data, then the divided bit regions are variable-length coded, reversible variable-length coding is performed on a bit region selected from the variable-length coding region according to a priority for recovery, and a flag is inserted in the variable-length coding or reversible variable-length coding bit region. But existing approaches do not focus on solving the problem of computational load.

Third, the invention

The purpose of the invention is: the video compression coding algorithm system which is designed autonomously is adopted, perfect balance is achieved on network overhead and calculation amount, the characteristics of high compression ratio and low calculation load are achieved, the compression ratio close to H.264 can be provided, and the calculation load can be reduced to the level close to H.263.

The video compression coding and decoding method is characterized by comprising the following procedures of coding a video compression signal, Discrete Cosine Transform (DCT): DCT is a spatial transform, which is performed on a block-by-block basis to generate DCT coefficient data blocks, and concentrates the energy of blocks on a few low-frequency DCT coefficients in a general image; transformation and quantization: the quantization is carried out on DCT transform coefficients, the DCT coefficients are removed in a certain quantization step size in the quantization process, and different quantization precisions are adopted for 64 DCT transform coefficients in a DCT transform block so as to ensure that the DCT transform block contains specific DCT spatial frequency information as much as possible and ensure that the quantization precision does not exceed the requirement. In the DCT transform coefficients, the low-frequency coefficient has higher importance on visual induction, so the distributed quantization precision is thinner; the importance of the high-frequency coefficient to the visual induction is low, the distributed quantization precision is coarse, and most high-frequency coefficients in one DCT block become zero after being quantized;

a channel buffer is required before the coded bit stream enters the channel. The channel buffer writes data from the entropy coder via a buffer at a variable bit rate, and reads data out into the channel at a nominally constant bit rate of the transmission system. The buffer size, or capacity, is set, but the instantaneous output bit rate of the encoder is often significantly higher or lower than the transmission system band, which may cause buffer overflow or underflow. Therefore, the buffer needs to have a control mechanism, and the bit rate of the encoder is adjusted by feedback control of the compression algorithm, so that the write data rate and the read data rate of the buffer tend to be balanced. The buffer controls the compression algorithm by controlling the quantization step of the quantizer, when the instantaneous output rate of the encoder is too high and the buffer overflows, the quantization step is increased to reduce the encoding data rate, and the loss of the image is correspondingly increased; when the instantaneous output rate of the encoder is too low and the buffer is about to overflow, the quantization step size is decreased to increase the encoded data rate.

And (3) motion estimation: when motion estimation is used in inter-frame coding, an estimate of the compressed image is generated by reference to the frame image. The motion estimation is performed on a macroblock-by-macroblock basis, and the positional shift between macroblocks at the corresponding positions of the compressed image and the reference image is calculated. This positional offset is described in terms of motion vectors, one representing displacement in both the horizontal and vertical directions. In motion estimation, the reference frame pictures used by P-frame and B-frame pictures are different. The P frame picture uses the last decoded I frame or P frame as the reference picture, called forward prediction; b-frame pictures, which use two frames of pictures as prediction references, called bi-directional prediction, where one reference frame precedes the coded frame in display order (forward prediction) and the other frame is later in display order than the coded frame (backward prediction), the reference frame of a B-frame being in any case an I-frame or a P-frame;

and motion compensation: the motion vector calculated by motion estimation is used to move the macro block in the reference frame image to the corresponding position in the horizontal and vertical directions, so as to generate the prediction of the compressed image. Motion is ordered in most natural scenes. The difference value between the prediction image generated by such motion compensation and the compressed image is small.

The invention is characterized in that: and after the motion search algorithm in the frequency domain is adopted, the quantization, the storage and the motion search of the sampled signal after the DCT transformation are all completed in the frequency domain by the video encoder, and all the calculation is completed in the frequency domain.

The basis of the invention also includes: in run-length coding, only non-zero coefficients are coded. The coding of a non-zero coefficient consists of two parts: the first part represents the number of consecutive zero coefficients preceding a non-zero coefficient (called run) and the latter part is that non-zero coefficient. This gives the advantage of the type scanning, since the type scanning has a high probability of zero-linking in most cases, and the efficiency of run-length coding is high. When the rear remaining DCT coefficients in the one-dimensional sequence are all zero, the encoding of this 8 x 8 transform block can be ended, as indicated by an "end of block" flag (EOB), and the resulting compression effect is very significant.

Subjective evaluation of digital image quality: the conditions for subjective evaluation include: panel structure, viewing distance, test image, ambient illumination, background tint, and the like. The evaluation group consists of a certain number of observers, wherein the professionals and the non-professionals respectively account for a certain proportion. The viewing distance is 3-6 times the diagonal dimension of the display. The test image is composed of a number of image sequences with certain image details and motion. Subjective evaluation reflects the average of many people's statistical evaluations of picture quality.

Type scan and run length coding: the DCT transform produces an 8 x 8 two-dimensional array that must be converted to a one-dimensional arrangement for transmission. There are two-dimensional to one-dimensional conversion modes, or scan modes: type-Zig (Zig-Zag) and alternating scans, of which type-Zig is the most common one. Since most of the non-zero DCT coefficients are concentrated in the upper left corner of the 8 x 8 two-dimensional matrix, i.e. the low frequency component region, after the pattern scan, these non-zero DCT coefficients are concentrated in the front of the one-dimensional array, followed by long strings of quantized zero DCT coefficients, which creates conditions for run-length coding. Entropy coding, an efficient discrete representation of the DCT coefficients generated by quantization, is bit stream encoded prior to transmission to produce a digital bit stream for transmission. Entropy coding is based on the statistical properties of the encoded signal such that the average bit rate is reduced. The run and non-zero coefficients can be entropy coded independently or jointly. In the Huffman coding used in entropy coding, a code table is produced after the probabilities of all coded signals are determined, less bits are allocated to frequently occurring signals with large probability for representation, more bits are allocated to infrequently occurring signals with small probability for representation, and the average length of the whole code stream tends to be shortest.

The invention is characterized in that: the performance of the motion search step itself is improved. In a traditional video coding system, an encoder needs to perform repeated conversion between a spatial domain and a frequency domain, an algorithm based on the spatial domain is used during motion search, and the algorithm needs to be performed in the frequency domain during residual coding, so that the energy of a coefficient is concentrated in a low-frequency region, and the quantization is convenient. Frequent switching between the spatial domain and the frequency domain is rather resource consuming.

The invention is also characterized in that: a set of first and then complete video conference system is provided. The technical core of the video conference is a video coding and decoding algorithm system, the invention develops deep research in the field, provides an innovative frequency domain-based sub-pixel motion search algorithm, and establishes a set of efficient and stable video coding algorithm system, and the video coding algorithm system not only has high coding efficiency, but also has far lower computational complexity than other algorithms of the same kind, and is easy to realize on a low-cost hardware platform. The invention provides an original searching algorithm in the sub-pixel searching step of the motion search, which can reduce the computational complexity to below 10 percent and simultaneously can ensure the sufficient accuracy of the searching result.

In addition, the system of the present invention implements the following functions:

a sub-pixel motion search is performed in the frequency domain using a novel video coding algorithm.

And providing a remote electronic whiteboard, a remote slide and data sharing.

12' liquid crystal touch screen, can draw any figure and communicate with characters.

And mobile phone access is supported, and the conference data is sent to the mobile terminal in time.

And the built-in web server provides a user interface to modify the coding parameters.

The built-in disk video recorder records the video image for a very long time.

And a usb interface is provided, so that data exchange and usb digital camera connection are facilitated.

A built-in high-sensitivity motion detection algorithm can be used as security monitoring.

Description of the drawings

FIG. 1 video compression coding algorithm block diagram FIG. 2 leaky bucket model

Delta response for the figure 3 object as it moves wherein the delta response for the figure 3(a) object as it translates s to the right

FIG. 3(b) delta response of an object translated to the left by s-1

FIG. 4 sub-pixel spatial locations

FIG. 5 comparison of computational Performance under various Standard test sequences

FIG. 6 is a flow of video encoding based entirely on the frequency domain

Distribution of pixels and surrounding pixels in fig. 74 × 4

FIG. 8 Intra 4 × 4 prediction modes

FIG. 9 intra 4 × 4 fast prediction mode selection flow chart

104 x 4 small blocks sub-sampling the current block and neighboring blocks of fig. 11

FIG. 12 is a diagram of 4 × 4 current frame with 4 × 4 current frame in the same location as the previous frame

FIG. 13 System software composition Block diagram FIG. 14 System hardware composition Block diagram

Fifth, detailed description of the invention

1 video compression coding algorithm

Fig. 1 is a block diagram of a video compression coding algorithm employed in the present invention.

The introduction of each algorithm module in the block diagram is as follows:

a. motion search (motion estimation)

Motion search (or called motion estimation) is one of the core technologies in the field of video compression coding, and is also an algorithm module which consumes the most system computing resources in video coding. The video coding scheme of the invention adopts a conventional hybrid search algorithm in the integer pixel search; in the sub-pixel search, the invention realizes the original search technology. This novel search algorithm will be described in detail later.

b. Intra prediction

In a video stream, each frame of image may be encoded in an I frame (intra-frame prediction frame) or a P frame (inter-frame prediction frame). When the P frame is coded, the information in the self image is not directly used as a coded data source, but motion search is carried out in the previously coded image to find out motion information which is used as a basis for inter-frame prediction, and then the difference value of the two frame images is coded. This can greatly reduce the number of bits used to describe the image, thereby achieving the purpose of compression. The I-frame is encoded without the aid of any previous image, but by using the pixels of the already encoded part itself to predict the values of the pixels of the non-encoded part. I-frames are less efficient than P-frames, but are an important building block in video streams because they provide the ability to resynchronize. If a certain frame loses packets during transmission, a subsequent P frame predicted by the frame cannot be correctly decoded, but because the I frame is self-contained and does not refer to any previous image, the code stream is resynchronized here, and errors are limited within a certain range. Due to the importance of I-frames, intra-prediction algorithms for I-frames are also one of the major research points for any video coding scheme. The invention provides a novel intra-frame prediction algorithm in the following, and provides efficient and stable intra-frame prediction performance under the condition of limited calculation cost.

c. Rate distortion optimization

An optimal scheme is selected among the respective coding modes. There are many coding mode and parameter decision problems in video coding. For example, what value the motion vector should take in inter prediction, what the search accuracy is, and the selection of these coding parameters and modes depends on the rate-distortion optimization algorithm. The rate-distortion optimization algorithm evaluates each candidate coding mode or parameter, and then picks out the optimal mode according to a certain rule. This selection method generally measures both the coding efficiency (i.e., compression performance) and the snr after compression. The relationship between these two performance indicators is non-linear, and in order to increase the computation speed, reduce the computation overhead of the system,

the video compression coding scheme of the invention adopts Lagrange operators to realize linear approximation. The following equation is the lagrangian in this scheme. Wherein, DREC is distortion degree, PREC is coding efficiency after prediction, Sk and Q are coding modes and parameters to be selected, JMODE is total cost value, and the coding mode and parameter with the minimum JMODE value are optimal values to be selected.

JMODE(Sk，Ik，λ)＝DREC(Sk，Q)+λRREC(Sk，Q)

d. Code rate control

And monitoring the channel condition and making a decision on the allocation of the code rate. This algorithm module uses a leaky bucket model as shown in fig. 2 to detect the transmission condition of the channel.

e. Memory management

The logical and physical management of the memory and the responsibility for reference frame queue management. When encoding a P frame, it is necessary to perform motion search with reference to an image that has been encoded or decoded in the past, and therefore, it is necessary to establish a reference frame queue and store reference frame data at the same time of encoding and decoding. The same memory logic model is used between the encoder and decoder, each independently maintaining a queue of reference frames, passing only minimal information for synchronization.

f. Entropy coding

Various methods of compression of video sequences are centered around three aspects: eliminating temporal redundancy, eliminating spatial redundancy, eliminating statistical redundancy. Inter and intra prediction addresses temporal and spatial redundancy, respectively, and the method of eliminating statistical redundancy is known as entropy coding. The video coding algorithm system of the invention adopts mature Huffman algorithm as entropy coding.

g. Transformation and quantization

The residual data is time-frequency transformed and quantized in the frequency domain.

1.1 sub-pel motion search

Motion search (or referred to as motion estimation) is one of the core techniques in the field of video compression coding. Video signals, after analog-to-digital conversion, have a huge data size and cannot be directly stored or used for communication. However, natural objects appearing in video images are slowly varying with respect to the high sampling frequency, which results in a great redundancy in the original video information both in the time domain and in the spatial domain. The basic principle of the motion search technology is to search adjacent images in a video sequence, find out motion information and motion vectors, and replace the original information of the corresponding images with data representing the motion of an object, thereby greatly eliminating time redundancy and achieving the purpose of data compression.

The accuracy of modern motion search algorithms is no longer limited to whole pixels. Experiments prove that when the sub-pixel precision of half pixel or more is achieved, the code rate after coding is obviously reduced. Under the condition of low noise, when the search precision is doubled, the compression ratio can be improved by about 0.5bit/sample, and the average code rate after coding can be reduced by 24.41-36.92%. However, when the search accuracy reaches 1/8 pixels or more, the increase in the compression ratio is no longer significant due to the noise enhancement. Currently, the mainstream video coding standard adopts a sub-pixel search technology to improve the coding performance, and introduces a half-pixel motion search in H.263 and MPEG-2, while MPEG-4 and the newly established H.264 use a motion search with 1/4 pixel precision.

In the existing sub-pixel search algorithm, the widely used technology is a full search algorithm based on a spatial domain or various fast algorithms of full search, the algorithms search the best matching block in a search window by taking a pixel block as a unit, and take the average variance sum or the sum of absolute differences as a judgment rule, the search process needs to be carried out with multiple filtering interpolation, and the cost function is repeatedly calculated, so the calculation complexity is very high. Experiments show that after the sub-pixel precision is achieved, the calculation cost of the motion search process is often more than one time of the original whole pixel search. Moreover, the accuracy of matching depends on the precision of the interpolation algorithm, and influences the coding efficiency to a certain extent. The invention provides a novel search algorithm, which predicts and searches motion vectors by utilizing phase correlation in a frequency domain, almost does not need interpolation calculation in the sub-pixel search process, does not need a calculation cost function, can greatly reduce the calculation expense brought by the space domain search algorithm, and is suitable for an embedded platform needing video content service.

1.1.1 frequency Domain phase and object space translation

As is well known, in the fourier transform domain, the change in phase corresponds to a translation of the object in the time/space domain:

F{x(s-τ)}＝e^-jwτF{x(s)} (1)

in equation (1), F {. cndot } represents the fourier transform of the discrete signal, and s represents the spatial displacement (if in the time domain, t is substituted, only the spatial domain is described below). By this property of fourier transformation, motion information in the spatial domain can be easily resolved in the frequency domain. In video coding schemes, if fourier transform is used, it becomes very convenient and accurate to search for motion information in the frequency domain. However, the fourier transform has poor energy convergence performance, and the spatial redundancy cannot be effectively removed after the transform, which makes the fourier transform inapplicable to practical video coding algorithms. At present, DCT transformation is generally adopted by each video coding standard, has energy convergence performance close to K-L transformation, can concentrate most energy on direct current and low frequency parts, and can ensure image quality under high compression ratio after passing through a low-pass filter. In view of this, the invention adopts DCT to realize time-frequency transformation, and calculates the spatial translation from the phase of DCT domain, because of the particularity of DCT, there is no simple corresponding relation in DCT domain like Fourier.

Suppose there is a one-dimensional discrete signal (x)₁(n)|n∈[0，N-1]N is the size of the search window, and after moving m to the right, a signal x is formed₂(n)|n∈[0，N-1]}：

<math> <mrow> <msub> <mi>x</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>x</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>-</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>n</mi> <mo>&GreaterEqual;</mo> <mi>m</mi> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> <mo>,</mo> <mi>n</mi> <mo><</mo> <mi>m</mi> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

According to^[2]The following DCT and DST transforms are defined:

<math> <mrow> <msubsup> <mi>X</mi> <mn>2</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>2</mn> <mi>N</mi> </mfrac> <mi>C</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mfrac> <mi>kπ</mi> <mi>N</mi> </mfrac> <mrow> <mo>(</mo> <mi>n</mi> <mo>+</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msubsup> <mi>X</mi> <mn>2</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>2</mn> <mi>N</mi> </mfrac> <mi>C</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>sin</mi> <mrow> <mo>(</mo> <mfrac> <mi>kπ</mi> <mi>N</mi> </mfrac> <mrow> <mo>(</mo> <mi>n</mi> <mo>+</mo> <mn>0.5</mn> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>2</mn> <mi>N</mi> </mfrac> <mi>C</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>cos</mi> <mrow> <mo>(</mo> <mfrac> <mi>kπ</mi> <mi>N</mi> </mfrac> <mi>n</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>2</mn> <mi>N</mi> </mfrac> <mi>C</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <munderover> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mi>sin</mi> <mrow> <mo>(</mo> <mfrac> <mi>kπ</mi> <mi>N</mi> </mfrac> <mi>n</mi> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>

in the above formula, the first and second carbon atoms are,

C (k) = \{\begin{matrix} \frac{1}{\sqrt{2}}, k = {0, N} \\ 1, k = [1, N - 1] \end{matrix} - - - - (7)

it is readily demonstrated that these four transformations satisfy the following equations:

[\begin{matrix} X_{2}^{C} (k) \\ X_{2}^{S} (k) \end{matrix}] = [\begin{matrix} Z_{1}^{C} (k) - Z_{1}^{S} (k) \\ Z_{1}^{S} (k) + Z_{1}^{C} (k) \end{matrix}] [\begin{matrix} g_{m}^{C} (k) \\ g_{m}^{S} (k) \end{matrix}] - - - - (8)

wherein,

we see that these two variables belonging to the frequency domain contain translation information m. At a known signal x₁(n)、x₂(n) in case of finding a fast algorithm to solve for g_m ^C、g_m ^SAnd m is extracted from the motion vector, so that the motion search of the DCT domain can be realized.

Rewriting the equation in (8) to

<math> <mrow> <mover> <mi>X</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>Z</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mover> <mi>Ω</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>

It can be shown that z (k) is an orthogonal matrix and has:

λZ^T(k)Z(k)＝I₂ (9)

I₂is a 2 x 2 identity matrix. In this way, we can solve the equation:

<math> <mrow> <mover> <mi>Ω</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>λ</mi> <msup> <mi>Z</mi> <mi>T</mi> </msup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mover> <mi>X</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>

thereby can solve g_m ^C、g_m ^S。

The orthogonal law according to the sine function has the following law^[4]：

Where δ (n) is a discrete impulse function.

According to the formulas (8) and (10-12), we can obtain:

analyzing formula (13), when m is larger than 0 and is located in the search window [0, N ], a positive δ response can be always found at N ═ m, and a negative δ response is found at N ═ m-1; when m < 0 and is located in the negative mirror image of the search window [ -N, 0), it is always possible to find a negative delta response at N ═ m, while finding a positive delta response at N ═ m-1. As shown in fig. 3, the gray area is a search window, and when a positive δ response is found in the search window, it means that the object has rightward translation and the motion displacement is s; when a negative delta response is found in the search window, it means that the object has a translation to the left and the motion displacement is s-1. See fig. 3(a) delta response of the object translated to the right by s and fig. 3(b) delta response of the object translated to the left by s-1. FIG. 4 is a schematic view of the spatial location of sub-pixels.

In the specific calculation, theInstead of the former

To reduce computational complexity.

1.1.2 frequency domain sub-pixel search algorithm flow

Based on the above derivation, the flow of the frequency domain-based sub-pixel search algorithm is as follows:

1) determining the search window as N, extracting in the x direction to refer to the imageOne-dimensional signal x starting from a pixel point F₁(n) and x of the corresponding position in the current image₂(n)。

2) According to the formula (3-6), x is calculated₁(n) and x₂(n) four discrete DCT/DST transform coefficients.

3) Is calculated at [1, N]Interval g_m ^SObtained from the following formulae (3 to 6) and (8):

<math> <mrow> <msubsup> <mi>g</mi> <mi>m</mi> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close='' separators=' '> <mtable> <mtr> <mtd> <mn>1</mn> <mo>,</mo> <mi>k</mi> <mo>=</mo> <mi>N</mi> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>·</mo> <msubsup> <mi>X</mi> <mn>2</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>·</mo> <msubsup> <mi>X</mi> <mn>2</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>/</mo> <mrow> <mo>(</mo> <msup> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mo>[</mo> <mn>1</mn> <mo>,</mo> <mrow> <mi>N</mi> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> <mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>15</mn> <mo>)</mo> </mrow> </mrow> </mfenced> </mrow> </math>

4) root of herbaceous plantThe translation direction d in the x direction is obtained from equation (13)_xAnd a displacement s_x。

5) Repeating the above steps in the y-direction yields d in the y-direction_y、s_y。

6) Carrying parameter m_x、m_yLooking up table 1, the matching points in fig. 4 are determined and the half-pel motion vector is determined.

TABLE 1m and motion vectors

m_x	m_yMatching point motion vectors
m_x	m_yMatching point motion vectors	＞0＞0＞0＜0＜0＜0＝0＝0	＞0 3 (0.5，0.5)＜0 8 (0.5，-0.5)＝0 5 (0.5，0)＞0 1 (-0.5，0.5)＜0 6 (-0.5，-0.5)＝0 4 (-0.5，0)＞0 2 (0，0.5)＜0 7 (0，-0.5)
＝0	＝0 F (0，0)	＞0＞0＞0＜0＜0＜0＝0＝0

7) If motion vectors of 1/4 pixel accuracy are required, the steps 1) -6) are repeated on the resulting pixel block using bilinear filter interpolation with the motion vectors obtained in 6).

FIG. 5 is a comparison of the computational complexity, i.e., computational performance, of the algorithm herein in a sub-pel search with a full search algorithm under various standard test sequences. Since the image composition of each test sequence is different and the computing environment is different, for convenience, the computation complexity of the full search algorithm in each test sequence is set to be 1 as a comparison reference.

The full name of DCT Transform is Discrete Cosine Transform (Discrete Cosine Transform), which refers to converting a set of light intensity data into frequency data so as to know the intensity variation. If the high-frequency data is modified and then converted back to the original form, the data is obviously different from the original data, but the human eyes cannot easily recognize the data. When compressing, the original image data is divided into 8 × 8 data cell matrices, for example, the first matrix of luminance values has the following contents:

y₀₀ y₀₁ y₀₂ y₀₃ y₀₄ y₀₅ y₀₆ y₀₇

y₁₀ y₁₁ y₁₂ y₁₃ y₁₄ y₁₅ y₁₆ y₁₇

y₂₀ y₂₁ y₂₂ y₂₃ y₂₄ y₂₅ y₂₆ y₂₇

y₃₀ y₃₁ y₃₂ y₃₃ y₃₄ y₃₅ y₃₆ y₃₇

y₄₀ y₄₁ y₄₂ y₄₃ y₄₄ y₄₅ y₄₆ y₄₇

y₅₀ y₅₁ y₅₂ y₅₃ y₅₄ y₅₅ y₅₆ y₅₇

y₆₀ y₆₁ y₆₂ y₆₃ y₆₄ y₆₅ y₆₆ y₆₇

y₇₀ y₇₁ y₇₂ y₇₃ y₇₄ y₇₅ y₇₆ y₇₇

JPEG refers to the entire luminance matrix and the chrominance Cb matrix, the saturation Cr matrix, as a basic unit called MCU. The number of matrices contained in each MCU must not exceed 10. For example, if the ratio of row and column samples is 4: 2, each MCU will contain four luminance matrices, one chrominance matrix and one saturation matrix.

After the image data is divided into 8-by-8 matrices, 128 values must be subtracted from each value, and then the subtracted values are substituted into a DCT transformation formula one by one, so that the DCT transformation can be achieved. The image data value must be subtracted by 128 because the DCT transformation formula accepts a number in the range of-128 to + 127.

DCT transform formula:

x, y represent the coordinate position of a value within the image data matrix. (x, y) represents a number of values within the image data matrix. V represents the coordinate position of a value in the DCT transformed matrix, and F (u, v) represents a value in the DCT transformed matrix.

u-0 and v-0 c (u) c (v) 1/1.414

u > 0 or v > 0 c (u) c (v) 1

The natural number of the matrix data after DCT transform is the frequency coefficient, the coefficient is the largest with the value of F (0, 0) and is called DC, and the other 63 frequency coefficients are mostly positive and negative floating point numbers close to 0 and are generally called AC.

1.1.3 nodules

For video coding, motion search is performed in the frequency domain, which is advantageous not only in improving the performance of the motion search step itself. In a traditional video coding system, an encoder needs to perform repeated conversion between a spatial domain and a frequency domain, an algorithm based on the spatial domain is used during motion search, and the algorithm needs to be performed in the frequency domain during residual coding, so that the energy of a coefficient is concentrated in a low-frequency region, and the quantization is convenient. Frequent switching between the spatial domain and the frequency domain is quite resource consuming, and after the motion search algorithm in the frequency domain is adopted, the video encoder will complete all computations in the frequency domain, and the encoding process is as shown in fig. 6. Compared with the video coding process of searching motion vectors in the spatial domain, the quantization, storage and motion search of the sample signal after DCT conversion in the figure 6 are all completed in the frequency domain, which not only reduces the inverse DCT conversion step in the spatial domain coding process, but also more effectively reduces the space required by storage, and is beneficial to the optimization of the encoder and the decoder.

1.2 fast selection algorithm for intra prediction mode

1.2.1 Intra coding prediction modes

If there is not a strong temporal correlation between the current picture and the previous input picture, the frame picture is typically encoded as an I-frame, using an intra-coding mode. In the conventional video coding standard, an I-frame image is directly coded without using a prediction technology, namely macroblock data is directly transformed, quantized and coded and transmitted, so that the data volume of the I-frame image after coding is very large. In order to more effectively improve the coding efficiency, the video coding system of the invention fully utilizes the spatial redundancy among the pixels in the image and defines 16 × 16 and 4 × 4 prediction units. Distribution of pixels and surrounding pixels in fig. 74 × 4

In the intra prediction module of the present invention, if the current macroblock coding mode is intra coding, the prediction value of the macroblock is from the neighboring coded reconstructed macroblock. The luminance component may use a 16 × 16 macroblock or a 4 × 4 small block as a basic unit of intra prediction encoding. When a 16 × 16 macroblock is used as a coding unit, 4 prediction modes are available for selection; when 4 × 4 small blocks are used as a coding unit, there are 9 prediction modes available for selection. Two chroma components use 8 × 8 macro block as the basic unit of intra prediction coding, 4 prediction modes are available for selection, and the modes selected by the two chroma components must be the same. Since 4 x 4 small blocks are more elaborate, the computational complexity is mainly reflected in this unit.

The distribution of pixels in a 4 x 4 tile and surrounding pixels is shown in fig. 7, where the lower case english letters a to p denote the 16 pixels inside the tile and the upper case letters a to M denote the pixels surrounding the tile. Intra 4 × 4 uses 9 modes for prediction, where mode 2 is DC prediction and the remaining prediction mode directions are as shown in fig. 8 for intra 4 × 4 prediction modes. For example, if mode 1 is selected for horizontal prediction, the predicted value in the small block is from pixel I, J, K, L.

1.2.2 fast intra prediction coding mode selection algorithm

The intra-frame prediction mode selection algorithm provided by the invention utilizes the boundary direction histogram, the context model and the prediction coding mode of the small block at the same position of the previous frame to quickly select the available candidate prediction mode, carries out pre-coding according to the pre-selection mode and then utilizes the Lagrangian cost function to select the optimal prediction mode. To further reduce the amount of computation, the raw data is sub-sampled before the boundary direction vectors are computed. Taking intra 4 × 4 as an example, the fast intra prediction mode selection process is shown in fig. 9, which is a flow chart of fast intra 4 × 4 prediction mode selection, and each part of the process will be described separately below.

1.2.2.1 Pixel sub-sampling

The input original pixel data is sub-sampled 2: 1, the number of the sampled pixels is 1/2 of the original pixel number, and the time consumed for calculating the boundary direction vector of the sampled pixels is about 1/2. The pixel sub-sampling method employed herein is illustrated by the sub-sampling of the 104 x 4 patches, where in the sub-sampled image the filled circles represent the available sampled pixels.

1.2.2.2 mode selection based on boundary direction

The nature of images that are spatially continuous and correlated, and the pixels that make up an image have correlation in 8 prediction directions in space, can be exploited to reduce spatial redundancy, and if the direction with the strongest correlation can be found and the values of the pixels are encoded using intra prediction, the best effect of intra coding can be achieved. Sobel operator is used herein^[3～5]To calculate the boundary direction vector of the sub-sampled pixel, the Sobel operator is

[\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

And

[\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}],

for calculating the horizontal and vertical components of the boundary vector, respectively.

For a sub-sampled pixel pi, j, the corresponding boundary vector is

<math> <mrow> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mo>{</mo> <msub> <mi>dx</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>dy</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>}</mo> <mo>,</mo> </mrow> </math>

dxi, j and dyi, j represent the boundary vector levels, respectivelyAnd a vertical direction component. dxi j and dyi, j, as shown in equation 1, where p_i-1，j+1Etc. refer to the neighboring pixels of the pixel pi, j in the original image.

dx_i，j＝p_i-1，j+1+2×p_i，j+1+p_i+1，j+1-p_i-1，j-1-2×p_i，j-1-p_i+1，j-1

dy_i，j＝p_i+1，j-1+2×p_i+1，j+p_i+1，j+1-p_i-1，j-1-2×p_i-1，j-p_i-1，j+1 (1)

For ease of calculation, the norm defining the boundary direction vector is:

<math> <mrow> <mi>Amp</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>dx</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>dy</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

the direction of the boundary direction vector is:

the moduli of the vectors in the same direction in the small block are added to obtain a corresponding edge direction histogram (edge direction histogram), the establishment of the 4 × 4 edge direction histogram in the frame is shown in the following formula 3, and the direction with the largest modulus in the direction histogram is selected as a candidate prediction direction.

<math> <mrow> <mi>Histo</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <mi>SET</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </munder> <mi>Amp</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow> </math>

<math> <mrow> <mi>SET</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <mo>{</mo> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>|</mo> <mi>Ang</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>&Element;</mo> <msub> <mi>a</mi> <mi>u</mi> </msub> <mo>}</mo> <mo>,</mo> </mrow> </math>

while

a₀＝(-103.30，-76.60]

a₁＝(-13.30，13.30]

a₃＝(35.80，54.20]

a₄＝(-54.20，-35.80]

a₅＝(-76.70，-54.20]

a₆＝(-35.80，-13.30]

a₇＝(54.20，-76.70]

a₈＝(13.30，35.80]

(3)

1.2.2.3 mode selection based on context model

There is spatial correlation between the small blocks of an image, so that the coding mode of the current small block can be predicted by using the coding modes of the adjacent small blocks. As shown in fig. 11, C denotes a current 4 × 4 tile, and a and B denote a 4 × 4 tile on the top of the current block and a 4 × 4 tile on the left of the current block. And using the maximum value of the A and B prediction modes as the candidate prediction mode of the current block. The current block and the neighboring blocks as shown in fig. 11

1.2.2.4 State mode selection based on blocks in the same position in the previous frame image

According to the coding mode of the 4 × 4 small block at the corresponding position of the current small block in the previous frame image, if the corresponding small block of the previous frame image is the used intra-frame coding mode, the coding mode of the corresponding small block in the previous frame image is selected as the candidate coding mode of the current 4 × 4 small block, as shown in fig. 12, which is a schematic diagram of the 4 × 4 small block at the same position of the current 4 × 4 small block and the previous frame.

1.2.2.5 precoding and Performance comparison

The pre-coding uses the pixels around the current small block, and the current small block is subjected to predictive coding in sequence according to the selected candidate prediction mode, and the optimal prediction mode is selected by using a Lagrangian cost function, wherein the Lagrangian cost function is as follows:

J(s，c，IMODE|QP，λ_MODE)＝SSD(s，c，IMODE|QP)+λ_MODE·R(s，c，IMODE|QP) (4)

wherein, IMODE refers to several optional prediction directions for intra-frame prediction, SSD refers to the sum of mean square errors between an original pixel value s and a reconstructed pixel value c of 4 × 4 in a frame, R (s, c, IMODE | QP) refers to coding by using an IMODE mode, and the size of a code stream to be coded uses variable length Huffman coding. The peak signal-to-noise ratio (PSNR) is used for quality detection in video coding, and equation (5) is the equation for peak signal-to-noise ratio:

PSNR = 10 \log_{10} (\frac{255^{2}}{MSE}) - - - - (5)

1.2.3 results of the experiment

The test sequences used in the experiment were Mobile, Tempete, Bus, Paris, size QCIF, while only the luminance component was tested. The test results are shown in table 2.

TABLE 2 variation of coding Performance under different test sequences

Test sequence	Change in the first I-frame image encoding time (%)	Average change in bit rate of images per frame in sequence (%)	Change in average per-frame image encoding time in sequence (%)	Variation of image PSNR (dB)
Test sequence	Change in the first I-frame image encoding time (%)			Variation of image PSNR (dB)	Mobile	-70.25	0.12	-33.56	-0.016
Tempete	-69.78	0.26	-32.14	-0.014	Mobile	-70.25	0.12	-33.56	-0.016
Tempete	-69.78	0.26	-32.14	-0.014	Bus	-69.58	0.39	-24.34	-0.024
Paris	-71.03	0.42	-31.76	-0.021	Bus	-69.58	0.39	-24.34	-0.024

2 System software composition block diagram (FIG. 13 System software composition block diagram)

In the software system of the system, the most core module is the video encoding and decoding device, and the two parts are the main bodies of the whole software architecture and are the biggest innovation of the invention. The video conference system designed by the invention uses RTP/RTCP protocol to transmit video and voice data. The real-time transport protocol (RTP) is responsible for packing and sending media data, and the RTCP is responsible for communicating the sending and receiving parties of video and voice data streams and transmitting feedback information and time synchronization information.

3 the system hardware is composed into a block diagram (shown in figure 14), and the system adopts an embedded design.

In a word, the video conference is a market which is rapidly growing, but as the industrial standards are not completely unified, the western countries cannot achieve monopoly on the core technology, and China is facing great opportunities and is expected to develop and develop pictures in the field. At present, products such as MCU, gatekeeper and the like of a part of video conference network equipment produced in China are internationally in the state of advanced technology and even leading, but for terminal equipment products of video conferences, China still lacks competitive products, and the market is almost completely occupied by foreign products. The AVCS-II video conference system developed by the applied physics research institute of Nanjing university is a new attempt and breakthrough in the field of video conference terminal products, especially in the video codec technology, in our country to a certain extent, and is expected to open the domestic and foreign video conference markets. The frequency domain-based sub-pixel motion search algorithm provided by the invention is technically innovative, and experiments and practical use of users prove that the algorithm has high accuracy and extremely low computation complexity, and can be used for rapidly matching the optimal motion vector. In addition to a unique video coding system, the system designed by the invention provides a rich video conference tool set, so that a complete video and data interaction platform is constructed for users.

Claims

1. The video compression coding and decoding method comprises the following procedures of coding a video compression signal, Discrete Cosine Transform (DCT): DCT is a spatial transform, which generates DCT coefficient data blocks in units of blocks, and concentrates the energy of the blocks on a few low-frequency DCT coefficients in a general image; transformation and quantization: the quantization is carried out aiming at the DCT transform coefficients, the DCT coefficients are removed in a quantization process by a certain quantization step length, and in the DCT transform coefficients, the low-frequency coefficients have higher importance on visual induction, so that the distributed quantization precision is finer; the importance of the high-frequency coefficient to the visual induction is low, and the distributed quantization precision is coarse; before the coded bit stream enters a channel, a channel buffer needs to be set: channel buffer, which writes data from entropy coder via a buffer at variable bit rate, reads data from transmission system at nominal constant bit rate, and sends the data into channel; the buffer has to be provided with a control mechanism, and the bit rate of the encoder is adjusted by feedback control of a compression algorithm, so that the write-in data rate and the read-out data rate of the buffer tend to be balanced; and (3) motion estimation: when the method is used in an interframe coding mode, the estimation of a compressed image is generated by a reference frame image, the motion estimation is carried out by taking a macro block as a unit, and the position offset between the macro blocks at the corresponding positions of the compressed image and the reference image is calculated, wherein the position offset is described by motion vectors, and one motion vector represents the displacement in the horizontal direction and the vertical direction; in motion estimation, the P frame picture uses the last decoded I frame or P frame as a reference picture, which is called forward prediction; b-frame pictures, which use two frames of pictures as prediction references, called bi-directional prediction, where one reference frame precedes the coded frame in display order (forward prediction) and the other frame is later in display order than the coded frame (backward prediction), the reference frame of a B-frame being in any case an I-frame or a P-frame; and motion compensation: the motion vector calculated by motion estimation is used for moving the macro block in the reference frame image to the corresponding position in the horizontal and vertical directions, so that the prediction of the compressed image can be generated; and performing searching calculation on the sub-pixels;

the method is characterized in that: and after the motion search algorithm in the frequency domain is adopted, the quantization, the storage and the motion search of the sampled signal after the DCT transformation are all completed in the frequency domain by the video encoder, and all the calculation is completed in the frequency domain.

2. The video compression coding and decoding method of claim 1, wherein: the flow of the frequency domain-based subpixel searching algorithm is as follows:

1) determining the search window as N, extracting one-dimensional signal x starting from the whole pixel point F of the reference image in the x direction₁(n) and x of the corresponding position in the current image₂(n)。

2) According to formula (3 &6) Calculating x₁(n) and x₂(n) four discrete DCT/DST transform coefficients.

<math> <mrow> <msubsup> <mi>g</mi> <mi>m</mi> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close='-'> <mtable> <mtr> <mtd> <mn>1</mn> <mo>,</mo> <mi>k</mi> <mo>=</mo> <mi>N</mi> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>·</mo> <msubsup> <mi>X</mi> <mn>2</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>-</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>·</mo> <msubsup> <mi>X</mi> <mn>2</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>/</mo> <mo>(</mo> <msup> <mrow> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>C</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msubsup> <mi>Z</mi> <mn>1</mn> <mi>S</mi> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>)</mo> <mo>,</mo> <mi>k</mi> <mo>&Element;</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>,</mo> <mi>N</mi> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>15</mn> <mo>)</mo> </mrow> </mrow> </math>

4) the translation direction d in the x direction is obtained from equation (13)_xAnd a displacement s_x。

6) Carrying parameter m_x、m_yLooking up Table 1, determine that in FIG. 4The points are matched and a half-pel motion vector is determined.

m and motion vector m_x m_yMatching point motion vectors ＞0＞0＞0＜0＜0＜0＝0＝0 ＞0 3 (0.5，0.5)＜0 8 (0.5，-0.5)＝0 5 (0.5，0)＞0 1 (-0.5，0.5)＜0 6 (-0.5，-0.5)＝0 4 (-0.5，0)＞0 2 (0，0.5)＜0 7 (0，-0.5) ＝0 ＝0 F (0，0)

3. The video compression coding and decoding method according to claim 1, wherein the intra-frame prediction mode selection algorithm uses the boundary direction histogram, the context model and the prediction coding mode of the same position small block of the previous frame to quickly select an available candidate prediction mode, performs pre-coding according to a pre-selection mode, and then selects an optimal prediction mode by using a Lagrange cost function; the method comprises the following steps of firstly performing sub-sampling on original data before calculating a boundary direction vector; pixel sub-sampling: the input original pixel data is sub-sampled 2: 1, the number of the sampled pixels is 1/2 of the original pixel number, and the time consumed for calculating the boundary direction vector of the sampled pixels is about 1/2. Mode selection based on boundary direction

The natural image is continuous and correlated in space, each pixel composing the image has correlation in 8 prediction directions in space, and Sobel operator is used^[3～5]To calculate the boundary direction vector of the sub-sampled pixel, the Sobel operator is

[\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

And

[\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}],

used to calculate the horizontal and vertical components of the boundary vector, respectively; for a sub-sampled pixel pi, j, the corresponding boundary vector is

dxi, j and dyi, j represent the boundary vector horizontal and vertical components, respectively. dxi j and dyi, j, as shown in equation 1, where p_i-1，j+1Etc. refer to the neighboring pixels of the pixel pi, j in the original image. dx (x)_i,j＝p_i-1，j+1+2×p_i，j+1+p_i+1，j+1-p_i-1，j-1-2×p_i，j-1-p_i+1，j-1dy_i，j＝p_i+1，j-1+2×p_i+1，j+p_i+1，j+1-p_i-1，j-1-2×p_i-1，j-p_i-1，j+1 (1)

For ease of calculation, the norm defining the boundary direction vector is:

<math> <mrow> <mi>Amp</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>D</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>|</mo> <msub> <mi>dx</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>dy</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

the direction of the boundary direction vector is:

adding the moduli of the vectors in the same direction in the small block to obtain a corresponding Edge direction histogram (Edge direction histogram), wherein the establishment of the 4 × 4 Edge direction histogram in the frame is shown in the following formula 3, and the direction with the largest modulus in the direction histogram is selected as a candidate prediction direction;

while

a₀＝(-103.3°，-76.6°]

a₁＝(-13.3°，13.3°]

a₃＝(35.8°，54.2°]

a₄＝(-54.2°，-35.8°]

a₅＝(-76.7°，-54.2°]

a₆＝(-35.8°，-13.3°]

a₇＝(54.2°，-76.7°]

a₈＝(13.3°，35.8°]

(3) according to the coding mode of the 4 x 4 small block at the corresponding position of the current small block in the previous frame image, if the corresponding small block of the previous frame image is the used intra-frame coding mode, the coding mode of the corresponding small block in the previous frame image is selected as the candidate coding mode of the current 4 x 4 small block;

J(s，c，IMODE|QP，λ_MODE)＝SSD(s，c，IMODE|QP)+λ_MODEr (s, c, IMODE | QP) (4) where IMODE refers to several alternatives for intra predictionThe prediction direction, SSD, refers to the sum of the mean square errors between the original 4 × 4 pixel values s and the reconstructed pixel values c in the frame, and R (s, c, IMODE | QP), refers to the size of the code stream to be coded using the IMODE mode, and uses variable length huffman coding.