HK1088159B

HK1088159B - Picture coding method

Info

Publication number: HK1088159B
Application number: HK06108309.3A
Authority: HK
Inventors: Miska Hannuksela
Original assignee: Nokia Technologies Oy
Priority date: 2003-02-18
Filing date: 2004-02-17
Publication date: 2011-06-17

Description

Image coding method

Technical Field

The present invention relates to a method for buffering multimedia information. The invention also relates to a method of decoding an encoded image stream in a decoder, wherein the encoded image stream is received as a transmission unit comprising multimedia data. The invention further relates to a system, a transmitting device, a receiving device, a computer program product, a signal and a module.

Background

Published video coding standards include ITU-T H.261, ITU-T H.263, ISO/IEC MPEG-1, ISO/IEC MPEG-2, and ISO/IEC MPEG-4 second part. These standards are referred to herein as conventional video coding standards.

Video communication system

Video communication systems can be divided into conversational and non-conversational systems. Conversational systems include video conferencing and video telephony. Examples of such systems include ITU-T recommendations h.320, h.323 and h.324 specifying video conferencing/telephony systems operating in ISDN, IP and PSTN networks, respectively. A feature of conversational systems is that the aim is to minimize the end-to-end delay (from audio-video acquisition to far-end audio-video presentation) in order to improve the user experience.

Non-conversational systems include playback of stored content, such as Digital Versatile Disks (DVDs) or video files stored in mass storage of a playback device, digital televisions, and streaming. A short review of the most important standards in these technical fields is given below.

The dominant standard in today's digital video consumer electronics is MPEG-2, which includes specifications for video compression, audio compression, storage, and transmission. The storage and transmission of coded video is based on the concept of elementary streams. The elementary stream consists of encoded data from a single source (e.g. video) plus auxiliary data required for synchronization, identification and characterization of the source information. The elementary stream is packetized into constant-length or variable-length packets to constitute Packetized Elementary Streams (PES), each PES packet consisting of a header and stream data, hereinafter referred to as a payload. PES packets from various elementary streams are combined to form a Program Stream (PS) or a Transport Stream (TS). PS is aimed at applications with negligible transmission errors, for example storage-and-play (store-and-play) type applications. TS is for applications that are susceptible to transmission errors. However, TS assumes that the network's traffic is guaranteed to be constant.

The Joint Video Team (JVT) consisting of ITU-T and ISO/IEC has published a draft standard that includes the same standard text as ITU-T recommendation H.264 and ISO/IEC international standard 14496-10(MPEG-4 Part 10). In this document, this draft standard is referred to as the JVT coding standard, and a codec according to the draft standard is referred to as the JVT codec.

The codec specification itself conceptually distinguishes between a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL includes signal processing functions of the codec, such as conversion, quantization, motion search/compensation, and loop filtering. It follows the general concept of most of the current video codecs, using a macroblock-based encoder with motion compensated inter-picture prediction and transform coding of the residual signal. The output of the VCL is slice (slice): including the bit string of macroblock data for an integer number of macroblocks and information of the slice header (including the spatial address of the first macroblock in the slice, the initial quantization parameter, and the like). Macroblocks in a slice are arranged in scan order unless a different macroblock allocation is specified using a so-called flexible macroblock ordering syntax. Intra-picture prediction is used only in slices.

The NAL encapsulates the slices output by the VCL into Network Abstraction Layer Units (NALUs), which are suitable for transport over packet networks or for use in packet-oriented multiplexing environments. Attachment B of JVT defines an encapsulation process for transporting the NALU over a byte stream oriented network.

The optional reference picture selection mode of H.263 and NEWPRED encoding tools of MPEG-4 Part 2 enable the selection of motion compensated reference frames for each tile, e.g., each slice in H.263. In addition, the optional enhanced reference picture selection mode of h.263 and the JVT coding standard enable the selection of a reference frame for each macroblock separately.

Reference picture selection enables multiple temporal scalability (temporal scalability) schemes. FIG. 1 shows an example of a temporal ranking scheme, referred to herein as a recursive temporal ranking. The exemplary scheme can decode at three constant frame rates. Fig. 2 depicts a scheme called video redundancy coding, in which a sequence of pictures is divided into two or more independent coding threads in an interleaved manner. The arrows in these and all subsequent figures indicate the direction of motion compensation and the values under the frame correspond to the relative times of acquisition and display of the frame.

Sequence of transmissions

In conventional video coding standards, the decoding order and display order of pictures are the same except for B pictures. A block in a conventional B picture can be bi-directionally temporally predicted from two reference pictures, one temporally preceding and the other temporally succeeding in display order. Only the most recent reference picture in decoding order can occur next to a B picture in display order (except: interleaved coding in h.263, where both field pictures of a temporally subsequent reference frame can precede a B picture in decoding order). A conventional B picture cannot be used as a reference picture for temporal prediction, and thus the conventional B picture can be processed without affecting the decoding of any other picture.

The JVT coding standard includes the following novel technical features compared to earlier standards;

-the picture decoding order is decoupled from the display order. The picture number indicates the decoding order, and the picture order count indicates the display order.

The reference pictures for blocks in a B picture may precede or follow the B picture in display order. Thus, B pictures represent bidirectional predictive pictures rather than bidirectional pictures.

Pictures that are not used as reference pictures are explicitly marked. Any type of (intra, inter, B, etc.) picture may be a reference picture or a non-reference picture. (thus, B pictures can be used as reference pictures for temporal prediction of other pictures.)

The pictures may comprise slices coded with different coding types. In other words, for example, one coded picture may be composed of one intra coded slice and one B coded slice.

Separating the decoding order from the display order is beneficial from a compression efficiency and error resilience (errorriniency) point of view.

Figure 3 shows an example of a prediction structure that potentially improves compression efficiency. The boxes represent pictures, the capital letters in the boxes represent coding types, the numbers in the boxes are picture numbers according to the JVT coding standard, and the arrows represent prediction dependencies. It is noted that picture B17 is a reference picture for picture B18. Compression efficiency is potentially improved compared to conventional coding because the reference pictures of B18 are temporally closer compared to conventional coding with PBBP or PBBBP coded picture patterns. Compression efficiency is potentially improved compared to the conventional PBP coded picture mode because part of the reference pictures are bi-directionally predicted.

Fig. 4 shows an example of an intra picture delay method that can be used to improve error resilience. Traditionally, an intra picture is encoded immediately after a scene cut or as a response to, for example, an expired intra picture refresh period. In the intra picture delay method, an intra picture is not encoded immediately after a need to encode an intra picture occurs, but a temporally subsequent picture is selected as an intra picture. Each picture between the coded intra picture and the conventional position of the intra picture is predicted from the next temporally subsequent picture. As shown in fig. 4, the intra picture delay method produces two independent inter picture prediction chains, whereas the conventional coding algorithm produces a single inter picture chain. It is clear that the double-stranded scheme has a more robust error-resilient capability than the single-stranded conventional scheme. If a packet loss occurs in one chain, the other chain can still be received correctly. In conventional coding, the loss of one packet typically results in propagation of errors to the prediction chain between the remaining pictures.

Two types of sequencing and timing information have traditionally been associated with digital video: decoding and display order. The related art will be described in more detail below.

The Decoding Time Stamp (DTS) indicates the time at which the data unit should be decoded relative to a reference clock. If DTS is encoded and transmitted, it serves two purposes: first, if the decoding order of pictures is different from their output order, the DTS explicitly indicates the decoding order. Second, if the reception rate is close to the transmission rate at any time, the DTS guarantees a buffering operation before a certain decoder. The second use of DTS does not or only marginally work in networks where end-to-end latency varies. Instead, the received data is decoded as soon as possible, provided there is room in the buffer for the uncompressed picture after the decoder.

The transmission of DTS depends on the communication system and the video coding standard used. In an MPEG-2 system, the DTS may optionally be transmitted as one item in the header of a PES packet. In the JVT coding standard, DTS may optionally be transmitted as part of the Supplemental Enhancement Information (SEI) and used in the operation of the optional hypothetical reference decoder. In the ISO base media file format, DTS is dedicated to its specific logical unit (box) type, i.e. decoding time for a sampling logical unit. In many systems, such as RTP-based streaming systems, DTS is not transmitted at all, because the decoding order is considered to be the same as the transmission order, and the exact decoding time does not play an important role.

Optional annex U and annex w.6.12 of h.263 specify a picture number that is incremented by 1 in decoding order relative to the previous picture. In the JVT coding standard, frame number coding units are similarly assigned to picture numbers in H.263. The JVT encoding standard specifies a special type of intra picture, called an Instantaneous Decoder Refresh (IDR) picture. No subsequent picture can refer to a picture earlier in decoding order than the IDR picture. IDR pictures are typically encoded in response to scene changes. In the JVT coding standard, in case of IDR picture loss as shown in fig. 5a and 5b, the frame number is reset to 0 at the IDR picture in order to improve error resilience. However, it should be noted that the scene information SEI message of the JVT coding standard can also be used to detect scene changes.

The h.263 picture number can be used to restore the decoding order of the reference picture. Similarly, the JVT frame number may be used to restore the decoding order of frames between an IDR picture (inclusive) and the next IDR picture (exclusive) in decoding order. However, since the auxiliary reference field pairs (consecutive pictures encoded as fields with different parity) share the same frame number, their coding order cannot be reconstructed from the frame numbers.

The h.263 picture number or JVT frame number of a non-reference picture is specified to be equal to the picture or frame number of the previous reference picture in decoding order plus 1. If several non-reference frames are consecutive in decoding order, they share the same picture or frame number. The picture or frame number of the non-reference picture is also the same as the picture or frame number of the next reference picture in decoding order. The decoding order of consecutive non-reference pictures can also be restored using the Temporal Reference (TR) coding unit in h.263 or the Picture Order Count (POC) concept in the JVT coding standard.

A Presentation Time Stamp (PTS) indicates a time when a picture should be displayed with respect to a reference clock. The presentation time stamps are also referred to as display time stamps, output time stamps and composition time stamps.

The transmission of the PTS depends on the communication system and the video coding standard used. In the MPEG-2 system, the PTS can be optionally transmitted as one item in the header of the PES packet. In the JVT coding standard, the PTS can optionally be transmitted as part of the Supplemental Enhancement Information (SEI). In the ISO base media file format, the PTS is specific to its specific logical unit type, i.e., the composition time for the sampling logical unit, where the presentation time stamp is encoded relative to the corresponding decoding time stamp. In RTP, the RTP timestamp in the RTP packet header corresponds to the PTS.

Conventional video coding etalons have Time Reference (TR) coding units that are similar to PTS in various respects. In some conventional video coding standards, such as MPEG-2 video, TR is reset to zero at the beginning of a group of pictures (GOP). In the JVT coding standard, there is no notion of time in the video coding layer. Picture Order Count (POC) is specified for each frame and field, and it is used for direct temporal prediction, e.g., B-slices, similar to TR. POC is reset to 0 at the IDR picture.

Buffer

Streaming clients typically have a receiver buffer that can store a relatively large amount of data. Initially, when a streaming session is established, the client does not immediately begin playing the data stream, but typically buffers the incoming data for a few seconds. This buffering helps to maintain continuous playback because the client can decode and play the buffered data in case of occasional enhanced transmission delays or network traffic drops. Otherwise, without initial buffering, the client has to freeze the display, stop decoding, and wait for incoming data. This buffering is also necessary for automatic or selective retransmission at any one protocol layer. If any portion of the picture is lost, a retransmission mechanism is employed to retransmit the lost data. If the retransmitted data is received before the scheduled decoding or playback time, the loss is completely recovered.

The coded pictures may be ordered according to their importance in the subjective quality of the decoded sequence. For example, non-reference pictures such as conventional B pictures are subjectively considered least important because their absence does not affect the decoding of any other picture. Subjective sorting may also be performed on a data segment or slice group basis. Coded slices and data segments that are subjectively the most important may be transmitted earlier than an indication of their decoding order, while coded slices and data segments that are subjectively the least important may be transmitted later than an indication of their natural coding order. Thus, any one of the most important slices and data segments is more likely to be received before their scheduled decoding or playback time than the least important slices and data segments.

Buffering before decoder

Buffering before a decoder refers to buffering encoded data before decoding. Initial buffering refers to buffering before the decoder at the beginning of a stream session. The initial buffering is conventionally done for reasons explained below.

In conversational packet-switched multimedia systems, such as IP-based video conferencing systems, different kinds of media are usually transmitted in separate packets. Furthermore, packets are typically transmitted on top of best-effort (best-effort) networks that cannot guarantee a constant transmission delay, but the delay may vary from packet to packet. Thus, packets with the same presentation (playback) timestamp may not be received at the same time, and the reception interval of two packets may be different (in time) from their presentation interval. Therefore, to maintain playback synchronization between different media types and to maintain the correct playback rate, multimedia terminals typically buffer the received data for a brief period of time (e.g., less than half a second) to eliminate delay variations. Such a buffer means is referred to herein as a delay jitter buffer. Buffering may be performed before and/or after decoding the media data.

Delay jitter buffering is also applied to streaming systems. Since the stream is a non-conversational application, the required delay jitter buffer may be much larger than a conversational application. When the stream player has established a connection with the server and requested a multimedia stream to be downloaded, the server starts transmitting the required stream. The player does not start playing back the stream immediately, but typically buffers the input data for a certain time, typically a few seconds. Here, this buffering is referred to as initial buffering. Initial buffering provides the ability to eliminate transmission delay variations in a manner similar to that provided by delay jitter buffering in conversational applications. In addition, it may enable link, transport, and/or application layer retransmission of lost Protocol Data Units (PDUs). The player can decode and play the buffered data while the retransmitted PDUs can be received in time to be decoded and played back at the scheduled time.

The initial buffering of streaming clients provides another advantage not realized in conversational systems: it allows for data rate changes of the media transmitted from the server. In other words, media packets may be transmitted faster or slower in time than their playback rate as long as the receiver buffer does not overflow or underflow. Fluctuations in data rate can be caused by two reasons.

First, the compression efficiency that can be achieved in some media types, such as video, depends on the content of the source data. Thus, if a stable quality is required, the bit rate of the resulting compressed bit stream is varied. Typically, a stable audio-video quality is subjectively considered more satisfactory than a varying quality. Thus, the initial buffering enables a more satisfactory audio-video quality compared to a system without initial buffering, such as a video conference system.

Second, packet loss in fixed IP networks is generally known to be bursty. To avoid bursty errors and high peak bit and packet rates, a forthcoming streaming server is designed to carefully schedule the transmission of data packets. Packets will not be sent exactly at the rate at which they are played back at the receiving end, however the server may attempt to achieve a stable interval between transmitted packets. The server may also adjust the packet transmission rate based on prevailing network conditions, such as decreasing the packet transmission rate when the network becomes congested and increasing the transmission rate when network conditions permit.

Transmission of multimedia streams

The multimedia streaming system is composed of a streaming server and a plurality of players accessing the server through a network. Networks are typically packet-oriented and offer little or no means to guarantee quality of service. The player fetches pre-stored or live multimedia content from the server and plays the content in real time while downloading the content. The kind of communication may be point-to-point or multicast. In point-to-point streaming, the server provides a separate connection for each player. In a multicast stream, a server transmits a single data stream to multiple players, and the network element replicates the stream only when needed.

When the player has established a connection with the server and requested a multimedia stream, the server starts transmitting the desired stream. The player does not immediately start playing back the stream, but typically buffers the input data for a few seconds. Here, this buffering is referred to as initial buffering. Initial buffering helps to maintain uninterrupted playback because the player can decode and play buffered data in the event of occasional increased transmission delays or network traffic drops.

In order to avoid unlimited transmission delays, reliable transmission protocols are not generally supported in streaming systems. Instead, the system prefers unreliable transport protocols, such as UDP, which inherits more stable transmission delays on the one hand, but also suffers from data corruption or loss on the other hand.

The RTP and RTCP protocols may be used above UDP to control real-time communications. RTP provides means to detect transmission packet loss, to re-order the packets in the correct order at the receiving end, and to associate a sampling timestamp with each packet. RTCP conveys information about how large part of a packet is received correctly and can therefore be used for flow control purposes.

In conventional video coding standards, the decoding order is combined with the output order. In other words, the decoding order of I and P pictures is the same as their output order, while the decoding order of B pictures directly follows the decoding order of the subsequent reference pictures of the B pictures in output order. Therefore, it is possible to restore the decoding order based on the known output order. The output order is typically transmitted in the elementary video bitstream in the Temporal Reference (TR) field and also in the system multiplex layer (e.g. in the RTP header).

Some RTP payload specifications allow encoded data to be transmitted out of decoding order. The amount of disorder is typically characterized by a value that is similarly defined in a variety of relevant specifications. For example, in the draft of the RTP payload format for the transport of MPEG-4 elementary streams, the maximum displacement parameter is specified as follows:

the maximum displacement in time of an access unit (AU, corresponding to a coded picture) is the maximum difference between the time stamp of the AU in the mode and the time stamp of the earliest AU that is no longer present. In other words, when considering an interleaved AU sequence, then:

maximum displacement max (ts (i) -ts (j)), for any i and j > i,

where i and j denote the index of the AU in the interleaved mode and TS denotes the time stamp of the AU

In the present invention it has been noted that there are some problems in this approach and that it gives too large a buffer value.

Disclosure of Invention

The following is an example of a scenario where the definition of maximum displacement cannot specify the buffering requirement (in terms of buffering space and initial buffering duration). The sequence is spliced into slices of 15 AUs and the last AU in decoding and output order is transmitted first in the slice of 15 AUs, while all other AUs are transmitted in decoding and output order. Thus, the transmission sequence of an AU is:

14 0 1 2 3 4 5 6 7 8 9 10 11 12 1329 15 16 17 18 19...

the maximum displacement of this sequence is 14 of AU (14+ k 15) (k is all non-negative integer values).

However, this sequence requires a buffer space and an initial buffer for only one AU.

In the draft (draft-ietf-avt-RTP-h264-01.txt) for the RTP payload format of h.264, the parameter num-reorder-VCL-NAL-units is specified as follows: this parameter may be used to indicate the nature of the NAL unit stream or the capabilities of the transmitter or receiver to perform. The parameter specifies a maximum number of VCL NAL units that precede any VCL NAL unit in the stream of NAL units in NAL unit decoding order and follow the VCL NAL units in RTP sequence number order or in synthesis order of the aggregate packets containing the VCL NAL units. If this parameter is not present, it means num-reorder-VCL-NAL-units equal to 0. The value of num-reorder-VCL-NAL-units must be an integer in the range of 0 to 32767, including 0 and 32767.

According to the h.264 standard, VCL NAL units are specified as NAL units with na1_ unit _ type equal to 1 to 5(1 and 5 are also included). The following NAL unit types 1 to 5 are defined in the standard:

1 coded slice of non-IDR picture

2 data segment A of coded slice

3 data segment B of coded slice

Data segment C of 4 coded slices

Coded slice of 5IDR image

The num-reorder-VCL-NAL-units parameter causes a problem similar to the above-mentioned problem occurring with the maximum displacement parameter. That is, it is not possible to infer the buffer space and initial buffer time requirements based on this parameter.

The invention can indicate the size of the receive buffer to the decoder.

In the following, an independent GOP is composed of pictures from one IDR picture (inclusive) to the next IDR picture (exclusive) in decoding order.

In the present invention, a parameter indicating the maximum amount of buffering required is defined more accurately than in prior art systems. In the following description the invention is described by using an encoder-decoder based system, but it is obvious that the invention can also be implemented in systems where video signals are stored. The stored video signal may be an unencoded signal stored prior to encoding, as an encoded signal stored after encoding, or as a decoded signal stored after encoding and decoding processes. For example, an encoder generates a bit stream in a transmission sequence. The file system receives an audio and/or video bitstream that is packaged, e.g., in decoding order, and stored as a file. The file may be stored in a database from which the streaming server may read NAL units and encapsulate them into RTP packets.

Furthermore, the invention is described in the following description by using an encoder-decoder based system, but it is obvious that the invention can also be implemented in a system in which an encoder outputs and sends encoded data in a first order to a further component, such as a streaming server, in which the other component reorders the encoded data from the first order to another order, defines the required buffer size for the other order, and forwards the encoded data to a decoder in a reordered format.

According to a first aspect of the present invention, there is provided a method for buffering multimedia information, wherein a parameter is defined to indicate a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a second aspect of the present invention, there is provided a method for decoding an encoded image stream in a decoder, wherein the encoded image stream is received as transmission units comprising multimedia data, and buffering of the encoded images is performed, wherein buffering requirements are indicated to the decoding process as a parameter indicating a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a third aspect of the present invention there is provided a system comprising an encoder for encoding pictures and a buffer for buffering the encoded pictures, wherein a parameter is arranged to define a parameter indicative of a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a fourth aspect of the invention, there is provided a transmitting device wherein a parameter is arranged to define a parameter indicating a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a fifth aspect of the present invention there is provided a receiving apparatus for receiving an encoded picture stream as transmission units comprising multimedia data, wherein a parameter is arranged to indicate a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a sixth aspect of the present invention there is provided a computer program product comprising machine executable steps for buffering encoded pictures, wherein the computer program product further comprises machine executable steps for defining a parameter indicative of a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

According to a seventh aspect of the present invention there is provided a signal, wherein the signal comprises a signal indicating a maximum number of transmission units comprising multimedia data, said transmission units preceding any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and following the transmission unit comprising multimedia data in decoding order.

According to an eighth aspect of the present invention there is provided a module for receiving an encoded image stream as transmission units comprising multimedia data, wherein a parameter is arranged to indicate a maximum number of transmission units comprising multimedia data that precede any transmission unit comprising multimedia data in the packet stream in transmission unit transmission order and follow the transmission unit comprising multimedia data in decoding order.

In one example embodiment of the present invention, the transport unit comprising multimedia data is a VCL NAL unit.

The invention improves the buffering efficiency of the coding system. By using the invention, a suitable and in fact required amount of buffering can be used. There is no need to allocate more memory than necessary to the encoding buffer in the encoding apparatus and the pre-decoding buffer in the decoding apparatus. Also, overflow of the pre-decoding buffer can be avoided.

Drawings

Figure 1 shows a recursive temporal hierarchical scheme,

fig. 2 depicts a scheme, referred to as video redundancy coding, in which a sequence of pictures is divided into two or more independent coding threads in an interleaved manner,

figure 3 shows an example of a prediction structure that potentially improves compression efficiency,

figure 4 shows an example of an intra picture delay method that can be used to improve error resilience,

figure 5 depicts an advantageous embodiment of a system according to the invention,

figure 6 depicts an advantageous embodiment of an encoder according to the invention,

figure 7 depicts an advantageous embodiment of a decoder according to the invention,

Detailed Description

The invention will be described in more detail below with reference to the system of fig. 5, the encoder 1 of fig. 6 and the decoder 2 of fig. 7. The images to be encoded may be, for example, images of a video stream from a video source 3, such as a video camera video recorder or the like. The images (frames) of the video stream may be divided into smaller portions such as slices. Slices may be further divided into blocks. In the encoder 1 the video stream is encoded to reduce information transmitted over the transmission channel 4 or to a storage medium (not shown). The images of the video stream are input into the encoder 1. The encoder has an encoding buffer 1.1 (fig. 6) for temporarily storing some of the pictures to be encoded. The encoder 1 further comprises a memory 1.3 and a processor 1.2 in which the encoding tasks according to the invention can be performed. The memory 1.3 and the processor 1.2 may be shared by the transmitting apparatus 6, or the transmitting apparatus 6 may have another processor and/or memory (not shown) for other functions of the transmitting apparatus 6. The encoder 1 performs motion estimation and/or some other task to compress the video stream. In motion estimation, the similarity between the image to be encoded (current image) and the preceding and/or following images is searched. If similarities are found, the compared picture or a part thereof can be used as a reference picture for the picture to be encoded. In JVT, the display order and decoding order of pictures do not have to be the same, where the reference picture has to be stored in a buffer (e.g. in the encoding buffer 1.1) as long as it is used as a reference picture. The encoder 1 also inserts information about the display order of the pictures in the transport stream.

From the start of the encoding process, the encoded pictures are moved to the encoded picture buffer 5.2 if necessary. The encoded pictures are transmitted from the encoder 1 to the decoder 2 via a transmission channel 4. In the decoder 2, the encoded image is decoded to form an uncompressed image that conforms as closely as possible to the encoded image.

The decoder 1 further comprises a memory 2.3 and a processor 2.2 in which the decoding tasks according to the invention can be performed. The memory 2.3 and the processor 2.2 may be shared by the receiving means 8, or the receiving means 8 may have another processor and/or memory (not shown) for other functions of the receiving means 8.

Encoding

Let us now consider the encoding-decoding process in more detail. Pictures from the video source 3 are input to the encoder 1 and advantageously stored in the encoding buffer 1.1. It is not necessary to start the encoding process immediately after the first picture is input to the encoder, but after the encoding buffer 1.1 has a certain number of available pictures. The encoder 1 then tries to find suitable candidates from the picture to be used as reference frames. The encoder 1 then performs encoding to form an encoded image. The encoded pictures can be, for example, predictive pictures (P), bidirectional predictive pictures (B) and/or intra-coded pictures (I). Intra-coded pictures can be decoded without using any other pictures, but other types of pictures require at least one reference picture before they can be decoded. Any of the above-mentioned image types may be used as the reference image.

The encoder advantageously attaches two timestamps to the picture: a Decoding Time Stamp (DTS) and an Output Time Stamp (OTS). The decoder can use the time stamp to determine the correct decoding time and the time to output (display) the picture. However, these timestamps are not necessarily transmitted to the decoder or it does not use them.

NAL units may be transmitted in packets of different kinds. In this advantageous embodiment, the different packet formats include simple packets and aggregate packets. The aggregate package is further divided into a single time aggregate package and a multi-time aggregate package.

The payload format of an RTP packet is defined as many different payload structures depending on the requirements. However, the RTP packet received contains structure that is evident from the first byte of the payload. This byte is typically constructed as a NAL unit header. The NAL unit type field indicates which structure is present. The possible structures are: single NAL unit packet, aggregate packet, and segment packet. A single NAL unit packet contains only one single NAL unit in the payload. The NAL header type field will be equal to the original NAL unit type, i.e., in the range of 1 to 23 (including 1 and 23). The aggregation packet type is used to aggregate multiple NAL units into a single RTP payload. There are four versions of this package: ｃA single Time-Aggregation Packet type ｃA (STAP- ｃA), ｃA single Time-Aggregation Packet type B (STAP-B), ｃA Multi-Time-Aggregation Packet (MTAP) with ｃA 16-bit offset (MTAP16), ｃA Multi-Time-Aggregation Packet (MTAP) with ｃA 24-bit offset (MTAP 24). The numbers of NAL unit types assigned for STAP-A, STAP-B, MTAP16, MTAP24 are 24, 25, 26, and 27, respectively. A segmentation unit is used to segment a single NAL unit over multiple RTP packets. It exists in two versions identified by NAL unit type numbers 28 and 29.

There are three cases defined for the packetization mode of RTP packet transmission:

-a single NAL unit mode of the NAL,

-a non-interleaved pattern, and

-an interleaving pattern.

The single NAL unit mode is used for conversational systems that comply with ITU-T recommendation H.241. The non-interleaved mode is used for conversational systems that do not comply with ITU-T recommendation h.241. In non-interleaved mode, NAL units are transmitted in NAL unit decoding order. The interleaved mode is used as a system that does not require very low end-to-end latency. The interleaving mode allows NAL units to be transmitted in an order other than NAL unit decoding order.

The packing mode used may be identified by the value of an optional packing mode MIME parameter or by an external device. The packetization mode used controls the types of NAL units allowed in the RTP payload.

In interleaved packetization mode, the transmission order of NAL units is allowed to be different from the decoding order of NAL units. The Decoding Order Number (DON) is a field in the payload structure or a derived variable indicating the NAL decoding order.

The combination of transmission and decoding order is controlled by the optional interleaving depth MIME parameter as follows. When the value of the optional interleaving depth MIME parameter is equal to 0 and external devices do not allow transmission of NAL units in an order other than their decoding order, the NAL unit transmission order coincides with the NAL unit decoding order. When the value of the optional interleaving depth MIME parameter is greater than 0 or the external device allows transmission of NAL units in an order other than their decoding order,

the NAL unit order in the multiple time-aggregation packets 16(MTAP16) and 24(MTAP24) need not be NAL unit decoding order, and

the order of NAL units consisting of decapsulated single time aggregation packet B (STAP-B), MATP and segmentation units (FU) in two consecutive packets is not necessarily NAL unit decoding order.

The RTP payload structures for one single NAL unit packet, one STAP-A and one FU-A do not include DONs. STAP-B and FU-B structures comprise DON, and the structure of MTAP is capable of deriving DON.

STAP-B packet types may be used if the sender wants to encapsulate one NAL unit per packet and transport the packets in an order other than their decoding order.

In single NAL unit packetization mode, the transmission order of NAL units is the same as their NAL unit decoding order. In the non-interleaved packetization mode, the transmission order of NAL units in ｃA single NAL unit packet and STAP-A and FU-A is the same as their NAL unit decoding order. NAL units in one STAP are presented in NAL unit decoding order.

Since h.264 allows the decoding order to be different from the display order, the value of the RTP timestamp may not monotonically not decrease (non-decoding) as the RTP sequence number changes.

The DON value of the first NAL unit in transmission order can be set to any value. The value of DON is in the range of 0 to 65535 (including 0 and 65535). After the maximum value is reached, the value of DON is turned back to 0.

A video sequence according to this specification may be any part of a NALU stream that can be decoded independently of other parts of the NALU stream.

The following is an example of a robust packet arrangement.

In the following example figures, time is shifted from left to right, I denotes an IDR picture, R denotes a reference picture, N denotes a non-reference picture, and the number indicates an associated output time proportional to the previous IDR picture in decoding order. The values below the image sequence indicate the scaled system clock timestamps and in this example they are initialized arbitrarily. Each I, R and N picture is mapped to the same timeline as the previous processing step, assuming no time is taken for encoding, transmission, and decoding, if any.

The image subsets in the plurality of video sequences are described below in output order.

...N58 N59 I00 N01 N02 R03 N04 N05 R06...N58 N59 I00 N01 N02...

...--|---|---|---|---|---|---|---|---|... | | | | | ...

...58 59 60 61 62 63 64 65 66 ...128 129 130 131 132...

The coding (and decoding) order of these pictures is from left to right as follows:

...N58 N59 I00 R03 N01 N02 R06 N04 N05...

...-|---|---|---|---|---|---|---|---|-...

...60 61 62 63 64 65 66 67 68...

the Decoding Order Number (DON) of a picture is equal to the value of DON of the previous picture in decoding order plus one.

For simplicity, we assume that:

-the frame rate of the sequence is constant,

each image is composed of only one slice,

each slice is encapsulated in a single NAL unit packet,

-transmitting pictures in decoding order, and

images are transmitted at constant intervals (i.e. equal to 1/frame rate).

Thus, the pictures are received in decoding order:

...N58 N59 I00 R03 N01 N02 R06 N04 N05...

...-|---|---|---|---|---|---|---|---|-...

...60 61 62 63 64 65 66 67 68 ...

the num-reorder-VCL-NAL-units parameter is set to 0 because no buffering is needed to recover the correct decoding order from the transmission (or reception order).

The decoder must initially buffer a picture interval in its decoded picture buffer in order to organize the pictures from decoding order to output order as shown below:

.. N58 N59 I00 N01 N02 R03 N04 N05 R06 ...

...-|——|——|——|——|——|——|——|————...

... 61 62 63 64 65 66 67 68 69 ...

the amount of initial buffering required in the decoded picture buffer can be expressed in buffering period SEI messages or in the value of the num _ reorder _ frames syntax element of the h.264 video availability information. num _ reorder _ frames indicates the maximum number of frames, supplemental field pairs, or unpaired fields in a sequence in decoding order, and follows it in output order.

For simplicity, it is assumed that num _ reorder _ frames is used to indicate initial buffering in the decoded picture buffer. In this example num _ reorder _ frames equals 1.

It may be noted that if an IDR picture 100 is lost during transmission and a retransmission request is issued at a system clock value of 62, there is a picture interval for receiving the retransmitted IDR picture 100 (until the system clock reaches the timestamp 63).

Let us then assume that IDR pictures are transmitted two frame intervals earlier than their decoding position, i.e. pictures are transmitted in the following order:

...I00 N58 N59 R03 N01 N02 R06 N04 N05...

..--|---|---|---|---|---|---|---|---|-...

...62 63 64 65 66 67 68 69 70 ...

the variable id1 is specified according to the prior art (as disclosed in draft-itef-avt-RTP-h264-01. txt), i.e., it specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the NAL unit stream in NAL unit decoding order and follow the VCL NAL units in RTP sequence numbering order or in the synthesis order of the aggregate packets containing the VCL NAL units. The variable id2 is specified according to the present invention, i.e., it specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the NAL unit stream in transmission order and succeed the VCL NAL unit in decoding order.

In the example, the value of id1 equals 2, and the value of id2 equals 1. As already shown in the second section, the value of id1 is not proportional to the time or buffer space required for initial buffering for rearranging packets from the reception order to the decoding order. In this example, an initial buffering time equal to one picture interval is required to restore the decoding order as shown below (the figure shows the output of the receiver buffering process). This example also demonstrates that values for the initial buffering time and buffering space can be inferred according to the invention.

...N58 N59 I00 R03 N01 N02 R06 N04 N05...

...-|---|---|---|---|---|---|---|---|-...

...63 64 65 66 67 68 69 70 71 ...

Furthermore, an initial buffering delay of one picture interval is required to organize from the decoding order to the output order as follows:

...N58 N59 I00 N01 N02 R03 N04 N05 R06...

...-|---|---|---|---|---|---|---|---|-...

...64 65 66 67 68 69 70 71 72 ...

it may be noted that the maximum delay that an IDR picture can withstand during transmission, including possible application, transmission or retransmission of the link layer, is equal to num _ reorder _ frames + id 2. Therefore, the loss resilience of IDR pictures is improved in a system supporting retransmission.

The receiver can organize the images in decoding order based on the DON values associated with each image.

Transmission of

Transmission and/or storage of the coded pictures (and optional virtual decoding) can begin immediately after the first coded picture is ready. This picture does not have to be the first picture in the decoder output order, as the decoding order and output order may not be the same.

Transmission may begin when the first picture of the video stream is encoded. The encoded pictures are optionally stored in an encoded picture buffer 1.2. Transmission may also be started later, for example after a certain part of the video stream has been encoded.

The decoder 2 should also output the decoded pictures in the correct order, for example by using an ordering of the picture order counts.

Unpacking

The unpacking process is execution dependent. Thus, described below is one example of a suitable non-limiting implementation. Other schemes may also be used. Optimization with respect to the above algorithm is also possible.

The general concept behind these unpacking rules is to reorder NAL units from transmission order to NAL unit delivery order.

Decoding

The operation of the receiver 8 will be described next. The receiver 8 collects all packets belonging to one image so that they have a reasonable order. The stringency of the sequence depends on the profile (profile) employed. The received packets are stored in a receive buffer 9.1 (pre-decode buffer). The receiver 8 discards anything that is not available and passes the remainder to the decoder 2. Aggregate packets are processed by unloading their payloads to individual RTP packets carrying NALUs. These NALUs are processed as if received in separate RTP packets in the order they were arranged in the aggregate packet.

In the following, let N be an optional num-reorder-VCL-NAL-units parameter (interleaving depth parameter) that specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the packet stream in NAL unit transmission order and follow the VCL NAL units in decoding order. If the parameter is not present, it may mean a 0 value number.

When a video streaming session is initialized, the receiver 8 allocates memory for storing at least N VCL NAL unit fragments to the receive buffer 9.1. The receiver then starts receiving the video stream and stores the received VCL NAL units into a receive buffer. The initial buffering continues until:

until at least N VCL NAL unit fragments are stored in the reception buffer 9.1, or

-if a max-don-diff MIME parameter is present, until the value of the function don _ diff (m, n) is greater than the value of max-don-diff, where n is equal to the NAL unit with the maximum value of AbsDON among the received NAL units, and m is equal to the NAL unit with the minimum value of AbsDON among the received NAL units, or

-until the initial buffering lasts for a duration equal to or greater than the value of the optional init-buf-time MIME parameter.

The function don _ diff (m, n) is specified as follows:

if don (m) is don (n), don _ diff (m, n) is 0

If (DON (m) < DON (n) and DON (n) < DON (m) < 32768),

then don _ diff (m, n) ═ don (n) -don (m)

If (don (m) > don (n) and don (m) -don (n) > (32768),

then don _ diff (m, n) ═ 65536-don (m) + don (n)

If (don (m) < don (n) and don (n) -don (m) > (32768),

then don _ diff (m, n) ═ - (don (m) +65536-don (n))

If (DON (m) > DON (n) and DON (m) -DON (n) < 32768),

then don _ diff (m, n) ═ - (don (m) -don (n))

Where DON (i) is the decoding order number of the NAL unit with index i in transmission order.

A positive value of don _ diff (m, n) indicates that a NAL unit with transport order index n follows in decoding order a NAL unit with transport order index m.

AbsDON denotes such decoding order number of NAL units that do not go back to 0 after 65535. In other words, AbsDON is calculated as follows:

let m and n be NAL units that are consecutive in transmission order. For the very first NAL unit in transmission order (with an index of 0), AbsDON (0) ═ DON (0). For other NAL units, AbsDON is calculated as follows:

if don (m) ═ don (n), absdon (n) ═ absdon (m)

If (DON (m) < DON (n) and DON (n) < DON (m) < 32768),

then absdon (n) ═ absdon (m) + don (n) -don (m)

If (don (m) > don (n) and don (m) -don (n) > (32768),

then absdon (n) ═ absdon (m) +65536-don (m) + don (n)

If (don (m) < don (n) and don (n) -don (m) > (32768),

then absdon (n) ═ absdon (m) - (don (m) +65536-don (n))

If (DON (m) > DON (n) and DON (m) -DON (n) < 32768),

then absdon (n) ═ absdon (m) - (don (m) -don (n))

When the reception buffer 9.1 comprises at least N VCL NAL units, the NAL units are removed from the receiver buffer 9.1 one by one and transmitted to the decoder 2. The NAL units do not have to be removed from the receiver buffer 9.1 in the same order as they are stored, but rather are in accordance with the DON of the NAL units as described below. The transport of packets to decoder 2 continues until the buffer contains less than N VCL NAL units, i.e., N-1 VCL NAL units.

NAL units to be removed from the receiver buffer are determined as follows:

-if the receiver buffer comprises at least N VCL NAL units, the NAL VCL units are removed from the receiver buffer and transmitted to the decoder in the order specified below until the buffer comprises N-1 VCL NAL units.

If max-don-diff occurs, all NAL units whose don _ diff (m, n) is greater than max-don-diff are shifted out of the receiver buffer and transmitted to the decoder in the order specified below. Here, n is equal to a NAL unit having the maximum value of AbsDON among received NAL units.

-when the first packet of the NAL unit stream is received, the variable ts is set to the value of the system timer initialized to 0. If the receiver buffer contains a NAL unit whose reception time tr satisfies ts-tr > ini-buf-time, the NAL units are transmitted to the decoder (and removed from the receiver buffer) in the order specified below until the receiver buffer does not contain a NAL unit whose reception time tr satisfies the specified condition.

The order in which NAL units are transmitted to the decoder is specified as follows.

Let PDON be a variable initialized to 0 at the beginning of an RTP session. For each NAL unit associated with a DON value, the DON distance will be calculated as follows. If the value of DON of a NAL unit is greater than the value of PDON, then the DON distance is equal to DON-PDON. Otherwise, the DON distance is equal to 65535-PDON + DON + 1.

NAL units are transmitted to the decoder in ascending order of DON distance. If several NAL units share the same value of DON distance, they can be transmitted to the decoder in any order. When the required number of NAL units has been delivered to the decoder, the value of PDON is set to the value of DON for the last NAL unit transmitted to the decoder.

DPB 2.1 contains memory locations for storing multiple pictures. These locations are also referred to as frame stores in the description. The decoder 2 decodes the received pictures in the correct order.

The present invention may be applied to a variety of systems and devices. Advantageously, the transmitting means 6 comprising the encoder 1 further comprise a transmitter 7 for transmitting the encoded image to the transmission channel 4. The receiving means 8 comprise a receiver 9 for receiving the encoded image, a decoder 2 and a display 10 on which the decoded image can be displayed. The transmission channel may be, for example, a landline communication channel and/or a wireless communication channel. The transmitting device and the receiving device also comprise one or more processors 1.2, 2.2 capable of executing the necessary steps for controlling the encoding/decoding process of the video stream according to the invention. The method according to the invention is thus mainly implemented as machine executable steps of a processor. The buffering of the image may be performed in the memory 1.3, 2.3 of the device. The program code 1.4 of the encoder may be stored in the memory 1.3. The program code 2.4 of the decoder can be stored correspondingly in the memory 2.3.

Claims

1. A method for buffering media data in a buffer, said media data being comprised in data transmission units, the data transmission units being ordered in a transmission order, the transmission order being at least partly different from a decoding order of the media data in the data transmission units, characterized by defining a parameter indicating a maximum number of data transmission units preceding any one of the data transmission units in the packet stream in the transmission order and following the any one of the data transmission units in the decoding order to be provided to a decoder for determining buffering needs.

2. The method according to claim 1, characterized in that said media data comprises at least one of the following:

-a video data stream comprising a video stream and a video stream,

-audio data.

3. A method according to claim 1 or 2, wherein said media data comprises slices of a coded picture.

4. The method according to claim 1 or 2, wherein said transport unit comprising media data is a VCL NAL unit.

5. A method for decoding an encoded image stream in a decoder, wherein the encoded image stream is received as data transmission units comprising media data, the data transmission units being ordered into a transmission order, the transmission order being at least partly different from a decoding order of the media data in the data transmission units, buffering of the data transmission units being performed, characterized in that a buffering requirement is indicated to the decoding process as a parameter indicating a maximum number of data transmission units preceding any one of the data transmission units in the packet stream in the transmission order and following the any one of the data transmission units in the decoding order to be provided to the decoder for determining the buffering requirement.

6. A method according to claim 5, characterized in that said parameter is checked and a storage location is reserved for the buffer on the basis of said parameter.

7. Method according to claim 6, characterized in that the pictures are decoded from the received coded picture stream and the coded pictures are buffered using the reserved storage locations.

8. A system comprising an encoder for encoding pictures and a buffer for buffering media data, the media data being contained in data transmission units, the data transmission units being ordered in a transmission order, the transmission order being at least partly different from a decoding order of the media data in the data transmission units, characterized in that the system further comprises a definer for defining a parameter indicating a maximum number of data transmission units preceding any one of the data transmission units in the packet stream in the transmission order and following the any one of the data transmission units in the decoding order to be provided to the decoder for determining the buffering need.

9. System according to claim 8, characterized in that it comprises a decoder for decoding encoded pictures to form decoded pictures and a memory for buffering said decoded pictures, wherein said parameters are set for determining the required number of memory locations reserved in the memory for buffering said decoded pictures.

10. The system of claim 8, wherein said media data comprises slices of encoded pictures.

11. The system of claim 8 wherein said data transport units are VCL NAL units.

12. A transmitting device for transmitting media data contained in data transmission units, the data transmission units being ordered in a transmission order which differs at least partly from a decoding order of the media data in the data transmission units, characterized in that the transmitting device comprises a definer arranged for defining a parameter indicating a maximum number of data transmission units which precede any one of the data transmission units in the packet stream in the transmission order and which follow the any one of the data transmission units in the decoding order to be supplied to a decoder for determining buffering requirements.

13. A transmitting device according to claim 12, characterized in that said media data comprises slices of coded pictures.

14. Transmission apparatus according to claim 12, characterized in that said transmission unit comprising multimedia data is a VCL NAL unit.

15. A receiving device for receiving an encoded image stream as data transmission units of media data, the data transmission units being ordered in a transmission order which differs at least partly from a decoding order of the media data in the data transmission units, characterized in that the receiving device comprises means for determining buffering needs, arranged to perform said determination by using a parameter indicating a maximum number of data transmission units which precede any one of the data transmission units in the packet stream in the transmission order and which follow the any one of the data transmission units in the decoding order.

16. Receiving device according to claim 15, characterized in that it comprises a memory and that said means for determining the need for buffering comprise a definer for checking said parameter and for reserving memory locations for buffering from said memory in dependence on said parameter.

17. Receiving device according to claim 16, characterized in that it comprises a decoder for decoding pictures from the received coded picture stream and means for using the reserved memory locations for buffering the coded pictures.

18. An apparatus for buffering media data in a buffer, said media data being comprised in data transmission units, the data transmission units being ordered in a transmission order, the transmission order being at least partly different from a decoding order of the media data in the data transmission units, characterized in that the apparatus further comprises means for defining a parameter indicating a maximum number of data transmission units, said data transmission units preceding any one of the data transmission units in the packet stream in the transmission order and following the any one of the data transmission units in the decoding order, the parameter being provided to a decoder for determining buffering needs.

19. An apparatus for decoding an encoded image stream in a decoder, wherein the encoded image stream is received as data transmission units comprising media data, the data transmission units are ordered into a transmission order, the transmission order being at least partly different from a decoding order of the media data in the data transmission units, and the apparatus comprises means for buffering the data transmission units, characterized in that the apparatus further comprises means for indicating a buffering need to a decoding process as a parameter indicating a maximum number of data transmission units preceding any one of the data transmission units in the packet stream in the transmission order and following the any one of the data transmission units in the decoding order to be provided to the decoder for determining the buffering need.