HK1133760B

HK1133760B - Carriage of sei messages in rtp payload format

Info

Publication number: HK1133760B
Application number: HK10101351.9A
Authority: HK
Inventors: M‧安尼克塞拉; 王业奎
Original assignee: Nokia Technologies Oy
Priority date: 2007-01-18
Filing date: 2008-01-17
Publication date: 2012-09-28

Description

Transmitting SEI messages in RTP payload format

Technical Field

The present invention relates generally to the field of scalable video coding. More particularly, the present invention relates to error resilience in h.264/Advanced Video Coding (AVC) and Scalable Video Coding (SVC).

Background

This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include principles that may be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

Video coding standards include ITU-T H.261, ISO/IEC MPEG-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Visual, ITU-T H.263, ISO/IEC MPEG-4Visual, and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC). In addition, currently, efforts are being made to develop new video coding standards. One such standard being developed is the SVC standard, which will become a scalable extension to the h.264/AVC standard. Another standard being developed is the multiview coding standard (MVC), which is also an extension of H.264/AVC. Efforts are also underway to develop the chinese video coding standard.

The latest Draft of SVC is described in JVT-U201, Joint Draft 8 of SVC annotation, which was proposed in the 21 st JVT conference held in Hangzhou in China at 2006, 10, and is available from ftp3. itu.ch/av-arch/JVT-site/2006-10 _ Hangzhou/JVT-U201. zip. The latest Draft of MVC is described in JVT-U209, Joint Draft 1.0 on Multiview Video Coding, which was proposed in the 21 st JVT conference held in Hangzhou, China at 2006, 10, available from ftp3.itu.ch/av-arch/JVT-site/2006_10_ Handzhou/JVT-U209. zip. The entire contents of these two documents are incorporated herein by reference.

Scalable media is typically arranged in hierarchical layers of data. The base layer contains individual representations of the coded media streams, such as video sequences. The enhancement layer contains fine data related to previous layers in the hierarchy. As the enhancement layer is added to the base layer, the quality of the decoded media stream gradually improves. An enhancement layer enhances the temporal resolution (i.e., frame rate), the spatial resolution, or simply the quality of the video content represented by another layer or portion thereof. Each layer, along with all of its dependent layers, is one representation of the video signal at some spatial resolution, temporal resolution, and quality level. Thus, the term "scalable layer representation" is used herein to describe a scalable layer and all its dependent layers. The part of the scalable bitstream corresponding to the scalable layer representation can be extracted and decoded to produce a representation of the original signal with a certain fidelity.

The concepts of Video Coding Layer (VCL) and Network Abstraction Layer (NAL) are inherited from Advanced Video Coding (AVC). The VCL contains the signal processing functions of the codec; such as transform, quantization, motion compensated prediction, loop filter, inter-layer prediction, etc. The coded picture of the base layer or the enhancement layer consists of one or more slices (slices). The NAL encapsulates each slice generated by the VCL into one or more NAL units. The NAL unit includes a NAL unit header and a NAL unit payload. The NAL unit header includes a NAL unit type indicating whether the NAL unit contains a coded slice, a coded slice data partition, a sequence or picture parameter set, or the like. A NAL unit stream is a concatenation of many NAL units. The coded bitstream according to H.264/AVC or its extensions (e.g., SVC) may be a NAL unit stream or it may be a byte stream by prefixing each NAL unit in the NAL unit stream with a start code.

Each SVC layer is formed of NAL units, representing the coded video bits of that layer. A real-time transport protocol (RTP) stream carrying only one layer would carry NAL units belonging to that layer only. An RTP stream carrying a complete scalable video bitstream would carry NAL units for the base layer and one or more enhancement layers. SVC specifies the decoding order of these NAL units.

In some cases, the data in the enhancement layer may be truncated after a certain position or may be truncated at arbitrary positions, where each truncation position may include additional data representing progressively enhanced visual quality. If the distance between the truncation points is small, the scalability can be said to be "fine-grained", so the term "fine-grained (fine-grained) scalability" (FGS) exists. In contrast to FGS, the scalability provided by enhancement layers that can only be truncated at certain coarse positions is referred to as "coarse-grained (granularity) scalability" (CGS).

According to the H.264/AVC video coding standard, an access unit comprises one primary coded picture. In some systems, detection of access unit boundaries can be simplified by inserting access unit delimiter NAL units into the bitstream. In SVC, an access unit may comprise multiple primary coded pictures, but each unique combination of dependency _ id, temporal _ level, and quality _ level has at most one picture.

The encoded video bitstream may include additional information for enhancing the use of the video for various purposes. For example, Supplemental Enhancement Information (SEI) and Video Usability Information (VUI) as defined in h.264/AVC provide such functionality. The h.264/AVC standard and its extensions include the support of SEI signaling by SEI messages. The SEI message is not required by the decoding process to generate the correct sample values in the output picture. Rather, SEI messages are helpful for other purposes (e.g., error recovery and display). h.264/AVC contains the syntax and semantics for the specified SEI message, but does not define a procedure for processing the message in the receiver. Thus, the encoder needs to comply with the H.264/AVC standard when creating the SEI message, and does not require a decoder compliant with the H.264/AVC standard to process the SEI message for output order conformance. One of the reasons for including the syntax and semantics of SEI messages in h.264/AVC is to allow system specifications (such as the 3GPP multimedia specification and the DVB specification) to make consistent interpretations of supplemental information and thus to allow interoperability. The aim is that the system specification may require the use of specific SEI messages at both the encoding end and the decoding end, and that the procedures in the receiving end for handling SEI messages may be specified for the application in the system specification.

SVC uses a mechanism similar to that used in h.264/AVC to provide hierarchical temporal scalability. In SVC, some set of reference pictures and non-reference pictures can be dropped from the coded bitstream without affecting the decoding of the remaining bitstream. Hierarchical temporal scalability requires multiple reference frames for motion compensation, i.e., there is a reference picture buffer containing multiple decoded pictures from which the encoder can select a reference frame for inter prediction. In h.264/AVC, a feature called sub-sequence supports hierarchical temporal scalability, where each enhancement layer contains sub-sequences, and each sub-sequence contains multiple reference pictures and/or non-reference pictures. The sub-sequence also comprises a number of inter-dependent pictures that can be arranged without disturbing any other sub-sequence in any lower sub-sequence layer. The sub-sequence layers are arranged hierarchically based on their dependency on each other. Thus, when the sub-sequences in the highest enhancement layer are arranged, the remaining bit stream remains valid. In h.264/AVC, the signaling of spatial scalability information can be achieved by using sub-sequence related Supplemental Enhancement Information (SEI) messages. In SVC, the temporal layer hierarchy is indicated in the header of a Network Abstraction Layer (NAL) unit.

In addition, SVC uses an inter-layer prediction mechanism, whereby certain information can be predicted from other layers than the current reconstructed layer or the next lower layer. Information that can be inter-layer predicted includes intra texture, motion, and residual data. Inter-layer motion prediction also includes prediction of block coding modes, header information, etc., where motion information from a lower layer may be used to predict a higher layer. Intra coding in SVC, i.e. using prediction from surrounding macroblocks or co-located macroblocks of lower layers, may also be used. This prediction technique does not use motion information and is therefore referred to as an intra prediction technique. In addition, residual data from lower layers may also be used to predict the current layer.

As described above, SVC involves coding of a "base layer" with some worst quality, and coding of enhancement information to improve the quality to a maximum level. The base layer of an SVC stream is generally compatible with Advanced Video Coding (AVC). In other words, the AVC decoder can decode the base layer of the SVC stream, and can ignore SVC-specific data. This feature has been implemented by specifying coded slice NAL unit types specific to SVC, reserved for future use in AVC, and must be skipped according to the AVC specification.

The Instantaneous Decoding Refresh (IDR) picture of h.264/AVC contains only intra-coded slices and all reference pictures except the current picture are marked as "unused for reference". An encoded video sequence is defined as a sequence of consecutive access units, whichever occurs before, in decoding order (from IDR access unit (inclusive) to the next IDR access unit (exclusive) or to the end of the bitstream). A group of pictures (GOP) in h.264/AVC denotes a number of pictures that are consecutive in decoding order, starting with an intra-coded picture and ending with the next GOP in decoding order or the first picture of an encoded video sequence (not included). All pictures within a GOP that follow an intra picture in output order can be correctly decoded, regardless of whether any previous pictures have been decoded. An open GOP is a group of pictures in which pictures preceding an initial intra picture in output order may not be correctly decoded. The h.264/AVC decoder can identify the intra pictures that start the open GOP from the recovery point SEI message in the h.264/AVC bitstream. Pictures that begin an open GOP are referred to herein as Open Decoding Refresh (ODR) pictures. A closed GOP is a group of pictures in which all pictures can be correctly decoded. In H.264/AVC, a closed GOP starts with an IDR access unit.

The coded picture may be represented by an index t10_ pic _ idx. The index t10_ pic _ idx indicates a NAL unit in the SVC bitstream that has the same dependency _ id and quality _ level values as one access unit, where tempora _ level is equal to zero. For an IDR picture with temporal level equal to zero, the value of t10_ pic _ idx is equal to zero or any value in the range of 0 to N-1, inclusive, where N is a positive integer. For temporal _ level equal to zero any other picture, the value of t10_ pic _ idx is equal to (t10_ pic _ idx _0+ 1)% N, where t10_ pic _ idx _0 is the value of t10_ pic _ idx for the previous picture with temporal _ level equal to 0,% represents the modulo operation. In the current SVC specification, t10_ pic _ idx is included as a condition field in the NAL unit header. The receiver or MANE may check the value of t10_ pic _ idx to determine if it has received all key pictures (i.e., pictures with temporal _ level equal to 0). If a key picture is lost, feedback may be sent to inform the encoder, which may then take some repair measures, e.g., retransmission of the lost key picture.

The RTP payload format for H.264/AVC is specified in request for comments (RFC)3984 (available at www.rfc-editor. org/RFC 3984. txt), while the draft RTP payload format for SVC is specified in the Internet Engineering Task Force (IETF) Internet draft-IETF-avt-RTP-SVC-00 (available at tools. IETF. org/id/draft-IETF-avt-RTP-SVC-00. txt).

RFC3984 specifies several modes of encapsulation, one of which is the interleaving mode. If interleaved packetization mode is being used, NAL units from more than one access unit can be packetized into one RTP packet. RFC3984 also specifies the concept of Decoding Order Number (DON), which indicates the decoding order of NAL units conveyed in an RTP stream.

In the SVC RTP payload format draft, a new NAL unit type is specified, called payload content scalability information (PACSI) NAL unit. The PACSI NAL unit, if present, is the first NAL unit in an aggregation packet (aggregation packet), and no PASCI NAL unit is present in other types of packets. The pasicalal unit indicates scalability features that are common to all remaining NAL units in the payload, thus making it easier for the MANE to decide whether to forward/process/discard the aggregated packet. The sender may create pasic NAL units and the receiver may ignore them or use them as a hint to enable efficient aggregation packet processing. When the first aggregation unit of an aggregation packet includes a pasic NAL unit, there is at least one additional aggregation unit in the same packet. The RTP header fields are set according to the NAL units remaining in the aggregation packet. When the pasic NAL unit is included in the multiple aggregation packet, the decoding order number of the pasic NAL unit is set so as to indicate: the pasic NAL unit is the first NAL unit in decoding order in NAL units in the aggregation packet or has the same decoding order number as the NAL unit that is the first in decoding order in the remaining NAL units in the aggregation packet. The structure of the pasic NAL unit is the same as a four byte SVC NAL unit header (where E equals 0), as described below.

Disclosure of Invention

Various embodiments of the present invention provide a method to modify the error resilience characteristics by conveying temporal layer 0 picture indices (such as t10_ pic _ idx) in SEI messages, rather than optionally including them in the NAL unit header. Additionally, a mechanism is provided for supporting repetition of any SEI messages in real-time transport protocol (RTP) packets. Supporting such repetition of any SEI messages facilitates detection of lost temporal layer 0 pictures based on any received packets.

Transmitting t10_ pic _ idx in the SEI message results in loss detection that is as straightforward and robust as transmitting t10_ pic _ idx in the NAL unit header. Furthermore, no changes need to be made to the NAL unit header or slice header, nor do the semantics of t10_ pic _ idx need to be changed. In addition, implementing error resilience features such as those described herein does not affect the decoding process of the H.264/AVC already specified or its current extensions.

Various embodiments provide a method, computer program product and apparatus for packaging an encoded bitstream representative of a video sequence, comprising: packetizing at least a portion of an encoded video sequence into a first packet, wherein the first packet includes information summarizing contents of the at least a portion of the encoded video sequence; and providing supplemental enhancement information associated with the at least a portion of the encoded video sequence in the first packet. Embodiments also provide a method, computer program product and apparatus for unpacking encoded video, comprising: at least a portion of an encoded video sequence is unpacked from a first packet, wherein the first packet includes information summarizing contents of at least a portion of the encoded video sequence. In addition, supplemental enhancement information associated with at least a portion of the encoded video sequence is obtained from the first packet.

Various embodiments provide a method, computer program product and apparatus for packetizing a temporally scalable bitstream representative of a sequence of images, the method comprising: packetizing at least a portion of the sequence of pictures into a first packet, wherein the first packet includes first information summarizing contents of the at least a portion of the encoded sequence of pictures and providing second information in the first packet indicating a decoding order of pictures in a lowest temporal layer of a temporal layer hierarchy. Still other embodiments provide a method, computer program product and apparatus for unpacking encoded video, comprising: at least a portion of an encoded image sequence is unpacked from a first packet, wherein the first packet includes first information summarizing contents of the at least a portion of the encoded image sequence. In addition, second information indicating a decoding order of pictures in a lowest temporal layer in the temporal layer hierarchy is obtained from the first packet.

These and other advantages and features of the invention, together with the manner in which the same is organized and operated, will become apparent from the following detailed description when taken in conjunction with the accompanying drawings, wherein like elements have like numerals throughout the several drawings described below.

Drawings

FIG. 1 illustrates a generic multimedia communication system for use with the present invention;

FIG. 2 is a perspective view of a mobile telephone that may be used in the implementation of the present invention;

FIG. 3 is a schematic diagram of the telephone circuitry of the mobile telephone of FIG. 2; and

fig. 4 is a diagram of an exemplary temporally scalable bitstream.

Detailed Description

Fig. 1 shows a generic multimedia communication system for use with the present invention. As shown in fig. 1, a data source 100 provides a source signal in an analog format, an uncompressed digital format, or a compressed digital format, or any combination of these formats. The encoder 110 encodes the source signal into an encoded media bitstream. The encoder 110 may be capable of encoding more than one media type, such as audio and video, or more than one encoder 110 may be required to encode different media types of the source signal. The encoder 110 may also have synthetically produced input such as graphics and text, or it may be capable of producing coded bitstreams of synthetic media. In the following, only one coded media bitstream of one media type is considered to be processed in order to simplify the description. It should be noted, however, that typically a real-time broadcast service comprises several streams (typically at least one audio, video and text subtitle stream). It should also be noted that a system may include many encoders, but in the following, without loss of generality, only one encoder 110 is considered to simplify the description.

It should be understood that although the text and examples contained herein specifically describe an encoding process, those skilled in the art will readily appreciate that the same concepts and principles may also be applied to a corresponding decoding process, and vice versa.

The coded media bitstream is transmitted to storage device 120. The storage device 120 may include any type of mass storage to store the coded media bitstream. The format of the encoded media bitstream in the storage device 120 may be an elementary self-contained (elementary-contained) bitstream format, or one or more encoded bitstreams may be encapsulated into a container file. Some systems operate "live", i.e., omit the storage device and directly transfer the encoded media bitstream from the encoder 110 to the sender 130. The coded media bitstream is then transmitted to the sender 130, also referred to as a server, as needed. The format used in the transmission may be an elementary self-contained bitstream format, a packet stream format, or one or more coded media bitstreams may be encapsulated into a container file. The encoder 110, the storage device 120, and the sender 130 may be located in the same physical device, or they may be included in separate devices. The encoder 110 and sender 130 may operate with live real-time content, in which case the coded media bitstream is typically not stored permanently, but rather buffered for small periods of time in the content encoder 110 and/or sender 130 to smooth out variations in processing delay, transmission delay, and coded media bitrate.

The sender 130 sends the coded media bitstream using a communication protocol stack. The stack may include, but is not limited to, RTP, User Datagram Protocol (UDP), and Internet Protocol (IP). When the communication protocol stack is packet-oriented, the sender 130 encapsulates the coded media bitstream into packets. For example, when RTP is used, the sender 130 encapsulates the coded media bitstream into RTP packets according to an RTP payload format. Typically, each media type has a dedicated RTP payload format. It is again noted that the system may contain more than one sender 130, but for simplicity, the following description only considers one sender 130.

The sender 130 may or may not be connected to the gateway 140 through a communication network. The gateway 140 may perform different types of functions such as translating a packet stream according to one communication protocol stack to another communication protocol stack, merging and forking data streams, and manipulating data streams according to the downlink and/or receiver capabilities, such as controlling the bit rate of the forwarded bit stream according to prevailing downlink network conditions. Examples of gateways 140 include multipoint conference control units (MCUs), gateways between circuit-switched and packet-switched video telephony, push-to-talk over cellular (PoC) servers, IP encapsulators in digital video broadcasting-handheld (DVB-H) systems, or set-top boxes that forward local broadcast transmissions to home wireless networks. When RTP is used, gateway 140 is referred to as an RTP mixer and serves as an endpoint of the RTP connection.

The system includes one or more receivers 150 that are generally capable of receiving, demodulating, and decapsulating the transmitted signal into an encoded media bitstream. The encoded media bitstream is typically further processed by a decoder 160, the output of which is one or more uncompressed media streams. Finally, a renderer 170 may render the uncompressed media streams, for example, through a speaker or a display. The receiver 150, the decoder 160 and the reproducer 170 may be located in the same physical device, or they may be contained in separate devices. It should be noted that the bitstream to be decoded may be received from a remote device virtually located in any type of network. In addition, the bitstream may be received from local hardware or software.

Scalability in terms of bit rate, decoding complexity and picture size is a desirable property for heterogeneous and error-prone environments. This property is desirable to combat limitations such as limitations on bit rate, display resolution, network throughput, and computational power in the receiving device.

The communications devices of the present invention may communicate using various transmission techniques including, but not limited to, Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Universal Mobile Telecommunications System (UMTS), Time Division Multiple Access (TDMA), Frequency Division Multiple Access (FDMA), transmission control protocol/internet protocol (TCP/IP), Short Messaging Service (SMS), Multimedia Messaging Service (MMS), e-mail, Instant Messaging Service (IMS), bluetooth, IEEE 802.11, and the like. A communication device may communicate using various media including, but not limited to, radio, infrared, laser, cable connection, and the like.

Fig. 2 and 3 illustrate one representative communication device 12 within which the present invention may be implemented. It should be understood, however, that the present invention is not intended to be limited to one particular type of mobile device 12 or other electronic device. Some or all of the features depicted in fig. 2 and 3 may be incorporated into any or all of the devices represented in fig. 1.

The communication device 12 of figures 2 and 3 includes a housing 30, a display 32 in the form of a liquid crystal display, a keypad 34, a microphone 36, an ear-piece 38, a battery 40, an infrared port 42, an antenna 44, a smart card 46 in the form of a UICC according to one embodiment of the invention, a card reader 48, radio interface circuitry 52, codec circuitry 54, a controller 56 and a memory 58. Individual circuits and elements are all of a type well known in the art, for example in the Nokia range of mobile telephones.

Fig. 4 shows an exemplary bitstream with four temporal layers and a value of t10_ pic _ idx for each picture. A temporal layer equal to 0 is referred to as the lowest temporal layer in the hierarchy of layers. According to the semantics of t10_ pic _ idx, if a picture has a temporal layer equal to 0, then t10_ pic _ idx is the temporal layer 0 index of the picture itself. Thus, the t10_ pic _ idx values for pictures with Picture Order Count (POC) equal to 0, 8, and 16 are equal to 0, 1, and 2, respectively. If the temporal layer of a picture is greater than 0, t10_ pic _ idx is the temporal layer 0 index of the picture whose preceding temporal layer in decoding order is equal to 0. Thus, the t10_ pic _ idx values for pictures with POC equal to 1 to 7 are all equal to 1, since for them the picture with POC equal to 0 in the preceding temporal layer in decoding order is a picture with POC equal to 8, while the t10_ pic _ idx values for pictures with POC equal to 9 to 15 are all equal to 2, since for them the picture with POC equal to 0 in the preceding temporal layer in decoding order is a picture with POC equal to 16.

A field representing the index t10_ pic _ idx may be included in the new SEI message, which may be associated with each coded picture of temporal level equal to 0 or equal to any value. The new SEI message may be referred to as a t10 picture index SEI message, for example, and may be specified as follows:

t10_ picture _ index (payload size) quick check	C	Descriptor(s)
			t10 pic_idx	5	u(8)
}

Transmitting t10_ pic _ idx in a new SEI message would yield temporal layer 0 picture loss detection that is also straightforward and robust, compared to transmitting t10_ pic _ idx in the NAL unit header. Also, no changes need to be made to the NAL unit header or slice header, nor do the semantics of t10_ pic _ idx need to be changed. In addition, implementing error resilience features such as those described herein does not affect the decoding process of the H.264/AVC already specified or its current extensions. In fact, similar to the error resilience features of t10_ pic _ idx, sub-sequence information SEI messages, such as also including frame counts, have been previously included as SEI messages, in sharp contrast to high level syntax structures such as NAL unit headers and slice headers. Thus, this method of delivering temporal layer 0 picture indices is commensurate with the other conventional error resilience features of H.264/AVC.

In addition, the payload content scalability information (PACSI) NAL unit may be modified to include new SEI messages. Currently, a PACSI NAL unit, if present, is the first NAL unit in a packet and contains an SVC NAL unit header that outlines the packet's content. The payload of a PACSI NAL unit is empty. The NAL unit type for the PACSI NAL unit is selected from those values not specified in the SVC specification and the H.264/AVC RTP payload specification, resulting in the PACSI NAL unit being ignored by H.264/AVC or SVC decoders and H.264/AVC RTP receivers.

Assuming SEI NAL units are allowed in the PACSI NAL unit payload, any SEI NAL unit in the PACSI NAL unit payload may be used to repeat the SEI NAL units of the access unit of the first NAL unit following (rather than nested in) the PACSI NAL unit. In addition, a PACSI NAL unit may include reference pictures marking repetition SEI messages and other NAL units that may occur before the first VCL NAL unit in an access unit. This enables detection of a long-term picture index assignment of a temporal layer 0 picture that is decoding order first. It should be noted that any additional bit rate overhead incurred as a result of transmitting t10_ pic _ idx in the new SEI message is negligible.

As described above, when interleaved packetization mode is used, a PACSI NAL unit may contain only the SEI message for the first NAL unit of the RTP payload. However, according to another embodiment of the present invention, a PACSI NAL unit does not encapsulate a new SEI message as such, but rather encapsulates a pair of SEI NAL units, plus a Decoding Order Number (DON) or DON difference, any other picture identifier, or any other NAL unit identifier within an RTP payload, such as a NAL unit sequence number within a payload.

According to yet another embodiment of the present invention, a new NAL unit type may be specified in the RTP payload specification, which may be referred to as Interleaved PACSI (IPACSI). This NAL unit can be inserted before any AVC/SVC NAL unit of the RTP payload. In addition, the payload of IPACSI can include repetitions of the SEANAL unit for the access unit to which the AVC/SVC NAL unit belongs.

It should be noted that various embodiments of the present invention do not associate dependency _ id and/or quality _ level with the t10_ pic _ idx SEI message, because the t10_ pic _ idx SEI message may be used in scalable nesting SEI when dependency _ id > 0 or quality _ level > 0. Thus, more than one use of scalable nesting SEI is possible, although the parsing process in a Media Aware Network Element (MANE) becomes more or less complex. Alternatively, a loop may be implemented in the t10_ pic _ idx SEI message itself for different values of dependency _ id and quality _ level.

It should be noted that there are other problems in addition to the one given here for t10_ pic _ index. For example, when a temporal layer 1 picture uses more than one temporal layer 0 picture as prediction reference, t10_ pic _ index may not be a reliable indication that the temporal layer 1 picture can be decoded. Therefore, other methods for solving the problem of t10_ pic _ index may be taken. For example, using different long-term indices in subsequent temporal layer 0 pictures makes it less likely that pictures assigned a particular long-term index will be erroneously referred to. Furthermore, the reference pictures actually used, including those that are long-term, may be inferred based on the slice header when using the reference picture list reordering command. Still alternatively, a sub-sequence SEI message may be used, where sub-sequence layer numbers and sub-sequence identifiers can be intelligently used to infer where sub-sequence layer losses occurred. In some prediction structures, short-term reference pictures may be utilized instead of long-term reference pictures. In yet another alternative, the "transport" layer may be the basis for solving the conventional t10_ pic _ idx problem, e.g., a category unacknowledged (NACK) packet using the RTP audio visual feedback (AVPF) profile, where the NACK packet may be transmitted whenever a potential loss of temporal layer 0 pictures is detected.

The present invention is described in the general context of method steps, which may be implemented in one embodiment by a program product including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable or non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), Compact Discs (CDs), Digital Versatile Discs (DVDs), and the like. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.

Software and web implementations of the present invention could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various database searching steps, correlation steps, comparison steps and decision steps. It should also be noted that the words "component" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving manual inputs.

The foregoing description of the implementation of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiments were chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated. The features of the embodiments described herein may be incorporated in all possible embodiments of methods, apparatus, computer program products and systems.

Claims

1. A method of packetizing an encoded bitstream representative of a video sequence, the method comprising:

packetizing at least a portion of an encoded video sequence into a first packet, wherein the first packet includes information summarizing contents of the at least a portion of the encoded video sequence, an

Providing supplemental enhancement information associated with the at least a portion of the encoded video sequence in the first packet, wherein information summarizing the content of the at least a portion of the encoded video sequence is placed in a data unit that precedes all other data units within the first packet, and wherein the data unit further includes the supplemental enhancement information.

2. The method of claim 1, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit is used to repeat a supplemental enhancement information unit of an access unit of a first one of the payload content scalability information network abstraction layer units.

3. The method of claim 1, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit includes a reference picture marking a repeated supplemental enhancement information message.

4. The method of claim 1, further comprising:

first information indicating a decoding order of pictures within a lowest temporal layer in a temporal layer hierarchy is provided in the first packet.

5. The method of claim 4, wherein the first information comprises a temporal layer picture index.

6. An apparatus for packetizing an encoded bitstream representative of a video sequence, the apparatus comprising:

means for packetizing at least a portion of an encoded video sequence into a first packet, wherein the first packet includes information summarizing contents of the at least a portion of the encoded video sequence, an

Means for providing supplemental enhancement information associated with the at least a portion of the encoded video sequence in the first packet, wherein information summarizing the content of the at least a portion of the encoded video sequence is placed in a data unit that precedes all other data units within the first packet, and wherein the data unit further includes the supplemental enhancement information.

7. The apparatus of claim 6, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit is used to repeat a supplemental enhancement information unit of an access unit of a first one of the payload content scalability information network abstraction layer units.

8. The apparatus of claim 6, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit includes a reference picture marking a repeated supplemental enhancement information message.

9. The apparatus of claim 6, further comprising:

means for providing, in the first packet, first information indicating a decoding order of pictures within a lowest temporal layer in a temporal layer hierarchy.

10. The apparatus of claim 9, wherein the first information comprises a temporal layer picture index.

11. A method for unpacking encoded video, comprising:

unpacking at least a portion of an encoded video sequence from a first packet, wherein the first packet includes information summarizing the content of at least a portion of the encoded video sequence, an

Obtaining supplemental enhancement information associated with the at least a portion of the encoded video sequence from the first packet, wherein information summarizing the content of the at least a portion of the encoded video sequence is in a data unit that precedes all other data units within the first packet, and wherein the data unit further includes the supplemental enhancement information.

12. The method of claim 11, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit is used to repeat a supplemental enhancement information unit of an access unit of a first one of the payload content scalability information network abstraction layer units.

13. The method of claim 11, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit includes a reference picture marking a repeated supplemental enhancement information message.

14. The method of claim 11, further comprising:

first information indicating a decoding order of pictures within a lowest temporal layer in a temporal layer hierarchy is obtained from the first packet.

15. The method of claim 14, wherein the first information comprises a temporal layer picture index.

16. An apparatus for unpacking encoded video, comprising:

means for unpacking at least a portion of an encoded video sequence from a first packet, wherein the first packet includes information summarizing the content of at least a portion of the encoded video sequence, an

Means for obtaining supplemental enhancement information associated with the at least a portion of the encoded video sequence from the first packet, wherein information summarizing the content of the at least a portion of the encoded video sequence is placed in a data unit that precedes all other data units within the first packet, and wherein the data unit further includes the supplemental enhancement information.

17. The apparatus of claim 16, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit is used to repeat a supplemental enhancement information unit of an access unit of a first one of the payload content scalability information network abstraction layer units.

18. The apparatus of claim 16, wherein the supplemental enhancement information is included in a payload content scalability information network abstraction layer unit in the first packet, and wherein a payload portion of the payload content scalability information network abstraction layer unit includes a reference picture marking a repeated supplemental enhancement information message.

19. The apparatus of claim 16, further comprising:

means for obtaining, from the first packet, first information indicating a decoding order of pictures within a lowest temporal layer in a temporal layer hierarchy.

20. The apparatus of claim 19, wherein the first information comprises a temporal layer picture index.