HK40009749B

HK40009749B - Signaling important video information in network video streaming using mime type parameters

Info

Publication number: HK40009749B
Application number: HK19133216.2A
Authority: HK
Inventors: 王业奎; 托马斯‧斯托克哈默
Original assignee: 高通股份有限公司
Priority date: 2017-03-27
Filing date: 2018-03-27
Publication date: 2022-11-04

Description

Signaling important video information in network video streaming using MIME type parameters

This application claims the benefit of U.S. provisional application No. 62/477,350, filed on 27/3/2017, the entire contents of which are hereby incorporated by reference.

Technical Field

The present invention relates to the delivery of encoded media data.

Background

Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, Personal Digital Assistants (PDAs), laptop or desktop computers, digital cameras, digital recording devices, digital media players, video gaming devices, video gaming consoles, cellular or satellite radio telephones, video teleconferencing devices, and the like. Digital Video devices implement Video compression techniques, such as those described in standards defined by MPEG-2, MPEG-4, ITU-T h.263, or ITU-T h.264/MPEG-4 part 10 "Advanced Video Coding (AVC)", ITU-T h.265 (also referenced to High Efficiency Video Coding (HEVC)), and extensions of such standards, thereby more efficiently transmitting and receiving digital Video information.

After the video (and other media data) data has been encoded, the video data may be packetized for transmission or storage. The video data may be compiled into a video file that conforms to any of a variety of standards, such as the International Organization for Standardization (ISO) base media file format and extensions thereof, such as AVC.

Disclosure of Invention

In general, this disclosure describes techniques for signaling important video information about: high Dynamic Range (HDR), Wide Color Gamut (WCG) video, metaverse/omni/360 degree video, frame-packed video, video with display orientation changes, video stored using constrained scheme features of the ISO base media file format (ISOBMFF), and video using other features that require dedicated post-decoding rendering processing to provide a desired visual experience. In particular, various example MIME-type parameters are described that can expose this important video information in a high-level system signaling message body, such as a dynamic adaptive streaming over HTTP (DASH) Media Presentation Description (MPD) file (or other such manifest file) over HTTP, so that the important video information can be conveniently accessed by an application client, such as a DASH client, to make content rejection/selection/acceptance/request decisions. That is, a DASH client may use this information to select an appropriate set of media data (e.g., an appropriate DASH representation) that may be deemed "appropriate" when the client device is capable of decoding and rendering the media data (e.g., the client device includes a video decoder capable of decoding the media data included in the DASH representation).

In one example, a method of retrieving media data includes: retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation; extracting the data specifying the one or more codecs from the manifest file, the extracting including: extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and extracting a second element of a constrained scheme type code representing the constrained scheme for the track; and retrieving the media data of the at least one representation based on the first element and the second element.

In another example, a device for retrieving media data includes a memory configured to store media data; and one or more processors implemented in the circuitry and configured to: retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation; extracting, from the manifest file, the data specifying the one or more codecs, the data including a first element representing a sample entry type code for a track of the at least one representation and a second element representing a constrained scheme type code for a constrained scheme for the track, wherein the first element represents that the track includes video data stored using the constrained scheme; and retrieving media data of the at least one representation based on the first element and the second element.

In another example, a device for retrieving media data includes: means for retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation; means for extracting the data specifying the one or more codecs from the manifest file, the means comprising: means for extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and means for extracting a second element representing a constrained scheme type code for the constrained scheme for the track; and means for retrieving media data of the at least one representation based on the first element and the second element.

In another example, a computer-readable storage medium, such as a non-transitory computer-readable storage medium, has stored thereon instructions that, when executed, cause a processor to: retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation; extract, from the manifest file, the data specifying the one or more codecs, the instructions including instructions that cause the processor to: extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and extracting a second element representing a constrained scheme type code for the constrained scheme for the track; and retrieving media data of the at least one representation based on the first element and the second element.

The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a block diagram illustrating an example system that implements techniques for streaming media data over a network.

FIG. 2 is a block diagram illustrating an example set of components of the retrieval unit of FIG. 1 in more detail.

FIG. 3 is a conceptual diagram illustrating elements of example multimedia content.

FIG. 4 is a block diagram illustrating elements of an example video file, which may correspond to a segment of a representation.

FIG. 5 is a flow diagram illustrating an example method in accordance with the techniques of this disclosure.

Detailed Description

In general, this disclosure describes techniques for signaling important video information about: high Dynamic Range (HDR), Wide Color Gamut (WCG) video, metaverse/omni/360 degree video, frame-packaged video, video with display orientation changes, video stored using constrained scheme features of the ISO base media file format (ISOBMFF), and video using other features that require dedicated post-decoding rendering processing to provide a desired visual experience. In particular, various example MIME-type parameters are described that can expose such important video information in a high-level system signaling message body, such as a dynamic adaptive streaming over HTTP (DASH) Media Presentation Description (MPD) file (or other such manifest files), such that the important video information can be suitably accessed by application clients, such as DASH clients, to make content rejection/selection/acceptance/request decisions. That is, a DASH client may use this information to select an appropriate set of media data (e.g., an appropriate DASH representation) that may be deemed "appropriate" when the client device is capable of decoding and rendering the media data (e.g., the client device includes a video decoder capable of decoding the media data included in the DASH representation).

For example, this disclosure discloses several example methods of signaling important video information about videos stored using constrained schemes, HDR/WCG videos, VR/omni/360 videos, frame-encapsulated videos, and videos with display orientation changes, such that the important video information is conveniently accessible by application clients, such as DASH clients, to make content rejection/selection/acceptance/request decisions. One or more of such methods may be performed independently or in any combination.

In the context of this document, "important video information" includes video information that is available for content selection, e.g., selecting a video track or portion thereof for consumption.

Video coding standards include ITU-T h.261; ISO/IEC MPEG-1 Visual; ITU-T H.262 or ISO/IEC MPEG-2 Visual; ITU-T H.263; ISO/IEC MPEG-4 Visual; ITU-T H.264 or ISO/IEC MPEG-4AVC, including Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions thereof; and High-Efficiency Video Coding (HEVC), also known as ITU-T h.265 and ISO/IEC 23008-2, including scalable Coding extensions thereof (i.e., scalable High Efficiency Video Coding, SHVC) and multi-view extensions (i.e., multi-view High Efficiency Video Coding, MV-HEVC).

Both AVC and HEVC support frame-encapsulated video indicated by the frame encapsulation arrangement SEI information. HEVC also supports different types of frame-encapsulated video indicated by the segment rectangular frame encapsulation arrangement SEI message. For such frame-encapsulated videos, the decoder side should apply a dedicated decapsulation transform to separate the components of the two views represented in the video bitstream before display.

AVC and HEVC also support video content indicated by display orientation SEI messages for which the decoder side should apply a rotation and/or flip transform to the cropped decoded pictures prior to display. This video is also referred to as video with a change in display orientation.

The techniques of this disclosure may be applied to video files conforming to video data encapsulated according to any one of: an ISO base media file format, Scalable Video Coding (SVC) file format, Advanced Video Coding (AVC) file format, third generation partnership project (3GPP) file format, and/or Multiview Video Coding (MVC) file format, or other similar video file formats.

File format standards include the ISO base media file format (ISOBMFF, ISO/IEC 14496-12), and other standards derived from ISOBMFF, including the MPEG-4 file format (ISO/IEC14496-15), the 3GPP file format (3GPP TS 26.244), and the file format for the AVC and HEVC family of video codecs (ISO/IEC 14496-15). Draft text for ISO/IEC 14496-12 and 14496-15 is available at http:// phenix. int-evry. fr/mpeg/doc _ end _ user/documents/111_ Geneva/wg11/w15177-v6-w15177.zip and http:// wg11.sc29.org/doc _ end _ user/documents/115_ Geneva/wg11/w16169-v2-w16169.zip, respectively.

ISOBMFF is used as the basis for many codec encapsulation formats, such as the AVC file format, and many multimedia container formats, such as the MPEG-4 file format, the 3GPP file format (3GP), and the DVB file format.

In addition to continuous media such as audio and video, static media such as images, as well as metadata, may be stored in a file that conforms to ISOBMFF. Files structured according to ISOBMFF may be used for many purposes, including local media file playback, progressive downloading of remote files, segments for dynamic adaptive streaming over HTTP (DASH), containers for content to be streamed and its packetization instructions, and recording of received real-time media streams.

The blocks are the basic syntax structures in ISOBMFF, including the four-word coded block type, the byte count of the block, and the payload. The ISOBMFF file consists of a series of blocks, and a block may contain other blocks. Movie blocks ("moov") contain metadata for the successive media streams present in the file, each media stream being represented in the file as a track. Metadata data for a track is enclosed in a track square ("trak"), while the media content of the track is enclosed in a media data square ("mdat") or directly in a separate file. The media content for a track consists of a series of samples, e.g., audio or video access units.

ISOBMFF specifies the following types of tracks: a media track containing a primary media stream; a hint track containing media launch instructions or representing a received packet stream; and a timed metadata track comprising time-synchronized metadata.

Although originally designed for storage, ISOBMFF has proven to be extremely valuable for streaming, e.g., for progressive download or DASH. For streaming purposes, movie fragments as defined in ISOBMFF may be used.

The metadata for each track includes a list of sample description entries, each entry providing the coding or encapsulation format used in the track and initialization data needed to process that format. Each sample is associated with one of the sample description entries of the track.

ISOBMFF enables the specification of sample-specific metadata with various mechanisms. Specific blocks within a sample table block ("stbl") have been standardized to respond to common needs. For example, a sync sample block ("stss") is used to enumerate random access samples for the track. The sample grouping mechanism enables mapping of samples according to a four character grouping type into sample groups sharing the same property specified as a sample group description item in the file. Several packet types have been specified in ISOBMFF.

High Dynamic Range (HDR) and Wide Color Gamut (WCG) information may be signaled using the colorlnformationbox defined in clause 12.1.5 of the ISOBMFF specification. For example, colour _ type may be set equal to "nclx", which indicates that the most important HDR/WCG information is carried in the fields colour _ fingerprints, transfer _ characteristics, matrix _ coefficients and full _ range _ flag.

ISOBMFF specifies constrained scheme design. The constrained scheme design in ISOBMFF is for handling situations where the file author needs to do some action on the player or renderer to enable the player to simply view the file for such requirements of rendering the bitstream, and to stop legacy players from decoding and render files that require further processing. The mechanism is applicable to any type of video codec.

The mechanism is similar to a content protection transform, indicating encrypted or encapsulated media where the sample item is hidden behind the generic sample items "encv", "enca", etc. A similar mechanism for constrained video uses a transform with the generic sample entry "resv". The method is applicable when the content should only be decoded by a player that can render the content correctly.

The constrained scheme is specified in clauses 8.15.1 through 8.15.3 of the ISOBMFF specification.

Clause 8.15.4 of the ISOBMFF specification defines a particular constrained scheme type for frame-encapsulating video.

Dynamic adaptive streaming over HTTP (DASH) specified in ISO/IEC 23009-1 is a standard for HTTP (adaptive) streaming applications. It mainly specifies the format of a Media Presentation Description (MPD), also called manifest, and the media segment format. The MPD describes the media available on the server and has the DASH client autonomously download the media version at the media time of interest.

A typical process for DASH-based HTTP streaming contains the following steps:

1) a DASH client obtains an MPD of streaming content, such as a movie. The MPD contains information about different alternative representations of streaming content, e.g., bit rate, video resolution, frame rate, audio language; and the URL of the HTTP resource (initialization segment and media segment).

2) Based on information in the MPD and local information of DASH clients, such as network bandwidth, decoding/display capabilities, and user preferences, the clients request the desired representation one segment (or portions thereof, e.g., local segments) at a time.

3) When the DASH client detects a network bandwidth change, the DASH client requests segments with different representations having better matching bit rates, ideally starting with segments starting with random access points.

During an HTTP streaming "session", to respond to a user request to search backward for past locations or forward for future locations, the DASH client request begins in a past or future segment that is close to the user requested location and ideally begins at a random access point. Users may also request fast forwarding content, which may be accomplished by requesting data sufficient to decode only intra-coded video pictures or only a temporary subset of the video stream.

In HTTP streaming, such as DASH, frequently used operations include HEAD, GET, and partial GET. The HEAD operation retrieves the header of the file associated with a given Uniform Resource Locator (URL) or Uniform Resource Name (URN) without retrieving the payload associated with the URL or URN. The GET operation retrieves the entire file associated with a given URL or URN. The partial GET operation receives a range of bit groups as an input parameter and retrieves a consecutive number of bit groups of the file, where the number of bit groups corresponds to the received range of bit groups. Thus, movie fragments may be provided for HTTP streaming, since a partial GET operation may result in one or more individual movie fragments. In a movie clip, there may be several track segments of different tracks. In HTTP streaming, a media presentation may be a structured collection of data accessible to a client. The client may request and download media data information to present the streaming service to the user.

In an example where 3GPP data is streamed using HTTP streaming, there may be multiple representations of video and/or audio data for multimedia content. As explained below, the different representations may correspond to different coding characteristics (e.g., different profiles or levels of a video coding standard), different coding standards or extensions of a coding standard (e.g., multiview and/or scalable extensions), or different bitrates. The manifest of such representations may be defined in a Media Presentation Description (MPD) data structure. The media presentation may correspond to a structured collection of data accessible to the HTTP streaming client device. An HTTP streaming client device may request and download media data information to present a streaming service to a user of the client device. The media presentation may be described in an MPD data structure, which may include updates to the MPD.

A media presentation may contain a sequence of one or more time periods. Each time period may extend until the start of the next time period, or in the case of the last time period, until the end of the media presentation. Each time period may contain one or more representations for the same media content. The representation may be one of several alternative encoded versions of audio, video, timed text, or other such data. The representations may differ by type of encoding, e.g., by bitrate, resolution, and/or codec for video data, and by bitrate, language, and/or codec for audio data. The term representation may be used to refer to portions of encoded audio or video data that correspond to multimedia content for a particular time period and are encoded in a particular manner.

The representations for a particular time period may be assigned to a group indicated by an attribute in an MPD that indicates an adaptation set to which the representations belong. Representations in the same adaptation set are often considered alternatives to each other in that a client device may dynamically and seamlessly switch between such representations, for example, to perform bandwidth adaptation. For example, each representation of video data for a particular time period may be assigned to the same adaptation set such that any of the representations may be selected for decoding to present media data, such as video data or audio data, of multimedia content for the corresponding time period. In some examples, media content within a time period may be represented by one representation from group 0 (if present), or by a combination of at most one representation from each non-zero group. The timing data for each representation of a time period may be expressed relative to a start time of the time period.

The representation may include one or more segments. Each representation may include an initialization segment, or each segment of a representation may be initialized. When present, the initialization segment may contain initialization information for accessing the representation. In general, the initialization segment does not contain media data. The segments may be uniquely referred to by an identifier such as a Uniform Resource Locator (URL), a Uniform Resource Name (URN), or a Uniform Resource Identifier (URI). The MPD may provide an identifier for each segment. In some examples, the MPD may also provide the byte range in the form of a range attribute, which may correspond to data for segments within a file accessible by a URL, URN, or URI.

Different representations may be selected to retrieve different types of media data substantially simultaneously. For example, the client device may select to retrieve the audio representation, the video representation, and the timed text representation from which the segment is derived. In some examples, a client device may select a particular adaptation set for performing bandwidth adaptation. That is, the client device may select an adaptation set that includes a video representation, an adaptation set that includes an audio representation, and/or an adaptation set that includes timed text. Alternatively, the client device may select an adaptation set for a particular type of media (e.g., video) and directly select representations for other types of media (e.g., audio and/or timed text).

Virtual Reality (VR) is the ability to be virtually presented in a non-physical world created by the appearance of natural and/or synthetic images and sounds associated with movements of an immersive user, allowing interaction with that world. With recent advances in rendering devices, such as Head Mounted Displays (HMDs), and VR video (also often referred to as 360 degree video) generation, significant quality of experience can be afforded. VR applications include gaming, training, education, sports video, online shopping, video entertainment, and the like.

A typical VR system may include the following components and steps:

1) a camera set, which typically includes a plurality of individual cameras pointing in different directions and ideally collectively encompassing all viewpoints around the camera set.

2) Image stitching, in which video pictures taken by multiple individual cameras are synchronized in the time domain and stitched in the spatial domain to form a spherical video, but mapped to a rectangular format, such as a spherical expansion (e.g., world map) or a perspective view.

3) Video in the mapped rectangular format is encoded/compressed using a video codec such as h.265/HEVC or h.264/AVC.

4) The compressed video bitstream may be stored and/or encapsulated in a media format and transmitted (possibly covering only a subset of the area visible to the user) over a network to a receiver.

5) The receiver receives a video bitstream, or portion thereof, that may be encapsulated in a format, and sends the decoded video signal, or portion thereof, to a rendering device.

6) The rendering device may be, for example, an HMD that may track head movements and even eye motion moments, and render corresponding portions of the video such that an immersive experience is delivered to the user.

The Omnidirectional Media Application Format (OMAF) is being developed by MPEG to define a media application format that allows omnidirectional media applications, focusing on VR applications with 360 degrees of video and associated audio. OMAF specifies a list of projection methods that can be used to convert spherical or 360 ° video into two-dimensional rectangular video, followed by how to store the omnidirectional media and associated metadata using the ISO base media file format (ISOBMFF) and how to encapsulate, signal, and stream the omnidirectional media using dynamic adaptive streaming over HTTP (DASH), and ultimately which video and audio codecs and media coding configurations can be used to compress and play the omnidirectional media signal.

OMAF is intended to be standardized to ISO/IEC 23000-20, and a Draft specification known as the OMAF Commission Draft (CD) is available at http:// wg11.sc29.org/doc _ end _ user/documents/117_ Geneva/wg 11/ww16636. zip.

Clause 7.1 of the OMAF CD defines a particular constrained scheme type, "odvd," for VR/omni/360 video. The OMAF CD specifies that when scheme _ type is equal to "odvd", the scheme information block ("schi") needs to contain a project Omnidirectional VideoBox ("posd") or a Fisheye Omnidirectional VideoBox ("fovd"). The OMAF CD specifies a "povd" block containing the ProjectionFormatBox, which carries the geometry _ type and project _ type. By OMAF CD, geometry _ type may indicate, for example, spherical geometry, and project _ type may indicate spherical unwrapped projection, cube map projection, or some other projection type. These pieces of information are all important for content selection purposes.

The DASH specification contains definitions of the MPD attributes @ mimeType and @ codecs, both of which may be conveyed at the level of an adaptation set, representation or sub-representation.

The @ mimeType attribute is defined in clause 5.3.7.2 of the DASH specification as follows:

furthermore, in clause 7.3.1 of the DASH specification, the semantics of the @ mimeType attribute are set forth for ISOBMFF-based media presentation as follows:

"the @ mimeType attribute of each representation should be set according to RFC 4337. Additional parameters may be added according to RFC 6381. "

The @ codecs attribute is defined in clause 5.3.7.2 of the DASH specification as follows:

clause E of ISO/IEC14496-15 defines the "codecs" parameter for AVC, HEVC, and extensions thereof.

The "codecs" parameter is an optional MIME type parameter according to ISO/IEC14496-15 clause E and RFC 6381. However, ISO/IEC14496-15 and RFC 6381 do not know whether the "codecs" parameter can be conveyed as part of the @ mimeType attribute.

As specified in RFC 6381, a "codecs" parameter is a single value or a comma-separated list of values, where each value is made up of one or more punctually separated (e.g., period delimited) elements. The namespace of the first element is determined by the MIME type. The namespace of each subsequent element is determined by the previous element. For ISOBMFF, the first element of the "codes" parameter value is the sample description term four character code.

Existing designs regarding signaling of HDR/WCG video, VR/omni/360 video, frame-packed video, video with display orientation changes, and video stored using constrained schemes may suffer from the following problems:

1) there is a lack of a mechanism to indicate the use of constrained schemes, and some important details of the constrained schemes used in MIME-type parameters, e.g., for VR/omni/360 video and frame-packed video. Further, the following problem (query/issue) is not obvious.

How does dash client handle @ mimeType with unrecognized optional parameter? Neglecting unrecognized portions and taking the remaining portions as if unrecognized portions did not exist? Or ignore the entire adaptation set/representation/sub-representation (i.e., do not attempt to request/process the adaptation set/representation/sub-representation containing this @ mimeType attribute)?

i. The latter seems to be more significant.

RFC 4337/RFC 6381 are silent to this. This should be explicitly specified somewhere, preferably in the update to RFC 6381 (which incidentally updates RFC 4337).

b. Why should the "codes" parameter for videos stored using constrained schemes?

c. Should the constrained scheme used be indicated by the "codes" parameter, or by a different/separate MIME type parameter? If the "codecs" parameter is the same as that of the case where the additional optional MIME type parameter is defined/has an indication for the constrained scheme used while the constrained scheme is not used?

2) For video with display orientation changes, a dedicated constrained scheme is missing, and the first problem described above also applies.

3) There is a lack of a mechanism to contain important video information for HDR/WCG video as part of the MIME-type parameters.

Fig. 1 is a block diagram illustrating an example system 10 that implements techniques for streaming media data over a network. In this example, system 10 includes content preparation device 20, server device 60, and client device 40. Client device 40 and server device 60 are communicatively coupled by a network 74, which network 74 may comprise the internet. In some examples, content preparation device 20 and server device 60 may also be coupled by network 74 or another network, or may be directly communicatively coupled. In some examples, content preparation device 20 and server device 60 may comprise the same device.

In the example of fig. 1, content preparation device 20 includes an audio source 22 and a video source 24. The audio source 22 may, for example, comprise a microphone that generates an electrical signal representative of captured audio data to be encoded by an audio encoder 26. Alternatively, audio source 22 may comprise a storage medium storing previously recorded audio data, an audio data generator such as a computerized synthesizer, or any other source of audio data. Video source 24 may include: a camera that generates video data to be encoded by video encoder 28; a storage medium encoded with previously recorded video data; a video data generation unit, such as a computer graphics source; or any other source of video data. Content preparation device 20 need not be communicatively coupled to server device 60 in all examples, but rather may store multimedia content to a separate medium that is read by server device 60.

Raw audio and video data may include analog or digital data. The analog data may be digitized before being encoded by audio encoder 26 and/or video encoder 28. Audio source 22 may obtain audio data from the speaking participant while the speaking participant is speaking, and video source 24 may simultaneously obtain video data of the speaking participant. In other examples, audio source 22 may comprise a computer-readable storage medium comprising stored audio data; and video source 24 may comprise a computer-readable storage medium comprising stored video data. In this manner, the techniques described in this disclosure may be applied to live, streaming, real-time audio and video data, or to archived, pre-recorded audio and video data.

An audio frame corresponding to a video frame is typically an audio frame that contains both audio data captured (or generated) by audio source 22 and video data captured (or generated) by video source 24 that is contained within the video frame. For example, when a speaking participant typically produces audio data by speaking, the audio source 22 captures audio data, and the video source 24 simultaneously (i.e., while the audio source 22 is capturing audio data) captures video data of the speaking participant. Thus, an audio frame may correspond temporally to one or more particular video frames. Thus, an audio frame corresponding to a video frame generally corresponds to the case where audio data and video data are captured simultaneously and the audio frame and video frame comprise the simultaneously captured audio data and video data, respectively.

In some examples, audio encoder 26 may encode a time stamp in each encoded audio frame, the time stamp representing a time at which audio data of the encoded audio frame was recorded; and similarly, video encoder 28 may encode a timestamp in each encoded video frame that represents the time at which the video data for the encoded video frame was recorded. In such examples, an audio frame corresponding to a video frame may comprise: audio frames including timestamps and video frames including the same timestamps. Content preparation device 20 may include an internal clock from which audio encoder 26 and/or video encoder 28 may generate a time stamp, or which audio source 22 and video source 24 may use to associate audio data and video data, respectively, with a time stamp.

In some examples, audio source 22 may send data to audio encoder 26 corresponding to the time at which audio data was recorded, and video source 24 may send data to video encoder 28 corresponding to the time at which video data was recorded. In some examples, audio encoder 26 may encode a sequence identifier in the encoded audio data to indicate a relative temporal ordering of the encoded audio data, but not necessarily an absolute time at which the audio data was recorded, and similarly, video encoder 28 may also use the sequence identifier to indicate a relative temporal ordering of the encoded video data. Similarly, in some examples, the sequence identifier may be mapped or otherwise correlated with a timestamp.

Audio encoder 26 typically generates a stream of encoded audio data, while video encoder 28 generates a stream of encoded video data. Each individual data stream, whether audio or video, may be referred to as an elementary stream. An elementary stream is a single digitally coded (possibly compressed) component of a representation. For example, the coded video or audio portion of the representation may be an elementary stream. An elementary stream may be converted into a Packetized Elementary Stream (PES) before being encapsulated within a video file. Within the same representation, a PES packet belonging to one elementary stream can be distinguished from other PES packets using a stream ID. The basic unit of data of an elementary stream is a Packetized Elementary Stream (PES) packet. Thus, the coded video data generally corresponds to the base video stream. Similarly, the audio data corresponds to one or more respective elementary streams.

Many video coding standards, such as ITU-T h.264/AVC and ITU-T h.265/High Efficiency Video Coding (HEVC), define syntax, semantics, and decoding processes for error-free bitstreams, any of which conforms to a particular profile or level. Video coding standards do not typically specify an encoder, but the encoder has the task of ensuring that the generated bitstream is standard compatible for the decoder. In the context of video coding standards, a "profile" corresponds to a subset of an algorithm, feature, or tool and a constraint applied to the algorithm, feature, or tool. For example, as defined by the h.264 standard, a "profile" is a subset of the entire bitstream syntax specified by the h.264 standard. "level" corresponds to a limit on decoder resource consumption, such as decoder memory and computations, that relates to picture resolution, bit rate, and block processing rate. A profile may be signaled with a profile idc (profile indicator) value and a level may be signaled with a level idc (level indicator) value.

For example, the h.264 standard recognizes that within the bounds imposed by the syntax of a given profile, it is still possible to require large variations in the performance of the encoder and decoder that depend on the values in the bitstream taken by the syntax elements, such as the specified size of the decoded pictures. The h.264 standard further recognizes that in many applications it is neither practical nor economical to implement a decoder that can handle all assumptions of syntax within a particular configuration file. Thus, the h.264 standard defines a "level" as a specified set of constraints imposed on the values of syntax elements in a bitstream. Such constraints may be simple limits on the values. Alternatively, such constraints may be in the form of constraints on arithmetic combinations of values (e.g., picture width multiplied by picture height multiplied by the number of pictures decoded per second). The h.264 standard further specifies that individual implementations can support different levels for each supported profile.

A decoder conforming to a profile generally supports all of the features defined in the profile. For example, as a coding feature, B-picture coding is not supported in the baseline profile of h.264/AVC, but is supported in other profiles of h.264/AVC. A decoder conforming to a level should be able to decode any bitstream that does not require resources beyond the limits defined in the level. The definition of configuration files and levels may help in interpretability. For example, during a video transmission, a pair of profile definitions and tier definitions may be negotiated and agreed for the entire transmission session. More specifically, in h.264/AVC, the hierarchy may define limitations with respect to: the number of macroblocks that need to be processed, the Decoded Picture Buffer (DPB) size, the Coded Picture Buffer (CPB) size, the vertical motion vector range, the maximum number of motion vectors per two consecutive MBs, and whether a B block can have a sub-macroblock partition of less than 8 x 8 pixels. In this way, the decoder may determine whether the decoder is able to correctly decode the bitstream.

In the example of fig. 1, encapsulation unit 30 of content preparation device 20 receives an elementary stream comprising coded video data from video encoder 28 and an elementary stream comprising coded audio data from audio encoder 26. In some examples, video encoder 28 and audio encoder 26 may each include a packetizer for forming PES packets from encoded data. In other examples, video encoder 28 and audio encoder 26 may each interface with a respective packetizer for forming PES packets from encoded data. In still other examples, encapsulation unit 30 may include a packetizer for forming PES packets from encoded audio and video data.

Video encoder 28 may encode video data of multimedia content in a variety of ways, generating different representations of multimedia content at various bitrates and with various characteristics, such as pixel resolution, frame rate, conformance to various coding standards, conformance to various profiles and/or profile levels of various coding standards, representations with one or more views (e.g., for two-dimensional or three-dimensional playback), or other such characteristics. As used in this disclosure, a representation may comprise one of audio data, video data, text data (e.g., for closed captioning), or other such data. A representation may include an elementary stream such as an audio elementary stream or a video elementary stream. Each PES packet may include a stream _ id that identifies the elementary stream to which the PES packet belongs. Encapsulation unit 30 is responsible for the translation of elementary streams into various representations of video files (e.g., segments).

Encapsulation unit 30 receives PES packets of the elementary streams represented from audio encoder 26 and video encoder 28 and forms corresponding Network Abstraction Layer (NAL) units from the PES packets. Coding video segments may be organized into NAL units that provide a "network friendly" video representation addressing applications such as video telephony, storage, broadcast, or streaming. NAL units can be classified into Video Coding Layer (VCL) NAL units and non-VCL NAL units. VCL units may contain core compression engines and may include block, macroblock, and/or slice level data. Other NAL units may be non-VCL NAL units. In some examples, a coded picture in a temporal execution unit, which is typically presented as a primary coded picture, may be contained in an access unit, which may include one or more NAL units.

non-VCL NAL units may include parameter set NAL units and SEI NAL units, plus others. Parameter sets may contain sequence-level header information (in a Sequence Parameter Set (SPS)) and infrequently changing picture-level header information (in a Picture Parameter Set (PPS)). In the case of parameter sets (e.g., PPS and SPS), information that does not change frequently need to be repeated for each sequence or picture, so coding efficiency may be improved. Furthermore, the use of parameter sets may enable out-of-band transmission of important header information, avoiding the need for redundant transmission of error resilience (error resilience). In the out-of-band transmission example, parameter set NAL units may be transmitted on a different channel than other NAL units, such as SEI NAL units.

Supplemental Enhancement Information (SEI) may contain Information that is not necessary for decoding coded picture samples from VCL NAL units, but may assist processes related to decoding, display, error recovery, and other purposes. SEI messages may be contained in non-VCL NAL units. SEI messages are a standardized part of some standard specifications and are therefore not always mandatory for standard compliant decoder implementations. The SEI message may be a sequence level SEI message or a picture level SEI message. Some sequence level information may be contained in SEI messages, such as the scalability information in SEI messages in the example of SVC and the view scalability information in SEI messages in MVC. Such example SEI messages may convey information about, for example, the extraction of operation points and characteristics of operation points. In addition, encapsulation unit 30 may form a manifest file, such as a Media Presentation Descriptor (MPD) that describes the characteristics of the representation. Encapsulation unit 30 may format the MPD according to extensible markup language (XML).

Encapsulation unit 30 may provide data for one or more representations of multimedia content as well as a manifest file (e.g., MPD) to output interface 32. The output interface 32 may comprise a network interface or an interface for writing to a storage medium, such as a Universal Serial Bus (USB) interface, a CD or DVD writer or burner, an interface to a magnetic or flash storage medium, or other interface for storing or transmitting media data. Encapsulation unit 30 may provide data for each of the representations of multimedia content to output interface 32, which may send the data to server device 60 via a network transmission or storage medium. In the example of fig. 1, server device 60 includes a storage medium 62 that stores various multimedia content 64, each multimedia content 64 including a respective manifest file 66 and one or more representations 68A-68N (representations 68). In some examples, output interface 32 may also send data directly to network 74.

In some examples, representation 68 may be divided into several adaptation sets. That is, the various subsets of the representation 68 may include, for example, respective common sets of characteristics for: codecs, profiles and levels, resolution, number of views, file format of segments, text type information that can identify language or other characteristics of text to be displayed with the representation and/or audio data to be decoded and presented (e.g., presented by speakers), camera angle information that can describe camera angles or real world camera perspectives of the scenes of the representations in the adaptation set, rating information that describes content suitability for a particular audience, or the like.

Manifest file 66 may include data indicating a subset of representations 68 that correspond to a particular adaptation set and common characteristics of the adaptation sets. Manifest file 66 may also contain data representing individual characteristics, such as bit rate, of the individual representations of the adaptation sets. In this way, the adaptation set may provide simplified network bandwidth adaptation. Representations in the adaptation set may be indicated using child elements of the adaptation set elements of the manifest file 66.

The server device 60 includes a request processing unit 70 and a network interface 72. In some examples, server device 60 may include multiple network interfaces. Further, any or all of the features of server device 60 may be implemented on other devices of the content delivery network, such as routers, bridges, proxy devices, switches, or other devices. In some examples, an intermediary device of the content delivery network may cache data of multimedia content 64 and include components that substantially conform to those of server device 60. In general, the network interface 72 is configured to send and receive data over a network 74.

Request processing unit 70 is configured to receive a network request for data of storage medium 62 from a client device, such as client device 40. For example, the request processing unit 70 may implement Hypertext Transfer Protocol (HTTP) version 1.1 as described in RFC 2616, namely, "Hypertext Transfer Protocol (HTTP/1.1" (Fielding et al, network working group, IETF, 6 months 1999). That is, the request processing unit 70 may be configured to receive HTTP GET or partial GET requests and provide data of the multimedia content 64 in response to the requests. The request may specify a segment of one of representations 68, for example, using a URL for the segment. In some examples, the request may also specify one or more ranges of bytes of the segment, thus comprising a partial GET request. Request processing unit 70 may be further configured to service HTTP HEAD requests to provide header data for segments of one of representations 68. In any case, request processing unit 70 may be configured to process the request to provide the requested data to a requesting device, such as client device 40.

Additionally or alternatively, request processing unit 70 may be configured to deliver media data via a broadcast or multicast protocol, such as eMBMS. Content preparation device 20 may generate DASH segments and/or subsections in substantially the same manner as described, but server device 60 may deliver such segments or subsections using eMBMS or another broadcast or multicast network transport protocol. For example, request processing unit 70 may be configured to receive a multicast group join request from client device 40. That is, server device 60 may advertise an Internet Protocol (IP) address associated with a multicast group to client devices 40 associated with particular media content, such as a broadcast of a live event. Client device 40 may in turn submit a request to join the multicast group. This request may be propagated throughout network 74, such as the routers comprising network 74, such that the routers are caused to direct traffic destined for an IP address associated with the multicast group to subscribing client devices, such as client device 40.

As illustrated in the example of fig. 1, the multimedia content 64 includes a manifest file 66, which manifest file 66 may correspond to a Media Presentation Description (MPD). Manifest file 66 may contain descriptions of different alternative representations 68 (e.g., video services having different qualities), and the descriptions may include codec information, profile values, tier values, bit rates, and other descriptive characteristics of representations 68, for example. Client device 40 may retrieve the MPD for the media presentation to determine how to access the segments of representation 68.

In particular, retrieval unit 52 may retrieve configuration data (not shown) of client device 40 to determine the decoding capabilities of video decoder 48 and the rendering capabilities of video output 44. The configuration data may also include any or all of the following: a language preference selected by a user of client device 40, one or more camera perspectives corresponding to a depth preference set by a user of client device 40, and/or a rating preference selected by a user of client device 40. For example, retrieval unit 52 may comprise a web browser or media client configured to submit HTTP GET and partial GET requests. Retrieval unit 52 may correspond to software instructions executed by one or more processors or processing units (not shown) of client device 40. In some examples, all or part of the functionality described with respect to retrieval unit 52 may be implemented in hardware or a combination of hardware, software, and/or firmware, where the necessary hardware may be provided to execute the instructions of the software or firmware.

Retrieval unit 52 may compare the decoding and rendering capabilities of client device 40 to the characteristics of representation 68 indicated by the information of manifest file 66. Retrieval unit 52 may initially retrieve at least a portion of list file 66 to determine characteristics of representation 68. For example, retrieval unit 52 may request a portion of manifest file 66 describing characteristics of one or more adaptation sets. Retrieval unit 52 may select a subset (e.g., an adaptation set) of representations 68 that have characteristics that may be met by coding and rendering capabilities of client device 40. Retrieval unit 52 may then determine the bit rates for the representations in the adaptation set, determine the current available amount of network bandwidth, and retrieve the segments from one of the representations having a bit rate that is satisfiable by the network bandwidth.

In general, a higher bit rate representation may result in higher quality video playback, while a lower bit rate representation may provide sufficient quality video playback when the available network bandwidth is reduced. Thus, when the available network bandwidth is relatively high, retrieval unit 52 may retrieve data from a representation of a relatively high bit rate, and when the available network bandwidth is low, retrieval unit 52 may retrieve data from a representation of a relatively low bit rate. In this manner, client device 40 may stream multimedia data over network 74 while also accommodating the changing network bandwidth availability of network 74.

Additionally or alternatively, retrieval unit 52 may be configured to receive data according to a broadcast or multicast network protocol, such as eMBMS or IP multicast. In such examples, retrieval unit 52 may submit a request to join a multicast network group associated with particular media content. After joining the multicast group, retrieval unit 52 may receive the data of the multicast group without otherwise requesting to be published to server device 60 or content preparation device 20. When the data for the multicast group is no longer needed, retrieval unit 52 may submit a request to leave the multicast group, such as stop playing or change channel to a different multicast group.

Network interface 54 may receive data for the segment of the selected representation and provide the data to retrieval unit 52, which retrieval unit 52 may in turn provide the segment to decapsulation unit 50. De-encapsulation unit 50 may de-encapsulate elements of the video file into constituent PES streams, de-packetize the PES streams to retrieve the encoded data, and send the encoded data to audio decoder 46 or video decoder 48 depending on whether the encoded data is an audio stream or part of a video stream, e.g., as indicated by PES packet headers of the streams. Audio decoder 46 decodes the encoded audio data and sends the decoded audio data to audio output 42; while video decoder 48 decodes the encoded video data and sends the decoded video data, which may include multiple views of the stream, to video output 44.

Video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, encapsulation unit 30, retrieval unit 52, and decapsulation unit 50, as applicable, may each be implemented as any of a variety of suitable processing circuitry, such as one or more microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), discrete logic circuitry, software, hardware, firmware, or any combinations thereof. Each of video encoder 28 and video decoder 48 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined video encoder/decoder (CODEC). Likewise, each of audio encoder 26 and audio decoder 46 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined CODEC. An apparatus including video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, encapsulation unit 30, retrieval unit 52, and/or decapsulation unit 50 may comprise an integrated circuit, a microprocessor, and/or a wireless communication device, such as a cellular telephone.

Client device 40, server device 60, and/or content preparation device 20 may be configured to operate in accordance with the techniques of this disclosure. For purposes of example, this disclosure describes such techniques with respect to client device 40 and server device 60. However, it should be understood that content preparation device 20 may be configured to perform such techniques instead of (or in addition to) server device 60.

Encapsulation unit 30 may form a NAL unit that includes a header that identifies the program to which the NAL unit belongs, as well as a payload, such as audio data, video data, or data describing the transport or program stream to which the NAL unit corresponds. For example, in h.264/AVC, a NAL unit includes a 1-bit byte header and a payload of varying size. NAL units that include video data in the payload may include various levels of granularity of video data. For example, a NAL unit may comprise a block of video data, a plurality of blocks, a slice of video data, or an entire picture of video data. Encapsulation unit 30 may receive encoded video data in the form of PES packets of an elementary stream from video encoder 28. Encapsulation unit 30 may associate each elementary stream with a corresponding program.

Encapsulation unit 30 may also interpret access units from multiple NAL units. In general, an access unit may include one or more NAL units for representing frames of video data as well as audio data corresponding to the frames when such audio data is available. An access unit generally includes all NAL units for one output time execution unit, e.g., all audio and video data for a time execution unit. For example, if each view has a frame rate of 20 frames per second (fps), each temporal execution individual may correspond to a time interval of 0.05 seconds. During this time interval, a particular frame of all views of the same access unit (executing an individual at the same time) may be rendered at the same time. In one example, an access unit may comprise a coded picture in a temporal execution unit, which may be represented as a primary coded picture.

Thus, an access unit may include all audio frames and video frames that execute an individual at a common time, e.g., all views corresponding to time instance X. This disclosure also refers to the encoded pictures of a particular view as "view components". That is, a view component may comprise an encoded picture (or frame) for a particular view at a particular time. Thus, an access unit may be defined to include all view components of a common temporal execution unit. The decoding order of the access units is not necessarily the same as the output or display order.

A media presentation may include a Media Presentation Description (MPD) that may contain descriptions of different alternative representations (e.g., video services having different qualities), and the descriptions may include, for example, codec information, profile values, and tier values. An MPD is an example of a manifest file, such as manifest file 66. Client device 40 may retrieve the MPD for the media presentation to determine how to access the movie fragments for the various presentations. The movie fragment may be located in a movie fragment box (moof box) of the video file.

Manifest file 66 (which may include, for example, an MPD) may advertise the availability of segments of representations 68. That is, the MPD may include information indicative of the wall clock time at which the first segment of one of representations 68 becomes available, and information indicative of the duration of the segments within representations 68. In this way, retrieval unit 52 of client device 40 may determine the time at which each segment is available based on the start time and the duration of the segment that precedes the particular segment.

After encapsulation unit 30 has compiled the NAL units and/or the access unit groups into a video file based on the received data, encapsulation unit 30 passes the video file to output interface 32 for output. In some examples, encapsulation unit 30 may store the video file locally, or send the video file to a remote server via output interface 32, rather than sending the video file directly to client device 40. For example, output interface 32 may include a transmitter, transceiver, means for writing data to a computer readable medium such as an optical disk drive, magnetic media drive (e.g., floppy disk drive), Universal Serial Bus (USB) port, network interface, or other output interface. Output interface 32 outputs the video file to a computer readable medium, such as a transmission signal, magnetic media, optical media, memory, a diskette, or other computer readable media.

Network interface 54 may receive NAL units or access units via network 74 and provide the NAL units or access units to decapsulation unit 50 via retrieval unit 52. De-encapsulation unit 50 may de-encapsulate elements of the video file into constituent PES streams, de-packetize the PES streams to retrieve the encoded data, and send the encoded data to audio decoder 46 or video decoder 48 depending on whether the encoded data is an audio stream or part of a video stream, e.g., as indicated by a PES packet header of the stream. Audio decoder 46 decodes the encoded audio data and sends the decoded audio data to audio output 42; while video decoder 48 decodes the encoded video data and sends the decoded video data, which may include multiple views of the stream, to video output 44.

FIG. 2 is a block diagram illustrating an example set of components of retrieval unit 52 of FIG. 1 in more detail. In this example, retrieval unit 52 includes eMBMS middleware unit 100, DASH client 110, and media application 112.

In this example, the eMBMS middleware unit 100 further includes an eMBMS receive unit 106, a cache memory 104, and a server unit 102. In this example, eMBMS reception unit 106 is configured to receive data via eMBMS, such as according to the File Delivery over Unidirectional Transport (FLUTE) described in "FLUTE-File Delivery over Unidirectional Transport" (network working group, RFC6726, month 11 2012) by t. pula (Paila) et al available at http:// tools. That is, the eMBMS reception unit 106 may receive the file via broadcast from, for example, the server apparatus 60 that may act as a BM-SC.

When the eMBMS middleware unit 100 receives data for a file, the eMBMS middleware unit may store the received data in the cache memory 104. Cache memory 104 may include a computer-readable storage medium, such as flash memory, a hard disk, RAM, or any other suitable storage medium.

Local server unit 102 may act as a server for DASH client 110. For example, local server unit 102 may provide an MPD file or other manifest file to DASH client 110. Local server unit 102 may advertise the availability time of a segment in the MPD file and may retrieve the hyperlink for the segment. Such hyperlinks may include native host address headers corresponding to client device 40 (e.g., 127.0.0.1 for IPv 4). In this manner, DASH client 110 may request segments from local server unit 102 using HTTP GET or partial GET requests. For example, for segments available from the link HTTP://127.0.0.1/rep1/seg3, DASH client 110 may construct an HTTP GET request that includes a request for HTTP://127.0.0.1/rep1/seg3, and submit the request to local server unit 102. Local server unit 102 may retrieve the requested data from cache memory 104 and provide the data to DASH client 110 in response to such requests.

In accordance with the techniques of this disclosure, encapsulation unit 30 may signal and retrieval unit 52 may receive important video information regarding any or all of video data stored using constrained schemes, HDR/WCG video, VR/omni/360 video, frame-encapsulated video, and video with display orientation changes, such that the important video information may be conveniently accessed by application clients, such as DASH clients, to make content rejection/selection/acceptance/request decisions. As noted above, important video information may include information that may be used for content selection, e.g., selection of a video track or portion thereof for consumption by retrieval unit 52.

The techniques of the present invention overcome the above-described problems. For example, to address the first problem, encapsulation unit 30 and retrieval unit 52 may be configured to use a new format for "@ codecs," including an indication of the use of a constrained scheme. In one example, the value of the "@ codecs" parameter is defined such that all information deemed important for video is contained in the "codecs" parameter. In this example, the first element of the "@ codecs" parameter is the sample entry type code of the track that uses the constrained scheme (i.e., the track stores media data, such as video data, according to the constrained scheme), e.g., "resv". In this example, the second element is a constrained scheme type code, e.g., "stvi" for frame-packed video, and "odvd" for omni-directional video.

Alternatively, for omnidirectional video, the second element is "povd" for projected omnidirectional video or "fovd" for ultra-wide-angle omnidirectional video. Alternatively, for projected omnidirectional video, the second element indicates the projection type, e.g., "erp" for spherical unwrapped projection, or "cmp" for cube map projection.

More information including important details of a particular type of constrained scheme may be included in subsequent elements of the manifest file. For example, if the second element is "odvd," then the third element exists and is "povd" for projected omnidirectional video or "fovd" for ultra-wide angle omnidirectional video. Alternatively, if the second element is "odvd" and the third element is "povd", then the fourth element exists and indicates the projection type, e.g., "erp" for the equirectangular projection or "cmp" for the cube map projection. For example, the first four elements of the value of the "codecs" parameter of the spherical unfolded projected omnidirectional video may be "resv.

The above elements may be further followed by common elements such as the "codecs" parameter values as defined in clause E of ISO/IEC 14496-15. For example, the "codecs" parameter value for the spherical spread projected omnidirectional video conforming to HEVC, progressive, non-encapsulated, main profile, main hierarchical, and level 3.1 may be "resv.

In this manner, encapsulation unit 30 may signal and retrieval unit 52 may receive any or all of the values "resv," "stvi," "odvd," "povd," "fovd," "erp," or "cmp," as discussed above, for example, within manifest file 66, as part of the "@ codes" parameter. Further, retrieval unit 52 may determine one of representations 68 to retrieve based on the value of the @ codes parameter signaled in manifest file 66 for representation 68 and the capabilities of video decoder 48.

In a second alternative example, the "codec" parameter value is defined such that some of the information deemed important for video is contained in the "codec" parameter, while more details of the information deemed important for video are contained in a different MIME-type parameter. In this example, the first element of the @ codecs parameter is the sample entry type code of the track using the constrained scheme, i.e., "resv". In this example, the second element may be a constrained scheme type code, such as "stvi" for frame-packed video and "odvd" for omni-directional video. Alternatively, for omnidirectional video, the second element may be "povd" for projected omnidirectional video or "fovd" for ultra-wide-angle omnidirectional video.

The above two elements may be further followed by common elements such as the "codecs" parameter values as defined in clause E of ISO/IEC 14496-15. For example, the "codecs" parameter value for the spherical spread projected omnidirectional video conforming to HEVC, progressive, non-encapsulated, main profile, main hierarchical, and level 3.1 may be "resv. Alternatively, the "codecs" parameter value of the above video may be "resv. Alternatively, the "codecs" parameter value of the above video may be "resv.

In addition to the new format of the "codecs" parameter discussed above, new optional MIME-type parameters containing more details of a particular type of constrained scheme can also be used. Such optional MIME-type parameters are similar in format to the "codecs" parameter, i.e., they can be a single value or a comma-separated list of values, wherein each value in the comma-separated list includes one or more punctuated (e.g., period-delimited) elements, and the namespace of each element can be determined from the aforementioned elements. As one example, an optional MIME type parameter "odvdinfo" may contain more details of the omnidirectional video. According to this example, for an "odvdinfo" value, the first element may be "povd" for projected omnidirectional video or "fovd" for ultra wide angle omnidirectional video, and in the former case of "povd", the second element is present and indicates the projection type, e.g., "erp" for spherical expansion projection or "cmp" for cube map projection. More elements may be added to contain more information. Alternatively, as another example, the optional MIME type parameter "fpvdinfo" may contain more detail of the frame-packed video. For example, "fpvdinfo" may include elements corresponding to stereo _ scheme and stereo _ indication _ type as defined in clause 8.15.4.2 of the ISOBMFF specification.

As another example, to address the second problem (which may be performed in conjunction with any of the techniques above to address the first problem), encapsulation unit 30 and retrieval unit 52 may be configured to carry a new constrained scheme type of video data with a display orientation change using a track that indicates a representation. For example, the 4-character code "vdoc" may indicate that the corresponding track carries video with a change in display orientation.

In one example, no further information is provided about the display orientation change, and the schemelnformationbox may not be present in the restictedschemelnbox. In another alternative, either or both of the rotation and flip are further indicated by a new square contained in the schemelnformationbox. For example, such a new block may include a field named display orientation change type, where a value of 0 indicates that both rotation and flipping are applied, a value of 1 indicates that only rotation is applied, and a value of 2 indicates that only flipping is applied. Thus, encapsulation unit 30 may set the value of the display orientation change type field based on whether the corresponding track includes either or both of a rotation and/or a flip, and retrieval unit 52 may determine whether the track of one of representations 68 includes a display orientation change, and if so, determine from the value of the field whether the change includes either or both of a rotation and/or a flip.

Furthermore, a new format for the "codecs" parameter as defined above may also be applied. For example, the value with the HEVC compliant, progressive, non-encapsulating, main profile, main hierarchy, and level 3.1 "codes" parameter that shows orienting the changes may be "resv. Similarly, as with several alternatives to the above example, so that some more information such as display _ orientation _ change _ type may be included in the third element of the "codes" parameter value, and the remaining elements may be pushed down one element in addition in sequence.

As another example, to address the third problem (which may be performed in conjunction with any of the techniques to address the first problem above and/or the techniques to address the second problem above), encapsulation unit 30 and retrieval unit 52 may be configured to use an optional MIME type parameter "HDR info", which may contain important information for HDR/WCG video. The format of this optional MIME-type parameter can be a single value or a list of comma-separated (e.g., period-delimited) values, where each value includes one or more punctually-separated elements. For example, the "hdrinfo" parameter value may contain four fields in the form of "element 1, element 2, element 3, element 4", where the four elements 1-4 may be hexadecimal representations of the fields colour _ fingerprints, transfer _ characteristics, matrix _ coeffs, and full _ range _ flag, respectively, as defined in clause 12.1.5 of the ISOBMFF specification.

Fig. 3 is a conceptual diagram illustrating elements of example multimedia content 120. The multimedia content 120 may correspond to the multimedia content 64 (fig. 1) or another multimedia content stored in the storage medium 62. In the example of fig. 3, multimedia content 120 includes a Media Presentation Description (MPD)122 and a plurality of representations 124A-124N (representation 124). Representation 124A includes optional header data 126 and segments 128A-128N (segment 128), while representation 124N includes optional header data 130 and segments 132A-132N (segment 132). For convenience, the letter N is used to designate the last movie fragment in each of the representations 124. In some examples, there may be a different number of movie fragments between representations 124.

MPD 122 may include a data structure separate from representation 124. MPD 122 may correspond to manifest file 66 of fig. 1. Likewise, representation 124 may correspond to representation 68 of FIG. 2. In general, MPD 122 may include data that generally describes characteristics of representation 124, such as coding and presentation characteristics, adaptation sets, a profile to which MPD 122 corresponds, text type information, camera angle information, rating information, trick mode information (e.g., information indicating a representation that includes temporal subsequences), and/or information for retrieving a remote time period (e.g., for targeted ad insertion into media content during playback).

When present, the header data 126 may describe characteristics of the segment 128, such as a temporal position of a Random Access Point (RAP), which is also referred to as a Stream Access Point (SAP), which of the segments 128 includes a random access point, a bit group offset to a random access point within the segment 128, a Uniform Resource Locator (URL) of the segment 128, or other aspects of the segment 128. When present, the header data 130 may describe similar characteristics of the segment 132. Additionally or alternatively, such characteristics may be entirely included within MPD 122.

The segments 128, 132 include one or more coded video samples, each of which may include a frame or slice of video data. Each of the coded video samples of section 128 may have similar characteristics, such as height, width, and bandwidth requirements. Such characteristics may be described by data of MPD 122, although such data is not illustrated in the example of fig. 3. MPD 122 may include characteristics as described by the 3GPP specifications, with the addition of any or all of the signaling information described in this disclosure.

Each of the segments 128, 132 may be associated with a unique Uniform Resource Locator (URL). Thus, each of the segments 128, 132 may be independently retrieved using a streaming network protocol, such as DASH. In this way, a destination device, such as client device 40, may retrieve segments 128 or 132 using an HTTP GET request. In some examples, client device 40 may retrieve the particular range of bytes of segments 128 or 132 using an HTTP partial GET request.

In accordance with the techniques of this disclosure, MPD 122 may include any or all of the various example MIME type information discussed above. For example, MPD 122 may include the @ codecs parameter as discussed above, which may indicate, for example, a sample entry type code for a track using a constrained scheme, a constrained scheme type code, and additional information, e.g., for omnidirectional video, a projection type, or the like. Additionally or alternatively, MPD 122 may include information indicating whether a display orientation change applies to one of representations 124A-124N, and if so, the type of display orientation change (e.g., either or both of rotation and/or flipping). Additionally or alternatively, display orientation change information may be provided in the header data of any or all of the header data 126, 130 and/or the segments 128, 132. Additionally or alternatively, as discussed above, the header data of any or all of the MPD 122, header data 126, 130, and/or segments 128, 132 may contain important information for HDR/WCG video.

Fig. 4 is a block diagram illustrating elements of an example video file 150, which may correspond to a segment of a representation, such as one of the segments 114, 124 of fig. 3. Each of the segments 128, 132 may include data that generally conforms to the arrangement of data illustrated in the example of fig. 4. Video file 150 may be referred to as an encapsulated segment. As described above, video files according to the ISO base media file format and its extensions store data in a series of objects called "tiles. In the example of fig. 4, video file 150 includes a File Type (FTYP) block 152, a Movie (MOOV) block 154, a section index (sidx) block 162, a movie fragment (MOOF) block 164, and a Movie Fragment Random Access (MFRA) block 166. Although fig. 4 represents an example of a video file, it should be understood that other media files may include other types of media data (e.g., audio data, timed text data, or the like) that are similar in structure to the data of media file 150, in accordance with the ISO base media file format and extensions thereof.

A File Type (FTYP) block 152 generally describes the file type of the video file 150. File type block 152 may contain data identifying a specification that describes the most preferred use of video file 150. The file type block 152 may alternatively be placed before the MOOV block 154, the movie fragment block 164, and/or the MFRA block 166.

In some examples, a segment such as video file 150 may include an MPD update block (not shown) preceding FTYP block 152. The MPD update block may include information indicating that an MPD corresponding to a representation including video file 150 is to be updated, and information for updating the MPD. For example, the MPD update cube may provide a URI or URL to be used to update the resources of the MPD. As another example, an MPD update block may include data for updating an MPD. In some examples, the MPD update tiles may immediately follow Segment Type (STYP) tiles (not shown) of video file 150, where the STYP tiles may define the segment types of video file 150. Fig. 7, discussed in more detail below, provides additional information regarding MPD update tiles.

In the example of fig. 4, MOOV block 154 includes a movie header (MVHD) block 156, a Track (TRAK) block 158, and one or more movie extension (MVEX) blocks 160. In general, MVHD box 156 may describe general characteristics of video file 150. For example, MVHD block 156 may include data describing the time video file 150 was originally generated, the time video file 150 was last modified, a schedule of video files 150, the duration of the play of video file 150, or other data generally describing video 150.

TRAK box 158 may contain data for a track of video file 150. TRAK box 158 may include a track header (TKHD) box that describes characteristics of the track corresponding to TRAK box 158. In some examples, TRAK block 158 may include coded video pictures, while in other examples, the coded video pictures of a track may be included in movie fragment 164, which may be referenced by data of TRAK block 158 and/or sidx block 162.

In some examples, video file 150 may include more than one track. Thus, MOOV block 154 may include a number of TRAK blocks equal to the number of tracks in video file 150. TRAK box 158 may describe characteristics of the corresponding track of video file 150. For example, TRAK block 158 may describe temporal and/or spatial information for the corresponding track. When encapsulation unit 30 (fig. 3) contains a parameter set track in a video file, such as video file 150, a TRAK block similar to TRAK block 158 of MOOV block 154 may describe characteristics of the parameter set track. Encapsulation unit 30 may signal that sequence level SEI messages are present in parameter set tracks within the TRAK block describing the parameter set tracks.

MVEX block 160 may describe characteristics of a corresponding movie fragment 164, e.g., to signal that video file 150 contains movie fragment 164 in addition to video data contained within MOOV block 154 (if present). In the context of streaming video data, coded video pictures may be included in movie fragment 164, rather than in MOOV block 154. Thus, all coded video samples may be included in movie fragment 164, rather than in MOOV block 154.

MOOV block 154 may include a number of MVEX blocks 160 equal to the number of movie fragments 164 in video file 150. Each of MVEX blocks 160 may describe characteristics of a corresponding one of movie fragments 164. For example, each MVEX block may include a movie extension header block (MEHD) block that describes the duration of a corresponding one of movie fragments 164.

As noted above, encapsulation unit 30 may store sequence datasets in video samples that do not include actual coded video data. A video sample may generally correspond to an access unit that performs a representation of coded pictures of an individual at a particular time. In the context of AVC, coded pictures include: one or more VCL NAL units containing information to construct all pixels of an access unit; and other associated non-VCL NAL units, such as SEI messages. Thus, encapsulation unit 30 may include a sequence data set in one of movie fragments 164, which may include sequence level SEI messages. Encapsulation unit 30 may further signal the presence of sequence data sets and/or sequence level SEI messages, as present in one of MVEX blocks 160 in movie fragment 164, which one of MVEX blocks 160 corresponds to the one of movie fragments 164.

The SIDX block 162 is an optional element of the video file 150. That is, video files that conform to the 3GPP file format or other such file formats do not necessarily include the SIDX block 162. According to an example of a 3GPP file format, a SIDX block may be used to identify sub-segments of a segment (e.g., a segment contained within video file 150). The 3GPP document format defines a subsection as "a self-contained collection of one or more consecutive movie fragment blocks having one or more corresponding media data blocks and a media data block containing data referenced by a movie fragment block must follow a movie fragment block and precede the next movie fragment block containing information about the same track". The 3GPP file format also indicates that the SIDX block "contains a sequence of references to sub-segments of the (sub) section recorded by the block. The referenced subsections are contiguous in presentation time. Similarly, the groups of bits referenced by the segment index block are always contiguous within a segment. The referenced size gives a count of the number of groups of bits in the referenced material ".

The SIDX block 162 generally provides information representative of one or more sub-segments of a segment included in the video file 150. For example, such information may include a playout time at which the sub-segment begins and/or ends, a byte offset of the sub-segment, whether the sub-segment includes (e.g., begins at) a Stream Access Point (SAP), a type of the SAP (e.g., whether the SAP is an Instantaneous Decoder Refresh (IDR) picture, a Clean Random Access (CRA) picture, a Broken Link Access (BLA) picture, or the like), a location of the SAP in the sub-segment (in terms of playout time and/or byte offset), and the like.

Movie fragment 164 may include one or more coded video pictures. In some examples, movie fragments 164 may include one or more groups of pictures (GOPs), each of which may include a number of coded video pictures, such as frames or pictures. Additionally, as described above, in some examples, movie fragments 164 may include sequence datasets. Each of movie fragments 164 may include a movie fragment header square (MFHD, not shown in fig. 4). The MFHD box may describe characteristics of the corresponding movie fragment, such as a sequence number of the movie fragment. Movie fragments 164 may be included in video file 150 in order of number.

MFRA block 166 may describe the random access points within movie fragment 164 of video file 150. This can assist in performing trick modes, such as performing a seek to a particular transient location (i.e., play time) within a section encapsulated by video file 150. In some examples, MFRA block 166 is generally optional and need not be included in a video file. Likewise, a client device, such as client device 40, does not necessarily need to reference MFRA block 166 to correctly decode and display video data of video file 150. MFRA block 166 may include a number of Track Fragment Random Access (TFRA) blocks (not shown) equal to the number of tracks of video file 150 or, in some examples, equal to the number of media tracks (e.g., non-hint tracks) of video file 150.

In some examples, movie fragments 164 may include one or more Stream Access Points (SAPs), such as IDR pictures. Likewise, MFRA block 166 may provide an indication of the location of the SAP within video file 150. Thus, a temporal subsequence of video file 150 may be formed from the SAP of video file 150. The temporal sub-sequence may also include other pictures, such as P-frames and/or B-frames depending on the SAP. The frames and/or slices of the temporal sub-sequence may be arranged within the segment such that frames/slices of the temporal sub-sequence that depend on other frames/slices of the sub-sequence may be properly decoded. For example, in a hierarchical arrangement of data, data used for prediction of other data may also be included in the temporal subsequence.

As discussed above, in accordance with the techniques of this disclosure, MOOV block 154 may include one or more blocks (e.g., a schemelnformationbox and/or a resutrictedschemelnfobox) that indicate whether either or both of a rotation and/or a flip applies to video data included in movie fragment 164. As discussed above, MOOV block 154 may additionally or alternatively contain important information for HDR/WCG video.

Likewise, the video data of the tracks of video file 150 may be stored according to a constrained scheme such as omni-directional video, frame-encapsulated video, or the like. As discussed above, a manifest file, such as an MPD, may include a constrained scheme that indicates video data for a track.

FIG. 5 is a flow diagram illustrating an example method in accordance with the techniques of this disclosure. The method of fig. 5 is described as being performed by retrieval unit 52 of client device 40 of fig. 1, although it should be understood that other devices may be configured to perform this method or similar methods.

Initially, the retrieval unit 52 retrieves a manifest file (180). The manifest file may correspond to the manifest file 66 of fig. 1. For example, the manifest file may be a Media Presentation Description (MPD). In accordance with the techniques of this disclosure, the manifest file may include data specifying one or more codecs for one or more representations, such as one or more of representations 68 of fig. 1.

Retrieval unit 52 may then extract data specifying one or more codecs. In detail, according to the technique of the present invention, the retrieval unit 52 may extract the sample entry type code from the manifest file (182). The sample entry type code may be for a track corresponding to one of the representations of the manifest file. As discussed above, the sample entry type code may include "resv" to indicate that the track stores video data using a constrained scheme.

Retrieval unit 52 may then extract the constrained scheme type code from the manifest file (184). For example, if the video data of a track is stored using a frame-encapsulated video scheme, retrieval unit 52 may extract "stvi" from the manifest file for the track, indicating that the track stores the video data using the frame-encapsulated video scheme. As another example, retrieval unit 52 may determine that the video data of the track is stored using an omnidirectional video scheme in response to extracting "odvd" from the manifest file for the track. As yet another example, "fovd" may indicate that the video data of the track is stored using ultra-wide angle omnidirectional video data, "erp" for a spherical expansion projection scheme, or "cmp" for a cube map projection scheme.

Although fig. 5 illustrates an example of extracting two elements, it should be understood that retrieval unit 52 may extract additional elements, for example, in period or comma delimited formats. In some examples, retrieval unit 52 may extract a set of values for one or more @ codecs parameters, where the values for the @ codecs parameters may be respective periods bounding lists of elements. For example, retrieval unit 52 may extract "resv. odvd. povd. erp" for the track, and determine that "resv" indicates that the track includes video data stored using a constrained scheme, "odvd" indicates that the video data is omnidirectional video data, "povd" indicates that the video data is projected omnidirectional video data, and "erp" indicates that the video data is spherically spread projected.

Furthermore, as discussed above, retrieval unit 52 may extract additional elements of different MIME-type parameters from the manifest file. For example, the MIME type parameter can be "odvdinfo". For example, the additional element may be "povd", "fovd", "erp", "cmp", or the like. Additionally or alternatively, the MIME type parameter can be "fpvdinfo" to specify the frame-encapsulated video information, such as a stereoscopic scheme and/or a stereoscopic indication type.

Retrieval unit 52 may then retrieve the media data using the extracted code (186). For example, retrieval unit 52 may retrieve video data of schemes such as those supported by video decoder 48 and video output 44 (fig. 1), and also avoid retrieving video data of schemes not supported by video decoder 48 and video output 44. For example, if video decoder 48 is capable of decoding and video output 44 is capable of displaying video data in an omnidirectional format, retrieval unit 52 may search a manifest file for the representation that includes a track that stores the video data using an omnidirectional video scheme, e.g., "resv. Likewise, if the video decoder 48 is not capable of decoding frame-packed video data, the retrieval unit 52 may refrain from retrieving video data of a track indicated in the manifest file as having "resv.stvi for @ codecs parameter".

In this manner, the method of FIG. 5 represents an example of a method that includes: retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation; extracting the data specifying the one or more codecs from the manifest file, the extracting including: extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and extracting a second element of a constrained scheme type code representing the constrained scheme for the track; and retrieving data of the at least one representation based on the first element and the second element.

In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or program code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or communication media, including any medium that facilitates transfer of a computer program from one place to another, such as according to a communication protocol. In this manner, a computer-readable medium may generally correspond to (1) a tangible computer-readable storage medium that is non-transitory, or (2) a communication medium such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, program code, and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, Application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Further, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding or incorporated in a combined codec. Furthermore, the techniques may be fully implemented in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including wireless handsets, Integrated Circuits (ICs), or collections of ICs (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperability hardware units, including one or more processors as described above, along with suitable software and/or firmware.

Various examples have been described. Such and other examples are within the scope of the following claims.

Claims

1. A method of retrieving media data, the method comprising:

retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation;

extracting the data specifying the one or more codecs from the manifest file, the extracting including:

extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and

extracting a second element representing a constrained scheme type code for the constrained scheme for the track; and

retrieving media data of the at least one representation based on the first element and the second element.

2. The method of claim 1, wherein extracting the data specifying the one or more codecs comprises:

extracting a first set of values for one or more @ codecs parameters, wherein the values for the @ codecs parameters comprise respective periods delimitation lists of elements, the periods delimitation lists including the first element and the second element; and

extracting a second set of values comprising additional information for the @ codecs parameter from a different MIME-type parameter.

3. The method of claim 2, wherein the additional information comprises information indicating a particular type of constrained scheme.

4. The method of claim 2, wherein the different MIME-type parameters comprise "odvdinfo", and wherein extracting the second set of values comprises extracting at least one of "povd", "fovd", "erp", or "cmp".

5. The method of claim 4, further comprising determining that the track of the at least one representation comprises video data that:

include projected omnidirectional video data when the constrained scheme type code comprises "povd";

contain ultra wide angle omnidirectional video data when the constrained scheme type code comprises "fovd";

include spherical unwrapped projected omnidirectional video data when the constrained scheme type code comprises an "erp"; or

Including cube map projected omnidirectional video data when the constrained scheme type code includes a "cmp".

6. The method of claim 2, wherein the additional information comprises information indicating a particular type of frame-encapsulated video data.

7. The method of claim 6 wherein the different MIME-type parameters comprise "fpvdinfo".

8. The method of claim 6, wherein the particular type of frame-encapsulated video data comprises elements corresponding to a stereo scheme and a stereo indication type.

9. The method of claim 1, further comprising:

receiving one or more values for a MIME type parameter "HDR info" for the track, the values for the MIME type parameter including information for at least one of high dynamic range, HDR, or wide color gamut, WCG, video data for the track,

wherein retrieving said media data of said at least one representation comprises retrieving media data of said track based on said value for said MIME-type parameter.

10. The method of claim 9, wherein the one or more values are a single value for the MIME-type parameter "hdr info".

11. The method of claim 9, wherein the one or more values is a comma separated list of values for the MIME-type parameter "hdr info", each of the values in the comma separated list comprising a single element or a list of point separated elements.

12. The method of claim 11, wherein at least one of the values in the comma separated list comprises a list of point separated elements, the elements comprising hexadecimal representations of HDR or WCG fields including color _ primaries, transfer _ characteristics, matrix _ coeffs, and full _ range _ flag.

13. The method of claim 1, wherein the manifest file comprises a dynamic adaptive streaming DASH Media Presentation Description (MPD) via HTTP.

14. The method of claim 1, wherein the manifest file specifies data for a plurality of representations including the at least one representation, the data including data specifying one or more codecs for each of the plurality of representations.

15. The method of claim 14, further comprising selecting the at least one representation in response to determining that a client device includes a video codec conforming to the sample entry type code and the scheme type code.

16. The method of claim 1, wherein the representation comprises a plurality of segments, each of the segments comprising an individually retrievable file associated with a unique Uniform Resource Locator (URL).

17. The method of claim 1, wherein retrieving the media data of the at least one representation comprises at least one of retrieving a segment of the representation using an HTTP GET request or retrieving a partial segment of the representation using an HTTP partial GET request.

18. The method of claim 1, wherein the sample entry type code comprises "resv".

19. The method of claim 1, wherein the constrained scheme type code comprises at least one of "stvi", "odvd", "povd", "fovd", "erp", or "cmp".

20. The method of claim 19, further comprising determining that the track of the at least one representation comprises video data that:

including frame-packaged video data when the constrained scheme type code comprises "stvi";

containing omnidirectional video data when the constrained scheme type code comprises "odvd"; include projected omnidirectional video data when the constrained scheme type code comprises "povd";

Including projecting omnidirectional video data via a cube map when the constrained scheme type code includes a "cmp".

21. The method of claim 1, wherein extracting the data specifying the one or more codecs further comprises extracting a third element in response to determining that the constrained scheme type code comprises an "odvd," the third element representing a type of omnidirectional video data for the track.

22. The method of claim 21, further comprising determining that the track of the at least one representation comprises video data that:

include projected omnidirectional video data when the third element includes a value of "povd"; or

When the constrained scheme type code includes "fovd", ultra wide angle omnidirectional video data is included.

23. The method of claim 21, wherein extracting the data specifying the one or more codecs further comprises extracting a fourth element indicating a type of projection in response to determining that the third element comprises a value of "povd".

24. The method of claim 23, further comprising determining the projected omnidirectional video data in response to determining that the video data of the track comprises projected omnidirectional video data:

including a spherical unwrapped projection when the fourth element has a value of "erp"; or

The cube map projection is included when the fourth element has a value of "cmp".

25. The method of claim 1, wherein extracting the data specifying the one or more codecs comprises extracting a plurality of elements following the first element and the second element.

26. The method of claim 1, wherein extracting the data specifying the one or more codecs comprises extracting values for one or more @ codecs parameters, wherein the values for the @ codecs parameters comprise respective lists of periods delimitations of elements.

27. The method of claim 1, further comprising extracting data specifying that the track includes video data with a display orientation change, wherein the track is included in a media file of the at least one representation, and wherein retrieving the media data of the at least one representation comprises retrieving media data of the track based on the data specifying that the track includes the video data with the display orientation change.

28. The method of claim 27, wherein the data specifying that the track includes the video data with the display orientation change comprises a "vdoc" value for a constrained scheme type of the track.

29. The method of claim 27, wherein extracting comprises extracting the value for the constrained scheme type from a restictedschemeinfobox for the track.

30. The method of claim 27, further comprising, in response to determining that the track includes the video data with the display orientation change, extracting data from a schemelnformationbox indicating whether the display orientation change includes either or both of a rotation or a flip.

31. The method of claim 30, further comprising:

determining that the display orientation change includes both rotation and flipping when the SchemeInformationBox has a value of 0;

determining that the display orientation change comprises a rotation when the SchemeInformationBox has a value of 1; or

Determining that the display orientation change comprises a flip when the SchemeInformationBox has a value of 2.

32. An apparatus for retrieving media data, the apparatus comprising:

a memory configured to store media data; and

one or more processors implemented in circuitry and configured to:

extracting, from the manifest file, the data specifying the one or more codecs, the data including:

a first element representing a sample entry type code for the at least one represented track, wherein the first element represents that the track contains video data stored using a constrained scheme; and a second element representing a constrained scheme type code for the constrained scheme for the track; and

retrieving the media data of the at least one representation based on the first element and the second element.

33. The device of claim 32, wherein to extract the data specifying the one or more codecs, the one or more processors are configured to:

extracting a first set of values for one or more @ codecs parameters, wherein the values for the @ codecs parameters comprise a respective bounded list of periods of elements; and

a second set of values is extracted from the different MIME-type parameters, said values comprising additional information for said @ codes parameter.

34. The device of claim 33, wherein the different MIME-type parameters comprise "odvdinfo", and wherein extracting the second set of values comprises extracting at least one of "povd", "fovd", "erp", or "cmp", and wherein the one or more processors are further configured to determine that the track of the at least one representation comprises video data that:

35. The device of claim 33, wherein the additional information comprises information indicating a particular type of frame-encapsulated video data, wherein the different MIME-type parameters comprise "fpvdinfo", and wherein the particular type of frame-encapsulated video data comprises elements corresponding to stereo _ scheme and stereo _ indication _ type.

36. The device of claim 32, wherein the one or more processors are further configured to receive one or more values for a MIME-type parameter "HDR info" for the track, the values for the MIME-type parameter including information for at least one of high dynamic range, HDR, or wide gamut, WCG, video data for the track, wherein to retrieve the media data of the at least one representation, the one or more processors are configured to retrieve media data of the track based on the values for the MIME-type parameter.

37. The device of claim 36, wherein the one or more values comprise one of a single value for the MIME-type parameter "hdr info" or a list of comma-separated values for the MIME-type parameter "hdr info", each of the values in the list of comma-separated values comprising a single element or a list of dot-separated elements.

38. The apparatus of claim 32, wherein the manifest file comprises a dynamic adaptive streaming DASH Media Presentation Description (MPD) via HTTP.

39. The device of claim 32, wherein the device comprises at least one of:

an integrated circuit;

a microprocessor; or

A wireless communication device.

40. An apparatus for retrieving media data, the apparatus comprising:

means for retrieving a manifest file that specifies data for at least one representation of a media presentation, wherein the manifest file includes data that specifies one or more codecs for the at least one representation;

means for extracting the data specifying the one or more codecs from the manifest file, the means comprising:

means for extracting a first element of a sample entry type code representing a track of the at least one representation, wherein the first element represents that the track contains video data stored using a constrained scheme; and

means for extracting a second element representing a constrained scheme type code for the constrained scheme for the track; and

means for retrieving media data of the at least one representation based on the first element and the second element.

41. A computer-readable storage medium having instructions stored thereon that, when executed, cause a processor to:

extracting, from the manifest file, the data specifying the one or more codecs, the data including instructions that cause the processor to: