GB2567485A

GB2567485A - Method and device for exchanging data between a web application and an associated web engine

Info

Publication number: GB2567485A
Application number: GB1716906.1A
Authority: GB
Inventors: Ouedraogo Naël; Bellessort Romain; Denoual Franck
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2019-04-17
Anticipated expiration: 2037-10-13
Also published as: GB2567485B; GB201716906D0

Abstract

Parsing (de-encapsulating) by a web engine 440 initialization data and initializing video components, wherein encoded video data is organized into a plurality of video components and at least one video component is able to be simultaneously displayed with another video component; obtaining, by the web engine from the initialisation data, information about the ability for each video component to be simultaneously selected with another video component; signalling this information to a web application 400; the web application selecting part of the video components based on the information and indicating to the web engine the video component to be decoded. The method may allow a web browser to control the decoding of a region of interest (ROI) or a virtual reality stream based on multiple video tracks. Categories may be determined for the video components, indicating whether they require other video components to be decoded.

Description

The present disclosure concerns a method and a device for the adaptive streaming of a video sequence in a Web application. It concerns more particularly the Media Source Extension Application Programming Interface (MSE API) used by a Web application in the browser to interact with the decoding engine of a media presentation.

Adaptive streaming protocols such as DASH or HLS provide a manifest file that describes several alternative representations of the same media, for instance with different bitrates and encoding characteristics. A Web application dynamically selects one of these alternatives according to its decoding capabilities and network resources and requests the selected alternatives for streaming.

The media is divided in individual media segments having a duration of a few seconds. It allows a Web application to select a suitable representation regularly at each media segment. A media segment contains either audio and/or video elementary streams encapsulated in a standard file format, typically ISO Base media format (ISO BMFF, ISO/IEC 14496-12) for H.264 and H.265 video codecs. The W3C standardized the Media Source Extension (MSE) API to allow Web applications to control the decoder of a web browser. MSE API provides an interface to control which video track and audio tracks of the media segment should be decoded. The specification provides directives and signaling means to decode simultaneously one video track and several audio tracks. It means that the known MSE API allows only the decoding of one video at a time.

A server providing media content typically proposes several versions of the same media provided with different encoding parameters. These versions typically exhibit different resolutions and/or encoding bitrates in order to adapt to different clients and network capabilities. These different versions of a same media are called different representations of the media.

Adaptive streaming makes it possible for a Web application to choose one representation of the video sequence that would provide the best compromise between the consumption of the network and processing resources and the quality of experience of the user. The final objective being to provide the user with the best possible quality depending on the streaming conditions and client capabilities.

Video codecs provide coding mechanisms that can be advantageously used for efficient adaptive streaming. For instance, one approach to encode highresolution video sequences consists in splitting each frame in independently decodable portions. H.265/HEVC codec provides motion-constrained tiles and slices coding structures. They permit to subdivide each frame in spatial parts that can be encoded and decoded independently and in parallel. Such kind of approach is commonly used for generation of virtual reality or immersive contents, which form a panorama of a scene in very high resolution to cover the surrounding environment with 360° field of view. The field of view is divided in different tiles. To render a particular angle of view, only the decoding of the tiles covered by the angle of view is needed instead of the decoding of the complete 360° field of view.

A second approach to provide several representation alternatives of the same content to an adaptive streaming client relies on the scalable encoding typically the scalable extension of H.265/HEVC. Using scalable encoding allows the efficient generation of bitstream alternatives with different bitrates. The quality, temporal or/and spatial resolution of the decoded stream is progressively increased by retrieving first the base layer (first layer) followed by the successive enhancement layers.

MPEG standardization community has recently published an extension of ISO BMFF specification that addresses in particular the new coding characteristics of H.265/HEVC which supports tile encoding and layered HEVC extension. The general approach to describe the different layers and spatial parts splits the bitstream in several tracks. For instance, ISO BMFF signals independently decodable spatial subparts (e.g. HEVC tiles) in a video sequence by defining one track for each tile. The dependency between the tracks are described in the ISO BMFF file format.

A Web application that decodes several layers or spatial subparts to obtain a pre-determined quality level requires decoding several video tracks in parallel. As the known MSE API allows only the decoding of one video track, there is a need to adapt the known method to allow a web browser to control the decoding of a region of interest (ROI) or virtual reality streams based on multiple video tracks.

The present invention has been devised to address one or more of the foregoing concerns. It concerns an extension of MSE API. The browser signals to the Web application the relationships between all video tracks present in the media. For example, it indicates that the track corresponding to a spatial subpart is dependent of or is referenced by another video track. Then the Web application can determine the track that can be selected or not. It then selects the appropriate tracks to respond to the user’s requests.

In another embodiment, the decoder of the browser relies on the signaling, meaning information provided in addition to the bitstream, to select dynamically the spatial subparts when the decoder supports it. The web browser provides additional signaling to indicate whether the Web application can select the spatial subpart track. When the Web application selects one spatial subpart track the web browser automatically activates the decoding of dependent streams and notifies the Web application. The invention applies not only to streams with spatial subpart access but also to layered HEVC streams.

According to a first aspect of the invention there is provided a method for exchanging data between a Web application and an associated Web engine, said data being related to encoded video data organized into a plurality of video components, at least one video component being able to be simultaneously displayed with another video component, the method comprising:

- parsing, by the Web engine, initialization data and initializing said video components, wherein the method further comprises

- obtaining by the Web engine, from the initialization data, information about ability for each video components to be simultaneously selected with another video component;

- signalling said information by the Web engine to the Web Application,

- selecting part of the video components by the Web application based on said information, and

- indicating by the Web Application to the Web engine the selected video components to be decoded by the Web engine.

In an embodiment, the step of indicating by the Web application to the Web engine the selected video components to be decoded comprises:

- associating a selected state to the video component, said selected state indicating whether the video component is selected or not.

In an embodiment, indicating the information comprising an attribute, and signalling said information comprises:

- determining a “composite” category for a video component that requires other video components to be decoded;

- determining a “composition reference” category for a video component required by a “composite” video component;

- setting the attribute of the video component to the determined category.

In an embodiment, the method further comprises:

- predefining by the Web engine the selected state of the first “composition reference” component in presentation order and the selected state of all components that depend on this first “composition reference” component as selected.

In an embodiment, the method further comprises:

- forbidding by the Web engine the Web application to change the predefined selected state of a “composite” component.

In an embodiment, the method further comprises:

- setting, by the Web engine, as selected the state for all components that depend on a “composition reference” component set as selected by the Web application.

- determining a “Tile track” category for a video component that contains samples corresponding to a spatial part of a video;

- determining a “Composite track” category for a video component that refers to others video components to compose an image;

- determining a “Tile base track” category for a video component common to one or more tile components that contains data that is shared among these one or more component;

- determining a “Base layer” category for a video component containing data of a base layer of layered stream that can be decoded independently of other components;

- determining a “Enhancement layer” category for a video component containing data of an enhancement layer of layered stream which depends of another component;

- determining a “Reference” category for a video component used only as decoding reference and not intended to be displayed;

- setting the attribute of the video component to the determined category.

In an embodiment, the method further comprises:

- forbidding by the Web engine the Web application to change a predefined selected state of a “Tile track” or “Reference” component.

- determining a “selectable” attribute for a video component as true when:

o the video component is a layer intended to be displayed;

o the video component is a composite or a tile base component; or o the video componentis not requiring data from another component to be decoded;

- determining the “selectable” attribute for the video component as false otherwise.

In an embodiment, the method further comprises:

- forbidding by the Web engine the Web application to change the selected state of component which “selectable” attribute is false.

- determining a “selectionstate” attribute for a video component as “selectable” when the video component is a layer intended to be displayable; when the video component is a composite or a tile base component; or when the video component can be independently displayed;

- determining the “selectionstate” attribute for the video component as “Combined” when the component needs to be combined with other video components;

- determining the “selectionstate” attribute for the video component as “NotSelectable” otherwise.

In an embodiment, the method further comprises:

- forbidding by the Web engine the Web application to change the selected state of component with a “selectionstate” attribute set to “NotSelectable”.

In an embodiment, the method further comprises:

- setting as false the selected state of all components that depend on a component which selected state has been set to false by the Web application.

In an embodiment, the Web engine further provides a “position” attribute for video components corresponding to a subpart of a video, the “position” attribute allowing indicating the position of the subpart.

In an embodiment, the Web engine further provides a “layer_id” attribute for video components corresponding to a layer of a layered video, the “layer_id” attribute allowing indicating the layer identifier of the layer.

In an embodiment, indicating the relationships between the plurality of video components comprises:

- providing a “selectVideoTrackPosition” method as interface to allow selecting the components at a given location.

According to another aspect of the invention there is provided a computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to the invention, when loaded into and executed by the programmable apparatus.

According to another aspect of the invention there is provided a computerreadable storage medium storing instructions of a computer program for implementing a method according to the invention.

According to another aspect of the invention there is provided a client device configured to execute the step of the method of the invention.

According to another aspect of the invention there is provided a streaming system comprising a server device and a client device according to the invention.

At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a circuit, module or system. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible, nontransitory carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:

Figure 1 illustrates a general principle of media streaming over HTTP, on which embodiments of the invention are based;

Figure 2 illustrates an example of video stream considered in an embodiment of the current invention;

Figure 3 illustrates an example of the hierarchical content of a DASH manifest file;

Figure 4 illustrates the architecture for the client in an embodiment of the invention;

Figure 5 illustrates the network process of the Web application in an embodiment of the invention;

Figure 6 illustrates the processing of the interface module of the Web application in an embodiment of the invention;

Figure 7 illustrates the processing of the Web Engine in the browser in an embodiment of the invention;

Figure 8a illustrates an example of the decoding pipeline operated in the Web browser;

Figure 8b is a second example of decoding pipeline;

Figure 9 illustrates an example of file system representation in tracks for a video sequence subdivided in four parts;

Figure 10 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.

As illustrated, media server 100 comprises media presentations among which, in particular, media presentation 105 that contains different media content components, e.g. audio and video data streams. Audio and video streams can be interleaved or stored independently. The media presentation can propose alternative versions of media content components (with different bitrate, quality, resolution, sampling rate etc.). A media content component (or media component) is typically a video component or an audio component or any other multimedia component. A media content component can be also a sub part of a video typically a spatial part of the video stream or one layer of a scalable or multiview video sequence or sub-temporal layer.

For example, the media content components of this media presentation are encapsulated according to the ISO Base Media File Format and DASH recommendations. As a result of the encapsulation step, each alternative versions (or Representation in DASH context e.g. Representation 1 and Representation 2) is temporally split into small independent and consecutive temporal media segments (e.g. temporal media segments 110-1 to 110-3 and 111-1 to 111-3, respectively), for example media segments conforming to the standard (ISO/IEC 23009-1), that can be addressed and downloaded independently. Each media segment may contain one or more media content components. The server 100 determines the addresses (i.e., HTTP URL addresses in the illustrated example) for all the media segments and creates a manifest as described herein below by reference to Figure 3.

A manifest, for example a MPD, is a document, typically an XML file (or even a plain text file, for HTTP Live Streaming), that describes all the media content components that can be accessed for a given media presentation. Such a description may comprise the types of the media content components (for example audio, video, audio-video, metadata, or text), the durations of the media segments, and the addresses (e.g. the URL) associated with the media segments, that is to say the addresses from which the media content components can be obtained.

Typically, an MPD is based on a hierarchical data model as depicted in Figure 3. It consists of one or multiple periods (reference 300 in Figure 3), each period having a starting time and duration and consists of one or multiple adaptation sets (reference 301 in Figure 3). An adaptation set provides the information about one or multiple media content components and its various encoded alternatives (reference 302 in Figure 3), each encoded alternative of the same media content component being referred to as a Representation. In turn, each Representation typically consists of one or multiple media and/or initialization segments (reference 303 in Figure 3).

For the sake of illustration, audio and video streams of media presentation 105 are considered interleaved. These interleaved audio and video data streams are proposed as two alternative version, each version being split into consecutive temporal media segments, for example into three consecutive temporal media segments 110-1 to 110-3 and 111-1 to 111-3 corresponding to three consecutive periods of time. The manifest file describes the media presentation as composed of at least one adaptation set (not represented) that comprises at least two versions that contain several media segments. Server 100 determines the addresses of these segments. These addresses and other items of information relative to the media content components and to media segments 110-1 to 110-3 and 111-1 to 111 -3 are accessible in manifest 115 corresponding to media presentation 105.

A client 120 requests this manifest file (step 125). The client is typically a Web application that relies on JavaScript and Web API to control the adaptive streaming session with the server. The description of the Figure 4 details an example of architecture for the client. After having been received, manifest file 115 is analyzed by client 120 to determine which presentations are available and which media segments 110-1 to 110-3 and 111-1 to 111 -3 of media presentation 105 are accessible. Manifest file 115 is also used to determine the http addresses of these media segments and the relations between these media segments. Moreover, manifest file 115 gives items of information about the content of the media presentation (i.e. interleaved audio and video in the given example). These items of information may comprise a resolution, a bit-rate, and similar information.

In view of this information, the adaptation logic module 150 (generally implemented in JavaScript) of the client 120 can therefore select media segments from appropriate versions to emit corresponding http requests (step 130) for downloading these media segments. In response, server 100 transmits the requested temporal media segments (step 135). These temporal media segments, received in http response 135, can be parsed (de-encapsulated) and then decoded in appropriate media decoder 140 (typically one decoder per media type) and displayed on display 145. In particular displaying may include a transformation process for instance to project a panorama image into a new frame reference (display frame reference). The client may request several temporal media segment at once for instance to retrieve two adjacent spatial parts of the video stream.

It is noted that server 100 may consist in separate servers or devices, each performing one or more of the following steps: generation of the media content, encapsulation of the media stream in file format, generation of the streaming manifest or playlist file, transmission of the media presentation, and transmission of media content, most often as content segments.

The client may thus issue requests for the manifest to a first server, for example an application server and requests for the media content to one or more other servers, for example media servers or streaming servers. The server, which transmits the media samples, may be also different, for example if media is delivered through a CDN (Content Delivery Network).

The video sequence 200 of Figure 2 illustrates an example of video stream considered in an embodiment of the current invention. The server generates a video stream from an input video sequence that efficiently respond to the adaptation requirement for adaptive streaming. The video sequence is encoded with two different quality levels. For instance, a scalable codec (Scalable extension of H.265/HEVC) compresses the video sequence into to two scalable layers. The frame of base layer 210 is composed of set of images, 205-1 up to 205-n. Each frame is split into independent regions through tile partitioning. The tile 211 is encoded independently of other tiles, which makes it possible to decode the region R1 of the base layer independently from data encoding other regions R2 to R6

The enhancement layer 220 refines the resolution of the base layer 210. In addition, each image (225-1 to 225-n) is split into tile to provide spatial access. The coding efficiency of the resulting stream is efficient since the enhancement layer is predicted from the base layer. This introduces dependency between the enhancement layer 220 and 210. The server uses at least one slice and at least one tile per spatial region to ensure that each region can be requested independently.

A slice in HEVC is a set of slice segments, with at least the first slice segment being an independent slice segment, the others, if any, being dependent slice segments. A slice segment contains an integer number of consecutive (in raster scan order) CTUs. The slice does not necessarily have a rectangular shape (it is thus less appropriate than tiles for spatial sub-part representations). A slice segment is encoded in the HEVC bitstream as a slice_segment_header followed by slice_segment_data. Independent slice segments (ISS) and dependent slice segments (DSS) differ by their header: the dependent slice segment has a shorter header as some information from the independent slice segment’s header are not duplicated. Both independent and dependent slice segments contain a list of entry points in the bitstream.

When a video bitstream is encoded with tiles, tiles can be motionconstrained to ensure that tiles do not depend from neighborhood tiles in the same picture (spatial dependency) and from neighborhood tiles in previous reference pictures (temporal dependency). Thus, motion-constrained tiles are independently decodable.

Alternatively, the packed image can be split into several spatial subpictures before encoding, each sub-picture being encoded independently forming for instance an independent encoded HEVC bitstream.

Therefore, as result of the encoding step, the video sequence 200 can be represented by one or more independent encoded bitstreams or by at least one encoded bitstream composed of one or more independently encoded subbitstreams.

Those encoded bitstreams and sub-bitstreams are then encapsulated in a file or in small temporal segment files according to an encapsulation file format, for instance according to ISO Base Media File Format defined by the MPEG standardization organization. The resulting file or segment files can be mp4 file or m4s segments. During the encapsulation, audio stream may be added to the video bitstream as well as metadata tracks providing information on the video or on the audio streams.

The International Standard Organization Base Media File Format (ISO BMFF, ISO/IEC 14496-12) is a well-known flexible and extensible format that describes encoded timed media data bitstreams for either local storage or transmission via a network or via another bitstream delivery mechanism. This file format is object-oriented. It is composed of building blocks called boxes that are sequentially or hierarchically organized and that define parameters of the encoded timed media data bitstream such as timing and structure parameters. In the file format, the overall presentation is called a movie. It is logically divided into tracks. Each track represents a timed sequence of an encoded version of the media component (samples corresponding to video frames, for example). It is to be noted that a component corresponds to a track and a track to a component. The two wording may be used to refer to the same object. Within each track, each timed unit of data is called a sample; this might be a frame of video or audio. Samples are implicitly numbered in sequence. The movie can be organized temporally as a list of movie and track fragments. The actual samples are in boxes called MediaDataBoxes. Within a movie fragment, there is a set of track fragments, zero or more per track. The track fragments in turn contain zero or more track runs, each of which documents a contiguous run of samples for that track.

An encoded bitstream (e.g. HEVC) and possibly its sub-bitstreams (e.g. tiled HEVC, MV-HEVC, scalable HEVC), can be encapsulated as one single track. Alternatively multiple encoded bitstreams that are spatially related (i.e. are sub-spatial parts of a projected image) can be encapsulated as several subpicture tracks. Alternatively, an encoded bitstream (e.g. tiled HEVC, MV-HEVC, scalable HEVC) comprising several sub-bitstreams (tiles, views, layers) can be encapsulated as multiple sub-picture tracks.

A sub-picture track is a track embedding data for a sub part of a picture or image. A sub-picture track may be related to other sub-picture tracks or to the track that describes the full picture from which the sub-picture is extracted. For example, a sub-picture track can be a tile track. It can be represented by an AVC track, an HEVC track, an HEVC tile track or any compressed video bitstream encapsulated as a sequence of samples.

A tile track (TT) is a sequence of timed video samples corresponding to a spatial part of an image or to a sub-picture of an image or picture. It can be for example a region of interest in an image or an arbitrary region in the image. The data corresponding to a tile track can come from a video bitstream or can come from a sub part of a video bitstream. For example, a tile track can be an AVC or HEVC compliant bitstream or can be a sub-part of AVC or HEVC or any encoded bitstream, like for example HEVC tiles. In a preferred embodiment a tile track is independently decodable (encoder took care to remove motion prediction from other tiles). When tile track corresponds to a video bitstream encoded in HEVC with tiles, it can be encapsulated into an HEVC Tile track denoted as ‘hvt1’ track as described in ISO/IEC 14496-15 4th edition. It can then refer to a tile base track to obtain parameter sets, high-level information to set up the video decoder. It can also be encapsulated into a HEVC track ‘hvc1 ’ or ‘hev1 ’ track. A tile track can be used for spatial composition of sub-pictures into a bigger image or picture.

A tile base track (or TBT) is a track needed by one or more tile tracks that contain data or metadata that is shared among these one or more tracks. A tile base track may contain instructions to compose images from one or more tile tracks. Tile tracks may depend on a tile base track for complete decoding or rendering. When tile base track derives from a video bitstream encoded in HEVC with tiles, it is encapsulated into an HEVC track denoted as ‘hvc2’ or ‘hev2’ track. In addition it is referenced by HEVC tile tracks via a track reference ‘tbas’ and it shall indicate the tile ordering using a ‘sabt’ track reference to the HEVC tile tracks as described in ISO/IEC 14496-15 4th edition.

A composite track (also denoted reference track or CT) is a track that refers to other tracks to compose an image. One example of composite track is, in case of video tracks, a track composing sub-picture tracks into a bigger image. This can be done by post-decoding operation, for example in a track deriving from video tracks that provides transformation and transformation parameters to compose the images from each video track to a bigger image. A composite track can also be a track with extractor NAL units providing instructions to extract NAL units from other video tracks or tile tracks to form before decoding a bitstream resulting from sub-bitstream concatenation. A composite track can also be a track that implicitly provides composition instructions, for example through track references to other tracks.

ISO/IEC 14496-12 provides a box located at track level to describe groups of tracks, where each group shares a particular characteristic or the tracks within a group have a particular relationship. The particular characteristic or the relationship is indicated by the box type (track_group_type) of the contained boxes. The contained boxes include an identifier (track_group_id), which can be used to determine the tracks belonging to the same track group. All the tracks having a track group box with the same track_group_type and track_group_id values are part of the same track group. THE MPEG OMAF standard is proposing a specific track group for spatial composition as a TrackGroupTypeBox of type ‘spco’. The samples of each track in an ‘spco’ track group can be spatially composed with samples (at the same composition or decoding time) from other tracks in this same group to produce a bigger image.

Depending on encoded bitstreams and sub-bitstreams resulting from the encoding, several variants of encapsulation in file format are possible.

In ISOBMFF, three ways to place a Layered HEVC (L-HEVC) streams in tracks is feasible. First, all the layers stored in one tracks. This method is inconvenient for adaptive streaming since is forbids to access independently to the different layers. The whole set of layers is transported and provided to the decoder at once. The second way is to place each layer or sub-layers in different tracks. This approach fits adaptive streaming needs. For instance, the base layer is placed in one track and the enhancement layers are placed in separate tracks. The client may request the track individually and activate the decoding of one or more enhancement layers.

The last method is to place a set of layers in individual tracks to reflect the operating points (i.e. the set of layers that could be outputted by the decoder) of the stream.

Some video sequence includes depth information. The depth information is generally encapsulated in one tracks and is considered as auxiliary information. This depth map track is meaning less when decoded alone and is dependent of the video track, which contains the texture samples. The depth map typically is an image where all pixels represents the depth, meaning a distance information, ofthe object in the scene represented by this pixel in a video.

An ISOBMFF writer may generate video tracks which are not intended to be displayed. Typically, the data of these tracks is placed in a separate track since used by several other tracks or may not be rendered by all display. It avoids duplicating data in several tracks and/or it avoids also presenting meaningless information not useful for a decoder. On example is the depth map tracks.

Another usage is when the author of the media presentation design predetermined region of interests. Each of these regions of interest is composed of different sets of subparts. In the file format, each of these ROI are represented by one composite tracks that refers to several tile tracks. The author of the media presentation may indicate in the file format that tile tracks selection is fixed. The Web App should not change this selection. For instance, the track includes a parameter which states that the track is not intended to be displayed. For instance, a “non-displayable” or “track_not_intended_for_presentation_alone” parameter declared in the initialization segment (typically the track header)..

In this document, we call a reference track a track that relies on other tracks (dependent tracks) to provide a decoded media sample. A reference track may contain the information to initialize a decoder and dependencies information to retrieve the dependent tracks that generally contains the media samples. In the context of subpart decoding, we consider tile base tracks, composite tracks as reference tracks and tile tracks as dependent tracks.

In the context of scalable encoding, the base layer is a reference track and the dependent enhancement layers are dependent tracks. The depth map track is a dependent track.

Figure 3 illustrates an example of the hierarchical content of a DASH manifest file. More precisely, it illustrates the content of a media presentation available at the server and the relation between each media component, also called media data, and the HTTP addresses.

For the sake of illustration, the media presentation may be temporally split into coarse-grained periods called period (splicing of arbitrary content).

A “period” at the MPD level describes all the media components that are available for a period of time (that could be the complete duration of the media presentation if there is only one period). Within this period, a media content component can be composed of several data segments corresponding to small period previously mentioned, to allow easy streaming, random accessing, and switching.

The MPD (e.g. a XML MPD) contains all the data corresponding to each period. Therefore, when receiving this information, a client is aware of the content of each period. For example, media presentation 300 is divided into several elements, each one corresponding to a period. Still for the sake of illustration, the second period is comprised into the moments 100s and 294s.

Each media presentation’s period contains data that describes the available media content component for the corresponding period. One of the media presentation’s period denoted 301 is illustrated in more detail.

Several “adaptation set” elements are incorporated: one for the video description and one for the audio description. Each adaptation set is associated with a given track. In this example, the first adaptation set is associated with the video track and the second adaptation set is associated with the audio track corresponding to the video track for the considered period.

As illustrated, an adaptation set structure 302 contains information about the different possible Representations (i.e. versions) of the encoded video available at the server. In this example, the first Representation is a video having a spatial resolution of 640x480 that is encoded at the bit rate of 500 kbit/s. The field “Segment Info” 303 gives more parameters.

The second Representation is the same video that is encoded at a rate of 250 kbit/s. It may represent a decrease in quality compared to the first Representation for instance. The client will be able to switch between those Representations depending on the available bandwidth on the network.

The Representation may include a Role attribute that describes the intent use of the representation.

Each of these Representations can be downloaded by HTTP requests if the client knows the HTTP addresses related to the video. The association between the content of each Representation and a HTTP address is done by using an additional temporal sub-layer.

As illustrated with reference 303, the video Representation 302 is split into temporal segments (of 10 seconds in this example).

Each temporal segment 303 is a content stored at the server that is accessible through an HTTP address. In addition, an initialization segment is available. This initialization segment contains MP4 initialization information (if the video has been encapsulated by using the ISO BMFF or extensions) describing the MP4 content of the encapsulated video. For example, it helps the client to instantiate the decoding algorithms related to the video. In particular, initialization segment describes the tracks present in the media content.

The HTTP addresses of the initialization segment and the media segments are given in the MPD (or description) file.

In another example, the media presentation subdivides the frame in sub parts. The server encapsulate each sub part, typically a tile, in one track and associate one adaptation set to each sub part to make it possible for the client to perform region of interest selection based on MPD. The spatial relationships between the adaption sets are described using SRD descriptor in the MPD document. As a result, the client may select a pre-determined set of adaptation sets to decode one ROI which covers a set of sub parts of the media presentation.

Figure 4 illustrates the architecture for the client in an embodiment of the invention. It contains two main processing units. The first unit is the Web application. The service provider develops a Web application that the server delivers. It implements the process of adaptive streaming and the graphical user interface. For instance, three main processing modules constitutes the Web application 400.

The network module 410 handles all the network communications. In particular, it performs the HTTP requests 130 (in Figure 1) and receive in return the HTTP responses 135 of the server. The client performs the HTTP communications with either XMLHttpRequest or Fetch API (from Web Hypertext Application Technology Working Group [WHATWG]) in the preferred embodiment. Web application may relies on any other API to perform the media segment requests. The network module 410 is detailed with reference to Figure 5

The Adaptation Streaming (AS) controller module 440 handles the adaptation logic 150, which consists in determining the media segments that provide the best compromise between available resources and quality. In addition, it schedules the timing of HTTP requests performed by the network module to ensure in time delivery of media segments for continuous playing of the media. Adaptation Logic (AL) may for instance follow a simple process: the AL retrieves from the Web application a default set of adaptation sets to play according to the Web Browser capabilities (i.e. based on supported codec format and resolution of the display). This default set of adaptation sets may be changed according to the user custom choices. One Adaptation Logic algorithm may consist in probing the network bandwidth and allocating an equal rate distribution among the set of pre-selected adaptation sets. Based on the rate allocation target for each adaptation set, the AL selects the representation, which has the best quality with a bandwidth below or ideally equal to the rate target. The AS controller orders requests for the media segments of the selected representation in each adaptation set of the set of pre-selected adaptation set such that the occupancy of the reception buffer of the network remains at pre-determined level.

The invention is also compatible with more advance AL algorithms of the prior art.

The interface module 420 manages both the display 145 and the graphical user interface presented to the user. For instance, the GUI is implemented using Web technologies such as HTML, CSS and JavaScript. This module is described with reference to Figure 6.

The second main processing unit of the client is the Web Engine that is a part of the Web Browser. The Web Engine processes the Web resources provided by the Web application and renders the Web resources. In particular, it includes a decoder 140 to decode media resources. The Web Engine 440 of the Web Browser includes a media decoding module which is able to decode at least one video and preferably also audio coding formats. In the preferred embodiment, the web browser supports H.265/HEVC format and its extensions and tiles encapsulated in ISO BMFF file format.

The Web application controls the decoder of the Web Browser thanks to the Media Source Extensions (MSE) API 430. The Media Source Extension API standardized by the W3C provides an interface to control the decoding of the media presentation 105 and to retrieve information on the coding characteristics of the media presentation.

In Web context, HTMLMediaElement is an HTML element that represent a media element. It can be video, audio or both. The media resource that represents the data of the media element contains several audio and video tracks. When the browser decodes the media resource it initializes a structure in the HTMLMediaElement that lists all video, audio and text tracks available or detected in the media resources.

The HTMLMediaElement comprises the three attributes audioTracks, videoTracks and textTracks that the decoder of the web browser initializes when decoding the initialization segment of the media resource associated to the HTMLMediaElement. These attributes are a list of tracks that can be accessed with an index value. Each track item is an AudioTrack (resp. VideoTrack, resp. TextTrack) object for audioTracks (resp. videoTracks, resp. textTracks) list. These objects gathers a set of description attributes.

The videoTracks, audioTracks and textTracks share the following attributes with the same semantic: “id”, “kind”, “label” and “language”.

The “id” attribute is the identifier of the track. The “kind” attribute is the category of the tracks. For instance, the specification describes alternative, main, captions, descriptions, main-desc, sign, subtitles, translation and commentary categories. For instance, the main category refers to the primary audio or video track while alternative category corresponds to possible alternative to the main track. Typically, a different take of a song for audio tracks or a different capture angle of the same content for video tracks.

The “label” attribute is a free string of characters that labeled the track or an empty string if no label is available. The “language” attribute encodes the language of the track.

The audioTracks object further comprise a boolean attribute named “enabled”. The videoTracks object further comprise a boolean attribute named “selected” defining a selected state of the videoTracks object.

When accessed in reading, the “enabled” attribute of an audio tracks and “selected” attribute of video tracks indicate whether the Web Engine decodes and renders the corresponding track. On the contrary, when the value of the attribute is modified, it is used to enable or disable the decoding of the track. The difference between “enabled” and “selected” attribute is that at most one video tracks can be selected at once. That means that at most one VideoTrack object should have selected attribute equal to true among all the video tracks described in the HTMLMediaElement.

When the Web application sets the “selected” attribute of one video track all other video tracks are automatically deselected. For audio tracks, it is possible to activate the decoding of multiple audio tracks at once by setting the “enabled” attribute of several audio tracks.

One objective of the invention is to relax this constraint to support the decoding of multiple video tracks at once when the tracks are dependent to each other.

MSE API interface defines the MediaSource object as a source of media data for an HTMLMediaElement. Each MediaSource element handles at least one SourceBuffer object. A Web application appends encoded data to SourceBuffer as segments of media. SourceBuffer comprises several attributes comprising an audioTracks attribute, a videoTracks attribute and a textTracks attribute. These audioTracks, videoTracks and textTracks attributes are the same as for HTMLMediaElement.

When the application appends a buffer containing data of media segment to the SourceBuffer, the web engine parses the bitstream. The application appends first the initialization segment. The web engine parses the headers of the file format and extracts the track parameters. It uses this information to set audioTracks, videoTracks and textTracks attributes.

In the following, we describe embodiments that make it possible to activate several video tracks when the tracks of the media resource corresponds to a sub bitstream, for instance a sub part track or enhancement layer track.

Figure 5 illustrates the network process 410 of the Web application in an embodiment of the invention. This process starts by the creation of the

MediaSource and associated SourceBuffer instances in step 520. The network module creates one MediaSource instance and one SourceBuffer for each media element.

In a second step 500, the network module performs the media segment requests in step 500. Each request corresponds to one of the media segments selected by the AS controller 440 from the parsing of the MPD, the user choices and the analysis of the streaming conditions. The media segment request is typically performed using Fetch API. The content of the Fetch response is a buffer of data corresponding to the media segment 510. It is appended in step 530 to the SourceBuffer in order to start the decoding of the stream. The first appended media segment is the initialization segment that would permit to initialize the decoder that itself initializes SourceBuffer attributes in particular by specifying the track configuration of the stream.

Particular coding and storage configurations of media require that the network module requests several media segments simultaneously to play the media element. For instance, data corresponding to audio and video tracks may be stored at different locations. The URLs to retrieve the audio segment and video segment are different and the network module thus sends two HTTP requests in parallel. In such a case, it is more convenient to instantiate two SourceBuffer objects, one for audio segments and one for video segments. The HTTP response to the video segment request is appended to a first SourceBuffer in charge of video and data of audio segment is appended to the second SourceBuffer.

Same approach applies for a media presentation, which comprises sub bitstreams. The network module may request several media segments, one for each sub bitstream. For instance, the user may choose to view one region of interest of the video sequence, which corresponds to two tiles of the video streams. The AS controler selects the corresponding media segments and the network module requests them simultaneously. The network modules creates as many SourceBuffers as requested sub bistreams during initialization step 520 and applies in parallel steps 500, 510 and 530 for each sub bitstream.

Figure 6 illustrates the processing of the interface module of the Web application in an embodiment of the invention.

The interface module of the Web application is the interface between the user and the Web application. It is in charge of retrieving choices of the user to play the media. In function of these choices, it configures the Web application and directs the network and AS controller modules to select, download and decode the media component.

Once the Web Engine decodes the initialization segment, it lists and describes the tracks of the media. The interface module retrieves this information in step 600. For example, with MSE API, it consists in waiting for HTMLMediaElement to change its readyState to HAVE_METADATA indicating that initialization segment is processed. Then, the Web application accesses to audioTracks, videoTracks and textTracks attributes to identify the available tracks in step 600.

In step 610, the Web application determines a set of tracks that can be decoded. This set is a sub set of the tracks listed in step 600. Indeed, not all the tracks may be selectable as detailed in the following. The Web application determines in step 620 which tracks should be effectively decoded by the Web application according to user playing choices and/or coding dependencies between the tracks.

The set of tracks selected in 620 corresponds to a subset of the data retrieved from the server. In particular, the Web application notifies in step 630, the tracks selected in 620 to the AS controller module. By doing so the AS controller module does not have to request the media segments not containing data required to decode the selected tracks.

Finally, the Web application configures the decoder through Media Source Extension API to decode only the selected tracks. In step 640, the interface module successively sets the “selected” attribute of video tracks in the set of tracks selected in step 620. Similarly, it sets “mode” attribute of selected text tracks and “enabled” attribute of audio tracks.

Figure 7 illustrates the processing of the Web Engine 440 in the browser in an embodiment of the invention. It concerns actions when the Web application appends the first media segment, the initialization segment, to SourceBuffer. This comprises the initialization of the attributes of HTMLMediaElement, SourceBuffer and MediaSource objects to indicate the coding structure of the media segment to the Web application but also the configuration of the decoder of the browser in function of the Web application requests performed in step 640 of Figure 6.

After, the media segment 701 is retrieved from the SourceBuffer object memory, the browser parses the media segment. This consists in parsing, in step 700, the syntax elements from the file format headers to extract the information related to the different tracks of the media. In particular, it extracts the description of each track that is used to initialize the tracks attributes: the identifier, the category, the label and the language are extracted from the file format in step 710 to 730 for each track described in the initialization segment. For instance, in ISOBMFF, the identifier of the track correspond to the track ID. The category may be set in the ‘kind’ box of the track. The language attribute may be defined in the Media Header Box of the track or in Extended Language Tag box. The label of the track may be defined in one UserData box attribute.

The browser successively processes each tracks of the initialization segment in a processing loop composed of step 710 to 760. The objective is to provide information on the video tracks to let the Web application be able to determine how the video tracks should be decoded. This would permit to determine which video tracks can be selected. In the following description, several alternative details the relation between the different video tracks.

The decoding of a tile track is different depending on the type of the track present in the media presentation. In a first example, the media presentation contains a set of tile tracks, one for each spatial sub part and one or more composite tracks. Composite tracks refer to the tile tracks to compose an image. For example, the server presents a media presentation, which contains a single layer subdivided into tiles as represented in the Figure 2 by the set 210 of frames. The media presentation contains six tile tracks. The author of the media presentation provides access to three different decoding alternatives. The first one contains all the six subparts. The second contains the subparts R1, R2, R4 and R5 while the third one is composed of R2, R3, R5 and R6. The server generates three composite tracks for each alternative. These composite tracks refer to the tile tracks to form the encoded data for each alternative. A tile base track provides the same features.

Each composite track relies on NAL unit extractors to compose the new video sequences. Specific NAL unit extractors permit to replace a range of encoded data by other data. This permits to replace in particular the header of slices in the tile tracks when the composite track reorders the subparts in the new video sequence or changes the size of the video sequence.

As a result, even if the composite tracks share the same tile tracks, each composite track may have different sizes and forms a new video sequence. The resulting elementary stream distributed to the video decoder forms a complete video sequence of size equal to the size of the set of subparts present in the stream. For example, a decoder that decodes the elementary streams corresponding to the composite track of the second alternative generates decoded frames of size equal to the size of the region composed of R1, R2, R4 and R5 subparts.

For such kind of media presentation, the set of available subpart combinations corresponds to the set of composite tracks present in the media presentation. Nevertheless, it is possible to decode only one subpart by decoding one composite track, which contains the subpart, and performing an a posteriori cropping of the composite track. For instance, the decoder may decode the second composite track and then trims the output frame to the region R1. The drawback of this method is that data is uselessly decoded since cropped later. The resource of the client are badly used.

To overcome this disadvantage, one approach is to avoid transmitting the data corresponding to useless regions. The decoder has only the data corresponding to the one region and cannot decode the other regions. It may consider that the video sequence is corrupted, for instance due to transmission errors, and then decode only the one subpart. We refer to this approach as partial subpart decoding. The main issue is that the behavior of a decoder in presence of transmission errors or corrupted video sequence is unknown. For instance, it may return an error and freeze the decoding until it receives a complete frame.

In a second example, the media presentation contains a set of tile tracks and one tile base track. The tile base track contains all the metadata shared among the tile track. Tile base track are very similar to composite track in terms of description. The main difference is that the media samples of the different tile tracks addressed in the tile base track are retrieved through tile tracks reference while it uses NAL unit extractors in composite tracks.

In current version of ISOBMFF, it is not possible to replace encoded data of one tile track with other data in a tile base track since the address of the first coding unit in the tile cannot be adapted to form a smaller video sequence. With tile base tracks, it is supposed that the decoder will consider that missing tile tracks in a media presentation, which are referred to in a tile base track, is intentional. The decoder is supposed to support the partial subpart decoding. It may crop the pixel areas, which corresponds to the area covered by tile tracks not provided to provide decoded frame of size equal to the areas covered by the available tile tracks. The partial subpart decoding may be used also for composite track, which are not using bitstream rewriting through NAL unit extractors. Indeed, the composite track may include NAL units that rewrite the header of the tile track. This forms a new version of the video sequence which has the appropriate size. The decoder does not need to support the partial subpart decoding.

Nevertheless, future version of ISOBMFF may describes new mechanisms that would permit to generate new video sequences with similar effects as for composite tracks.

A layered HEVC streams may interleaved all the layers in the same track. Such approach is not suitable since it is not possible to select easily which layers the decoders has to decode. The preferred approach is to encapsulate each layer in one track.

An enhancement layer refers to a reference layer in a lower scalable level (for instance the base layer) to encode efficiently the video stream. Thus, the elementary stream sent to the decoder contains samples from reference layer in addition to sample from the enhancement layer that is currently decoded. When the reference layers and the enhancement layers are in different tracks, the decoder activates the decoding of several tracks.

The Web application retrieves a media segment from a media source and stores the media segment in a source buffer. The Web engine, then parses the media segment and initializes components found in the media segment by a Web engine, the components comprising a plurality of video components. According to a main aspect of the invention, the Web engine signals the relationships between the plurality of video components to the Web application. Then the Web application, based on the signalled information selects the video components to be decoded to the Web engine. These selected video components will be decoded by the Web engine and then simultaneously displayed by the Web application.

In one embodiment, the web browser uses the “kind” attribute to specify the dependencies between the video tracks. The web browser annotates the video tracks with two additional categories namely “composite” and “composition reference” attribute.

The “composite” category refers to a video track that is dependent on or requires other tracks to be decoded. This corresponds typically to reference tracks such as tile tracks and enhancement layers that are not intended to be displayed for scalable codec.

A “composition reference” video track is a track referenced by or required by one or more composite video tracks to form a video.

The Web Browser lists all the video tracks declared in the initialization segment and successively applies the following steps:

First, it retrieves tracks information to fill the VideoTrack’s attributes in step 720. This includes the “id”, “label” and “language” attributes. The values of this attribute is set as in the file format description.

Second, it sets, in step 730, the “kind” attribute (i.e. the category) equal to “composite” when the video track is a tile track or a layer track not intended to be displayed. It can be determined based on brand value of the track and “notdisplayed” parameters of the track headers. When the track is a displayable layer track or a composite track or a tile base track it sets the “kind” attribute to “composition reference”.

Third, it sets, in step 740, “selected” attribute to true if the video tracks is the first “composition reference” track in presentation order of the file format track description. As alternative, it selects the default track when indicated in the file format header. A “composition reference” relies on other tracks. The browser sets the “selected” attributes of this related track to true to indicate to the Web application that these tracks are played in addition to the “composition reference” track.

In step 750, the web browser determines whether tracks can be selected. In this step, the web browser is configured to throw an exception, for instance InvalidModificationError, when the Web application changes “selected” attribute value of video track of “composite” category. This ensures that a Web application would not attempt to activate/disable the decoding of a track that cannot be decoded without other tracks. On the other hand, when the Web application selects a “composition reference” video track, the following processing applies: the “selected” attribute of tracks dependent of or referenced by the “composition reference” is set to true. The web browser decodes only the subpart described by the “composition reference” video track.

When the browser supports partial subpart decoding of the “composition reference” track of the media presentation, the processing allows that an application changes the “selected” attribute of a “composite” track. When the application disable the decoding of one “composite” track, the web browser decodes the tracks associated to the “composition reference” track except the tracks for which “selected” attribute is equal to false. In this embodiment, the last step 760 is not used. Optional step 760 is only used in particular embodiments and will be described later in relation with those embodiments.

In another embodiment, the category of tile tracks is extended to specify more precisely the type of the tracks.

For instance, the “kind” attribute may have the following categories. These categories describe precisely the type of the track, which permits to assert which track can be selected simultaneously.

A first category, where the “kind” attribute can take the value “Tile track”, corresponds to a track that contains samples corresponding to a spatial part of a video.

A second category, where the “kind” attribute can take the value “Composite track”, corresponds to a track that refers to others video tracks to compose an image.

A third category, where the “kind” attribute can take the value “Tile base track”, corresponds to a track common to one or more tile tracks that contains data that is shared among these one or more track.

A fourth category, where the “kind” attribute can take the value “Base layer”, corresponds to a track containing data of an base layer of a layered stream that can be decoded independently of other tracks.

A fifth category, where the “kind” attribute can take the value “Enhancement layer”, corresponds to a track containing data of an enhancement layer of layered stream which depends of another track.

A sixth category, where the “kind” attribute can take the value “Reference”, corresponds to a track used only as decoding reference and not intended to be displayed.

The Web application checks the category of each video tracks before changing the value of “selected” attribute. The Web browser forbids the Web application to change the value of the “selected” attribute of tracks whose kind is “Tile track” or “Reference”. When the decoder supports partial subparts decoding, it allows changing the value of the “selected” attribute of tracks whose kind is “Tile track”.

When the server 100 generates one track per media segment, the server extracts the category of the track and reports it in the Role attribute of the Representation that contains the media segment in MPD 115.

In another embodiment, the web browser provides more information to the Web Application. It provides a new syntax element that indicates if the tracks can be selected or more precisely if the value of selected attribute can be modified. The table below provides an example of syntax that includes a new attribute of the videoTrack object to convey this information. This new attribute is named “selectable” in the preferred embodiment. When equal to true, this new boolean “selectable” attribute indicates that the web application should consider the selected attribute as read-only i.e. it cannot change its value. This new attribute called “selectable” is set in additional step 760. In step 760, the decoder sets selectable attribute to true when one of the following condition is valid otherwise it is set to false: when the video track is a layer intended to be displayed; when the video track is a composite or a tile base track; and when the video track is not requiring data from another track to be decoded.

When the browser supports partial subpart decoding, the browser may set the “selectable” attribute of tile track to true. More generally, a decoder able to continue to decode a reference track when dependent tracks are missing may set the “selectable” attribute of these dependent tracks to true.

The advantage of this embodiment is that each decoder and by extension the browser is able to indicate if it supports video track selection for multi video track playing. The support of the media presentation covers not only the browser capabilities but also the encoding configuration (includes both video format but also the file format encapsulation configuration) of the media presentation.

In yet another embodiment, the web browser provides a different kind of information to the Web application on each video track. The interface of the VideoTrack is extended to include a new attribute, called “selectionstate” in the preferred embodiment. This new attribute describes more precisely the selection characteristics than the “selectable” attribute of the previous embodiment. This attribute may take several values:

A first value, for example “Selectable” indicates that the track corresponds to a complete video sequence, meaning that the track can be independently decoded. It may be, for example, a reference track or tile track that does not depend on its neighbors or on metadata in a reference track to be decoded. It allows the decoder of the web browser to indicate that the Web application can select the tracks and will obtain the complete video sequence. The Web application can freely change the selected attribute of the video track.

A second value, for example “Combined” allows the decoder to notify the Web application that it can enable or disable the decoding of the track with selected attribute. Nevertheless, it also indicates that this track needs to be combined with other video tracks (with any one of the selection state) to be correctly decoded. This applies to dependent video tracks that needs to refer to another video tracks (a reference track) to be decodable.

A third value, for example “NotSelectable”, indicates that the tracks is not selectable. Typically, tracks that are not intended to be displayed or that require multiple track activation to be decoded. For instance, it may be a tile track that is not used for the current decoding but is needed by another tile track for its decoding due to coding dependencies.

The browser starts by initializing the value of the selectionstate attribute of each track in step 760:

The browser sets the “selectionstate” attribute value to “NotSelectable” when one of the following conditions is valid: when the “not-displayed” indication is present in one of the track header; or when the video track is a dependent track such as a tile track or a track that needs to active decoding of another track to be correctly decoded.

The browser sets the “selectionstate” attribute value to “Selectable” when one of the following condition is verified: when the video track is a layer intended to be displayable; when the video track is a composite or a tile base track; or when the video track is not requiring data from another track to be decoded.

In processing step 760 of the last track of the initialization segment, the decoder determines the track that was selected by default. It is the track, which has a “selected” attribute equal to true set in step 740. The decoder determines the list of tracks referred to in the selected track. If the decoder supports correctly partial subpart decoding it sets the “selectionstate” attribute to “Combined” for dependent tracks (e.g. tile tracks) referred to in the selected track.

Figure 9 illustrates an example of file system representation 902 in tracks for a video sequence 901 subdivided in four parts. The media segment 902 comprises the tracks 900 to 960. Each tile of the frame is encapsulated in one of the tile track (930, 940, 950, 960). The author of the media presentation proposes three associations of subparts as three composite tracks 900, 910 and 920. The composite track 900 contains the four subparts T1, T2, T3 and T4 and thus refers to the four tile tracks. The composite track 910 (resp. 920) is composed of subparts T1 (resp. T3) and T2 (resp. T4) and thus refers to tile track 930 (resp. 950) and tile track 940 (resp. 960). The default track is the composite track 900.

At the end of processing of Figure 7, the composite track 900 has a “selected” attribute equal to true as well as all the tile tracks 930 to 960. The selection state of all composite tracks is equal to “Selectable” and is equal to “Combined” for all tile tracks.

In the following paragraphs, we describe the browser processing when the Web application changes the value ofthe “selected” attribute of one track in step 640.

The browser throws an exception when the Web application attempts to change the value of the “selected” attribute of one track with a “selectionstate” equal to “NotSelectable”.

When the Web application changes the “selected” attribute of one track from true to false, the browser stops the decoding of the track.

If the track is a reference track, all its dependent tracks may be also currently played and thus have the “selected” attribute equal to true. The reference track and its dependent tracks are then stopped and all their “selected” attributes are set to false. Continuing with the previous example, if the Web application sets the “selected” attribute of composite track 900 to false, the same attribute for the tile tracks 930 to 960 are set to false by the Web engine and it stops playing all these tile tracks.

If the track is a dependent track, the dependent track is stopped and only its “selected” attribute is set to false. The reference track that initially started the playing of the dependent track remains unchanged as well as other dependent tracks. In some particular cases a dependent track may relies on another dependent track, typically in scalable encoding when multiple enhancement layers are present. When the Web application stops such kind of dependent tracks, the browser behaves as for a reference track: it stops all dependent tracks of the stopped dependent tracks and set the “selected” attribute accordingly to false.

When the Web application changes the “selected” attribute of one track with a “selectionstate” equal to “Combined” from true to false, the browser activate the decoding of the track.

When the Web application changes the “selected” attribute of one track with a “selectionstate” equal to “Selectable” from true to false, the browser stops the decoding of all other video tracks. It resets the “selectionstate” attributes as initially done when decoding the initialization segment in processing step 760, the default track being the newly selected track. Continuing with previous example, the decoder decodes the Composite Track 900 and its tile tracks 930 to 960. These tracks have a selected attribute equal to true. The Web Application sets the “selected” attribute of Composite Track 910 to true. At the end of processing step 760, the Composite Tracks 910 and Tile Tracks 930 and 940 have selected attribute equal to true while all the others tracks have a selected attribute equal to false. The Composite Tracks 900 to 920 have “selectionstate” attribute equal to “Selectable”. Tile Tracks 930 and 940 have “selectionstate” equal to “Combined” and equal to “NotSelectable” for Tile Tracks 950 and 960 since cannot be played with Composite Tracks 910.

The Web application notifies the possible selection choices to the user in the step 630 of Figure 6 according to the value of the “selectionstate” attribute for each track of the media presentation. It proposes only tracks with the “selectionstate” attribute with a value equal to “Selectable” or “Combined” (the intent is to not signal tracks with the “selectionstate” attribute equal to “NotSelectable”).

In yet another embodiment, the browser provides additional attributes to the VideoTrack interface. These attributes described information to facilitate the creation of the GUI interface offered to the user to select the track. This information characterize in particular the hierarchical relations between the tracks. For instance, these attributes describe position information of the tile track. The interface module parses this attribute to generate a GUI adapted to the content. Typically, when the VideoTrack is a tile track the spatial coordinates of the subpart is specified. These coordinates may be defined in pixels or in proportion relatively to the size of the decoded frame.

For example, the videoTrack object may be provided with a “position” attribute. The “position” attribute is of type TrackPosition. TrackPosition object includes a set of attributes, “x”, “y”, “width” and “height” representing the position of the subpart. The “x” and “y” attributes are the coordinates in pixels of the subpart in the video sequence and “width” and “height” attributes represent the size of the subpart in pixels. In one alternative, the value of these attributes is scaled to the size of the video sequence including the subparts.

The browser initializes this object each time a composite track or a tile base track is decoded. Indeed, one tile track may be used by several composite tracks. The position of the tile track in the decoded video depends thus of the composite track or tile base track being currently played by the browser.

In one alternative, the videoTrack object is provided with a hierarchical information representing the scalable hierarchy of the track. Typically, the VideoTrack defines an attribute, for example called “layer_id” indicating the identifier of the layer. In an example regarding quality scalability, the attribute may be called “quality_id”, which indicate the quality level of the layer. In one alternative, browser provides more levels attributes to indicate the spatial, the temporal and/or SNR scalability levels.

In another embodiment, the browser provides methods as interface (at HTMLMediaElement, MediaSource or SourceBuffer level) to select the tracks at a predetermined location instead of simple attributes.

For example, in an example of extension of VideoTrack object, it includes a method, for example called “selectVideoTrackPosition” taking the parameters “x”, “y”, “width” and “height”, to select the tracks at the location parameterized by the arguments of the method. The arguments “x” and “y” corresponds to the coordinates of the decoding position. The “width” and “height” arguments specify the size of the decoding position.

When this method is invoked, the browser determines the subpart tracks (typically tile tracks) which contains samples in this decoding position. Then it determines the composite track or tile base track that reference all this tracks. When several composite or tile base tracks are identified, the browser preferably selects the track which refers to the smallest number of tile tracks encompassing the determined subpart tracks. This ensures that the selected track has the smallest size encompassing the decoding position. The browser sets “selected” attributes of the selected composite or tile base track and tile tracks.

The VideoTrack interface may also provide a method that allows selecting the level of the layer to be decoded. Typically, this method may have at least one argument that is the level value of the layer to be decoded.

Figure 8a illustrates an example of the decoding pipeline operated in the Web browser. The Web application fills the SourceBuffer instances 800 with Media segments, which contains encoded data. The format of the Media Segment 810 is following a file format specification typically ISOBMFF.

The decoder of the browser is in charge of extracting the elementary streams from media segment 810 and then to provide them to the elementary stream decoders 840 and 845. In this example, we represented video decoder 840 and audio decoder 845 but a metadata decoder may also decode metadata data. The invention is thus not limited to media presentation comprising only video and/or audio samples but includes also metadata samples. Typically, this comprises text such subtitles captions.

The extraction of elementary streams consist in applying a de-multiplexing operation in step 820. This operation generates buffer of data, which contains the data of the different tracks present in the Media Segment. For example, the media segment 810 of the Figure 8a contains 3 tracks. The two first tracks 830 and 831 are video tracks (i.e. contain video sample compressed with a video codec typically H.265/HEVC) and the third track 832 contains audio samples and thus is an audio track.

In the context of the invention, the Media Segment includes several video tracks that may or should be played together. For instance, each of the video tracks may correspond either to one layer of a layered HEVC media or to spatial parts of a media samples. As a result, the two video tracks may be decoded simultaneously as represented in the example of decoding pipeline configuration of Figure 8a. The video decoder 840 receives as input the samples of the several tracks and outputs the decoded samples 850. These decoded samples contain the data encoded by the two tracks which may corresponds to an enhancement layer in Layered-HEVC stream or to two spatial parts of the media presentation.

The Audio decoder 845 decodes the audio track 845. In this example, only one audio tracks is decoded but the same audio decoder can also decode multiple audio tracks.

The video and decoded samples are then provided to the Tenderer 860 for final presentation to the user.

Figure 8b is a second example of decoding pipeline. In this example, the Web application provides the media segments 810 and 811 into two SourceBuffer instances. The first SourceBuffer 810 includes the two same video tracks 830 and 831 as in the example of Figure 8a, similar elements with the same reference are not described again. The audio track is not represented in this example. The second SourceBuffer 811 include data corresponding to a third video track 834. The decoder analyses the initialization segment to determine if the same decoder can decode the third track 834 along with the two first video tracks 830 and 831. Typically, this is the case when this third video track corresponds for instance to a second enhancement layer or a third spatial part of the media. In such a case, the output of the demuxer 821 is redirected to the video decoder 840.

Figure 10 is a schematic block diagram of a computing device 1000 for implementation of one or more embodiments of the invention. The computing device 1000 may be a device such as a micro-computer, a workstation or a light portable device. The computing device 1000 comprises a communication bus connected to:

- a central processing unit 1001, such as a microprocessor, denoted CPU;

- a random access memory 1002, denoted RAM, for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method according to embodiments of the invention, the memory capacity thereof can be expanded by an optional RAM connected to an expansion port for example;

- a read only memory 1003, denoted ROM, for storing computer programs for implementing embodiments of the invention;

- a network interface 1004 is typically connected to a communication network over which digital data to be processed are transmitted or received. The network interface 1004 can be a single network interface, or composed of a set of different network interfaces (for instance wired and wireless interfaces, or different kinds of wired or wireless interfaces). Data packets are written to the network interface for transmission or are read from the network interface for reception under the control of the software application running in the CPU 1001;

- a user interface 1005 may be used for receiving inputs from a user or to display information to a user;

- a hard disk 1006 denoted HD may be provided as a mass storage device;

- an I/O module 1007 may be used for receiving/sending data from/to external devices such as a video source or display.

The executable code may be stored either in read only memory 1003, on the hard disk 1006 or on a removable digital medium such as for example a disk.

According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 1004, in order to be stored in one of the storage means of the communication device 1000, such as the hard disk 1006, before being executed.

The central processing unit 1001 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 1001 is capable of executing instructions from main RAM memory 1002 relating to a software application after those instructions have been loaded from the program ROM 1003 or the hard-disc (HD) 1006 for example. Such a software application, when executed by the CPU 1001, causes the steps of the flowcharts of the invention to be performed.

Any step of algorithms of the invention may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC (“Personal Computer”), a DSP (“Digital Signal Processor”) or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA (“Field-Programmable Gate Array”) or an ASIC (“Application-Specific Integrated Circuit”).

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.

Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.

In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

1. A method for exchanging data between a Web application and an associated Web engine, said data being related to encoded video data organized into a plurality of video components, at least one video component being able to be simultaneously displayed with another video component, the method comprising:

- signalling said information by the Web engine to the Web Application,

2. The method of claim 1, wherein the step of indicating by the Web application to the Web engine the selected video components to be decoded comprises:

3. The method of claim 2 wherein indicating the information comprising an attribute, and signalling said information comprises:

- setting the attribute of the video component to the determined category.

4. The method of claim 3 further comprising:

5. The method of claim 4 further comprising:

6. The method of claim 4 further comprising:

7. The method of claim 2 wherein indicating the information comprising an attribute, and signalling said information comprises:

- setting the attribute of the video component to the determined category.

8. The method of claim 7 further comprising:

9. The method of claim 2 wherein indicating the information comprising an attribute, and signalling said information comprises:

- determining a “selectable” attribute for a video component as true when:

o the video component is a layer intended to be displayed;

10. The method of claim 9 further comprising:

11 .The method of claim 2 wherein indicating the information comprising an attribute, and signalling said information comprises:

12. The method of claim 11 further comprising:

13. The method of claim 12 further comprising:

14. The method of any one claim 1 to 12 wherein the Web engine further provides a “position” attribute for video components corresponding to a subpart of a video, the “position” attribute allowing indicating the position of the subpart.

15. The method of any one claim 1 to 12 wherein the Web engine further provides a “layer_id” attribute for video components corresponding to a layer of a layered video, the “layer_id” attribute allowing indicating the layer identifier of the layer.

16. The method of claim 1, wherein indicating the relationships between the plurality of video components comprises:

17. A computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to any one of claims 1 to 16, when loaded into and executed by the programmable apparatus.

18. A computer-readable storage medium storing instructions of a computer program for implementing a method according to any one of claims 1 to

16.

19. A client device configured to execute the step of the method of any one claim 1 to 16.

20. A streaming system comprising a server device and a client device according to claim 19.

18 10 18

Amendments to claims have been filed as follows

1. A method for exchanging data between a Web application and an associated Web engine, said data being related to encoded video data

5 organized into a plurality of video components, at least one video component being able to be simultaneously displayed with another video component, the method comprising:

- parsing, by the Web engine, initialization data and initializing said video components,

10 wherein the method further comprises

- signalling said information by the Web engine to the Web

15 Application,

- associating a selected state to the video component, said selected

25 state indicating whether the video component is selected or not.

3. The method of claim 2 wherein the information comprises an attribute, and signalling said information comprises:

- determining a “composite” category for a video component that

30 requires other video components to be decoded;

- setting the attribute of the video component to the determined category.

4. The method of claim 3 further comprising:

5 - predefining by the Web engine the selected state of the first “composition reference” component in presentation order and the selected state of all components that depend on this first “composition reference” component as selected.

18 10 18

5. The method of claim 4 further comprising:

6. The method of claim 4 further comprising:

7. The method of claim 2 wherein the information comprises an attribute, and signalling said information comprises:

18 10 18

- determining a “Reference” category for a video component used

5 only as decoding reference and not intended to be displayed;

- setting the attribute of the video component to the determined category.

8. The method of claim 7 further comprising:

10 - forbidding by the Web engine the Web application to change a predefined selected state of a “Tile track” or “Reference” component.

9. The method of claim 2 wherein the information comprises a “selectable”

15 attribute, and signalling said information comprises:

- determining the “selectable” attribute for a video component as true when:

o the video component is a layer intended to be displayed;

o the video component is a composite or a tile base 20 component; or o the video componentis not requiring data from another component to be decoded;

10. The method of claim 9 further comprising:

11.The method of claim 2 wherein the information comprises a “selectionstate” attribute, and signalling said information comprises:

18 10 18

- determining the “selectionstate” attribute for a video component as “selectable” when the video component is a layer intended to be displayable; when the video component is a composite or a tile base component; or when the video component can be independently

5 displayed;

- determining the “selectionstate” attribute for the video component

10 as “NotSelectable” otherwise.

12. The method of claim 11 further comprising:

- forbidding by the Web engine the Web application to change the selected state of component with the “selectionstate” attribute set

15 to “NotSelectable”.

13. The method of claim 12 further comprising:

- setting as false the selected state of all components that depend on a component which selected state has been set to false by the Web

Intellectual Property Office

Application No: GB 1716906.1 Examiner: Rianis Dickson

20 application.

14. The method of any one claim 1 to 12 wherein the Web engine further provides a “position” attribute for video components corresponding to a subpart of a video, the “position” attribute allowing indicating the

25 position of the subpart.

15. The method of any one claim 1 to 12 wherein the Web engine further provides a “layer_id” attribute for video components corresponding to a layer of a layered video, the “layer_id” attribute allowing indicating the

30 layer identifier of the layer.

16. The method of claim 1, wherein signalling the information comprises:

18 10 18

17. A computer program product for a programmable apparatus, the

5 computer program product comprising a sequence of instructions for implementing a method according to any one of claims 1 to 16, when loaded into and executed by the programmable apparatus.

18. A computer-readable storage medium storing instructions of a computer

10 program for implementing a method according to any one of claims 1 to

16.