HK1258771B

HK1258771B - Interactive audio metadata handling

Info

Publication number: HK1258771B
Application number: HK19101102.2A
Authority: HK
Inventors: P‧L‧马内斯; M‧R‧约翰逊
Original assignee: Dts公司
Priority date: 2016-03-23
Filing date: 2017-03-14
Publication date: 2022-05-20

Description

Interactive audio metadata handling

Cross Reference to Related Applications

Priority is claimed in this application for united states patent application No.15/078,945 entitled "INTERACTIVE AUDIO METADATA HANDLING" filed on 23.3.2016, which is hereby expressly incorporated by reference in its entirety.

Technical Field

The present disclosure relates generally to audio processing/handling and, more particularly, to interactive audio metadata processing/handling.

Background

A source device, such as a set-top box or an Optical Disk (OD) player, may transmit an encoded audio stream to a sink device, such as an Audio Video (AV) receiver or a television. If the user wants to modify the audio stream (e.g., modify the volume associated with audio objects in the audio stream, add/remove objects in the audio stream), the source device may decode the audio stream, modify the audio stream accordingly, and then re-encode the audio stream for transmission to the sink device. Alternative methods for modifying an audio stream are desirable.

Disclosure of Invention

In an aspect of the present disclosure, a method and apparatus for processing an object-based audio signal for reproduction by a playback system are provided. The apparatus receives a plurality of object-based audio signals in at least one audio frame. In addition, the device receives at least one audio object command associated with at least one of the plurality of object-based audio signals. In addition, the device processes the at least one object-based audio signal based on the received at least one audio object command. Further, the device renders a set of object-based audio signals of the plurality of object-based audio signals to a set of output signals based on the at least one audio object command.

In an aspect of the present disclosure, a method and apparatus for processing an object-based audio signal for reproduction by a playback system are provided. The apparatus receives user selection information indicating at least one audio object command associated with at least one object-based audio signal. In addition, the device obtains the at least one audio object command based on the received user selection information. In addition, the device receives a plurality of object-based audio signals. Further, the device transmits the at least one audio object command with the received plurality of object based audio signals.

Drawings

Fig. 1 is a block diagram illustrating a first method associated with interactive audio metadata handling/processing.

Fig. 2 is a block diagram illustrating a second method associated with interactive audio metadata handling/processing.

Fig. 3 is a block diagram illustrating a third method associated with interactive audio metadata handling/processing.

Fig. 4 is a block diagram illustrating a fourth method associated with interactive audio metadata handling/processing.

Fig. 5 is a diagram illustrating an audio frame when an audio object command chunk (chunk) is in-band (in-band) with an audio chunk in the audio frame.

Fig. 6 is a diagram for illustrating audio objects related to the head of a listener and modifications to such audio objects by audio object commands.

Fig. 7 is a flow diagram of a method of processing an object-based audio signal for reproduction by a playback system.

Fig. 8 is a flow diagram of a method of processing an object-based audio signal for reproduction by a playback system.

FIG. 9 is a conceptual data flow diagram illustrating data flow between different components/assemblies in an exemplary device.

FIG. 10 is a diagram illustrating an example of a hardware implementation of a device employing a processing system.

FIG. 11 is a conceptual data flow diagram illustrating data flow between different components/assemblies in an exemplary device.

FIG. 12 is a diagram illustrating an example of a hardware implementation of a device employing a processing system.

Detailed Description

The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. It will be apparent, however, to one skilled in the art that these concepts may be practiced without these specific details. In some instances, well-known structures and components are shown in block diagram form in order to avoid obscuring such concepts. The apparatus and methods are described in the following detailed description and may be illustrated by various blocks, components, circuits, steps, processes, algorithms, elements, and the like, in the figures.

As discussed previously, a source device, such as a set-top box (STB), which is also known as a set-top unit (STU) or integrated receiver/decoder (IRD), or OD player, may transmit an encoded audio stream to a sink device, such as an AV receiver or television. If the user wants to modify the audio stream, such as modifying the volume of audio objects in the audio stream and/or adding/removing audio objects from the audio stream, the source device may decode the audio stream, modify the audio stream accordingly, and then re-encode the audio stream for transmission to the sink device. With respect to user interactivity, modification of the audio stream may be more efficient if handled by the sink device rather than the source device.

Fig. 1 is a block diagram 100 illustrating a first method associated with interactive audio metadata handling/processing. As shown in fig. 1, the sink device 104 (which may be an AV receiver, television, etc.) receives an audio object command 108. In addition, sink device 104 receives one or more audio frames 110 (also referred to as object-based audio signals) that include audio objects from source device 102. The sink device 104 may periodically receive the audio frame(s) 110 once per a time period T (e.g., T may be about 10.67 ms). The source device 102 may be, for example, a STB or OD player. Alternatively, source device 102 may be a mobile phone, tablet, streaming stick, media Personal Computer (PC), or other source device. Source device 102 receives audio objects in one or more audio frames 140 and provides the received audio objects in one or more audio frames 110 to sink device 104. The sink device 104 decodes the audio objects received in the audio frame(s) 110 and processes 112 one or more of the decoded audio objects based on the received audio object commands 108. The sink device 104 may perform additional processing (e.g., amplification) on the audio object and may then render/generate an audio signal for a channel (channel)114 of the sound/playback system 106. The sink device 104 then transmits the processed audio signal 114 to the sound/playback system 106. The sound/playback system 106 (e.g., a loudspeaker) converts the received electrical audio signals into corresponding sounds.

An audio object is one or more audio waveforms having dynamic or static object-specific metadata that describes certain characteristics of the waveforms. Audio objects are typically associated with particular objects, such as particular dialogs, sound effects, particular instruments, etc. The characteristics may include location in three-dimensional (3D) space at a given point in time, measured loudness, properties of the audio object (such as instrument, effect, music, background, or dialog), dialog language, how the audio object is displayed, and metadata in the form of instructions on how to process, render, or playback the audio object. Within an audio stream comprising a set of audio frames, there may be hundreds to thousands of different audio objects. An audio frame may include a subset of such audio objects, depending on which audio objects may be rendered for playback within the audio frame. Audio objects are not necessarily mapped to a particular channel. The sink device 104 may process the audio objects individually. Subsequently, in the rendering process, the AV receiver may map the audio objects to the channels by converting and/or mixing specific audio objects for each channel corresponding to the sound/playback system 106.

The audio object commands 108 may include commands associated with: modifying the volume of an audio object, spatially repositioning an object (see, e.g., below in connection with fig. 6), turning an audio object on/off, adding/removing/replacing an audio object, adjusting a listener location/position in connection with a loudspeaker/playback configuration, or otherwise adjusting a parameter, configuration, or property associated with an audio object. In one aspect, an audio object may include audio waveform data and object metadata associated with the audio waveform data. The audio object commands 108 may include one or more commands associated with modifying object metadata associated with the audio waveform data.

Fig. 2 is a block diagram 200 illustrating a second method associated with interactive audio metadata handling/processing. As shown in FIG. 2, user selection means 208 may receive audio object user selection command information 210. The user selection means 208 may receive audio object user selection command information 210 from the user, such as through an application and/or interface provided on the user selection means 208. The user selection means 208 processes 212 the audio object user selection command information 210 to generate user selection information 214 for the source device 202. The source device 202 may be, for example, an STB or an OD player. Alternatively, source device 202 may be a mobile phone, tablet, streaming bar, media PC, or other source device. In a first configuration, source device 202 generates audio object commands based on received user selection information 214. In the second configuration, the source device 202 provides user selection information 220 to the network host 218, the network host 218 generates a corresponding audio object command 222, and provides the generated audio object command 222 to the source device 202. Once source device 202 has obtained (e.g., generated and/or received) the audio object command corresponding to user selection information 214 and/or 220, source device 202 may prepare 216 to send the audio object command to sink device 204 along with the audio object received from network host 218 in one or more audio frames 240. The sink device 204 may be an AV receiver and/or a television. Source device 202 may also determine in which audio frame(s) the audio object command is to be included because source device 202 may receive the audio object command for the audio object to be sent by source device 202 to sink device 204 at a later time.

In a first configuration, where audio object commands are sent in-band with audio chunks, source device 202 may append the audio object commands as audio object command chunks behind the encoded/compressed audio chunks within the audio frame(s). In such a configuration, source device 202 may send audio chunks (in 224) and audio object command chunks 226 together in one or more audio frames 224. As such, although the arrows 226, 224 are shown as separate arrows, the audio object commands and audio objects are transmitted together, simultaneously in the same frequency band, and within the same audio frame(s) 224. In a second configuration, where audio object commands are sent out-of-band with audio chunks, the source apparatus 202 may split the audio object commands 226 and audio frame(s) 224 to the sink apparatus 204 in different frequency bands.

Upon receiving the audio frame(s) 224 including the plurality of audio objects and the one or more audio object commands 226, the sink device 204 may process 228 the audio objects based on the one or more audio object commands 226. Subsequently, after processing the one or more audio objects based on the one or more audio object commands, the sink device 204 renders/maps the audio objects to the channels 230 for playback by the sound/playback system 206.

Referring again to fig. 2, in the first configuration, the user selection device 208 may be a separate stand-alone device separate from the source device 202 and the sink device 204, such as a cellular telephone, tablet, STB remote, OD player remote, or other device for receiving user input associated with audio object commands. In a second configuration, the user selection device 208 and the source device 202 may be the same device. That is, the source device 202 itself may provide a mechanism for receiving user input associated with an audio object command. In a third configuration, the user selection device 208 and the television may be the same device. In such a configuration, the sink device 204 may be an AV receiver, and the television itself may provide a mechanism for receiving user input associated with audio object commands (e.g., via a television remote control, a touch screen display, etc.).

Fig. 3 is a block diagram 300 illustrating a third method associated with interactive audio metadata handling/processing. As shown in FIG. 3, user selection device 308 may receive audio object user selection command information 310. The user selection device 308 may receive audio object user selection command information 310 from a user, such as through an application and/or interface provided on the user selection device 308. The user selection device 308 processes 312 the audio object user selection command information 310 to generate user selection information 314 for the source device 302. The source device 302 may be, for example, an STB, an OD player, or a television. Alternatively, source device 302 may be a mobile phone, tablet, streaming bar, media PC, or other source device. In the first configuration, source device 302 generates an audio object command based on received user selection information 314. In the second configuration, the source device 302 provides user selection information 320 to the network host 318, the network host 318 generates a corresponding audio object command 322, and provides the generated audio object command 322 to the source device 302. Once source device 302 has obtained (e.g., generated and/or received) the audio object commands corresponding to user selection information 314 and/or 320, source device 302 may prepare 316 to send the audio object commands to sink device 304 along with the audio objects received from network host 318 in one or more audio frames 340. Sink device 304 may be an AV receiver. Source device 302 may also determine in which audio frame(s) the audio object command is to be included because source device 302 may receive the audio object command for the audio object to be sent by source device 302 to sink device 304 at a later time.

In a first configuration, where audio object commands are sent in-band with audio chunks, source device 302 may append the audio object commands as audio object command chunks behind the encoded/compressed audio chunks within the audio frame(s). In such a configuration, source device 302 may send the audio chunk (in 324) along with audio object command chunk 326 in one or more audio frames 324. As such, although the arrows 326, 324 are shown as separate arrows, the audio object commands and audio objects are transmitted together, simultaneously in the same frequency band and within the same audio frame(s) 324. In a second configuration, where audio object commands are sent out-of-band with audio chunks, source apparatus 302 may send audio object commands 326 and audio frame(s) 324 in different frequency bands to sink apparatus 304.

Upon receiving audio frame(s) 324 comprising a plurality of audio objects and one or more audio object commands 326, sink device 304 may process 328 the audio objects based on the one or more audio object commands 326. Subsequently, after processing the one or more audio objects based on the one or more audio object commands, sink device 304 renders/maps the audio objects to the channels 330 for playback by sound/playback system 306.

Referring again to fig. 3, in the first configuration, the user selection device 308 may be a separate stand-alone device separate from the source device 302 and the sink device 304, such as a cellular telephone, tablet, STB remote, OD player remote, television remote, or other device for receiving user input associated with audio object commands. In a second configuration, the user selection device 308 and the source device 302 may be the same device. That is, the source device 302 itself may provide a mechanism for receiving user input associated with an audio object command.

Fig. 4 is a block diagram 400 illustrating a fourth method associated with interactive audio metadata handling/processing. As shown in fig. 4, user selection means 408 may receive audio object user selection command information 410. User selection means 408 may receive audio object user selection command information 410 from a user, such as through an application and/or interface provided on user selection means 408. User selection means 408 processes 412 the audio object user selection command information 410 to generate user selection information 414 for the source device 402. The source device 402 may be, for example, an STB or an OD player. Alternatively, source device 402 may be a mobile phone, tablet, streaming bar, media PC, or other source device. In a first configuration, the source device 402 generates an audio object command based on the received user selection information 414. In the second configuration, the source device 402 provides the user selection information 402 to the network host 418, the network host 418 generates a corresponding audio object command 422, and provides the generated audio object command 422 to the source device 402. Once source device 402 has obtained (e.g., generated and/or received) the audio object commands corresponding to user selection information 414 and/or 420, source device 402 may prepare 416 to send the audio object commands to television 432 along with the audio objects received from network host 418 in one or more audio frames 440. Source device 402 may also determine in which audio frame(s) the audio object command is to be included because source device 402 may receive the audio object command for the audio object to be sent by source device 402 to sink device 404 at a later time.

In a first configuration, where audio object commands are sent in-band with audio chunks, the source device 402 may append the audio object commands as audio object command chunks behind the encoded/compressed audio chunks within the audio frame(s). In such a configuration, the source device 402 may send the audio chunk (in 424) along with the audio object command chunk 426 in one or more audio frames 424. As such, although the arrows 426, 424 are shown as separate arrows, the audio object commands and the audio object are transmitted together, simultaneously in the same frequency band, and within the same audio frame(s) 424. In a second configuration, where audio object commands are sent out-of-band with the audio chunks, the source device 402 may send audio object commands 426 and audio frame(s) 424, respectively, in different frequency bands to the television 432.

The television 432 receives the audio object commands and audio objects and forwards the audio object commands and audio objects to the sink device 404. Sink device 404 may be an AV receiver. The television 432 may send audio object commands and audio objects in-band or out-of-band depending on how the audio object commands and audio objects are received by the television 432. For example, if the television 432 receives audio object commands and audio objects in-band together in one or more audio frames from the source device 402, the television 432 may forward the audio object commands and audio objects in-band together in one or more audio frames to the sink device 404. As another example, if television 432 receives audio object commands and audio objects, respectively, out-of-band from source device 402, television 432 may forward the audio object commands and audio objects, respectively, out-of-band to sink device 404.

Upon receiving the audio frame(s) 424 comprising the plurality of audio objects and the one or more audio object commands 426, the sink device 404 may process 428 the audio objects based on the one or more audio object commands 426. Subsequently, after processing the one or more audio objects based on the one or more audio object commands, sink device 404 renders/maps the audio objects to the channels 430 for playback by sound/playback system 406.

Referring again to fig. 4, in the first configuration, the user selection device 408 may be a separate stand-alone device separate from the source device 402 and the sink device 404, such as a cellular telephone, tablet, STB remote, OD player remote, or other device for receiving user input associated with audio object commands. In a second configuration, the user selection device 408 and the source device 402 may be the same device. That is, the source device 402 itself may provide a mechanism for receiving user input associated with an audio object command.

FIG. 5 is a diagram illustrating an audio frame when audio object command chunks and audio chunks are in-band in the audio frame. As shown in FIG. 5, audio frame 502 includes audio chunks and audio object command chunks. The audio chunk comprises a plurality (n) of audio objects, where n is a subset of the total number of audio objects available within the audio stream. For example, the audio stream may include audio for a full-length movie. Such an audio stream may comprise thousands to tens of thousands of audio objects, if not more. The audio stream may include 500k or more audio frames. An audio frame may in particular carry n audio objects, depending on which audio objects may be rendered for playback in the audio frame. An audio object command chunk may comprise m audio object commands x₁,x₂,…,x_mWherein m is more than or equal to 0. Audio object command x_iMay correspond to one or more of the n audio objects. For example, audio object command x_iMay be a command to change the volume associated with one or more audio objects. As another example, audio object command x_iMay be a command to replace one audio object with another (e.g., replace an english speaking announcer with a spanish speaking announcer during a sporting activity). As another example, audio object command x_iMay be a command for including an audio object for processing, rendering, and playback, such as when a user wants another audio stream (e.g., a phone call) overlaid with the initial audio stream (e.g., a full-length movie).

In one configuration, the audio object command may be applied to the corresponding audio object(s) until the command is undone. In another configuration, the audio object command may be applicable to the corresponding audio object(s) for a particular period of time. In such a configuration, the audio object command may include a time period for which the audio object command applies.

Diagram 500 illustrates an audio frame comprising n audio objects and m audio object commands. As discussed previously, one or more audio frames may be received within a simultaneous time period (e.g., 10.67ms) corresponding to one audio frame. Suppose q audio frames are received within a simultaneous (current) time period, where the ith audio frame includes n_iAn audio object and m_iAn audio object command, such simultaneous time periods may be equal to n₁+n₂+…+n_qAn audio object and m₁+m₂+…+m_qAn audio object command is associated.

Fig. 6 is a diagram 600 illustrating audio objects related to a listener's head and modifications to such audio objects by audio object commands. The audio object 602 may be "positioned" at a particular location relative to the listener's head 604. As shown in fig. 6, the audio object 602 is positioned at an angle θ to the forward direction F of the listener's head 604 along the xy plane and at an angle F to the forward direction F of the listener's head 604 in the z directionThe expression "positioned" means that a listener having a head position as indicated by the listener's head position 604 may perceive the audio object 602 as being at such a spatial location relative to the listener's head 604 when the audio object 602 is rendered and played by the sound/playback system. The audio object command may be generated by providing a sum of θ for a given listener positionBy providing information indicating the sum of theta relative to a given listener positionVariations of (2)To change the position/spatial location of the audio object in the 3D space. Further, the audio object command may replace the audio object 602 with another audio object. For example, as shown in FIG. 6, audio object 602 is audio object 1. The audio object command may replace audio object 1 with any one of audio objects 2 to p. For the specific example, assuming that the audio stream is a sporting event, the p audio objects may be conversations from game callers (play callers) in different languages, and the user may select one of the p audio objects depending on what language the listener wants to hear.

Rendering is the processing of an object-based audio signal to render based on audio object metadata (e.g., θ,And other parameters) to generate an output audio signal. For example, rendering may be performed by a multi-dimensional audio (MDA) reference renderer, such as a Vector Base Amplitude Panning (VBAP) renderer. VBAP is a method for positioning virtual sources in a particular direction using a specific set of multi-loudspeakers, e.g. an International Telecommunication Union (ITU)5.1/7.1 loudspeaker layout configuration or some other loudspeaker layout configuration. When rendering, the MDA/VBAP renderer is based on one or more audio object commands and on audio object metadata (e.g., θ, B, C, and C, and C, or any one or any combination thereof,And other parameters) to render a set of object-based audio signals to a set of output signals.

Fig. 7 is a flow diagram 700 of a method of processing an object-based audio signal for reproduction by a playback system. The method may be performed by a device such as an AV receiver or a television. At 702, the apparatus receives a plurality of object-based audio signals in at least one audio frame. The device may receive at least one audio frame from one of a set-top box, an OD player, or a television. Alternatively, the apparatus may receive at least one audio frame from a mobile phone, tablet, streaming bar, media PC, or other source device. For example, referring to fig. 1-4, the sink device 104, 204, 304, 404 receives a plurality of object-based audio signals in an audio frame 110, 224, 324, 424. At 704, the apparatus receives at least one audio object command associated with at least one object-based audio signal of the plurality of object-based audio signals. For example, referring to fig. 1-4, the sink apparatus 104, 204, 304, 404 receives at least one audio object command 108, 226, 326, 426' associated with at least one of the plurality of object-based audio signals. At 706, the device processes the at least one object-based audio signal based on the received at least one audio object command. For example, referring to fig. 1-4, the sink device 104, 204, 304, 404 processes 112, 228, 328, 428 the at least one object-based audio signal based on the received at least one audio object command 108, 226, 326, 426'. At 708, the device renders a set of object-based audio signals of the plurality of object-based audio signals to a set of output signals based on the at least one audio object command. For example, referring to fig. 1-4, the sink device 104, 204, 304, 404 renders a set of object-based audio signals of the plurality of object-based audio signals to a set of output signals 114, 230, 330, 430 based on the at least one audio object command 108, 226, 326, 426'.

For a particular example, referring to fig. 2-4, the sink apparatus 104, 204, 304, 404 may receive a plurality of object-based audio signals in at least one audio frame. The object-based audio signal may comprise an object-based audio signal s₁,s₂,…,s_n. The sink device 104, 204, 304, 404 may also receive an object-based audio signal s₁,s₂,…,s_nS associated audio object command x₁,x₂,…,x_m. For example, audio object command x₁It may be specified that the audio signal s is to be object-based when rendered₁Replacement by an object-based audio signal s₂. As another example, audio object command x₂Can be specific to changing object-basedAudio signal s₃The volume of (c). The sink device 104, 204, 304, 404 may then command x based on the received audio object₁、x₂To the object-based audio signal s₁、s₂、s₃And (6) processing. The sink device 104, 204, 304, 404 may be configured to remove the object-based audio signal s by removing the object-based audio signal s₁Adding an object based audio signal s₂And changing the object-based audio signal s₃To the object-based audio signal s₁、s₂、s₃And (6) processing. Subsequently, the sink device 104, 204, 304, 404 may command x based on the audio object₁,x₂,…,x_mTo generate an object-based audio signal s₁,s₂,…,s_nOf (a) a set of object-based audio signals (which comprise at least s)₂And s₃But does not include s₁) Rendering to a set of output signals.

In one configuration, at 704, the at least one audio object command is received in an audio frame(s) having the plurality of object-based audio signals. For example, as discussed previously with respect to fig. 2-4, the audio object commands may be received in-band with the object-based audio signal in the audio frame(s). In such a configuration, the at least one audio object command may be appended to the ends of the plurality of object-based audio signals in the audio frame(s).

In one configuration, the at least one audio object command is received separately from the audio frame(s) comprising the plurality of object-based audio signals at 704. The at least one audio object command may be received before/after the audio frame(s) or simultaneously with the audio frame(s) comprising the plurality of object-based audio signals. For example, as discussed previously with respect to fig. 2-4, the audio object command may be received out-of-band with the audio frame(s) having the object-based audio signal.

In one configuration, each object-based audio signal of the plurality of object-based audio signals includes audio waveform data and object metadata associated with the audio waveform data. In such a configuration, to process the at least one object-based audio signal based on the received at least one audio object command, the device may modify object metadata of the at least one object-based audio signal based on the at least one audio object command. For example, to process the at least one object based audio signal, the device may modify object metadata associated with the audio waveform data to change the volume of the audio waveform data, reposition a perceived spatial location associated with the audio waveform data, add/remove audio waveform data, adjust a listener location/orientation related to a loudspeaker/playback configuration, or otherwise adjust a parameter, configuration, or property associated with the audio waveform data.

In one configuration, at 706, to process the at least one object-based audio signal based on the received at least one audio object command, the device may modify a volume associated with the at least one object-based audio signal, remove the at least one object-based audio signal from the set of object-based audio signals from being rendered, add the at least one object-based audio signal to the set of object-based audio signals for rendering, replacing a first object-based audio signal of the at least one object-based audio signal with a second object-based audio signal of the at least one object-based audio signal when rendering the set of object-based audio signals, modifying a spatial location of the at least one object-based audio signal, or changing metadata/rendering properties of the at least one object-based audio signal.

Fig. 8 is a flow diagram 800 of a method of processing an object-based audio signal for reproduction by a playback system. The method may be performed by a device such as a set-top box, an OD player or a television. At 802, the apparatus receives user selection information indicating at least one audio object command associated with at least one object-based audio signal. For example, referring to fig. 2-4, the source device 202, 302, 402 receives user selection information 214, 314, 414, the user selection information 214, 314, 414 indicating at least one audio object command associated with at least one object-based audio signal. At 804, the device obtains the at least one audio object command based on the received user selection information. For example, referring to fig. 2-4, in one configuration, to obtain the at least one audio object command, the source device 202, 302, 402 may generate the at least one audio object command based on the received user selection information 214, 314, 414. As another example, in one configuration, to obtain the at least one audio object command, the source device 202, 302, 402 may send user selection information 220, 320, 420 to the network host 218, 318, 418. In response, the source device 202, 302, 402 may receive the at least one audio object command 222, 322, 422 from the network host 218, 318, 418. At 806, the device receives a plurality of object based audio signals. For example, referring to fig. 2-4, the device may receive a plurality of object-based audio signals in at least one audio frame 240, 340, 440 from the network host 218, 318, 418. When the at least one audio object command is transmitted in-band with the plurality of object-based audio signals, the source device 202, 302, 402 may append 808 the at least one audio object command to the end of the plurality of object-based audio signals. In such a configuration, the source device 202, 302, 402 may transmit the at least one audio object command and the plurality of object-based audio signals in at least one audio frame. At 810, the device transmits (serially or in parallel/simultaneously) the at least one audio object command with the received plurality of object based audio signals. For example, referring to fig. 2-4, the source device 202, 302, 402 transmits the at least one audio object command 226, 326, 426 with the plurality of object-based audio signals 224, 324, 424.

In one configuration, the at least one audio object command is transmitted with the plurality of object-based audio signals in at least one audio frame. For example, as previously discussed with respect to fig. 2-4, the audio object commands 226, 326, 426 may be transmitted in-band with the object-based audio signal within at least one audio frame 224, 324, 424. In one configuration, the at least one audio object command 226, 326, 426 is transmitted separately from at least one audio frame comprising the plurality of object-based audio signals. For example, as previously discussed, the audio object commands 226, 326, 426 may be transmitted out-of-band with the audio frame(s) 224, 324, 424 comprising the object-based audio signal. The source device 202, 302, 402 may transmit the at least one audio object command and the plurality of object-based audio signals to one of an AV receiver or a television.

Fig. 9 is a conceptual data flow diagram 900 illustrating the flow of data between different components/assemblies in an exemplary device 902. The device 902 processes the object-based audio signal for reproduction by the playback system. The device 902 includes a receiving component 904, a processor component 906, and a renderer component 908. The receiving component 904 is configured to receive a plurality of object based audio signals 920 in at least one audio frame. Additionally, the receiving component 904 is configured to receive at least one audio object command 922 associated with at least one of the plurality of object-based audio signals 920. The receiving component 904 is configured to provide the object based audio signal 920 and the at least one audio object command 922 to the processor component 906. The processor component 906 is configured to process the at least one object-based audio signal based on the received at least one audio object command 922. The processor component 906 is configured to provide the processed object-based audio signal to the renderer component 908. The renderer component 908 is configured to render a set of object-based audio signals of the plurality of object-based audio signals to a set of output signals 924 based on the at least one audio object command. The set of output signals 924 may be provided to a sound/playback system (e.g., to drive a loudspeaker).

The at least one audio object command may be received in an audio frame(s) having the plurality of object-based audio signals. The at least one audio object command may be appended to the ends of the plurality of object-based audio signals in the audio frame(s). The at least one audio object command may be received separately from the audio frame(s) comprising the plurality of object-based audio signals. Each object-based audio signal of the plurality of object-based audio signals includes audio waveform data and object metadata associated with the audio waveform data. To process the at least one object-based audio signal based on the received at least one audio object command, the processor component 906 may be configured to modify object metadata of the at least one object-based audio signal based on the at least one audio object command. To process the at least one object-based audio signal based on the received at least one audio object command, the processor component 906 may be configured to modify a volume associated with the at least one object-based audio signal, remove the at least one object-based audio signal from the set of object-based audio signals from being rendered, add the at least one object-based audio signal to the set of object-based audio signals for rendering, replacing a first object-based audio signal of the at least one object-based audio signal with a second object-based audio signal of the at least one object-based audio signal when rendering the set of object-based audio signals, modifying a spatial location of the at least one object-based audio signal, or changing metadata/rendering properties of the at least one object-based audio signal. The audio frame(s) may be received from one of a set-top box, an OD player, or a television. The device may be an AV receiver or a television.

Fig. 10 is a diagram 1000 illustrating an example of a hardware implementation of a device 902' employing a processing system 1014. The processing system 1014 may be implemented with a bus architecture, represented generally by the bus 1024. The bus 1024 may include any number of interconnecting buses and bridges depending on the specific application of the processing system 1014 and the overall design constraints. The bus 1024 links together various circuits including the processor 1004, the components 904, 906, 908, and one or more processors and/or hardware components represented by the computer-readable medium/memory 1006. The bus 1024 may also link various other circuits that are well known in the art and therefore will not be described any further, such as timing sources, peripherals, voltage regulators, and power management circuits.

The processing system 1014 includes a processor 1004 coupled to a computer-readable medium/memory 1006. The processor 1004 is responsible for general processing, including the execution of software stored on the computer-readable medium/memory 1006. The software, when executed by the processor 1004, causes the processing system 1014 to perform the various functions described supra for any particular apparatus. The computer-readable medium/memory 1006 may also be used for storing data that is manipulated by the processor 1004 when executing software. The processing system 1014 further includes at least one of the components 904, 906, 908. The components may be software components running in the processor 1004, resident/stored in the computer readable medium/memory 1006, one or more hardware components coupled to the processor 1004, or some combination thereof.

In one configuration, an apparatus for processing an object-based audio signal for reproduction by a playback system is provided. The apparatus comprises means for receiving a plurality of object based audio signals in at least one audio frame. In addition, the apparatus includes means for receiving at least one audio object command associated with at least one of the plurality of object-based audio signals. In addition, the apparatus comprises means for processing the at least one object-based audio signal based on the received at least one audio object command. Furthermore, the apparatus comprises means for rendering a set of object based audio signals of the plurality of object based audio signals to a set of output signals based on the at least one audio object command. In one configuration, each object-based audio signal of the plurality of object-based audio signals includes audio waveform data and object metadata associated with the audio waveform data. In such a configuration, the means for processing the at least one object-based audio signal based on the received at least one audio object command is configured to modify object metadata of the at least one object-based audio signal based on the at least one audio object command. In one configuration, the means for processing the at least one object-based audio signal based on the received at least one audio object command is configured to perform at least one of the following operations: modifying a volume associated with the at least one object-based audio signal, removing the at least one object-based audio signal from the set of object-based audio signals without being rendered, adding the at least one object-based audio signal to the set of object-based audio signals for rendering, replacing a first object-based audio signal of the at least one object-based audio signal with a second object-based audio signal of the at least one object-based audio signal when rendering the set of object-based audio signals, modifying a spatial location of the at least one object-based audio signal, or changing metadata/rendering properties of the at least one object-based audio signal.

Fig. 11 is a conceptual data flow diagram 1100 illustrating data flow between different components/assemblies in an exemplary device. The device 1102 processes the object-based audio signal for reproduction by the playback system. The device 1102 includes a receiving component 1104, a commanding component 1106, and a transmitting component 1108. The receiving component 1104 is configured to receive user selection information 1122, the user selection information 1122 indicating at least one audio object command associated with at least one object-based audio signal. The command component 1106 is configured to obtain the at least one audio object command based on the received user selection information. The receiving component 1104 is configured to receive a plurality of object-based audio signals 1120. The receiving component 1104 is configured to provide the plurality of object based audio signals 1120 to the transmitting component 1108. The command component 1106 is configured to provide the at least one audio object command to the transmitting component 1108. The transmitting component 1108 is configured to transmit the at least one audio object command with the received plurality of object-based audio signals.

In one configuration, the transmitting component 1108 is configured to append the at least one audio object command to the ends of the plurality of object-based audio signals. In such a configuration, the at least one audio object command and the plurality of object based audio signals are transmitted in at least one audio frame. The command component 1106 may be configured to obtain the at least one audio object command based on the received user selection information by generating the at least one audio object command based on the received user selection information. The command component 1106 may be configured to obtain the at least one audio object command by sending the received user selection information to the network host and receiving the at least one audio object command from the network host. The at least one audio object command is based on the transmitted user selection information.

Fig. 12 is a diagram 1200 illustrating an example of a hardware implementation of a device 1102' employing a processing system 1214. The processing system 1214 may be implemented with a bus architecture, represented generally by the bus 1224. The bus 1224 may include any number of interconnecting buses and bridges depending on the specific application of the processing system 1214 and the overall design constraints. The bus 1224 links together various circuits including the processor 1204, the components 1104, 1106, 1108, and one or more processors and/or hardware components represented by the computer-readable medium/memory 1206. The bus 1224 may also link various other circuits well known in the art, such as timing sources, peripherals, voltage regulators, and power management circuits, which therefore will not be described any further.

The processing system 1214 includes a processor 1204 coupled to a computer-readable medium/memory 1206. The processor 1204 is responsible for general processing, including the execution of software stored on the computer-readable medium/memory 1206. The software, when executed by the processor 1204, causes the processing system 1214 to perform the various functions described supra for any particular apparatus. The computer-readable medium/memory 1206 may also be used for storing data that is manipulated by the processor 1204 when executing software. The processing system 1214 further includes at least one of the components 1104, 1106, 1108. The components may be software components running in the processor 1204, resident/stored in the computer readable medium/memory 1206, one or more hardware components coupled to the processor 1204, or some combination thereof.

In one configuration, an apparatus is provided for processing an object-based audio signal for reproduction by a playback system. The apparatus comprises means for receiving user selection information indicative of at least one audio object command associated with at least one object-based audio signal. The apparatus further comprises means for obtaining the at least one audio object command based on the received user selection information. The apparatus further includes means for receiving a plurality of object based audio signals. The apparatus further comprises means for transmitting the at least one audio object command with the received plurality of object based audio signals. The apparatus may further comprise means for appending the at least one audio object command to the end of the plurality of object based audio signals. The at least one audio object command and the plurality of object based audio signals may be transmitted in at least one audio frame. In one configuration, the means for obtaining the at least one audio object command based on the received user selection information is configured to generate the at least one audio object based on the received user selection information. In one configuration, the means for obtaining the at least one audio object command based on the received user selection information is configured to send the received user selection information to the network host and receive the at least one audio object command from the network host, the at least one audio object command being based on the sent user selection information.

The various illustrative logical blocks, components, methods, algorithmic processes, and sequences described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, components, and process actions have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality may be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of this document.

The various illustrative logical blocks and components described in connection with the embodiments disclosed herein may be implemented or performed with a machine such as a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be a controller, microcontroller, or state machine, combinations of these, or the like. A processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

Embodiments of the interactive audio metadata handling system and method described herein are operational with numerous types of general purpose or special purpose computing system environments or configurations. In general, a computing environment may include any type of computer system, including but not limited to one or more microprocessor-based computer systems, mainframe computers, digital signal processors, portable computing devices, personal organizers, device controllers, computing engines within an appliance, mobile telephones, desktop computers, mobile computers, tablet computers, smart phones, AV receivers, televisions, STBs, OD players, appliances with embedded computers, to name a few.

Such computing devices may typically be found in devices having at least some minimal computing power, including but not limited to personal computers, server computers, hand-held computing devices, laptop or mobile computers, communication devices (such as cellular telephones and PDAs), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, and the like. In some embodiments, the computing device will include one or more processors. Each processor may be a specialized microprocessor, such as a DSP, Very Long Instruction Word (VLIW), or other microcontroller, or may be a conventional CPU having one or more processing cores, including specialized Graphics Processing Unit (GPU) based cores in a multi-core CPU.

The processing acts of a method, process, or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software component executed by a processor, or in any combination of the two. The software components may be embodied in a computer-readable medium that may be accessed by a computing device. Computer-readable media include volatile and nonvolatile media that are removable, non-removable, or some combination thereof. Computer-readable media are used to store information such as computer-readable or computer-executable instructions, data structures, program components, or other data. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

Computer storage media includes, but is not limited to, computer or machine readable media or storage devices, such as optical storage, blu-ray disc (BD), Digital Versatile Discs (DVD), Compact Discs (CD), floppy disks, tape drives, hard drives, optical drives, solid state memory devices, Random Access Memory (RAM) memories, ROM memories, EPROM memories, EEPROM memories, flash memory or other memory technology, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other device that can be used to store the desired information and that can be accessed by one or more computer devices.

The software components may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory computer-readable storage medium, or physical computer storage known in the art. An exemplary storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

The phrase "non-transitory" as used in this document means "persistent or long-lived". The phrase "non-transitory computer readable medium" includes any and all computer readable media, with the sole exception of transitory, propagating signals. By way of example, and not limitation, this includes non-transitory computer-readable media such as register memory, processor cache, and RAM.

The maintenance of information such as computer-readable or computer-executable instructions, data structures, program components, etc. may also be implemented by encoding one or more modulated data signals, electromagnetic waves (such as carrier waves), or other transmission mechanisms or communication protocols using various communication media, and includes any wired or wireless information delivery mechanisms. In general, these communications media refer to signals having one or more of its characteristics set or changed in such a manner as to encode information or instructions in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, Radio Frequency (RF), infrared, laser, and other wireless media for transmitting, receiving, or both transmitting and receiving one or more modulated data signals or electromagnetic waves. Combinations of any of the above should also be included within the scope of communication media.

Further, one or any of the software, programs, computer program products, or portions thereof, which implement some or all of the various embodiments of the interactive audio metadata handling systems and methods described herein may be stored, received, transmitted, or read from a computer or machine-readable medium or any desired combination of storage and communication media in the form of computer-executable instructions or other data structures.

Embodiments of the interactive audio metadata handling systems and methods described herein may be further described in the general context of computer-executable instructions, such as program components, being executed by a computing device. Generally, program components include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices or within a cloud of one or more devices that are linked through one or more communications networks. In a distributed computing environment, program components may be located in both local and remote computer storage media including media storage devices. Still further, the foregoing instructions may be implemented partially or wholly as hardware logic circuits, which may or may not include a processor.

As used herein, conditional language (such as, "can," "might," "can," "etc.," and similar conditional language) is generally intended to convey that certain embodiments include, but other embodiments do not include, certain features, elements and/or states unless specifically stated otherwise or otherwise understood within the context in which it is used. Thus, such conditional language is not generally intended to imply that features, elements, and/or states are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements, and/or states are included or are to be performed in any particular embodiment. The terms "comprising," "including," "having," and the like, are synonymous, are used inclusively in an open-ended fashion, and do not exclude additional elements, features, acts, operations, or the like. Furthermore, the term "or" is used in its inclusive sense (and not its exclusive sense) such that, when used in connection with, for example, a list of elements, the term "or" means one, some or all of the elements in the list.

While the above detailed description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or algorithm illustrated may be made without departing from the spirit of the disclosure. As will be recognized, certain embodiments of the interactive audio metadata handling systems and methods described herein may be embodied within a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from others.

Furthermore, although the subject matter has been described in language specific to structural features and methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean "one and only one" unless specifically so stated, but rather "one or more. The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any aspect described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects. The term "some" means one or more unless specifically stated otherwise. Combinations such as "A, B or at least one of C", "at least one of A, B and C", and "A, B, C or any combination thereof" include any combination of A, B and/or C, and may include a plurality of a, B or C. Specifically, a combination such as "at least one of A, B or C", "at least one of A, B and C", and "A, B, C or any combination thereof" may be only one a, only one B, only one of C, A and B, A and C, B and C, or a and B and C, wherein any such combination may comprise one or more members of A, B or C. Any structural and functional equivalents to the elements of the aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed as a means for function unless the element is explicitly recited by the use of the phrase "means for … …".

Claims

1. A method of processing an object based audio signal for reproduction by a playback system, the method comprising:

receiving a plurality of encoded object-based audio signals in at least one audio frame;

receiving at least one audio object command associated with at least one of the plurality of encoded object-based audio signals, wherein the at least one audio object command is not encoded;

sending at least some of the plurality of encoded object-based audio signals to the playback system with the at least one audio object command that is not encoded;

processing the at least one object-based audio signal based on the received at least one audio object command; and

rendering a set of object-based audio signals of the plurality of encoded object-based audio signals to a set of output signals based on the at least one audio object command.

2. The method of claim 1, wherein the at least one audio object command is received with the plurality of encoded object-based audio signals in the at least one audio frame.

3. The method of claim 2, wherein the at least one audio object command is appended to the end of the plurality of encoded object-based audio signals in the at least one audio frame.

4. The method of claim 1, wherein the at least one audio object command is received separately from the at least one audio frame comprising the plurality of encoded object-based audio signals.

5. The method of claim 1, wherein each object-based audio signal of the plurality of encoded object-based audio signals comprises audio waveform data and object metadata associated with the audio waveform data, the processing of the at least one object-based audio signal based on the received at least one audio object command comprising modifying the object metadata of the at least one object-based audio signal based on the at least one audio object command.

6. The method of claim 1, wherein processing the at least one object-based audio signal based on the received at least one audio object command comprises at least one of:

modifying a volume associated with the at least one object-based audio signal;

removing the at least one object-based audio signal from rendering in the set of object-based audio signals;

adding the at least one object-based audio signal to the set of object-based audio signals for rendering;

replacing a first object-based audio signal of the at least one object-based audio signal with a second object-based audio signal of the at least one object-based audio signal when rendering the set of object-based audio signals;

modifying a spatial location of the at least one object-based audio signal; or

Changing a property of the at least one object-based audio signal.

7. The method of claim 1, wherein the at least one audio frame is received from one of a set-top box, a compact disc player, or a television.

8. The method of claim 1, wherein the method is performed by one of an Audio Video (AV) receiver or a television.

9. A method of processing an object based audio signal for reproduction by a playback system, the method comprising:

receiving user selection information indicative of at least one audio object command associated with at least one object-based audio signal;

obtaining the at least one audio object command based on the received user selection information, wherein the at least one audio object command is unencoded;

receiving a plurality of encoded object-based audio signals; and

sending the at least one audio object command that is not encoded with the received plurality of encoded object-based audio signals.

10. The method of claim 9, wherein the at least one audio object command is transmitted with the plurality of encoded object-based audio signals in at least one audio frame.

11. The method of claim 9, further comprising appending the at least one audio object command to the ends of the plurality of encoded object-based audio signals, the at least one audio object command and the plurality of encoded object-based audio signals being sent in at least one audio frame.

12. The method of claim 9, wherein the at least one audio object command is transmitted separately from at least one audio frame comprising the plurality of encoded object-based audio signals.

13. The method of claim 9, wherein obtaining the at least one audio object command based on the received user selection information comprises generating the at least one audio object command based on the received user selection information.

14. The method of claim 9, wherein obtaining the at least one audio object command based on the received user selection information comprises:

transmitting the received user selection information to the network host; and

receiving the at least one audio object command from the network host, the at least one audio object command being based on the transmitted user selection information.

15. The method of claim 9, wherein the at least one audio object command and the plurality of encoded object-based audio signals are transmitted to one of an audio-visual (AV) receiver or a television.

16. The method of claim 9, wherein the method is performed by one of a set-top box, a compact disc player, or a television.

17. An apparatus for processing an object based audio signal for reproduction by a playback system, the apparatus comprising:

a memory; and

at least one processor coupled to the memory and configured to:

18. The apparatus of claim 17, wherein the at least one audio object command is received with the plurality of encoded object-based audio signals in the at least one audio frame.

19. The apparatus according to claim 18, wherein said at least one audio object command is appended to the end of said plurality of encoded object-based audio signals in said at least one audio frame.

20. The apparatus of claim 17, wherein the at least one audio object command is received separately from the at least one audio frame comprising the plurality of encoded object-based audio signals.

21. The device of claim 17, wherein each object-based audio signal of the plurality of encoded object-based audio signals comprises audio waveform data and object metadata associated with the audio waveform data, and wherein to process the at least one object-based audio signal based on the received at least one audio object command, the at least one processor is configured to modify the object metadata of the at least one object-based audio signal based on the at least one audio object command.

22. The device of claim 17, wherein to process the at least one object-based audio signal based on the received at least one audio object command, the at least one processor is configured to perform at least one of:

modifying a volume associated with the at least one object-based audio signal;

modifying a spatial location of the at least one object-based audio signal; or

Changing a property of the at least one object-based audio signal.

23. The apparatus of claim 17, wherein the at least one audio frame is received from one of a set-top box, a compact disc player, or a television.

24. The apparatus of claim 17, wherein the apparatus is one of an Audio Video (AV) receiver or a television.

25. An apparatus for processing an object based audio signal for reproduction by a playback system, the apparatus comprising:

a memory; and

at least one processor coupled to the memory and configured to:

receiving a plurality of encoded object-based audio signals; and

26. The apparatus of claim 25, wherein the at least one audio object command is transmitted with the plurality of encoded object-based audio signals in at least one audio frame.

27. The device of claim 25, wherein the at least one processor is further configured to append the at least one audio object command to the ends of the plurality of encoded object-based audio signals, the at least one audio object command and the plurality of encoded object-based audio signals being transmitted in at least one audio frame.

28. The apparatus of claim 25, wherein the at least one audio object command is transmitted separately from at least one audio frame comprising the plurality of encoded object-based audio signals.

29. The device of claim 25, wherein to obtain the at least one audio object command based on received user selection information, the at least one processor is configured to generate the at least one audio object command based on received user selection information.

30. The device of claim 25, wherein to obtain the at least one audio object command based on the received user selection information, the at least one processor is configured to:

transmitting the received user selection information to the network host; and

31. The apparatus of claim 25, wherein the at least one audio object command and the plurality of encoded object-based audio signals are transmitted to one of an audio-visual (AV) receiver or a television.

32. The device of claim 25, wherein the device is one of a set-top box, a compact disc player, or a television.

33. An apparatus for processing an object based audio signal for reproduction by a playback system, the apparatus comprising:

means for receiving a plurality of encoded object-based audio signals in at least one audio frame;

means for receiving at least one audio object command associated with at least one of the plurality of encoded object-based audio signals, wherein the at least one audio object command is unencoded;

means for transmitting at least some of the plurality of encoded object-based audio signals to the playback system with the at least one audio object command that is not encoded;

means for processing the at least one object-based audio signal based on the received at least one audio object command; and

means for rendering a set of object-based audio signals of the plurality of encoded object-based audio signals to a set of output signals based on the at least one audio object command.

34. The apparatus of claim 33, wherein the at least one audio object command is received with the plurality of encoded object-based audio signals in the at least one audio frame.

35. The apparatus according to claim 34, wherein said at least one audio object command is appended to the end of said plurality of encoded object-based audio signals in said at least one audio frame.

36. The apparatus of claim 33, wherein the at least one audio object command is received separately from the at least one audio frame comprising the plurality of encoded object-based audio signals.

37. The apparatus of claim 33, wherein each object-based audio signal of the plurality of encoded object-based audio signals comprises audio waveform data and object metadata associated with the audio waveform data, the means for processing the at least one object-based audio signal based on the received at least one audio object command configured to modify the object metadata of the at least one object-based audio signal based on the at least one audio object command.

38. The device of claim 33, wherein the means for processing the at least one object-based audio signal based on the received at least one audio object command is configured to perform at least one of:

modifying a volume associated with the at least one object-based audio signal;

modifying a spatial location of the at least one object-based audio signal; or

Changing a property of the at least one object-based audio signal.

39. The device of claim 33, wherein the at least one audio frame is received from one of a set-top box, a compact disc player, or a television.

40. The apparatus of claim 33, wherein the apparatus is one of an Audio Video (AV) receiver or a television.

41. An apparatus for processing an object based audio signal for reproduction by a playback system, the apparatus comprising:

means for receiving user selection information indicative of at least one audio object command associated with at least one object-based audio signal;

means for obtaining the at least one audio object command based on the received user selection information, wherein the at least one audio object command is unencoded;

means for receiving a plurality of encoded object-based audio signals; and

means for transmitting the at least one audio object command that is not encoded with the received plurality of encoded object-based audio signals.

42. The apparatus according to claim 41, wherein said at least one audio object command is transmitted in at least one audio frame together with said plurality of encoded object-based audio signals.

43. The apparatus according to claim 41, further comprising means for appending said at least one audio object command to the end of the plurality of encoded object-based audio signals, said at least one audio object command and said plurality of encoded object-based audio signals being sent in at least one audio frame.

44. The apparatus of claim 41, wherein the at least one audio object command is transmitted separately from at least one audio frame comprising the plurality of encoded object-based audio signals.

45. The apparatus according to claim 41, wherein the means for obtaining the at least one audio object command based on the received user selection information is configured to generate the at least one audio object command based on the received user selection information.

46. The device of claim 41, wherein the means for obtaining the at least one audio object command based on the received user selection information is configured to:

transmitting the received user selection information to the network host; and

47. The apparatus of claim 41, wherein the at least one audio object command and the plurality of encoded object-based audio signals are transmitted to one of an audio-visual (AV) receiver or a television.

48. The device of claim 41, wherein the device is one of a set-top box, a compact disc player, or a television.

49. A computer-readable medium storing computer-executable code for processing object-based audio signals for reproduction by a playback system, the computer-readable medium comprising code for:

50. A computer-readable medium storing computer-executable code for processing object-based audio signals for reproduction by a playback system, the computer-readable medium comprising code for:

receiving a plurality of encoded object-based audio signals; and

transmitting the at least one audio object command that is not encoded with the received plurality of encoded object-based audio signals.