WO2013088208A1

WO2013088208A1 - An audio scene alignment apparatus

Info

Publication number: WO2013088208A1
Application number: PCT/IB2011/055692
Authority: WO
Inventors: Juha Petteri Ojanpera
Original assignee: Nokia Corporation
Priority date: 2011-12-15
Filing date: 2011-12-15
Publication date: 2013-06-20

Abstract

An apparatus comprising: a signal combiner configured to generate a combined audio signal from at least two first audio signals; a combined signal comparator configured to compare the combined audio signal to at least one second audio signal; an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

Description

AN AUDIO SCENE ALIGNMENT APPARATUS

Field The present application relates to apparatus for the processing of audio and additionally audio-video signals to enable alignment to audio signals. The invention further relates to, but is not limited to, apparatus for processing audio and additionally audio-video signals from mobile devices. Background

Viewing recorded or streamed audio-video or audio content is well known. Commercial broadcasters covering an event often have more than one recording device (video-camera/microphone) and a programme director will select a 'mix' where an output from a recording device or combination of recording devices is selected for transmission.

Multiple 'feeds' may be found in sharing services for video and audio signals (such as those employed by YouTube). Such systems, which are known and are widely used to share user generated content recorded and uploaded or up- streamed to a server and then downloaded or down-streamed to a viewing/listening user. Such systems rely on users recording and uploading or up- streaming a recording of an event using the recording facilities at hand to the user. This may typically be in the form of the camera and microphone arrangement of a mobile device such as a mobile phone.

Often the event is attended and recorded from more than one position by different recording users at the same time. The viewing/listening end user may then select one of the up-streamed or uploaded data to view or listen.

Summary Aspects of this application thus provide an audio signal alignment process whereby multiple devices can record audio signals and these audio signals be aligned to permit audio source selection. There is provided according to a first aspect an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

The apparatus may be further caused to perform: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene. The apparatus may be further caused to perform aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

The apparatus may be further caused to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

Generating a combined audio signal from at least two first audio signals may cause the apparatus to perform: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise. Comparing the combined audio signal to at least one second audio signal may cause the apparatus to perform a cross-correlation between the combined audio signal and the at least one second audio signal. Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may cause the apparatus to perform determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal. Associating the alignment value to the at least two first audio signals may cause the apparatus to perform: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.

The apparatus may be further caused to perform: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.

According to a second aspect there is provided a method comprising: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

The method may further comprise: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene.

The method may further comprise aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

The method may further comprise rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

Generating a combined audio signal from at least two first audio signals may comprise: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.

Comparing the combined audio signal to at least one second audio signal may comprise a cross-correlation between the combined audio signal and the at least one second audio signal.

Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

Associating the alignment value to the at least two first audio signals may comprise: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.

The method may further comprise: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

According to a third aspect there is provided an apparatus comprising: a signal combiner configured to generate a combined audio signal from at least two first audio signals; a combined signal comparator configured to compare the combined audio signal to at least one second audio signal; an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and and a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

The apparatus may further comprise: a first signal receiver configured to receive the at least two first audio signals from an audio scene; and a second signal receiver configured to receive the at least one second audio signal from the audio scene. The apparatus may further comprise a delay configured to align at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value. The apparatus may further comprise a signal renderer configured to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

The signal combiner may comprise: an averager configured to generate a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and an appender configured to append to the combined audio signal an available at least two first audio signal part otherwise. The signal comparator may comprises a cross-correlator configured to generate a cross-correlation product between the combined audio signal and the at least one second audio signal.

The alignment determiner may comprise an offset determiner configured to determine a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

The full signal aligner may comprise: a first assigner configured to assign the at least two first audio signals to a first group of audio signals; a second assigner configured to assign the at least one second audio signal to a second group of audio signals; a null assigner configured to assign a null alignment value to the first group of audio signals; and a delay assigner configured to assign the alignment value to the second group of audio signals. The signal combiner may be configured to further combine an audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value. The comparator may be configured to further compare the further combined audio signal to at least one further second audio signal.

The aligner may be configured to further determine a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal.

The full signal aligner may be further configured to associate the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

According to a fourth aspect there is provided an apparatus comprising: means for generating a combined audio signal from at least two first audio signals; means for comparing the combined audio signal to at least one second audio signal; means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and means for associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal. The apparatus may further comprise: means for receiving the at least two first audio signals from an audio scene; means for receiving the at least one second audio signal from the audio scene.

The apparatus may further comprise means for aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value. The apparatus may further comprise means for rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting. The means for generating a combined audio signal from at least two first audio signals may comprise: means for generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and means for appending to the combined audio signal an available at least two first audio signal part otherwise.

The means for comparing the combined audio signal to at least one second audio signal may comprise means for generating a cross-correlation product between the combined audio signal and the at least one second audio signal. The means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise means for determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

The means for associating the alignment value to the at least two first audio signals may comprise: means for assigning the at least two first audio signals to a first group of audio signals; means for assigning the at least one second audio signal to a second group of audio signals; means for assigning a null alignment value to the first group of audio signals; and means for assigning the alignment value to the second group of audio signals.

The means for combining may further comprise means for generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value.

The means for comparing may further comprise means for comparing the further combined audio signal to at least one further second audio signal. The means for determining an alignment value may comprise means for determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal. The means for associating the alignment value may comprise means for associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal. The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically a multi-user free-viewpoint service sharing system which may encompass embodiments of the application;

Figure 2 shows schematically an apparatus suitable for being employed in embodiments of the application;

Figure 3 shows schematically an audio signal system according to some embodiments; Figure 4 shows a flow diagram of the operation of the audio signal system as shown in Figure 3;

Figure 5 shows schematically the time stamp determiner shown in Figure 3 in further detail according to some embodiments;

Figure 6 shows the operation of the time stamp determiner shown in Figure

5 according to some embodiments; and

Figure 7 shows an example set of time stamped and non-time stamped audio signals to be aligned according to embodiments. Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanism for the provision of effective audio signal alignment. In the following examples, audio signals and audio capture signals are described. However it would be appreciated that in some embodiments the audio signal/audio capture is a part of an audio-video system.

The concept of this application is related to assisting in the production of immersive person-to-person communication and can include video. It would be understood that the space within which the devices record the audio signal can be arbitrarily positioned within an event space. The captured signals as described herein are transmitted or alternatively stored for later consumption where the end user can select the listening point based on their preference from the reconstructed audio space. The rendering part then can provide one or more down mixed signals from which the multiple recordings that correspond to the selective listening point. It would be understood that each recording device can record the event seen and upload or upstream the recorded content. The uploaded or upstream process can include implicitly positioning information about where the content is being recorded.

However in such systems some audio signal content can be uploaded in non-real time operations. In other words the media content in the form of a captured audio signal can be uploaded a few minutes, hours, days or weeks after the event. Thus the amount of content that represents each particular event and held in the server can fluctuate as a function of time. Furthermore uploaded audio signal content typically does not employ common time keeping or time stamping and therefore the newly uploaded or streamed content needs to be aligned to use a common time stamping before any down mixed signal is provided for consumption. As new content can be uploaded to the server at any time, a situation could arrive where the existing content already uses a common time stamping, however new content lacks this time stamp and should be transformed to this time stamping mode.

The concept of this application therefore is to provide an enabler for the case where some of the audio signal content for the event is already converted to a common time stamping and the conversion is to be applied to any new uploaded audio signal content.

Although a common time base can be achieved with a dedicated synchronisation signal, for example the capture devices are equipped to receive a specific beacon signal or timing information obtained through a network or other received data such as positioning satellite timing data (such as from a GPS satellite), this can be problematic. For example the use of a beacon signal typically requires special hardware and/or software installations to the recording or capture apparatus which limits the applicability of multiuser sharing services as recording devices become too expensive for mass use or limits the use of existing devices. Although GPS or other satellite synchronisation timing signals can be used, the device requires a GPS or other satellite positioning receiver to receive the signal and furthermore could be used in circumstances where a GPS or satellite positioning signal is not available - such as, for example, indoors or operating in heavily built-up urban areas, or in woodland or forest regions. In some synchronisation systems a network time protocol (NTP) is used to time stamp the recorded content from multiple users. In these examples the recording devices synchronise the recordings against an NTP reference. However an NTP reference requires network connection which may not be available in all situations and typically timing errors can be introduced to the time stamps due to transmission delays.

Furthermore although signal to signal synchronisation using correlation could be employed, this is impractical for applications where the number of recording increases as the processing requirement increasing exponentially rather than linearly. Furthermore any time skew between multiple content recordings has to be limited to an order of 10s of seconds otherwise computational complexity of the correlation approach is a problem.

With respect to Figure 1 an overview of a suitable system within which embodiments of the application can be located is shown. The audio space 1 can have located within it at least one recording or capturing device or apparatus 19 which are arbitrarily positioned within the audio space to record suitable audio scenes. The apparatus 19 shown in Figure 1 are represented as microphones with a polar gain pattern 101 showing the directional audio capture gain associated with each apparatus. The apparatus 19 in Figure 1 are shown such that some of the apparatus are capable of attempting to capture the audio scene or activity 103 within the audio space. The activity 103 can be any event the user of the apparatus wishes to capture. For example the event could be a music event or audio of a "news worthy" event. The apparatus 19 although being shown having a directional microphone gain pattern 101 would be appreciated that in some embodiments the microphone or microphone array of the recording apparatus 19 has a omnidirectional gain or different gain profile to that shown in Figure 1 .

Each recording apparatus 19 can in some embodiments transmit or alternatively store for later consumption the captured audio signals via a transmission channel 107 to an audio scene server 109. The recording apparatus 19 in some embodiments can encode the audio signal to compress the audio signal in a known way in order to reduce the bandwidth required in "uploading" the audio signal to the audio scene server 109. The recording apparatus 19 in some embodiments can be configured to estimate and upload via the transmission channel 107 to the audio scene server 109 an estimation of the location and/or the orientation or direction of the apparatus. The position information can be obtained, for example, using GPS coordinates, cell-ID or a-GPS or any other suitable location estimation methods and the orientation/direction can be obtained, for example using a digital compass, accelerometer, or gyroscope information.

In some embodiments the recording apparatus 19 can be configured to capture or record one or more audio signals for example the apparatus in some embodiments have multiple microphones each configured to capture the audio signal from different directions. In such embodiments the recording device or apparatus 19 can record and provide more than one signal from different the direction/orientations and further supply position/direction information for each signal. With respect to the application described herein an audio or sound source can be defined as each of the captured or audio recorded signal. In some embodiments each audio source can be defined as having a position or location which can be an absolute or relative value. For example in some embodiments the audio source can be defined as having a position relative to a desired listening location or position. Furthermore in some embodiments the audio source can be defined as having an orientation, for example where the audio source is a beamformed processed combination of multiple microphones in the recording apparatus, or a directional microphone. In some embodiments the orientation may have both a directionality and a range, for example defining the 3dB gain range of a directional microphone.

The capturing and encoding of the audio signal and the estimation of the position/direction of the apparatus is shown in Figure 1 by step 1001. The uploading of the audio and position/direction estimate to the audio scene server 109 is shown in Figure 1 by step 1003. The audio scene server 109 furthermore can in some embodiments communicate via a further transmission channel 1 1 1 to a listening device 1 13.

In some embodiments the listening device 1 13, which is represented in Figure 1 by a set of headphones, can prior to or during downloading via the further transmission channel 1 1 1 select a listening point, in other words select a position such as indicated in Figure 1 by the selected listening point 105. In such embodiments the listening device 113 can communicate via the further transmission channel 11 1 to the audio scene server 109 the request.

The selection of a listening position by the listening device 113 is shown in Figure 1 by step 1005.

The audio scene server 109 can as discussed above in some embodiments receive from each of the recording apparatus 19 an approximation or estimation of the location and/or direction of the recording apparatus 19. The audio scene server 109 can in some embodiments from the various captured audio signals from recording apparatus 19 produce a composite audio signal representing the desired listening position and the composite audio signal can be passed via the further transmission channel 1 1 1 to the listening device 1 13.

The generation or supply of a suitable audio signal based on the selected listening position indicator is shown in Figure 1 by step 1007. In some embodiments the listening device 1 13 can request a multiple channel audio signal or a mono-channel audio signal. This request can in some embodiments be received by the audio scene server 109 which can generate the requested multiple channel data. The audio scene server 109 in some embodiments can receive each uploaded audio signal and can keep track of the positions and the associated direction/orientation associated with each audio source. In some embodiments the audio scene server 109 can provide a high level coordinate system which corresponds to locations where the uploaded/upstreamed content source is available to the listening device 1 13. The "high level" coordinates can be provided for example as a map to the listening device 1 13 for selection of the listening position. The listening device (end user or an application used by the end user) can in such embodiments be responsible for determining or selecting the listening position and sending this information to the audio scene server 109. The audio scene server 109 can in some embodiments receive the selection/determination and transmit the downmixed signal corresponding to the specified location to the listening device. In some embodiments the listening device/end user can be configured to select or determine other aspects of the desired audio signal, for example signal quality, number of channels of audio desired, etc. In some embodiments the audio scene server 109 can provide in some embodiments a selected set of downmixed signals which correspond to listening points neighbouring the desired location/direction and the listening device 1 13 selects the audio signal desired.

In this regard reference is first made to Figure 2 which shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to record (or operate as a recording device 19) or listen (or operate as a listening device 1 13) to the audio signals (and similarly to record or view the audio-visual images and data). Furthermore in some embodiments the apparatus or electronic device can function as the audio scene server 109.

The electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system when functioning as the recording device or listening device 1 13. In some embodiments the apparatus can be an audio player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable device suitable for recording audio or audio/video camcorder/memory audio or video recorder.

The apparatus 10 can in some embodiments comprise an audio subsystem. The audio subsystem for example can comprise in some embodiments a microphone or array of microphones 1 1 for audio signal capture. In some embodiments the microphone or array of microphones can be a solid state microphone, in other words capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone or array of microphones 1 1 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone 1 1 or array of microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.

In some embodiments the apparatus can further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and outputting the audio captured signal in a suitable digital form. The analogue-to-digital converter 14 can be any suitable analogue-to- digital conversion or processing means.

In some embodiments the apparatus 10 audio subsystem further comprises a digital-to-analogue converter 32 for converting digital audio signals from a processor 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology.

Furthermore the audio subsystem can comprise in some embodiments a speaker 33. The speaker 33 can in some embodiments receive the output from the digital- to-analogue converter 32 and present the analogue audio signal to the user. In some embodiments the speaker 33 can be representative of a headset, for example a set of headphones, or cordless headphones. Although the apparatus 10 is shown having both audio capture and audio presentation components, it would be understood that in some embodiments the apparatus 10 can comprise one or the other of the audio capture and audio presentation parts of the audio subsystem such that in some embodiments of the apparatus the microphone (for audio capture) or the speaker (for audio presentation) are present.

In some embodiments the apparatus 10 comprises a processor 21. The processor 21 is coupled to the audio subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 1 1 , and the digital-to-analogue converter (DAC) 12 configured to output processed digital audio signals. The processor 21 can be configured to execute various program codes. The implemented program codes can comprise for example audio classification and audio scene mapping code routines. In some embodiments the program codes can be configured to perform audio scene event detection and device selection indicator generation, wherein the audio scene server 109 can be configured to determine events from multiple received audio recordings to assist the user in selecting an audio recording which is meaningful and does not require the listener to carry out undue searching of all of the audio recordings.

In some embodiments the apparatus further comprises a memory 22. In some embodiments the processor is coupled to memory 22. The memory can be any suitable storage means. In some embodiments the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21 . Furthermore in some embodiments the memory 22 can further comprise a stored data section 24 for storing data, for example data that has been encoded in accordance with the application or data to be encoded via the application embodiments as described later. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via the memory-processor coupling. In some further embodiments the apparatus 10 can comprise a user interface 15. The user interface 15 can be coupled in some embodiments to the processor 21. In some embodiments the processor can control the operation of the user interface and receive inputs from the user interface 15. In some embodiments the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.

In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The coupling can, as shown in Figure 1 , be the transmission channel 107 (where the apparatus is functioning as the recording device 19 or audio scene server 109) or further transmission channel 1 1 1 (where the device is functioning as the listening device 1 13 or audio scene server 109). The transceiver 13 can communicate with further devices by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

In some embodiments the apparatus comprises a position sensor 16 configured to estimate the position of the apparatus 10. The position sensor 16 can in some embodiments be a satellite positioning sensor such as a GPS (Global Positioning System), GLONASS or Galileo receiver.

In some embodiments the positioning sensor can be a cellular ID system or an assisted GPS system. In some embodiments the apparatus 10 further comprises a direction or orientation sensor. The orientation/direction sensor can in some embodiments be an electronic compass, accelerometer, a gyroscope or be determined by the motion of the apparatus using the positioning estimate.

It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways. Furthermore it could be understood that the above apparatus 10 in some embodiments can be operated as an audio scene server 109. In some further embodiments the audio scene server 109 can comprise a processor, memory and transceiver combination. Figure 3 shows an example alignment apparatus for content time stamping for crowd source media content. In these circumstances the apparatus is configured to create a combined signal for the received time stamped content, to align the non-time stamped content signal and the combined signal and then to further align both the time stamped and the non-time stamped content signal using the information from the operation of aligning the non-time stamped content signal and the combined signal.

The alignment apparatus comprises in some embodiments a non-time stamped content input 201 or suitable input means. The non-time stamped content input 201 is configured to receive audio signals in any suitable format and pass these to the time stamped determiner 205. In some embodiments the non-time stamped content input 201 can be configured to pre-process the non-time stamped content to deliver the audio data in a format suitable for processing by the time stamp determiner 205.

The operation of receiving the non-time stamped content is shown in Figure 4 by step 301. Furthermore in some embodiments the alignment apparatus comprises a time stamped content input 203 or suitable second input means. The time stamp content input 203 can be configured to receive the time stamped audio data in any suitable format and pass the time stamped content audio data to the time stamped determiner 205 for further processing.

The operation of receiving the time stamped content is shown in Figure 4 by step 303. In some embodiments the time stamped content input 203 and non-time stamp content input 201 can be implemented by a transceiver element of an audio scene server receiving audio signals from various capture devices recording audio signals within the audio scene and/or from the memory or store associated with the audio scene server. However it would be appreciated that in some embodiments the alignment apparatus represents a combination of an audio content recorder or capturer and thus receives the time stamped content via a communications coupling such as a wireless communications link, audio scene server and listener apparatus. In some embodiments the alignment apparatus comprises a time stamp determiner 205 or suitable alignment means. The time stamp determiner 205 is configured to receive the audio data from the non-time stamp content input 201 and also the audio data from the time stamp content input 203 and determine (or align) a time stamp for the non-time stamp content input based on the time stamped content audio signals.

In some embodiments the time stamp determiner 205 can output the audio data, both the received time stamped content input and the aligned non-time stamp content to a content renderer 207.

The operation of determining time stamps for the non-time stamp content is shown in Figure 4 by step 305. In some embodiments the alignment apparatus comprises a content renderer 207 or rendering means. The content renderer 207 is configured in some embodiments to receive the audio signals and render these into an audio signal suitable for consumption. For example in some embodiments the content renderer 207 can be configured to generate a multi-channel audio signal from the input content audio signals suitable for passing to a listener apparatus.

In some embodiments the content renderer 207 can be implemented in the audio scene server, in the content recorder or capturer, or in the content listener apparatus. The implementation of content rendering is generally known and will not be described any further.

The operation of rendering the content of the audio signals is shown in Figure 4 by step 307.

In some embodiments the apparatus further comprises a content processor 209 configured to receive the rendered audio signal data from the content renderer 207 and process it in order that it can be displayed or listened to by the end user. As described herein, the content processor 209 can in some embodiments control the content renderer 207 to produce a suitable rendered audio signal at a determined position, or configuration. The content processor 209 can in some embodiments be implemented within a listener apparatus.

The operation of processing the content, or consuming the content is shown in Figure 4 by step 309.

With respect to Figure 5, the timestamp determiner 205 is shown in further detail. Furthermore with respect to Figure 6 the operation of the time stamp determiner as shown in Figure 5 is described in further detail.

In some embodiments the time stamp determiner 205 comprises a signal combiner 401 or combiner means. The signal combiner 401 is configured to receive at least two of the time stamped audio content signals and combine these signals to generate a single combined time stamp signal.

With respect to Figure 7, an example sequence of time stamped audio signals are shown. A first time stamp signal 601A which has a range from T=0 to T=4, a second time stamp signal 603B with time stamp range from T=2 to T=5 and a third time stamp signal 605C with a signal range from time stamp T=1 to T=6. The time stamp audio signals 651 can be combined in the signal combiner 401 to generate a combined signal.

In some embodiments the combined signal can be created by a suitable averaging and appending means according to the following mathematical expression,

CX, =

₍,₎( > t_start„_dx≤t < t_end„_dx

n^^LlPlldx '=0 where x is the audio signal (or representative signal derived from the audio signal) of the timestamped crowd-sourced content, nClip_lldx and clipldx_lldx describe the number of content items and the content item indices for the tldx^th time segment, respectively, t _start,_ldx and t _end_lldx describe the start and end times of the segment for the tldx^th time segment, respectively. In some embodiments the signal combiner 401 can be configured to repeat the combining expression for 0 < tldx < nT where nT is the number of time segments in the timeline. Using the example shown in Figure 7, the data for the combined signal would be, tldx = 0:

clipldx_tldx = {A}, nClip_IIdx = 1, t _start_IIdx = 0, t _end_tIdx = 1, tldx = 1 :

clipldx_tldx = {A, B], nClip_lldx = 2, t _ start _l!dx = 1, t _ end_lldx = 2, tldx = 4:

clipldx_lhk = {C], nClip_IIdx = 1, t _ start _lldx = 5, t _ ewi/,_/rft = 6, The combined signal generated by the signal combiner 401 therefore mixes overlapping content for each segment to obtain a signal which represents the average signal covering the whole of the time stamped content range.

As shown in Figure 7 the combined signal 606 CX has a range from T=0 to T=6. The combined signal 652 can then be output by the signal combiner 401 to a combined signal aligner determiner 403.

The operation of creating a combined signal is shown in Figure 6 by step 501. In some embodiments the time stamp determiner comprising a combined signal aligner determiner 403. The combined signal aligner determiner 403 is configured to receive the combined signal from the signal combiner 401 containing a time stamp and also the non-time stamped content audio signals and attempt to align these signals. With respect to Figure 7, the combined signal 606 CX and a non-time stamped content signal 607D is shown. The combined signal aligner 403 can be configured to determine the time offset between the signals, in other words whether the non-time stamped content audio signal is delayed with respect to these combined signals with time stamped content values or vice versa.

In some embodiments the determination of the time offset can be carried out via a time offset determiner or suitable signal aligner means or function. The time offset between the audio signals can mathematically be represented by the following, tOffset = time_offset(CX, D) where time_offset() defines the function that determines the alignment for the specified input signals, CX the combined audio signal and D the non-time stamped audio signal. The output tOffset therefore in such embodiments contains corresponding time offset values for each of the input signals. The determination of time offset can be carried out by any suitable function such as a correlation analysis such as indicated by Carter, Nutall and Cable in "The smoothed coherence transform", Proceedings of the IEEE, Vol. 61 , No. 10, pages 1497 to 1498 or Cusani in "Performance of fast time delay estimators", IEEE Transactions in Acoustics, Speech, and Signal Processing, Vol. 37, No. 5, pages 757 to 759 or by Chen et al. in "Performance of GCC and AMDF Based Time Delay Estimation in Practical Reverberant Environments", Journal on Applied Signal Processing, Vol. 1 , pages 25 to 36.

The output of the combined signal aligner determiner 403 can then be passed in some embodiments to the full signal aligner 405.

The operation of combined signal aligning with the non-time stamped content is shown in Figure 6 by step 503.

In some embodiments the time stamp determiner 205 comprises a full signal aligner 405 or suitable full signal alignment means configured to align or time stamp the non-time stamped content signal with respect to the time stamped content, in other words to map the output time offsets to actual input content items. For example using the signals A, B, C and D from Figure 7, the following offsets can be determined,

A_offset = tOffset(O)

B_offset = tOffset(O)

C_offset = tOffset(O)

D_offset = tOffset(1 ) tOffset2 = {A_offset, B_offset, C_offset, D_offset}

As the combined signal CX is a combination of A, B, and C, the time offset for all of these is equal to the time offset of CX.

Thus the time stamps of previously time stamped and non-time stamped content items can be updated to use a common time base. This can be performed, for example in some embodiments by having the following information, Table 1 :

The grpID is used to identify crowd-sourced content items that are continuous in time. In the example shown in Figure 7, content audio signals A, B, and C could in such embodiments be assigned the same grpID. The start and end timestamps identify the positions of the content within the continuous timeline either in absolute terms or in relative terms (with respect to content A in the example Figure 7 as that appears first in the continuous timeline). Furthermore, similar information related to the grouping IDs can be generated in some embodiments that correspond to the input signals of the time offset function. For the example signals shown in Figure 7, this information could be grplD_Comb = {grpID of A, grpID of D}

In addition in some embodiments the following information can be generated from the input content items, the creation of the combined signal(s) and the output of the time offset function: Table 2: nGroups Number of groups that share the same continuous timeline trackldx Number of combined signals

nTracks Number of total signals (combined + non-timestamped signals) trackldxLeft Indices of non-timestamped signals

grpCombldx Number of signals assinged to each group after determining the Num time offsets in Equation (2)

grpCombID Group index for signals in Equation (2) where trackldxLeft describes the indices of the non-timestamped content items in the order they appear in the input content. For example, trackldxLeft for the example signal set in Figure 7 would be trackldxLeft = {D} as content D is the first content item that appears after the combined signals. Furthermore, grpCombID indicates which of the input signals in the time offset function share the same continuous timeline. In some embodiments it can be assumed that the time_offset() function can provide this information as an output value. Where for example signals do not share the same continuous timeline, the time offset function can be configured to assign a time offset in such a way that the different continuous timelines do not overlap. Based on the returned time offsets and the content durations it possible in some embodiments to derive which input signals in the time offset function create a continuous timeline. Where there is more than one continuous timeline then in some embodiments different group indices (for example, starting from index 0 onwards) can be used. Furthermore the variable, grpCombldxNum indicates the number of clips (including also the content signal) the particular content signal is overlapping with. Using the audio signals from Figure 7 again as an example, the values of grpCombID and grpCombldxNum would be grpCombID = {0, 0} and grpCombldxNum = {2, 2}, respectively.

For the signals in Figure 7, the following information would therefore be, trackldx = 1

nTracks = 2

trackldxLeft = {3 (index number of content D)} grpCombldxNum = {2, 2} % CX overlaps with D and D overlaps with CX grpCombID = {0, 0} % Both input signals share the same continuous timeline, thus belong to the same group (group 0 in the example case)

In some embodiments the alignment or mapping can be performed using such generated values as described herein according to the following pseudo code,

1 Vector of size trackldxPerGroup[0, ... ,M-1 ] = 0

2 Matrix of size cliplndices[0 M-1 ][0,... ,M-1 ] = 0

3 Matrix of size laglndicesfO M-1 ][0 M-1 ] = 0

4

5 for(i = clipldx; i < nTracks; i++)

6 {

7 sldx = i - clipldx; cldx = trackldxLeft[sldx];

8 if(grpldx[cldx] == -1 )

9 {

10 if(grpCombldxNum[i] > 1 )

1 1 {

12 for(j = 0, tGrpldx = -1 ; j < clipldx; j++)

13 if(j != i)

14 if(grpCombldxlD[j] == grpCombldxIDfi])

15 tGrpldx = j; Break for loop;

16

17 if(tGrpldx == -1 )

18 {

19 curldx = trackldxPerGroup[grpCombldxlD[i]];

20 cliplndices[grpCombldxlD[i]][curldx] = cldx;

21 laglndices[grpCombldxlD[i]][curldx] = tOffset2[cldx];

22 trackldxPerGroup[grpCombldxlD[i]] += 1 ;

23 24 else

25 {

26 refGrpldx = grplD_Comb[tGrpidx];

27 forG = 0; j < M; j++)

28 {

29 if(grplD[j] == refGrpldx)

30 {

31 curldx = trackldxPerGroup[grpCombldxlD[i]];

32 for(k = 0, isfound = 0; k < curldx; k++)

33 if(cliplndices[grpCombldxlD[i]][k] == j)

34 isfound = 1 ; Break for loop;

35

36 if(isfound == 0)

37 {

38 cliplndices[grpCombldxlD[i]][curldx] = j;

39 laglndices[grpCombldxlD[i]][curldx] = tOffset2[j];

40 trackldxPerGroup[grpCombldxlD[i]] += 1 ;

41 }

42 }

43 }

44

45 curldx = mergeData->trackldxPerGroup[grpCombldxlD[i]];

46 cliplndices[grpCombldxlD[i]][curldx] = cldx;

47 laglndices[grpCombldxlD[i]][curldx] = tOffset2[cldx];

48 trackldxPerGroup[grpCombldxlD[i]] += 1 ;

49 }

50 }

51 }

52 } where M is the number of input content items which includes the time stamped and non-time stamped content audio signals that have been specified for processing. In the above code, lines 6 to 51 are repeated for each of the non-time stamped content items. The group identification value grpID of the non-time stamped input content is checked to determine if it is unknown as shown in line 8 of the above pseudo code. Furthermore after determining the time offset, checks whether the non-time stamped audio signal overlaps with some of the other input signals, shown on line 10.

Where these conditions are met the content signal information can be processed further. For example lines 12 to 15 determine whether the signal is overlapping with any of the combined signals. Where no overlapping group is found the signal parameters (the input content item index) such as shown in line 20 and the time offset in line 21 are appended to the output grouping data as shown in lines 19 to 22 of the pseudo code.

Where overlapping groups are found the full signal aligner can as shown in lines 24 onwards the input content item is checked to determine whether the audio signal item belongs to the combined signal as shown in line 29. Where this is the situation the values are appended to the output grouping data as shown in lines 38 to 40. Furthermore the input content item that is overlapping with the combined signal is also appended as shown in lines 45 to 48. The lines 31 to 34 of the pseudo code check whether the signal parameters are appended only once to the output grouping data.

In some embodiments the maximum group identification value can be determined according to the following expression, maxGrpID = max(grplD[0], grplD[M-1]) where max() returns the maximum value of the specified input values. The full signal aligner can then in some embodiments finalise the mapping by updating the timings and the group identification values of those input content items that has been identified to share continuous timelines. In some embodiments this finalisation operation can be described with regards to the following pseudo code,

1 for(i = 0; i < nGroups; i++)

2 {

3 nClips = trackldxPerGroup[i]

4 if(nClips > 1 )

5 {

6 refGrpldx = grplD[cliplndices[i][0]]

7 forG = 0; j < nClips; j++)

8 {

9 if(grplD[cliplndices[i][j]] != -1 )

10 refGrpldx = grplD[cliplndices[i][j]]

1 1 }

12

13 forO = 0, minPos = -1 ; j < nClips; j++)

14 if(grplD[cliplndices[i][j]]== refGrpldx)

15 if(minPos == -1 || minPos > startTS[cliplndices[i][j]])

16 minPos = startTS[cliplndices[i][j]]

17

18 for(j = 0; j < nClips;

19 {

20 if(grplD[cliplndices[i][j]] != -1 )

21 {

22 diffPos = startTS[cliplndices[i][j]] - minPos

23 startTS[cliplndices[i] ]] = diffPos

24 }

25 else startTS[cliplndices[i][j]] = 0

26 }

27 28 forG = 0; j < nClips; j++)

29 {

30 cldx = cliplndices[i][j]

31 lagVal = laglndices[i][j]

32 contentDuration = stopTS[cldx] - startTS[cldx]

33 if(grpiD[cIdx] == -1 )

34 grplD[cldx] = (refGrpldx == -1 ) ? maxGrpID : refGrpldx

35 tServerStatus[cldx] = 2

36 startTS[cldx] += minPos + lagVal

37 stopTS[cldx] = startTS[cIdx] + contentDuration

38 }

39 }

40 maxGrpID = maxGrpID + 1

41 }

In such embodiments lines 2 to 41 of the above code are repeated for each group that share continuous timelines. Where the group has more than one content signal, as determined by line 4, the corresponding input content information is updated. In the above code, lines 6 to 1 1 determine the reference group identification value or the content signals. Furthermore lines 13 to 16 determine the start time for the corresponding group. In such embodiments lines 18 to 26 update the start time of the content item within the group to match a relative time differences as defined by the offset time value defined herein. Lines 28 to 38 of the code update the start and stop times and the group identification value information for each content item within the identified group. The variable tServerStatus in line 35 is used to indicate that the timing information for the content item is obtained through the combined signal processing mode. Furthermore in some embodiments the time stamped and non-time stamped content items can be aligned based on the guidance information as indicated herein. The guidance information can in some embodiments include the variable tServerStatus and the corresponding startTSinformation for each input content item where available. In some embodiments the alignment can be carried out according to the following expression,

(startTS, stopTS, grpID) = time_align(startTS, stopTS, grpID, <input_content_items>) where time_align() defines the function that determines the alignment for the specified input. In such embodiments the function can take as an input the start and end times and the group identification value along with the actual media content. The output of such a function would be the updated timings and group identification values for each of the input content. The operation of determining the time offset can be any suitable approach such as shown previously with regards to correlation analysis. When signals (X and Y) are about to be aligned, such as described in the expression above classify the condition, x.tServerStatus == 2 and y.tServerStatus == 2 the alignment parameters can be adjusted for efficient time offset search by defining the start and stop times, and the time offset window for the signal pair. This adjustment is in some embodiments performed according to the following code: Pseudo-code 3: 1 toWindow = 1 s

2 x.duration = x.stopTS - x.startTS

3 y. duration = y.stopTS - y.startTS

4

5 if x.startTS smaller than y.startTS

6 x.startTSNew = y.startTS - x.startTS - toWindow

7 if x.startTSNew < 0

8 x.startTSNew = 0

9 x.stopTS = x.startTSNew + (x.duration - (x.startTSNew - x.startTS))

10 x.startTS = x.startTSNew

1 1 else

12 y.startTSNew = x.startTS - y.startTS - toWindow

13 if y.startTSNew < 0

14 y.startTSNew = 0

5 y.stopTS = y.startTSNew + (y. duration - (y.startTSNew - y.startTS))

16 y.startTS = y.startTSNew

17 End

18 toWindow = toWindow * 2

In such embodiments it therefore can be seen that the start and stop times of the signal pair are dated such that the time offset window (toWindow) for the pair can be limited to a small value. In some embodiments the value can be set to 1 second indicating that the signals in the pair are aligned within + or - 1 second within respect to each other. In such embodiments the major operation efficiency as the final time offset needs to be searched only from a very limited time period. Furthermore in such embodiments the combined signal represents the entire events scene and is therefore highly unlikely that the time offset returned by the expressions herein are not optimal since the geographical area of the event scene may cover a large area where different content signals can be located quite far from each other. The steps of processing as described herein are designed to overcome the challenges the size of the event scene may bring to determining the time offsets between various content in a robust and reliable manner. In some embodiments the combined signal may be computed more than once. This for example can be carried out where the number of content items to be covered is large. In such examples a suitable combination, for example a random or fixed combination of the already time stamped content can be used to create the combined signal and then following the steps as described herein, the final time steps in the common time data can be determined. The combined signal can in some embodiments be recreated from time differences in composition including some new content items compared to previously created combined signal or replacing some content items with new items in the combined signal and then repeating the processing steps for alignment. The output values from each of these combined signal iterations can then be saved and the final time differences for the content items can be completed from the saved values. In some examples the data analysis can be performed to determine the time offset value towards each content item in a converging form. This converging form can be easily extracted using mean and standard variance calculations where the final output is a mean value which does not include any outlier values defined by the standard variance.

Although the above has been described with regards to audio signals, or audio- visual signals it would be appreciated that embodiments may also be applied to audio-video signals where the audio signal components of the recorded data are processed in terms of the determining of the base signal and the determination of the time alignment factors for the remaining signals and the video signal components may be synchronised using the above embodiments of the invention. In other words the video parts may be synchronised using the audio synchronisation information.

It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.

Furthermore elements of a public land mobile network (PLMN) may also comprise apparatus as described above. In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1. Apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform:

generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and

associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

2. The apparatus as claimed in claim 1 , further caused to perform:

receiving the at least two first audio signals from an audio scene; and receiving the at least one second audio signal from the audio scene.

3. The apparatus as claimed in claims 1 and 2, further caused to perform aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

4. The apparatus as claimed in claim 3, further caused to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

5. The apparatus as claimed in claims 1 to 4, wherein generating a combined audio signal from at least two first audio signals causes the apparatus to perform: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.

6. The apparatus as claimed in claims 1 to 5, wherein comparing the combined audio signal to at least one second audio signal causes the apparatus to perform a cross-correlation between the combined audio signal and the at least one second audio signal.

7. The apparatus as claimed in claim 6, wherein determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal causes the apparatus to perform determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

8. The apparatus as claimed in claims 1 to 7, wherein associating the alignment value to the at least two first audio signals causes the apparatus to perform:

assigning the at least two first audio signals to a first group of audio signals;

assigning the at least one second audio signal to a second group of audio signals;

assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.

9. The apparatus as claimed in claims 1 to 8, further caused to perform:

generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;

comparing the further combined audio signal to at least one further second audio signal;

determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

10. The apparatus as claimed in claims 1 to 9, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.

1 1 . A method comprising:

12. The method as claimed in claim 1 1 , further comprising:

13. The method as claimed in claims 1 1 and 12, further comprising aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

14. The method as claimed in claim 13, further comprising rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

15. The method as claimed in claims 1 1 to 14, wherein generating a combined audio signal from at least two first audio signals comprises:

generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.

16. The method as claimed in claims 1 1 to 15, wherein comparing the combined audio signal to at least one second audio signal comprises a cross- correlation between the combined audio signal and the at least one second audio signal.

17. The method as claimed in claim 16, wherein determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal comprises determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

18. The method as claimed in claims 1 1 to 17, wherein associating the alignment value to the at least two first audio signals comprises:

19. The method as claimed in claims 1 1 to 18, further comprising:

20. The method as claimed in claims 1 1 to 19, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.

21. An apparatus comprising:

a signal combiner configured to generate a combined audio signal from at least two first audio signals;

a combined signal comparator configured to compare the combined audio signal to at least one second audio signal;

an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and

a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

22. The apparatus as claimed in claim 21 , further comprising:

a first signal receiver configured to receive the at least two first audio signals from an audio scene; and

a second signal receiver configured to receive the at least one second audio signal from the audio scene.

23. The apparatus as claimed in claims 21 and 22, further comprising a delay configured to align at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

24. The apparatus as claimed in claim 23, further comprising a signal renderer configured to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

25. The apparatus as claimed in claims 21 to 24, wherein the signal combiner comprises: an averager configured to generate a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and

an appender configured to append to the combined audio signal an available at least two first audio signal part otherwise.

26. The apparatus as claimed in claims 21 to 25, wherein the signal comparator comprises a cross-correlator configured to generate a cross- correlation product between the combined audio signal and the at least one second audio signal.

27. The apparatus as claimed in claim 26, wherein the alignment determiner comprises an offset determiner configured to determine a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

28. The apparatus as claimed in claims 21 to 27, wherein the full signal aligner comprises:

a first assigner configured to assign the at least two first audio signals to a first group of audio signals;

a second assigner configured to assign the at least one second audio signal to a second group of audio signals;

a null assigner configured to assign a null alignment value to the first group of audio signals; and

a delay assigner configured to assign the alignment value to the second group of audio signals.

29. The apparatus as claimed in claims 21 to 28, wherein:

the signal combiner is configured to further combine an audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;

the comparator is configured to further compare the further combined audio signal to at least one further second audio signal; the aligner is configured to further determine a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and

the full signal aligner is further configured to associate the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

30. The apparatus as claimed in claims 21 to 29, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.

31. An apparatus comprising:

means for generating a combined audio signal from at least two first audio signals;

means for comparing the combined audio signal to at least one second audio signal;

means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and

means for associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

32. The apparatus as claimed in claim 31 , further comprising:

means for receiving the at least two first audio signals from an audio scene; and

means for receiving the at least one second audio signal from the audio scene.

33. The apparatus as claimed in claims 31 and 32, further comprising means for aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.

34. The apparatus as claimed in claim 33, further comprising means for rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.

35. The apparatus as claimed in claims 31 to 34, wherein the means for generating a combined audio signal from at least two first audio signals comprises:

means for generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and

means for appending to the combined audio signal an available at least two first audio signal part otherwise.

36. The apparatus as claimed in claims 31 to 35, wherein the means for comparing the combined audio signal to at least one second audio signal comprises means for generating a cross-correlation product between the combined audio signal and the at least one second audio signal.

37. The apparatus as claimed in claim 36, wherein the means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal comprises means for determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.

38. The apparatus as claimed in claims 31 to 37, wherein the means for associating the alignment value to the at least two first audio signals comprises: means for assigning the at least two first audio signals to a first group of audio signals;

means for assigning the at least one second audio signal to a second group of audio signals;

means for assigning a null alignment value to the first group of audio signals; and means for assigning the alignment value to the second group of audio signals.

39. The apparatus as claimed in claims 31 to 38, further comprising:

the means for combining further comprising means for generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;

the means for comparing further comprising means for comparing the further combined audio signal to at least one further second audio signal;

the means for determining an alignment value comprising means for determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and

the means for associating the alignment value comprising means for associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.

40. The apparatus as claimed in claims 31 to 39, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.

41. A computer program product stored on a medium for causing an apparatus to perform the method of any of claims 1 1 to 20.

42. An electronic device comprising apparatus as claimed in claims 1 to 10 and 21 to 40.

43. A chipset comprising apparatus as claimed in claims 1 to 10 and 21 to 40.