[go: up one dir, main page]

WO2013088208A1 - An audio scene alignment apparatus - Google Patents

An audio scene alignment apparatus Download PDF

Info

Publication number
WO2013088208A1
WO2013088208A1 PCT/IB2011/055692 IB2011055692W WO2013088208A1 WO 2013088208 A1 WO2013088208 A1 WO 2013088208A1 IB 2011055692 W IB2011055692 W IB 2011055692W WO 2013088208 A1 WO2013088208 A1 WO 2013088208A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
audio
signal
combined
audio signals
Prior art date
Application number
PCT/IB2011/055692
Other languages
French (fr)
Inventor
Juha Petteri Ojanpera
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to PCT/IB2011/055692 priority Critical patent/WO2013088208A1/en
Publication of WO2013088208A1 publication Critical patent/WO2013088208A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/27Server based end-user applications
    • H04N21/274Storing end-user multimedia data in response to end-user request, e.g. network recorder
    • H04N21/2743Video hosting of uploaded data from client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • H04N21/4622Retrieving content or additional data from different sources, e.g. from a broadcast channel and the Internet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/414Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance
    • H04N21/41407Specialised client platforms, e.g. receiver in car or embedded in a mobile appliance embedded in a portable device, e.g. video client on a mobile phone, PDA, laptop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42202Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] environmental sensors, e.g. for detecting temperature, luminosity, pressure, earthquakes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras

Definitions

  • the present application relates to apparatus for the processing of audio and additionally audio-video signals to enable alignment to audio signals.
  • the invention further relates to, but is not limited to, apparatus for processing audio and additionally audio-video signals from mobile devices.
  • Multiple 'feeds' may be found in sharing services for video and audio signals (such as those employed by YouTube).
  • Such systems which are known and are widely used to share user generated content recorded and uploaded or up- streamed to a server and then downloaded or down-streamed to a viewing/listening user.
  • Such systems rely on users recording and uploading or up- streaming a recording of an event using the recording facilities at hand to the user. This may typically be in the form of the camera and microphone arrangement of a mobile device such as a mobile phone.
  • the viewing/listening end user may then select one of the up-streamed or uploaded data to view or listen.
  • an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
  • the apparatus may be further caused to perform: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene.
  • the apparatus may be further caused to perform aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
  • the apparatus may be further caused to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
  • Generating a combined audio signal from at least two first audio signals may cause the apparatus to perform: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.
  • Comparing the combined audio signal to at least one second audio signal may cause the apparatus to perform a cross-correlation between the combined audio signal and the at least one second audio signal.
  • Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may cause the apparatus to perform determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
  • Associating the alignment value to the at least two first audio signals may cause the apparatus to perform: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
  • the apparatus may be further caused to perform: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
  • the first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
  • a method comprising: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
  • the method may further comprise: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene.
  • the method may further comprise aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
  • the method may further comprise rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
  • Generating a combined audio signal from at least two first audio signals may comprise: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.
  • Comparing the combined audio signal to at least one second audio signal may comprise a cross-correlation between the combined audio signal and the at least one second audio signal.
  • Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
  • Associating the alignment value to the at least two first audio signals may comprise: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
  • the method may further comprise: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
  • the first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
  • an apparatus comprising: a signal combiner configured to generate a combined audio signal from at least two first audio signals; a combined signal comparator configured to compare the combined audio signal to at least one second audio signal; an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and and a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
  • the apparatus may further comprise: a first signal receiver configured to receive the at least two first audio signals from an audio scene; and a second signal receiver configured to receive the at least one second audio signal from the audio scene.
  • the apparatus may further comprise a delay configured to align at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
  • the apparatus may further comprise a signal renderer configured to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
  • the signal combiner may comprise: an averager configured to generate a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and an appender configured to append to the combined audio signal an available at least two first audio signal part otherwise.
  • the signal comparator may comprises a cross-correlator configured to generate a cross-correlation product between the combined audio signal and the at least one second audio signal.
  • the alignment determiner may comprise an offset determiner configured to determine a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
  • the full signal aligner may comprise: a first assigner configured to assign the at least two first audio signals to a first group of audio signals; a second assigner configured to assign the at least one second audio signal to a second group of audio signals; a null assigner configured to assign a null alignment value to the first group of audio signals; and a delay assigner configured to assign the alignment value to the second group of audio signals.
  • the signal combiner may be configured to further combine an audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value.
  • the comparator may be configured to further compare the further combined audio signal to at least one further second audio signal.
  • the aligner may be configured to further determine a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal.
  • the full signal aligner may be further configured to associate the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
  • the first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
  • an apparatus comprising: means for generating a combined audio signal from at least two first audio signals; means for comparing the combined audio signal to at least one second audio signal; means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and means for associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
  • the apparatus may further comprise: means for receiving the at least two first audio signals from an audio scene; means for receiving the at least one second audio signal from the audio scene.
  • the apparatus may further comprise means for aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
  • the apparatus may further comprise means for rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
  • the means for generating a combined audio signal from at least two first audio signals may comprise: means for generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and means for appending to the combined audio signal an available at least two first audio signal part otherwise.
  • the means for comparing the combined audio signal to at least one second audio signal may comprise means for generating a cross-correlation product between the combined audio signal and the at least one second audio signal.
  • the means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise means for determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
  • the means for associating the alignment value to the at least two first audio signals may comprise: means for assigning the at least two first audio signals to a first group of audio signals; means for assigning the at least one second audio signal to a second group of audio signals; means for assigning a null alignment value to the first group of audio signals; and means for assigning the alignment value to the second group of audio signals.
  • the means for combining may further comprise means for generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value.
  • the means for comparing may further comprise means for comparing the further combined audio signal to at least one further second audio signal.
  • the means for determining an alignment value may comprise means for determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal.
  • the means for associating the alignment value may comprise means for associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
  • the first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows schematically a multi-user free-viewpoint service sharing system which may encompass embodiments of the application
  • FIG. 2 shows schematically an apparatus suitable for being employed in embodiments of the application
  • Figure 3 shows schematically an audio signal system according to some embodiments
  • Figure 4 shows a flow diagram of the operation of the audio signal system as shown in Figure 3;
  • FIG. 5 shows schematically the time stamp determiner shown in Figure 3 in further detail according to some embodiments
  • Figure 6 shows the operation of the time stamp determiner shown in Figure
  • Figure 7 shows an example set of time stamped and non-time stamped audio signals to be aligned according to embodiments.
  • audio signals and audio capture signals are described. However it would be appreciated that in some embodiments the audio signal/audio capture is a part of an audio-video system.
  • the concept of this application is related to assisting in the production of immersive person-to-person communication and can include video. It would be understood that the space within which the devices record the audio signal can be arbitrarily positioned within an event space.
  • the captured signals as described herein are transmitted or alternatively stored for later consumption where the end user can select the listening point based on their preference from the reconstructed audio space.
  • the rendering part then can provide one or more down mixed signals from which the multiple recordings that correspond to the selective listening point.
  • each recording device can record the event seen and upload or upstream the recorded content.
  • the uploaded or upstream process can include implicitly positioning information about where the content is being recorded.
  • audio signal content can be uploaded in non-real time operations.
  • the media content in the form of a captured audio signal can be uploaded a few minutes, hours, days or weeks after the event.
  • the amount of content that represents each particular event and held in the server can fluctuate as a function of time.
  • uploaded audio signal content typically does not employ common time keeping or time stamping and therefore the newly uploaded or streamed content needs to be aligned to use a common time stamping before any down mixed signal is provided for consumption.
  • new content can be uploaded to the server at any time, a situation could arrive where the existing content already uses a common time stamping, however new content lacks this time stamp and should be transformed to this time stamping mode.
  • the concept of this application therefore is to provide an enabler for the case where some of the audio signal content for the event is already converted to a common time stamping and the conversion is to be applied to any new uploaded audio signal content.
  • a common time base can be achieved with a dedicated synchronisation signal
  • the capture devices are equipped to receive a specific beacon signal or timing information obtained through a network or other received data such as positioning satellite timing data (such as from a GPS satellite)
  • positioning satellite timing data such as from a GPS satellite
  • the use of a beacon signal typically requires special hardware and/or software installations to the recording or capture apparatus which limits the applicability of multiuser sharing services as recording devices become too expensive for mass use or limits the use of existing devices.
  • GPS or other satellite synchronisation timing signals can be used, the device requires a GPS or other satellite positioning receiver to receive the signal and furthermore could be used in circumstances where a GPS or satellite positioning signal is not available - such as, for example, indoors or operating in heavily built-up urban areas, or in woodland or forest regions.
  • NTP network time protocol
  • the recording devices synchronise the recordings against an NTP reference.
  • NTP reference requires network connection which may not be available in all situations and typically timing errors can be introduced to the time stamps due to transmission delays.
  • the audio space 1 can have located within it at least one recording or capturing device or apparatus 19 which are arbitrarily positioned within the audio space to record suitable audio scenes.
  • the apparatus 19 shown in Figure 1 are represented as microphones with a polar gain pattern 101 showing the directional audio capture gain associated with each apparatus.
  • the apparatus 19 in Figure 1 are shown such that some of the apparatus are capable of attempting to capture the audio scene or activity 103 within the audio space.
  • the activity 103 can be any event the user of the apparatus wishes to capture. For example the event could be a music event or audio of a "news worthy" event.
  • the apparatus 19 although being shown having a directional microphone gain pattern 101 would be appreciated that in some embodiments the microphone or microphone array of the recording apparatus 19 has a omnidirectional gain or different gain profile to that shown in Figure 1 .
  • Each recording apparatus 19 can in some embodiments transmit or alternatively store for later consumption the captured audio signals via a transmission channel 107 to an audio scene server 109.
  • the recording apparatus 19 in some embodiments can encode the audio signal to compress the audio signal in a known way in order to reduce the bandwidth required in "uploading" the audio signal to the audio scene server 109.
  • the recording apparatus 19 in some embodiments can be configured to estimate and upload via the transmission channel 107 to the audio scene server 109 an estimation of the location and/or the orientation or direction of the apparatus.
  • the position information can be obtained, for example, using GPS coordinates, cell-ID or a-GPS or any other suitable location estimation methods and the orientation/direction can be obtained, for example using a digital compass, accelerometer, or gyroscope information.
  • the recording apparatus 19 can be configured to capture or record one or more audio signals for example the apparatus in some embodiments have multiple microphones each configured to capture the audio signal from different directions. In such embodiments the recording device or apparatus 19 can record and provide more than one signal from different the direction/orientations and further supply position/direction information for each signal.
  • an audio or sound source can be defined as each of the captured or audio recorded signal.
  • each audio source can be defined as having a position or location which can be an absolute or relative value.
  • the audio source can be defined as having a position relative to a desired listening location or position.
  • the audio source can be defined as having an orientation, for example where the audio source is a beamformed processed combination of multiple microphones in the recording apparatus, or a directional microphone.
  • the orientation may have both a directionality and a range, for example defining the 3dB gain range of a directional microphone.
  • the capturing and encoding of the audio signal and the estimation of the position/direction of the apparatus is shown in Figure 1 by step 1001.
  • the uploading of the audio and position/direction estimate to the audio scene server 109 is shown in Figure 1 by step 1003.
  • the audio scene server 109 furthermore can in some embodiments communicate via a further transmission channel 1 1 1 to a listening device 1 13.
  • the listening device 1 13 which is represented in Figure 1 by a set of headphones, can prior to or during downloading via the further transmission channel 1 1 1 select a listening point, in other words select a position such as indicated in Figure 1 by the selected listening point 105.
  • the listening device 113 can communicate via the further transmission channel 11 1 to the audio scene server 109 the request.
  • the audio scene server 109 can as discussed above in some embodiments receive from each of the recording apparatus 19 an approximation or estimation of the location and/or direction of the recording apparatus 19.
  • the audio scene server 109 can in some embodiments from the various captured audio signals from recording apparatus 19 produce a composite audio signal representing the desired listening position and the composite audio signal can be passed via the further transmission channel 1 1 1 to the listening device 1 13.
  • the listening device 1 13 can request a multiple channel audio signal or a mono-channel audio signal. This request can in some embodiments be received by the audio scene server 109 which can generate the requested multiple channel data.
  • the audio scene server 109 in some embodiments can receive each uploaded audio signal and can keep track of the positions and the associated direction/orientation associated with each audio source.
  • the audio scene server 109 can provide a high level coordinate system which corresponds to locations where the uploaded/upstreamed content source is available to the listening device 1 13. The "high level" coordinates can be provided for example as a map to the listening device 1 13 for selection of the listening position.
  • the listening device (end user or an application used by the end user) can in such embodiments be responsible for determining or selecting the listening position and sending this information to the audio scene server 109.
  • the audio scene server 109 can in some embodiments receive the selection/determination and transmit the downmixed signal corresponding to the specified location to the listening device.
  • the listening device/end user can be configured to select or determine other aspects of the desired audio signal, for example signal quality, number of channels of audio desired, etc.
  • the audio scene server 109 can provide in some embodiments a selected set of downmixed signals which correspond to listening points neighbouring the desired location/direction and the listening device 1 13 selects the audio signal desired.
  • Figure 2 shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to record (or operate as a recording device 19) or listen (or operate as a listening device 1 13) to the audio signals (and similarly to record or view the audio-visual images and data). Furthermore in some embodiments the apparatus or electronic device can function as the audio scene server 109.
  • the electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system when functioning as the recording device or listening device 1 13.
  • the apparatus can be an audio player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable device suitable for recording audio or audio/video camcorder/memory audio or video recorder.
  • the apparatus 10 can in some embodiments comprise an audio subsystem.
  • the audio subsystem for example can comprise in some embodiments a microphone or array of microphones 1 1 for audio signal capture.
  • the microphone or array of microphones can be a solid state microphone, in other words capable of capturing audio signals and outputting a suitable digital format signal.
  • the microphone or array of microphones 1 1 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone.
  • the microphone 1 1 or array of microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.
  • ADC analogue-to-digital converter
  • the apparatus can further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and outputting the audio captured signal in a suitable digital form.
  • ADC analogue-to-digital converter
  • the analogue-to-digital converter 14 can be any suitable analogue-to- digital conversion or processing means.
  • the apparatus 10 audio subsystem further comprises a digital-to-analogue converter 32 for converting digital audio signals from a processor 21 to a suitable analogue format.
  • the digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology.
  • the audio subsystem can comprise in some embodiments a speaker 33.
  • the speaker 33 can in some embodiments receive the output from the digital- to-analogue converter 32 and present the analogue audio signal to the user.
  • the speaker 33 can be representative of a headset, for example a set of headphones, or cordless headphones.
  • the apparatus 10 is shown having both audio capture and audio presentation components, it would be understood that in some embodiments the apparatus 10 can comprise one or the other of the audio capture and audio presentation parts of the audio subsystem such that in some embodiments of the apparatus the microphone (for audio capture) or the speaker (for audio presentation) are present.
  • the apparatus 10 comprises a processor 21.
  • the processor 21 is coupled to the audio subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 1 1 , and the digital-to-analogue converter (DAC) 12 configured to output processed digital audio signals.
  • the processor 21 can be configured to execute various program codes.
  • the implemented program codes can comprise for example audio classification and audio scene mapping code routines.
  • the program codes can be configured to perform audio scene event detection and device selection indicator generation, wherein the audio scene server 109 can be configured to determine events from multiple received audio recordings to assist the user in selecting an audio recording which is meaningful and does not require the listener to carry out undue searching of all of the audio recordings.
  • the apparatus further comprises a memory 22.
  • the processor is coupled to memory 22.
  • the memory can be any suitable storage means.
  • the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21 .
  • the memory 22 can further comprise a stored data section 24 for storing data, for example data that has been encoded in accordance with the application or data to be encoded via the application embodiments as described later.
  • the implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via the memory-processor coupling.
  • the apparatus 10 can comprise a user interface 15.
  • the user interface 15 can be coupled in some embodiments to the processor 21.
  • the processor can control the operation of the user interface and receive inputs from the user interface 15.
  • the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15.
  • the user interface 15 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.
  • the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the coupling can, as shown in Figure 1 , be the transmission channel 107 (where the apparatus is functioning as the recording device 19 or audio scene server 109) or further transmission channel 1 1 1 (where the device is functioning as the listening device 1 13 or audio scene server 109).
  • the transceiver 13 can communicate with further devices by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the apparatus comprises a position sensor 16 configured to estimate the position of the apparatus 10.
  • the position sensor 16 can in some embodiments be a satellite positioning sensor such as a GPS (Global Positioning System), GLONASS or Galileo receiver.
  • GPS Global Positioning System
  • GLONASS Galileo receiver
  • the positioning sensor can be a cellular ID system or an assisted GPS system.
  • the apparatus 10 further comprises a direction or orientation sensor.
  • the orientation/direction sensor can in some embodiments be an electronic compass, accelerometer, a gyroscope or be determined by the motion of the apparatus using the positioning estimate.
  • the structure of the electronic device 10 could be supplemented and varied in many ways.
  • the above apparatus 10 in some embodiments can be operated as an audio scene server 109.
  • the audio scene server 109 can comprise a processor, memory and transceiver combination.
  • Figure 3 shows an example alignment apparatus for content time stamping for crowd source media content.
  • the apparatus is configured to create a combined signal for the received time stamped content, to align the non-time stamped content signal and the combined signal and then to further align both the time stamped and the non-time stamped content signal using the information from the operation of aligning the non-time stamped content signal and the combined signal.
  • the alignment apparatus comprises in some embodiments a non-time stamped content input 201 or suitable input means.
  • the non-time stamped content input 201 is configured to receive audio signals in any suitable format and pass these to the time stamped determiner 205.
  • the non-time stamped content input 201 can be configured to pre-process the non-time stamped content to deliver the audio data in a format suitable for processing by the time stamp determiner 205.
  • the alignment apparatus comprises a time stamped content input 203 or suitable second input means.
  • the time stamp content input 203 can be configured to receive the time stamped audio data in any suitable format and pass the time stamped content audio data to the time stamped determiner 205 for further processing.
  • the time stamped content input 203 and non-time stamp content input 201 can be implemented by a transceiver element of an audio scene server receiving audio signals from various capture devices recording audio signals within the audio scene and/or from the memory or store associated with the audio scene server.
  • the alignment apparatus represents a combination of an audio content recorder or capturer and thus receives the time stamped content via a communications coupling such as a wireless communications link, audio scene server and listener apparatus.
  • the alignment apparatus comprises a time stamp determiner 205 or suitable alignment means.
  • the time stamp determiner 205 is configured to receive the audio data from the non-time stamp content input 201 and also the audio data from the time stamp content input 203 and determine (or align) a time stamp for the non-time stamp content input based on the time stamped content audio signals.
  • the time stamp determiner 205 can output the audio data, both the received time stamped content input and the aligned non-time stamp content to a content renderer 207.
  • the alignment apparatus comprises a content renderer 207 or rendering means.
  • the content renderer 207 is configured in some embodiments to receive the audio signals and render these into an audio signal suitable for consumption.
  • the content renderer 207 can be configured to generate a multi-channel audio signal from the input content audio signals suitable for passing to a listener apparatus.
  • the content renderer 207 can be implemented in the audio scene server, in the content recorder or capturer, or in the content listener apparatus.
  • the implementation of content rendering is generally known and will not be described any further.
  • the apparatus further comprises a content processor 209 configured to receive the rendered audio signal data from the content renderer 207 and process it in order that it can be displayed or listened to by the end user.
  • the content processor 209 can in some embodiments control the content renderer 207 to produce a suitable rendered audio signal at a determined position, or configuration.
  • the content processor 209 can in some embodiments be implemented within a listener apparatus.
  • step 309 The operation of processing the content, or consuming the content is shown in Figure 4 by step 309.
  • the timestamp determiner 205 is shown in further detail. Furthermore with respect to Figure 6 the operation of the time stamp determiner as shown in Figure 5 is described in further detail.
  • the time stamp determiner 205 comprises a signal combiner 401 or combiner means.
  • the signal combiner 401 is configured to receive at least two of the time stamped audio content signals and combine these signals to generate a single combined time stamp signal.
  • the time stamp audio signals 651 can be combined in the signal combiner 401 to generate a combined signal.
  • the combined signal can be created by a suitable averaging and appending means according to the following mathematical expression,
  • n ⁇ Ll Plldx ' 0
  • x is the audio signal (or representative signal derived from the audio signal) of the timestamped crowd-sourced content
  • nClip lldx and clipldx lldx describe the number of content items and the content item indices for the tldx th time segment, respectively
  • t _start, ldx and t _end lldx describe the start and end times of the segment for the tldx th time segment, respectively.
  • the combined signal 652 can then be output by the signal combiner 401 to a combined signal aligner determiner 403.
  • the time stamp determiner comprising a combined signal aligner determiner 403.
  • the combined signal aligner determiner 403 is configured to receive the combined signal from the signal combiner 401 containing a time stamp and also the non-time stamped content audio signals and attempt to align these signals.
  • the combined signal 606 CX and a non-time stamped content signal 607D is shown.
  • the combined signal aligner 403 can be configured to determine the time offset between the signals, in other words whether the non-time stamped content audio signal is delayed with respect to these combined signals with time stamped content values or vice versa.
  • the determination of the time offset can be carried out via a time offset determiner or suitable signal aligner means or function.
  • the output tOffset therefore in such embodiments contains corresponding time offset values for each of the input signals.
  • the determination of time offset can be carried out by any suitable function such as a correlation analysis such as indicated by Carter, Nutall and Cable in "The smoothed coherence transform", Proceedings of the IEEE, Vol. 61 , No.
  • the output of the combined signal aligner determiner 403 can then be passed in some embodiments to the full signal aligner 405.
  • the time stamp determiner 205 comprises a full signal aligner 405 or suitable full signal alignment means configured to align or time stamp the non-time stamped content signal with respect to the time stamped content, in other words to map the output time offsets to actual input content items. For example using the signals A, B, C and D from Figure 7, the following offsets can be determined,
  • D_offset tOffset(1 )
  • tOffset2 ⁇ A_offset, B_offset, C_offset, D_offset ⁇
  • the time offset for all of these is equal to the time offset of CX.
  • time stamps of previously time stamped and non-time stamped content items can be updated to use a common time base. This can be performed, for example in some embodiments by having the following information, Table 1 :
  • the grpID is used to identify crowd-sourced content items that are continuous in time.
  • content audio signals A, B, and C could in such embodiments be assigned the same grpID.
  • the start and end timestamps identify the positions of the content within the continuous timeline either in absolute terms or in relative terms (with respect to content A in the example Figure 7 as that appears first in the continuous timeline).
  • the following information can be generated from the input content items, the creation of the combined signal(s) and the output of the time offset function: Table 2: nGroups Number of groups that share the same continuous timeline trackldx Number of combined signals
  • grpCombID Group index for signals in Equation (2) where trackldxLeft describes the indices of the non-timestamped content items in the order they appear in the input content. For example, trackldxLeft for the example signal set in Figure 7 would be trackldxLeft ⁇ D ⁇ as content D is the first content item that appears after the combined signals.
  • grpCombID indicates which of the input signals in the time offset function share the same continuous timeline. In some embodiments it can be assumed that the time_offset() function can provide this information as an output value. Where for example signals do not share the same continuous timeline, the time offset function can be configured to assign a time offset in such a way that the different continuous timelines do not overlap.
  • grpCombldxNum indicates the number of clips (including also the content signal) the particular content signal is overlapping with.
  • the alignment or mapping can be performed using such generated values as described herein according to the following pseudo code,
  • M is the number of input content items which includes the time stamped and non-time stamped content audio signals that have been specified for processing.
  • lines 6 to 51 are repeated for each of the non-time stamped content items.
  • the group identification value grpID of the non-time stamped input content is checked to determine if it is unknown as shown in line 8 of the above pseudo code. Furthermore after determining the time offset, checks whether the non-time stamped audio signal overlaps with some of the other input signals, shown on line 10.
  • lines 12 to 15 determine whether the signal is overlapping with any of the combined signals. Where no overlapping group is found the signal parameters (the input content item index) such as shown in line 20 and the time offset in line 21 are appended to the output grouping data as shown in lines 19 to 22 of the pseudo code.
  • the full signal aligner can as shown in lines 24 onwards the input content item is checked to determine whether the audio signal item belongs to the combined signal as shown in line 29. Where this is the situation the values are appended to the output grouping data as shown in lines 38 to 40. Furthermore the input content item that is overlapping with the combined signal is also appended as shown in lines 45 to 48. The lines 31 to 34 of the pseudo code check whether the signal parameters are appended only once to the output grouping data.
  • the full signal aligner can then in some embodiments finalise the mapping by updating the timings and the group identification values of those input content items that has been identified to share continuous timelines. In some embodiments this finalisation operation can be described with regards to the following pseudo code,
  • lines 2 to 41 of the above code are repeated for each group that share continuous timelines. Where the group has more than one content signal, as determined by line 4, the corresponding input content information is updated.
  • lines 6 to 1 1 determine the reference group identification value or the content signals.
  • lines 13 to 16 determine the start time for the corresponding group.
  • lines 18 to 26 update the start time of the content item within the group to match a relative time differences as defined by the offset time value defined herein.
  • Lines 28 to 38 of the code update the start and stop times and the group identification value information for each content item within the identified group.
  • the variable tServerStatus in line 35 is used to indicate that the timing information for the content item is obtained through the combined signal processing mode.
  • the time stamped and non-time stamped content items can be aligned based on the guidance information as indicated herein.
  • the guidance information can in some embodiments include the variable tServerStatus and the corresponding startTSinformation for each input content item where available.
  • the alignment can be carried out according to the following expression,
  • startTS, stopTS, grpID time_align(startTS, stopTS, grpID, ⁇ input_content_items>)
  • time_align() defines the function that determines the alignment for the specified input.
  • the function can take as an input the start and end times and the group identification value along with the actual media content.
  • the output of such a function would be the updated timings and group identification values for each of the input content.
  • the operation of determining the time offset can be any suitable approach such as shown previously with regards to correlation analysis.
  • the start and stop times of the signal pair are dated such that the time offset window (toWindow) for the pair can be limited to a small value.
  • the value can be set to 1 second indicating that the signals in the pair are aligned within + or - 1 second within respect to each other.
  • the major operation efficiency as the final time offset needs to be searched only from a very limited time period.
  • the combined signal represents the entire events scene and is therefore highly unlikely that the time offset returned by the expressions herein are not optimal since the geographical area of the event scene may cover a large area where different content signals can be located quite far from each other.
  • the steps of processing as described herein are designed to overcome the challenges the size of the event scene may bring to determining the time offsets between various content in a robust and reliable manner.
  • the combined signal may be computed more than once. This for example can be carried out where the number of content items to be covered is large.
  • a suitable combination for example a random or fixed combination of the already time stamped content can be used to create the combined signal and then following the steps as described herein, the final time steps in the common time data can be determined.
  • the combined signal can in some embodiments be recreated from time differences in composition including some new content items compared to previously created combined signal or replacing some content items with new items in the combined signal and then repeating the processing steps for alignment.
  • the output values from each of these combined signal iterations can then be saved and the final time differences for the content items can be completed from the saved values.
  • the data analysis can be performed to determine the time offset value towards each content item in a converging form. This converging form can be easily extracted using mean and standard variance calculations where the final output is a mean value which does not include any outlier values defined by the standard variance.
  • embodiments may also be applied to audio-video signals where the audio signal components of the recorded data are processed in terms of the determining of the base signal and the determination of the time alignment factors for the remaining signals and the video signal components may be synchronised using the above embodiments of the invention.
  • the video parts may be synchronised using the audio synchronisation information.
  • user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.
  • PLMN public land mobile network
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
  • a standardized electronic format e.g., Opus, GDSII, or the like

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

An apparatus comprising: a signal combiner configured to generate a combined audio signal from at least two first audio signals; a combined signal comparator configured to compare the combined audio signal to at least one second audio signal; an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.

Description

AN AUDIO SCENE ALIGNMENT APPARATUS
Field The present application relates to apparatus for the processing of audio and additionally audio-video signals to enable alignment to audio signals. The invention further relates to, but is not limited to, apparatus for processing audio and additionally audio-video signals from mobile devices. Background
Viewing recorded or streamed audio-video or audio content is well known. Commercial broadcasters covering an event often have more than one recording device (video-camera/microphone) and a programme director will select a 'mix' where an output from a recording device or combination of recording devices is selected for transmission.
Multiple 'feeds' may be found in sharing services for video and audio signals (such as those employed by YouTube). Such systems, which are known and are widely used to share user generated content recorded and uploaded or up- streamed to a server and then downloaded or down-streamed to a viewing/listening user. Such systems rely on users recording and uploading or up- streaming a recording of an event using the recording facilities at hand to the user. This may typically be in the form of the camera and microphone arrangement of a mobile device such as a mobile phone.
Often the event is attended and recorded from more than one position by different recording users at the same time. The viewing/listening end user may then select one of the up-streamed or uploaded data to view or listen.
Summary Aspects of this application thus provide an audio signal alignment process whereby multiple devices can record audio signals and these audio signals be aligned to permit audio source selection. There is provided according to a first aspect an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
The apparatus may be further caused to perform: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene. The apparatus may be further caused to perform aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
The apparatus may be further caused to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
Generating a combined audio signal from at least two first audio signals may cause the apparatus to perform: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise. Comparing the combined audio signal to at least one second audio signal may cause the apparatus to perform a cross-correlation between the combined audio signal and the at least one second audio signal. Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may cause the apparatus to perform determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal. Associating the alignment value to the at least two first audio signals may cause the apparatus to perform: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
The apparatus may be further caused to perform: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
According to a second aspect there is provided a method comprising: generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
The method may further comprise: receiving the at least two first audio signals from an audio scene; receiving the at least one second audio signal from the audio scene.
The method may further comprise aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
The method may further comprise rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
Generating a combined audio signal from at least two first audio signals may comprise: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.
Comparing the combined audio signal to at least one second audio signal may comprise a cross-correlation between the combined audio signal and the at least one second audio signal.
Determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
Associating the alignment value to the at least two first audio signals may comprise: assigning the at least two first audio signals to a first group of audio signals; assigning the at least one second audio signal to a second group of audio signals; assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
The method may further comprise: generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value; comparing the further combined audio signal to at least one further second audio signal; determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
According to a third aspect there is provided an apparatus comprising: a signal combiner configured to generate a combined audio signal from at least two first audio signals; a combined signal comparator configured to compare the combined audio signal to at least one second audio signal; an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and and a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
The apparatus may further comprise: a first signal receiver configured to receive the at least two first audio signals from an audio scene; and a second signal receiver configured to receive the at least one second audio signal from the audio scene. The apparatus may further comprise a delay configured to align at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value. The apparatus may further comprise a signal renderer configured to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
The signal combiner may comprise: an averager configured to generate a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and an appender configured to append to the combined audio signal an available at least two first audio signal part otherwise. The signal comparator may comprises a cross-correlator configured to generate a cross-correlation product between the combined audio signal and the at least one second audio signal.
The alignment determiner may comprise an offset determiner configured to determine a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
The full signal aligner may comprise: a first assigner configured to assign the at least two first audio signals to a first group of audio signals; a second assigner configured to assign the at least one second audio signal to a second group of audio signals; a null assigner configured to assign a null alignment value to the first group of audio signals; and a delay assigner configured to assign the alignment value to the second group of audio signals. The signal combiner may be configured to further combine an audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value. The comparator may be configured to further compare the further combined audio signal to at least one further second audio signal.
The aligner may be configured to further determine a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal.
The full signal aligner may be further configured to associate the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
According to a fourth aspect there is provided an apparatus comprising: means for generating a combined audio signal from at least two first audio signals; means for comparing the combined audio signal to at least one second audio signal; means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and means for associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal. The apparatus may further comprise: means for receiving the at least two first audio signals from an audio scene; means for receiving the at least one second audio signal from the audio scene.
The apparatus may further comprise means for aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value. The apparatus may further comprise means for rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting. The means for generating a combined audio signal from at least two first audio signals may comprise: means for generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and means for appending to the combined audio signal an available at least two first audio signal part otherwise.
The means for comparing the combined audio signal to at least one second audio signal may comprise means for generating a cross-correlation product between the combined audio signal and the at least one second audio signal. The means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal may comprise means for determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
The means for associating the alignment value to the at least two first audio signals may comprise: means for assigning the at least two first audio signals to a first group of audio signals; means for assigning the at least one second audio signal to a second group of audio signals; means for assigning a null alignment value to the first group of audio signals; and means for assigning the alignment value to the second group of audio signals.
The means for combining may further comprise means for generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value.
The means for comparing may further comprise means for comparing the further combined audio signal to at least one further second audio signal. The means for determining an alignment value may comprise means for determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal. The means for associating the alignment value may comprise means for associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal. The first audio signals may comprise timestamp information and the second audio signals lack the timestamp information.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically a multi-user free-viewpoint service sharing system which may encompass embodiments of the application;
Figure 2 shows schematically an apparatus suitable for being employed in embodiments of the application;
Figure 3 shows schematically an audio signal system according to some embodiments; Figure 4 shows a flow diagram of the operation of the audio signal system as shown in Figure 3;
Figure 5 shows schematically the time stamp determiner shown in Figure 3 in further detail according to some embodiments;
Figure 6 shows the operation of the time stamp determiner shown in Figure
5 according to some embodiments; and
Figure 7 shows an example set of time stamped and non-time stamped audio signals to be aligned according to embodiments. Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanism for the provision of effective audio signal alignment. In the following examples, audio signals and audio capture signals are described. However it would be appreciated that in some embodiments the audio signal/audio capture is a part of an audio-video system.
The concept of this application is related to assisting in the production of immersive person-to-person communication and can include video. It would be understood that the space within which the devices record the audio signal can be arbitrarily positioned within an event space. The captured signals as described herein are transmitted or alternatively stored for later consumption where the end user can select the listening point based on their preference from the reconstructed audio space. The rendering part then can provide one or more down mixed signals from which the multiple recordings that correspond to the selective listening point. It would be understood that each recording device can record the event seen and upload or upstream the recorded content. The uploaded or upstream process can include implicitly positioning information about where the content is being recorded.
However in such systems some audio signal content can be uploaded in non-real time operations. In other words the media content in the form of a captured audio signal can be uploaded a few minutes, hours, days or weeks after the event. Thus the amount of content that represents each particular event and held in the server can fluctuate as a function of time. Furthermore uploaded audio signal content typically does not employ common time keeping or time stamping and therefore the newly uploaded or streamed content needs to be aligned to use a common time stamping before any down mixed signal is provided for consumption. As new content can be uploaded to the server at any time, a situation could arrive where the existing content already uses a common time stamping, however new content lacks this time stamp and should be transformed to this time stamping mode.
The concept of this application therefore is to provide an enabler for the case where some of the audio signal content for the event is already converted to a common time stamping and the conversion is to be applied to any new uploaded audio signal content.
Although a common time base can be achieved with a dedicated synchronisation signal, for example the capture devices are equipped to receive a specific beacon signal or timing information obtained through a network or other received data such as positioning satellite timing data (such as from a GPS satellite), this can be problematic. For example the use of a beacon signal typically requires special hardware and/or software installations to the recording or capture apparatus which limits the applicability of multiuser sharing services as recording devices become too expensive for mass use or limits the use of existing devices. Although GPS or other satellite synchronisation timing signals can be used, the device requires a GPS or other satellite positioning receiver to receive the signal and furthermore could be used in circumstances where a GPS or satellite positioning signal is not available - such as, for example, indoors or operating in heavily built-up urban areas, or in woodland or forest regions. In some synchronisation systems a network time protocol (NTP) is used to time stamp the recorded content from multiple users. In these examples the recording devices synchronise the recordings against an NTP reference. However an NTP reference requires network connection which may not be available in all situations and typically timing errors can be introduced to the time stamps due to transmission delays.
Furthermore although signal to signal synchronisation using correlation could be employed, this is impractical for applications where the number of recording increases as the processing requirement increasing exponentially rather than linearly. Furthermore any time skew between multiple content recordings has to be limited to an order of 10s of seconds otherwise computational complexity of the correlation approach is a problem.
With respect to Figure 1 an overview of a suitable system within which embodiments of the application can be located is shown. The audio space 1 can have located within it at least one recording or capturing device or apparatus 19 which are arbitrarily positioned within the audio space to record suitable audio scenes. The apparatus 19 shown in Figure 1 are represented as microphones with a polar gain pattern 101 showing the directional audio capture gain associated with each apparatus. The apparatus 19 in Figure 1 are shown such that some of the apparatus are capable of attempting to capture the audio scene or activity 103 within the audio space. The activity 103 can be any event the user of the apparatus wishes to capture. For example the event could be a music event or audio of a "news worthy" event. The apparatus 19 although being shown having a directional microphone gain pattern 101 would be appreciated that in some embodiments the microphone or microphone array of the recording apparatus 19 has a omnidirectional gain or different gain profile to that shown in Figure 1 .
Each recording apparatus 19 can in some embodiments transmit or alternatively store for later consumption the captured audio signals via a transmission channel 107 to an audio scene server 109. The recording apparatus 19 in some embodiments can encode the audio signal to compress the audio signal in a known way in order to reduce the bandwidth required in "uploading" the audio signal to the audio scene server 109. The recording apparatus 19 in some embodiments can be configured to estimate and upload via the transmission channel 107 to the audio scene server 109 an estimation of the location and/or the orientation or direction of the apparatus. The position information can be obtained, for example, using GPS coordinates, cell-ID or a-GPS or any other suitable location estimation methods and the orientation/direction can be obtained, for example using a digital compass, accelerometer, or gyroscope information.
In some embodiments the recording apparatus 19 can be configured to capture or record one or more audio signals for example the apparatus in some embodiments have multiple microphones each configured to capture the audio signal from different directions. In such embodiments the recording device or apparatus 19 can record and provide more than one signal from different the direction/orientations and further supply position/direction information for each signal. With respect to the application described herein an audio or sound source can be defined as each of the captured or audio recorded signal. In some embodiments each audio source can be defined as having a position or location which can be an absolute or relative value. For example in some embodiments the audio source can be defined as having a position relative to a desired listening location or position. Furthermore in some embodiments the audio source can be defined as having an orientation, for example where the audio source is a beamformed processed combination of multiple microphones in the recording apparatus, or a directional microphone. In some embodiments the orientation may have both a directionality and a range, for example defining the 3dB gain range of a directional microphone.
The capturing and encoding of the audio signal and the estimation of the position/direction of the apparatus is shown in Figure 1 by step 1001. The uploading of the audio and position/direction estimate to the audio scene server 109 is shown in Figure 1 by step 1003. The audio scene server 109 furthermore can in some embodiments communicate via a further transmission channel 1 1 1 to a listening device 1 13.
In some embodiments the listening device 1 13, which is represented in Figure 1 by a set of headphones, can prior to or during downloading via the further transmission channel 1 1 1 select a listening point, in other words select a position such as indicated in Figure 1 by the selected listening point 105. In such embodiments the listening device 113 can communicate via the further transmission channel 11 1 to the audio scene server 109 the request.
The selection of a listening position by the listening device 113 is shown in Figure 1 by step 1005.
The audio scene server 109 can as discussed above in some embodiments receive from each of the recording apparatus 19 an approximation or estimation of the location and/or direction of the recording apparatus 19. The audio scene server 109 can in some embodiments from the various captured audio signals from recording apparatus 19 produce a composite audio signal representing the desired listening position and the composite audio signal can be passed via the further transmission channel 1 1 1 to the listening device 1 13.
The generation or supply of a suitable audio signal based on the selected listening position indicator is shown in Figure 1 by step 1007. In some embodiments the listening device 1 13 can request a multiple channel audio signal or a mono-channel audio signal. This request can in some embodiments be received by the audio scene server 109 which can generate the requested multiple channel data. The audio scene server 109 in some embodiments can receive each uploaded audio signal and can keep track of the positions and the associated direction/orientation associated with each audio source. In some embodiments the audio scene server 109 can provide a high level coordinate system which corresponds to locations where the uploaded/upstreamed content source is available to the listening device 1 13. The "high level" coordinates can be provided for example as a map to the listening device 1 13 for selection of the listening position. The listening device (end user or an application used by the end user) can in such embodiments be responsible for determining or selecting the listening position and sending this information to the audio scene server 109. The audio scene server 109 can in some embodiments receive the selection/determination and transmit the downmixed signal corresponding to the specified location to the listening device. In some embodiments the listening device/end user can be configured to select or determine other aspects of the desired audio signal, for example signal quality, number of channels of audio desired, etc. In some embodiments the audio scene server 109 can provide in some embodiments a selected set of downmixed signals which correspond to listening points neighbouring the desired location/direction and the listening device 1 13 selects the audio signal desired.
In this regard reference is first made to Figure 2 which shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to record (or operate as a recording device 19) or listen (or operate as a listening device 1 13) to the audio signals (and similarly to record or view the audio-visual images and data). Furthermore in some embodiments the apparatus or electronic device can function as the audio scene server 109.
The electronic device 10 may for example be a mobile terminal or user equipment of a wireless communication system when functioning as the recording device or listening device 1 13. In some embodiments the apparatus can be an audio player or audio recorder, such as an MP3 player, a media recorder/player (also known as an MP4 player), or any suitable portable device suitable for recording audio or audio/video camcorder/memory audio or video recorder.
The apparatus 10 can in some embodiments comprise an audio subsystem. The audio subsystem for example can comprise in some embodiments a microphone or array of microphones 1 1 for audio signal capture. In some embodiments the microphone or array of microphones can be a solid state microphone, in other words capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone or array of microphones 1 1 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone 1 1 or array of microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 14.
In some embodiments the apparatus can further comprise an analogue-to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and outputting the audio captured signal in a suitable digital form. The analogue-to-digital converter 14 can be any suitable analogue-to- digital conversion or processing means.
In some embodiments the apparatus 10 audio subsystem further comprises a digital-to-analogue converter 32 for converting digital audio signals from a processor 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology.
Furthermore the audio subsystem can comprise in some embodiments a speaker 33. The speaker 33 can in some embodiments receive the output from the digital- to-analogue converter 32 and present the analogue audio signal to the user. In some embodiments the speaker 33 can be representative of a headset, for example a set of headphones, or cordless headphones. Although the apparatus 10 is shown having both audio capture and audio presentation components, it would be understood that in some embodiments the apparatus 10 can comprise one or the other of the audio capture and audio presentation parts of the audio subsystem such that in some embodiments of the apparatus the microphone (for audio capture) or the speaker (for audio presentation) are present.
In some embodiments the apparatus 10 comprises a processor 21. The processor 21 is coupled to the audio subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 1 1 , and the digital-to-analogue converter (DAC) 12 configured to output processed digital audio signals. The processor 21 can be configured to execute various program codes. The implemented program codes can comprise for example audio classification and audio scene mapping code routines. In some embodiments the program codes can be configured to perform audio scene event detection and device selection indicator generation, wherein the audio scene server 109 can be configured to determine events from multiple received audio recordings to assist the user in selecting an audio recording which is meaningful and does not require the listener to carry out undue searching of all of the audio recordings.
In some embodiments the apparatus further comprises a memory 22. In some embodiments the processor is coupled to memory 22. The memory can be any suitable storage means. In some embodiments the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21 . Furthermore in some embodiments the memory 22 can further comprise a stored data section 24 for storing data, for example data that has been encoded in accordance with the application or data to be encoded via the application embodiments as described later. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via the memory-processor coupling. In some further embodiments the apparatus 10 can comprise a user interface 15. The user interface 15 can be coupled in some embodiments to the processor 21. In some embodiments the processor can control the operation of the user interface and receive inputs from the user interface 15. In some embodiments the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.
In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The coupling can, as shown in Figure 1 , be the transmission channel 107 (where the apparatus is functioning as the recording device 19 or audio scene server 109) or further transmission channel 1 1 1 (where the device is functioning as the listening device 1 13 or audio scene server 109). The transceiver 13 can communicate with further devices by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
In some embodiments the apparatus comprises a position sensor 16 configured to estimate the position of the apparatus 10. The position sensor 16 can in some embodiments be a satellite positioning sensor such as a GPS (Global Positioning System), GLONASS or Galileo receiver.
In some embodiments the positioning sensor can be a cellular ID system or an assisted GPS system. In some embodiments the apparatus 10 further comprises a direction or orientation sensor. The orientation/direction sensor can in some embodiments be an electronic compass, accelerometer, a gyroscope or be determined by the motion of the apparatus using the positioning estimate.
It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways. Furthermore it could be understood that the above apparatus 10 in some embodiments can be operated as an audio scene server 109. In some further embodiments the audio scene server 109 can comprise a processor, memory and transceiver combination. Figure 3 shows an example alignment apparatus for content time stamping for crowd source media content. In these circumstances the apparatus is configured to create a combined signal for the received time stamped content, to align the non-time stamped content signal and the combined signal and then to further align both the time stamped and the non-time stamped content signal using the information from the operation of aligning the non-time stamped content signal and the combined signal.
The alignment apparatus comprises in some embodiments a non-time stamped content input 201 or suitable input means. The non-time stamped content input 201 is configured to receive audio signals in any suitable format and pass these to the time stamped determiner 205. In some embodiments the non-time stamped content input 201 can be configured to pre-process the non-time stamped content to deliver the audio data in a format suitable for processing by the time stamp determiner 205.
The operation of receiving the non-time stamped content is shown in Figure 4 by step 301. Furthermore in some embodiments the alignment apparatus comprises a time stamped content input 203 or suitable second input means. The time stamp content input 203 can be configured to receive the time stamped audio data in any suitable format and pass the time stamped content audio data to the time stamped determiner 205 for further processing.
The operation of receiving the time stamped content is shown in Figure 4 by step 303. In some embodiments the time stamped content input 203 and non-time stamp content input 201 can be implemented by a transceiver element of an audio scene server receiving audio signals from various capture devices recording audio signals within the audio scene and/or from the memory or store associated with the audio scene server. However it would be appreciated that in some embodiments the alignment apparatus represents a combination of an audio content recorder or capturer and thus receives the time stamped content via a communications coupling such as a wireless communications link, audio scene server and listener apparatus. In some embodiments the alignment apparatus comprises a time stamp determiner 205 or suitable alignment means. The time stamp determiner 205 is configured to receive the audio data from the non-time stamp content input 201 and also the audio data from the time stamp content input 203 and determine (or align) a time stamp for the non-time stamp content input based on the time stamped content audio signals.
In some embodiments the time stamp determiner 205 can output the audio data, both the received time stamped content input and the aligned non-time stamp content to a content renderer 207.
The operation of determining time stamps for the non-time stamp content is shown in Figure 4 by step 305. In some embodiments the alignment apparatus comprises a content renderer 207 or rendering means. The content renderer 207 is configured in some embodiments to receive the audio signals and render these into an audio signal suitable for consumption. For example in some embodiments the content renderer 207 can be configured to generate a multi-channel audio signal from the input content audio signals suitable for passing to a listener apparatus.
In some embodiments the content renderer 207 can be implemented in the audio scene server, in the content recorder or capturer, or in the content listener apparatus. The implementation of content rendering is generally known and will not be described any further.
The operation of rendering the content of the audio signals is shown in Figure 4 by step 307.
In some embodiments the apparatus further comprises a content processor 209 configured to receive the rendered audio signal data from the content renderer 207 and process it in order that it can be displayed or listened to by the end user. As described herein, the content processor 209 can in some embodiments control the content renderer 207 to produce a suitable rendered audio signal at a determined position, or configuration. The content processor 209 can in some embodiments be implemented within a listener apparatus.
The operation of processing the content, or consuming the content is shown in Figure 4 by step 309.
With respect to Figure 5, the timestamp determiner 205 is shown in further detail. Furthermore with respect to Figure 6 the operation of the time stamp determiner as shown in Figure 5 is described in further detail.
In some embodiments the time stamp determiner 205 comprises a signal combiner 401 or combiner means. The signal combiner 401 is configured to receive at least two of the time stamped audio content signals and combine these signals to generate a single combined time stamp signal.
With respect to Figure 7, an example sequence of time stamped audio signals are shown. A first time stamp signal 601A which has a range from T=0 to T=4, a second time stamp signal 603B with time stamp range from T=2 to T=5 and a third time stamp signal 605C with a signal range from time stamp T=1 to T=6. The time stamp audio signals 651 can be combined in the signal combiner 401 to generate a combined signal.
In some embodiments the combined signal can be created by a suitable averaging and appending means according to the following mathematical expression,
CX, =
Figure imgf000023_0001
(,)( > t_start„dx≤t < t_end„dx
n^LlPlldx '=0 where x is the audio signal (or representative signal derived from the audio signal) of the timestamped crowd-sourced content, nCliplldx and clipldxlldx describe the number of content items and the content item indices for the tldxth time segment, respectively, t _start,ldx and t _endlldx describe the start and end times of the segment for the tldxth time segment, respectively. In some embodiments the signal combiner 401 can be configured to repeat the combining expression for 0 < tldx < nT where nT is the number of time segments in the timeline. Using the example shown in Figure 7, the data for the combined signal would be, tldx = 0:
clipldxtldx = {A}, nClipIIdx = 1, t _startIIdx = 0, t _endtIdx = 1, tldx = 1 :
clipldxtldx = {A, B], nCliplldx = 2, t _ start l!dx = 1, t _ endlldx = 2, tldx = 4:
clipldxlhk = {C], nClipIIdx = 1, t _ start lldx = 5, t _ ewi/,/rft = 6, The combined signal generated by the signal combiner 401 therefore mixes overlapping content for each segment to obtain a signal which represents the average signal covering the whole of the time stamped content range.
As shown in Figure 7 the combined signal 606 CX has a range from T=0 to T=6. The combined signal 652 can then be output by the signal combiner 401 to a combined signal aligner determiner 403.
The operation of creating a combined signal is shown in Figure 6 by step 501. In some embodiments the time stamp determiner comprising a combined signal aligner determiner 403. The combined signal aligner determiner 403 is configured to receive the combined signal from the signal combiner 401 containing a time stamp and also the non-time stamped content audio signals and attempt to align these signals. With respect to Figure 7, the combined signal 606 CX and a non-time stamped content signal 607D is shown. The combined signal aligner 403 can be configured to determine the time offset between the signals, in other words whether the non-time stamped content audio signal is delayed with respect to these combined signals with time stamped content values or vice versa.
In some embodiments the determination of the time offset can be carried out via a time offset determiner or suitable signal aligner means or function. The time offset between the audio signals can mathematically be represented by the following, tOffset = time_offset(CX, D) where time_offset() defines the function that determines the alignment for the specified input signals, CX the combined audio signal and D the non-time stamped audio signal. The output tOffset therefore in such embodiments contains corresponding time offset values for each of the input signals. The determination of time offset can be carried out by any suitable function such as a correlation analysis such as indicated by Carter, Nutall and Cable in "The smoothed coherence transform", Proceedings of the IEEE, Vol. 61 , No. 10, pages 1497 to 1498 or Cusani in "Performance of fast time delay estimators", IEEE Transactions in Acoustics, Speech, and Signal Processing, Vol. 37, No. 5, pages 757 to 759 or by Chen et al. in "Performance of GCC and AMDF Based Time Delay Estimation in Practical Reverberant Environments", Journal on Applied Signal Processing, Vol. 1 , pages 25 to 36.
The output of the combined signal aligner determiner 403 can then be passed in some embodiments to the full signal aligner 405.
The operation of combined signal aligning with the non-time stamped content is shown in Figure 6 by step 503.
In some embodiments the time stamp determiner 205 comprises a full signal aligner 405 or suitable full signal alignment means configured to align or time stamp the non-time stamped content signal with respect to the time stamped content, in other words to map the output time offsets to actual input content items. For example using the signals A, B, C and D from Figure 7, the following offsets can be determined,
A_offset = tOffset(O)
B_offset = tOffset(O)
C_offset = tOffset(O)
D_offset = tOffset(1 ) tOffset2 = {A_offset, B_offset, C_offset, D_offset}
As the combined signal CX is a combination of A, B, and C, the time offset for all of these is equal to the time offset of CX.
Thus the time stamps of previously time stamped and non-time stamped content items can be updated to use a common time base. This can be performed, for example in some embodiments by having the following information, Table 1 :
Figure imgf000026_0001
The grpID is used to identify crowd-sourced content items that are continuous in time. In the example shown in Figure 7, content audio signals A, B, and C could in such embodiments be assigned the same grpID. The start and end timestamps identify the positions of the content within the continuous timeline either in absolute terms or in relative terms (with respect to content A in the example Figure 7 as that appears first in the continuous timeline). Furthermore, similar information related to the grouping IDs can be generated in some embodiments that correspond to the input signals of the time offset function. For the example signals shown in Figure 7, this information could be grplD_Comb = {grpID of A, grpID of D}
In addition in some embodiments the following information can be generated from the input content items, the creation of the combined signal(s) and the output of the time offset function: Table 2: nGroups Number of groups that share the same continuous timeline trackldx Number of combined signals
nTracks Number of total signals (combined + non-timestamped signals) trackldxLeft Indices of non-timestamped signals
grpCombldx Number of signals assinged to each group after determining the Num time offsets in Equation (2)
grpCombID Group index for signals in Equation (2) where trackldxLeft describes the indices of the non-timestamped content items in the order they appear in the input content. For example, trackldxLeft for the example signal set in Figure 7 would be trackldxLeft = {D} as content D is the first content item that appears after the combined signals. Furthermore, grpCombID indicates which of the input signals in the time offset function share the same continuous timeline. In some embodiments it can be assumed that the time_offset() function can provide this information as an output value. Where for example signals do not share the same continuous timeline, the time offset function can be configured to assign a time offset in such a way that the different continuous timelines do not overlap. Based on the returned time offsets and the content durations it possible in some embodiments to derive which input signals in the time offset function create a continuous timeline. Where there is more than one continuous timeline then in some embodiments different group indices (for example, starting from index 0 onwards) can be used. Furthermore the variable, grpCombldxNum indicates the number of clips (including also the content signal) the particular content signal is overlapping with. Using the audio signals from Figure 7 again as an example, the values of grpCombID and grpCombldxNum would be grpCombID = {0, 0} and grpCombldxNum = {2, 2}, respectively.
For the signals in Figure 7, the following information would therefore be, trackldx = 1
nTracks = 2
trackldxLeft = {3 (index number of content D)} grpCombldxNum = {2, 2} % CX overlaps with D and D overlaps with CX grpCombID = {0, 0} % Both input signals share the same continuous timeline, thus belong to the same group (group 0 in the example case)
In some embodiments the alignment or mapping can be performed using such generated values as described herein according to the following pseudo code,
1 Vector of size trackldxPerGroup[0, ... ,M-1 ] = 0
2 Matrix of size cliplndices[0 M-1 ][0,... ,M-1 ] = 0
3 Matrix of size laglndicesfO M-1 ][0 M-1 ] = 0
4
5 for(i = clipldx; i < nTracks; i++)
6 {
7 sldx = i - clipldx; cldx = trackldxLeft[sldx];
8 if(grpldx[cldx] == -1 )
9 {
10 if(grpCombldxNum[i] > 1 )
1 1 {
12 for(j = 0, tGrpldx = -1 ; j < clipldx; j++)
13 if(j != i)
14 if(grpCombldxlD[j] == grpCombldxIDfi])
15 tGrpldx = j; Break for loop;
16
17 if(tGrpldx == -1 )
18 {
19 curldx = trackldxPerGroup[grpCombldxlD[i]];
20 cliplndices[grpCombldxlD[i]][curldx] = cldx;
21 laglndices[grpCombldxlD[i]][curldx] = tOffset2[cldx];
22 trackldxPerGroup[grpCombldxlD[i]] += 1 ;
23 24 else
25 {
26 refGrpldx = grplD_Comb[tGrpidx];
27 forG = 0; j < M; j++)
28 {
29 if(grplD[j] == refGrpldx)
30 {
31 curldx = trackldxPerGroup[grpCombldxlD[i]];
32 for(k = 0, isfound = 0; k < curldx; k++)
33 if(cliplndices[grpCombldxlD[i]][k] == j)
34 isfound = 1 ; Break for loop;
35
36 if(isfound == 0)
37 {
38 cliplndices[grpCombldxlD[i]][curldx] = j;
39 laglndices[grpCombldxlD[i]][curldx] = tOffset2[j];
40 trackldxPerGroup[grpCombldxlD[i]] += 1 ;
41 }
42 }
43 }
44
45 curldx = mergeData->trackldxPerGroup[grpCombldxlD[i]];
46 cliplndices[grpCombldxlD[i]][curldx] = cldx;
47 laglndices[grpCombldxlD[i]][curldx] = tOffset2[cldx];
48 trackldxPerGroup[grpCombldxlD[i]] += 1 ;
49 }
50 }
51 }
52 } where M is the number of input content items which includes the time stamped and non-time stamped content audio signals that have been specified for processing. In the above code, lines 6 to 51 are repeated for each of the non-time stamped content items. The group identification value grpID of the non-time stamped input content is checked to determine if it is unknown as shown in line 8 of the above pseudo code. Furthermore after determining the time offset, checks whether the non-time stamped audio signal overlaps with some of the other input signals, shown on line 10.
Where these conditions are met the content signal information can be processed further. For example lines 12 to 15 determine whether the signal is overlapping with any of the combined signals. Where no overlapping group is found the signal parameters (the input content item index) such as shown in line 20 and the time offset in line 21 are appended to the output grouping data as shown in lines 19 to 22 of the pseudo code.
Where overlapping groups are found the full signal aligner can as shown in lines 24 onwards the input content item is checked to determine whether the audio signal item belongs to the combined signal as shown in line 29. Where this is the situation the values are appended to the output grouping data as shown in lines 38 to 40. Furthermore the input content item that is overlapping with the combined signal is also appended as shown in lines 45 to 48. The lines 31 to 34 of the pseudo code check whether the signal parameters are appended only once to the output grouping data.
In some embodiments the maximum group identification value can be determined according to the following expression, maxGrpID = max(grplD[0], grplD[M-1]) where max() returns the maximum value of the specified input values. The full signal aligner can then in some embodiments finalise the mapping by updating the timings and the group identification values of those input content items that has been identified to share continuous timelines. In some embodiments this finalisation operation can be described with regards to the following pseudo code,
1 for(i = 0; i < nGroups; i++)
2 {
3 nClips = trackldxPerGroup[i]
4 if(nClips > 1 )
5 {
6 refGrpldx = grplD[cliplndices[i][0]]
7 forG = 0; j < nClips; j++)
8 {
9 if(grplD[cliplndices[i][j]] != -1 )
10 refGrpldx = grplD[cliplndices[i][j]]
1 1 }
12
13 forO = 0, minPos = -1 ; j < nClips; j++)
14 if(grplD[cliplndices[i][j]]== refGrpldx)
15 if(minPos == -1 || minPos > startTS[cliplndices[i][j]])
16 minPos = startTS[cliplndices[i][j]]
17
18 for(j = 0; j < nClips;
19 {
20 if(grplD[cliplndices[i][j]] != -1 )
21 {
22 diffPos = startTS[cliplndices[i][j]] - minPos
23 startTS[cliplndices[i] ]] = diffPos
24 }
25 else startTS[cliplndices[i][j]] = 0
26 }
27 28 forG = 0; j < nClips; j++)
29 {
30 cldx = cliplndices[i][j]
31 lagVal = laglndices[i][j]
32 contentDuration = stopTS[cldx] - startTS[cldx]
33 if(grpiD[cIdx] == -1 )
34 grplD[cldx] = (refGrpldx == -1 ) ? maxGrpID : refGrpldx
35 tServerStatus[cldx] = 2
36 startTS[cldx] += minPos + lagVal
37 stopTS[cldx] = startTS[cIdx] + contentDuration
38 }
39 }
40 maxGrpID = maxGrpID + 1
41 }
In such embodiments lines 2 to 41 of the above code are repeated for each group that share continuous timelines. Where the group has more than one content signal, as determined by line 4, the corresponding input content information is updated. In the above code, lines 6 to 1 1 determine the reference group identification value or the content signals. Furthermore lines 13 to 16 determine the start time for the corresponding group. In such embodiments lines 18 to 26 update the start time of the content item within the group to match a relative time differences as defined by the offset time value defined herein. Lines 28 to 38 of the code update the start and stop times and the group identification value information for each content item within the identified group. The variable tServerStatus in line 35 is used to indicate that the timing information for the content item is obtained through the combined signal processing mode. Furthermore in some embodiments the time stamped and non-time stamped content items can be aligned based on the guidance information as indicated herein. The guidance information can in some embodiments include the variable tServerStatus and the corresponding startTSinformation for each input content item where available. In some embodiments the alignment can be carried out according to the following expression,
(startTS, stopTS, grpID) = time_align(startTS, stopTS, grpID, <input_content_items>) where time_align() defines the function that determines the alignment for the specified input. In such embodiments the function can take as an input the start and end times and the group identification value along with the actual media content. The output of such a function would be the updated timings and group identification values for each of the input content. The operation of determining the time offset can be any suitable approach such as shown previously with regards to correlation analysis. When signals (X and Y) are about to be aligned, such as described in the expression above classify the condition, x.tServerStatus == 2 and y.tServerStatus == 2 the alignment parameters can be adjusted for efficient time offset search by defining the start and stop times, and the time offset window for the signal pair. This adjustment is in some embodiments performed according to the following code: Pseudo-code 3: 1 toWindow = 1 s
2 x.duration = x.stopTS - x.startTS
3 y. duration = y.stopTS - y.startTS
4
5 if x.startTS smaller than y.startTS
6 x.startTSNew = y.startTS - x.startTS - toWindow
7 if x.startTSNew < 0
8 x.startTSNew = 0
9 x.stopTS = x.startTSNew + (x.duration - (x.startTSNew - x.startTS))
10 x.startTS = x.startTSNew
1 1 else
12 y.startTSNew = x.startTS - y.startTS - toWindow
13 if y.startTSNew < 0
14 y.startTSNew = 0
5 y.stopTS = y.startTSNew + (y. duration - (y.startTSNew - y.startTS))
16 y.startTS = y.startTSNew
17 End
18 toWindow = toWindow * 2
In such embodiments it therefore can be seen that the start and stop times of the signal pair are dated such that the time offset window (toWindow) for the pair can be limited to a small value. In some embodiments the value can be set to 1 second indicating that the signals in the pair are aligned within + or - 1 second within respect to each other. In such embodiments the major operation efficiency as the final time offset needs to be searched only from a very limited time period. Furthermore in such embodiments the combined signal represents the entire events scene and is therefore highly unlikely that the time offset returned by the expressions herein are not optimal since the geographical area of the event scene may cover a large area where different content signals can be located quite far from each other. The steps of processing as described herein are designed to overcome the challenges the size of the event scene may bring to determining the time offsets between various content in a robust and reliable manner. In some embodiments the combined signal may be computed more than once. This for example can be carried out where the number of content items to be covered is large. In such examples a suitable combination, for example a random or fixed combination of the already time stamped content can be used to create the combined signal and then following the steps as described herein, the final time steps in the common time data can be determined. The combined signal can in some embodiments be recreated from time differences in composition including some new content items compared to previously created combined signal or replacing some content items with new items in the combined signal and then repeating the processing steps for alignment. The output values from each of these combined signal iterations can then be saved and the final time differences for the content items can be completed from the saved values. In some examples the data analysis can be performed to determine the time offset value towards each content item in a converging form. This converging form can be easily extracted using mean and standard variance calculations where the final output is a mean value which does not include any outlier values defined by the standard variance.
Although the above has been described with regards to audio signals, or audio- visual signals it would be appreciated that embodiments may also be applied to audio-video signals where the audio signal components of the recorded data are processed in terms of the determining of the base signal and the determination of the time alignment factors for the remaining signals and the video signal components may be synchronised using the above embodiments of the invention. In other words the video parts may be synchronised using the audio synchronisation information.
It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.
Furthermore elements of a public land mobile network (PLMN) may also comprise apparatus as described above. In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:
1. Apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to at least perform:
generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and
associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
2. The apparatus as claimed in claim 1 , further caused to perform:
receiving the at least two first audio signals from an audio scene; and receiving the at least one second audio signal from the audio scene.
3. The apparatus as claimed in claims 1 and 2, further caused to perform aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
4. The apparatus as claimed in claim 3, further caused to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
5. The apparatus as claimed in claims 1 to 4, wherein generating a combined audio signal from at least two first audio signals causes the apparatus to perform: generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.
6. The apparatus as claimed in claims 1 to 5, wherein comparing the combined audio signal to at least one second audio signal causes the apparatus to perform a cross-correlation between the combined audio signal and the at least one second audio signal.
7. The apparatus as claimed in claim 6, wherein determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal causes the apparatus to perform determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
8. The apparatus as claimed in claims 1 to 7, wherein associating the alignment value to the at least two first audio signals causes the apparatus to perform:
assigning the at least two first audio signals to a first group of audio signals;
assigning the at least one second audio signal to a second group of audio signals;
assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
9. The apparatus as claimed in claims 1 to 8, further caused to perform:
generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;
comparing the further combined audio signal to at least one further second audio signal;
determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
10. The apparatus as claimed in claims 1 to 9, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.
1 1 . A method comprising:
generating a combined audio signal from at least two first audio signals; comparing the combined audio signal to at least one second audio signal; determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and
associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
12. The method as claimed in claim 1 1 , further comprising:
receiving the at least two first audio signals from an audio scene; and receiving the at least one second audio signal from the audio scene.
13. The method as claimed in claims 1 1 and 12, further comprising aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
14. The method as claimed in claim 13, further comprising rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
15. The method as claimed in claims 1 1 to 14, wherein generating a combined audio signal from at least two first audio signals comprises:
generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and appending to the combined audio signal an available at least two first audio signal part otherwise.
16. The method as claimed in claims 1 1 to 15, wherein comparing the combined audio signal to at least one second audio signal comprises a cross- correlation between the combined audio signal and the at least one second audio signal.
17. The method as claimed in claim 16, wherein determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal comprises determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
18. The method as claimed in claims 1 1 to 17, wherein associating the alignment value to the at least two first audio signals comprises:
assigning the at least two first audio signals to a first group of audio signals;
assigning the at least one second audio signal to a second group of audio signals;
assigning a null alignment value to the first group of audio signals; and assigning the alignment value to the second group of audio signals.
19. The method as claimed in claims 1 1 to 18, further comprising:
generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;
comparing the further combined audio signal to at least one further second audio signal;
determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
20. The method as claimed in claims 1 1 to 19, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.
21. An apparatus comprising:
a signal combiner configured to generate a combined audio signal from at least two first audio signals;
a combined signal comparator configured to compare the combined audio signal to at least one second audio signal;
an signal aligner configured to determine an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and
a full signal aligner configured to associate the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
22. The apparatus as claimed in claim 21 , further comprising:
a first signal receiver configured to receive the at least two first audio signals from an audio scene; and
a second signal receiver configured to receive the at least one second audio signal from the audio scene.
23. The apparatus as claimed in claims 21 and 22, further comprising a delay configured to align at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
24. The apparatus as claimed in claim 23, further comprising a signal renderer configured to render the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
25. The apparatus as claimed in claims 21 to 24, wherein the signal combiner comprises: an averager configured to generate a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and
an appender configured to append to the combined audio signal an available at least two first audio signal part otherwise.
26. The apparatus as claimed in claims 21 to 25, wherein the signal comparator comprises a cross-correlator configured to generate a cross- correlation product between the combined audio signal and the at least one second audio signal.
27. The apparatus as claimed in claim 26, wherein the alignment determiner comprises an offset determiner configured to determine a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
28. The apparatus as claimed in claims 21 to 27, wherein the full signal aligner comprises:
a first assigner configured to assign the at least two first audio signals to a first group of audio signals;
a second assigner configured to assign the at least one second audio signal to a second group of audio signals;
a null assigner configured to assign a null alignment value to the first group of audio signals; and
a delay assigner configured to assign the alignment value to the second group of audio signals.
29. The apparatus as claimed in claims 21 to 28, wherein:
the signal combiner is configured to further combine an audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;
the comparator is configured to further compare the further combined audio signal to at least one further second audio signal; the aligner is configured to further determine a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and
the full signal aligner is further configured to associate the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
30. The apparatus as claimed in claims 21 to 29, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.
31. An apparatus comprising:
means for generating a combined audio signal from at least two first audio signals;
means for comparing the combined audio signal to at least one second audio signal;
means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal; and
means for associating the alignment value to the at least one second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal.
32. The apparatus as claimed in claim 31 , further comprising:
means for receiving the at least two first audio signals from an audio scene; and
means for receiving the at least one second audio signal from the audio scene.
33. The apparatus as claimed in claims 31 and 32, further comprising means for aligning at least one of the at least two first audio signals and the at least one second audio signal dependent on the alignment value.
34. The apparatus as claimed in claim 33, further comprising means for rendering the aligned at least one of the at least two first audio signals and the at least one second audio signal for outputting.
35. The apparatus as claimed in claims 31 to 34, wherein the means for generating a combined audio signal from at least two first audio signals comprises:
means for generating a combined audio signal from an average of the at least two first audio signals when the at least two first audio signals are concurrent; and
means for appending to the combined audio signal an available at least two first audio signal part otherwise.
36. The apparatus as claimed in claims 31 to 35, wherein the means for comparing the combined audio signal to at least one second audio signal comprises means for generating a cross-correlation product between the combined audio signal and the at least one second audio signal.
37. The apparatus as claimed in claim 36, wherein the means for determining an alignment value configured to temporally align the combined audio signal to the at least one second audio signal comprises means for determining a time offset which maximises the cross-correlation product between the combined audio signal and the at least one second audio signal.
38. The apparatus as claimed in claims 31 to 37, wherein the means for associating the alignment value to the at least two first audio signals comprises: means for assigning the at least two first audio signals to a first group of audio signals;
means for assigning the at least one second audio signal to a second group of audio signals;
means for assigning a null alignment value to the first group of audio signals; and means for assigning the alignment value to the second group of audio signals.
39. The apparatus as claimed in claims 31 to 38, further comprising:
the means for combining further comprising means for generating a further combined audio signal from the at least two first audio signals and the at least one second audio signal associated with the alignment value;
the means for comparing further comprising means for comparing the further combined audio signal to at least one further second audio signal;
the means for determining an alignment value comprising means for determining a further alignment value configured to temporally align the further combined audio signal to the at least one further second audio signal; and
the means for associating the alignment value comprising means for associating the alignment value to the at least one further second audio signal so to temporally align the at least two first audio signals with the at least one second audio signal and the at least one further second audio signal.
40. The apparatus as claimed in claims 31 to 39, wherein the first audio signals comprise timestamp information and the second audio signals lack the timestamp information.
41. A computer program product stored on a medium for causing an apparatus to perform the method of any of claims 1 1 to 20.
42. An electronic device comprising apparatus as claimed in claims 1 to 10 and 21 to 40.
43. A chipset comprising apparatus as claimed in claims 1 to 10 and 21 to 40.
PCT/IB2011/055692 2011-12-15 2011-12-15 An audio scene alignment apparatus WO2013088208A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2011/055692 WO2013088208A1 (en) 2011-12-15 2011-12-15 An audio scene alignment apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2011/055692 WO2013088208A1 (en) 2011-12-15 2011-12-15 An audio scene alignment apparatus

Publications (1)

Publication Number Publication Date
WO2013088208A1 true WO2013088208A1 (en) 2013-06-20

Family

ID=48611921

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2011/055692 WO2013088208A1 (en) 2011-12-15 2011-12-15 An audio scene alignment apparatus

Country Status (1)

Country Link
WO (1) WO2013088208A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018091777A1 (en) 2016-11-16 2018-05-24 Nokia Technologies Oy Distributed audio capture and mixing controlling
CN110036416A (en) * 2016-11-25 2019-07-19 诺基亚技术有限公司 Device and correlation technique for space audio
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
CN112416289A (en) * 2020-11-12 2021-02-26 北京字节跳动网络技术有限公司 Audio synchronization method, device, equipment and storage medium
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100119083A1 (en) * 2008-11-11 2010-05-13 Motorola, Inc. Compensation for nonuniform delayed group communications
WO2010131105A1 (en) * 2009-05-12 2010-11-18 Nokia Corporation Synchronization of audio or video streams
US20110085671A1 (en) * 2007-09-25 2011-04-14 Motorola, Inc Apparatus and Method for Encoding a Multi-Channel Audio Signal
US20110301730A1 (en) * 2010-06-02 2011-12-08 Sony Corporation Method for determining a processed audio signal and a handheld device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110085671A1 (en) * 2007-09-25 2011-04-14 Motorola, Inc Apparatus and Method for Encoding a Multi-Channel Audio Signal
US20100119083A1 (en) * 2008-11-11 2010-05-13 Motorola, Inc. Compensation for nonuniform delayed group communications
WO2010131105A1 (en) * 2009-05-12 2010-11-18 Nokia Corporation Synchronization of audio or video streams
US20110301730A1 (en) * 2010-06-02 2011-12-08 Sony Corporation Method for determining a processed audio signal and a handheld device

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110089131B (en) * 2016-11-16 2021-07-13 诺基亚技术有限公司 Apparatus and method for distributed audio capture and mixing control
CN110089131A (en) * 2016-11-16 2019-08-02 诺基亚技术有限公司 Distributed audio capture and mixing control
EP3542549A4 (en) * 2016-11-16 2020-07-08 Nokia Technologies Oy DISTRIBUTED AUDIO CAPTURE AND MIX CONTROL
US10785565B2 (en) 2016-11-16 2020-09-22 Nokia Technologies Oy Distributed audio capture and mixing controlling
WO2018091777A1 (en) 2016-11-16 2018-05-24 Nokia Technologies Oy Distributed audio capture and mixing controlling
CN110036416A (en) * 2016-11-25 2019-07-19 诺基亚技术有限公司 Device and correlation technique for space audio
CN110036416B (en) * 2016-11-25 2022-11-29 诺基亚技术有限公司 Apparatus, and associated method, for spatial audio
US10963841B2 (en) 2019-03-27 2021-03-30 On Time Staffing Inc. Employment candidate empathy scoring system
US10728443B1 (en) 2019-03-27 2020-07-28 On Time Staffing Inc. Automatic camera angle switching to create combined audiovisual file
US11961044B2 (en) 2019-03-27 2024-04-16 On Time Staffing, Inc. Behavioral data analysis and scoring system
US11457140B2 (en) 2019-03-27 2022-09-27 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US11863858B2 (en) 2019-03-27 2024-01-02 On Time Staffing Inc. Automatic camera angle switching in response to low noise audio to create combined audiovisual file
US11783645B2 (en) 2019-11-26 2023-10-10 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11127232B2 (en) 2019-11-26 2021-09-21 On Time Staffing Inc. Multi-camera, multi-sensor panel data extraction system and method
US11023735B1 (en) 2020-04-02 2021-06-01 On Time Staffing, Inc. Automatic versioning of video presentations
US11861904B2 (en) 2020-04-02 2024-01-02 On Time Staffing, Inc. Automatic versioning of video presentations
US11184578B2 (en) 2020-04-02 2021-11-23 On Time Staffing, Inc. Audio and video recording and streaming in a three-computer booth
US11636678B2 (en) 2020-04-02 2023-04-25 On Time Staffing Inc. Audio and video recording and streaming in a three-computer booth
US11720859B2 (en) 2020-09-18 2023-08-08 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
US11144882B1 (en) 2020-09-18 2021-10-12 On Time Staffing Inc. Systems and methods for evaluating actions over a computer network and establishing live network connections
CN112416289B (en) * 2020-11-12 2022-12-09 北京字节跳动网络技术有限公司 Audio synchronization method, device, equipment and storage medium
CN112416289A (en) * 2020-11-12 2021-02-26 北京字节跳动网络技术有限公司 Audio synchronization method, device, equipment and storage medium
US11727040B2 (en) 2021-08-06 2023-08-15 On Time Staffing, Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11966429B2 (en) 2021-08-06 2024-04-23 On Time Staffing Inc. Monitoring third-party forum contributions to improve searching through time-to-live data assignments
US11423071B1 (en) 2021-08-31 2022-08-23 On Time Staffing, Inc. Candidate data ranking method using previously selected candidate data
US11907652B2 (en) 2022-06-02 2024-02-20 On Time Staffing, Inc. User interface and systems for document creation
US12321694B2 (en) 2022-06-02 2025-06-03 On Time Staffing Inc. User interface and systems for document creation

Similar Documents

Publication Publication Date Title
WO2013088208A1 (en) An audio scene alignment apparatus
US20130304244A1 (en) Audio alignment apparatus
US9820037B2 (en) Audio capture apparatus
US9936292B2 (en) Spatial audio apparatus
US10097943B2 (en) Apparatus and method for reproducing recorded audio with correct spatial directionality
US20130226324A1 (en) Audio scene apparatuses and methods
CN117412237A (en) Combining audio signals and spatial metadata
US20160155455A1 (en) A shared audio scene apparatus
US20130297053A1 (en) Audio scene processing apparatus
US9195740B2 (en) Audio scene selection apparatus
US20150310869A1 (en) Apparatus aligning audio signals in a shared audio scene
US20150271599A1 (en) Shared audio scene apparatus
US20150302892A1 (en) A shared audio scene apparatus
US9392363B2 (en) Audio scene mapping apparatus
CN103180907B (en) audio scene device
WO2015086894A1 (en) An audio scene capturing apparatus
GB2536203A (en) An apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11877416

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11877416

Country of ref document: EP

Kind code of ref document: A1