WO2015086895A1

WO2015086895A1 - Spatial audio processing apparatus

Info

Publication number: WO2015086895A1
Application number: PCT/FI2014/050953
Authority: WO
Inventors: Toni Mäkinen; Roope JÄRVINEN; Lasse Laaksonen; Kari Järvinen
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2013-12-11
Filing date: 2014-12-04
Publication date: 2015-06-18
Anticipated expiration: 2016-06-11
Also published as: GB201321928D0; GB2521175A

Abstract

An apparatus comprising: an input configured to receive at least two audio signals from at least two microphones located within an acoustic space; a classifier configured to determine the at least two audio signals comprising at least one speech component; a source estimator configured to determine a location associated with the at least one speech component; a processor configured to process the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; an output configured to output the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

Description

SPATIAL AUDIO PROCESSING APPARATUS

Field The present application relates to apparatus for spatial audio processing. The application further relates to, but is not limited to, automotive or vehicular apparatus for spatial audio processing.

Background

Audio capture or recording on electronic apparatus is now common. Capturing or recording audio signals within a vehicle has become a standard feature and used for example to provide hands-free communication using a suitably configured communications apparatus installed or located within the vehicle.

Similarly multichannel playback systems installed or located within the vehicle, such as an in-car entertainment 5.1 channel reproduction system, can be used for presenting audio signals such as the received communication from another location via the suitably configured communications apparatus.

The noise levels inside a moving vehicle, for example a car, can become high, These noise levels, which can be separated into 'road noise' produced mainly by tyres, 'wind noise' produced by aerodynamic inefficiencies in the design such as windscreen noise and rear-view mirror noise, and 'engine noise' produced by the internal combustion engine. These noise sources disturb conversation inside the car, and especially between persons sitting on the back and front seats. Furthermore whereas in a typical conversation visual cues, such as lip or facial reading can be relied on to assist the listener the positioning of the persons in a forward facing direction prevents these from being used. Similarly the whenever an external audio source, such as a 'radio' or other voice or music source is being listened to in the car then conversation can be made difficult to understand. Being able to hold a conversation and talk and discuss conveniently during driving is often desirable and also important in cases where guidance or alerting is needed. For example when a child is experiencing motion sickness in the rear of the car and needs to alert the driver to stop the vehicle to be ill.

Summary of the Application

Aspects of this application thus provide a spatial audio processing capability to assist the usability of vehicles such as cars.

There is provided an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to: receive at least two audio signals from at least two microphones located within an acoustic space; determine the at least two audio signals comprise at least one speech component; determine a location associated with the at least one speech component; process the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; output the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

Determining the at least two audio signals comprise at least one speech component may cause the apparatus to: classify the at least two audio signals according to whether the at least two audio signals comprise a speech or non- speech component; determine at least one speech component common to the at least two audio signals.

Determining the at least two audio signals comprise at least one speech component may cause the apparatus to: receive at least two audio signals comprising a speech component originating from a speech source; receive an indication that the at least two audio signals comprise the speech component; train a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

Determining a location associated with the at least one speech component may cause the apparatus to: analyse the at least two audio signals to determine at least one time delay for a common element of the at least one speech component; determine a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

Determining a location associated with the at least one speech component may cause the apparatus to: analyse a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component; determine a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

Processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may cause the apparatus to: generate a combined/selected audio signal based on the at least two audio signals and at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

Amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component may cause the apparatus to perform at least one of: amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

Processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may cause the apparatus to: determine a location parameter based on the at least two audio signals at least one speech component; process the location parameter based on the location associated with the at least one speech component to generate an output location parameter; render at least one output audio signal based on the output location parameter.

Processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter may cause the apparatus to perform at least one of: modify the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component.

The apparatus may be caused to determine a parameter associated with the situation of the acoustic space and wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may further cause the apparatus to process the at least two audio signals based on the parameter associated with the situation of the acoustic space.

According to a second aspect there is provided an apparatus comprising: means for receiving at least two audio signals from at least two microphones located within an acoustic space; means for determining the at least two audio signals comprise at least one speech component; means for determining a location associated with the at least one speech component; means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; means for outputting the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component. The means for determining the at least two audio signals comprising at least one speech component may comprise: means for classifying the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component; means for determining at least one speech component common to the at least two audio signals.

The means for determining the at least two audio signals comprise at least one speech component may comprise: means for receiving at least two audio signals comprising a speech component originating from a speech source; means for receiving an indication that the at least two audio signals comprise the speech component; means for training a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

The means for determining a location associated with the at least one speech component may comprise: means for analysing the at least two audio signals to determine at least one time delay for a common element of the at least one speech component; means for determining a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

The means for determining a location associated with the at least one speech component may comprise: means for analysing a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component; means for determining a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

The means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may comprise: means for generating a combined/selected audio signal based on the at least two audio signals and at least one speech component; means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component. The means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component may comprise at least one of: means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component; means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component; means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component; means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

The means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may comprise: means for determining a location parameter based on the at least two audio signals at least one speech component; means for processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter; means for rendering at least one output audio signal based on the output location parameter.

The means for processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter may comprise at least one of: means for modifying the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component; means for modifying the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component; means for modifying the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component; means for modifying the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component. The apparatus may further comprise means for determining a parameter associated with the situation of the acoustic space and wherein the means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may further comprise means for processing the at least two audio signals based on the parameter associated with the situation of the acoustic space.

According to a third aspect there is provided apparatus comprising: an input configured to receive at least two audio signals from at least two microphones located within an acoustic space; a classifier configured to deternnine the at least two audio signals comprise at least one speech component; a source estimator configured to determine a location associated with the at least one speech component; a processor configured to process the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; an output configured to output the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

The classifier may be configured to: classify the at least two audio signals according to whether the at least two audio signals comprise a speech or non- speech component; determine at least one speech component common to the at least two audio signals.

The classifier may be configured to: receive at least two audio signals comprising a speech component originating from a speech source; receive an indication that the at least two audio signals comprise the speech component; train the classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

The source estimator may comprise: a correlator configured to analyse the at least two audio signals to determine at least one time delay for a common element of the at least one speech component; a location determiner configured to determine a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

The source estimator may comprise: the correlator configured to analyse a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component; the location determiner configured to determine a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

The processor may comprise: a signal combiner configured to generate a combined/selected audio signal based on the at least two audio signals and at least one speech component; a selective multichannel amplifier configured to amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

The selective multichannel amplifier may be configured to perform at least one of: amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component; amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

The processor may comprise: a location parameter determiner configured to determine a location parameter based on the at least two audio signals at least one speech component; a spatial processor configured to process the location parameter based on the location associated with the at least one speech component to generate an output location parameter; a signal renderer configured to render at least one output audio signal based on the output location parameter.

The spatial processor may be configured to perform at least one of: modify the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component; modify the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component.

The apparatus may comprise an acoustic space determiner configured to determine a parameter associated with the situation of the acoustic space and wherein the processor may be configured to process the at least two audio signals based on the parameter associated with the situation of the acoustic space.

According to a fourth aspect there is provided a method comprising: receiving at least two audio signals from at least two microphones located within an acoustic space; determining the at least two audio signals comprise at least one speech component; determining a location associated with the at least one speech component; processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; outputting the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component. Determining the at least two audio signals comprise at least one speech component may comprise: classifying the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component; determining at least one speech component common to the at least two audio signals.

Determining the at least two audio signals comprise at least one speech component may comprise: receiving at least two audio signals comprising a speech component originating from a speech source; receiving an indication that the at least two audio signals comprise the speech component; training a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

Determining a location associated with the at least one speech component may comprise: analysing the at least two audio signals to determine at least one time delay for a common element of the at least one speech component; determining a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

Determining a location associated with the at least one speech component may comprise: analysing a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component; determining a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component. Processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may comprise: generating a combined/selected audio signal based on the at least two audio signals and at least one speech component; amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

Amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component may comprise at least one of: amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component; amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component; amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component; amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component. Processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may comprise: determining a location parameter based on the at least two audio signals at least one speech component; processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter; rendering at least one output audio signal based on the output location parameter.

Processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter may comprise at least one of: modifying the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component; modifying the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component; modifying the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component; modifying the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component. The method may further comprise determining a parameter associated with the situation of the acoustic space and wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal may further comprise processing the at least two audio signals based on the parameter associated with the situation of the acoustic space.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows a schematic view of an apparatus suitable for implementing embodiments;

Figure 2 shows schematically the apparatus suitable for implementing embodiments as located within a suitable vehicle;

Figure 3 shows schematically apparatus suitable for implementing embodiments in further detail;

Figure 4 shows a flow diagram of the operation of the apparatus shown in

Figure 3 according to some embodiments;

Figure 5 shows a flow diagram of an example operation of the apparatus in further detail;

Figure 6 shows a flow diagram of the operation of the classifier and the source estimator in further detail according to some embodiments; and

Figure 7 shows schematically a feedback filter implemented within the spatial processor according to some embodiments. Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective spatial audio processing.

The concept of the application is related to estimating the sound direction of arrival (DOA) and recognizing the audio content such that conversations or speech within a vehicle or acoustic space is able to be heard by others. In such embodiments a microphone array is employed (or installed) inside the vehicle (for example a car), which further comprises a suitable processor apparatus configured to perform a suitable audio classifier algorithm (or voice activity detector - VAD) implementation. The embodiments as described herein can further comprise suitable audio analysis and processing algorithms or apparatus configured such that whenever a person is speaking (for example in the car back seats), the speech is captured by the installed microphone array, processed and amplified and output (from the car front speakers). Correspondingly when for example the apparatus captures audio signals comprising the voice of a person sitting in front seats, then these audio signals can be amplified and played from the rear speakers. Whilst in the embodiments described herein the audio signals are passed between the front and rear areas of the vehicle it would be understood that a similar method can be performed to pass the voice audio signals between the left and right sides of the vehicle.

Moreover in the embodiments described herein the apparatus can be configured to process any other audio signals to relatively emphasise the voice audio signals. For example where in-car entertainment (ICE) audio or audio-video system is fitted, and capable of outputting music or speech from a suitable audio source for example a radio station, compact disc player, hard disk, or input audio, the playback level from the audio source is automatically reduced during determination and processing of the live speech audio signal.

In some embodiments the processing is triggered by the audio classifier/VAD. In this regard reference is first made to Figure 1 which shows a schematic block diagram of an exemplary apparatus or electronic device 10, which may be used to capture, analyse, process and output the audio signals. Although in the following examples the apparatus 10 is described as being fixed or integrated within the vehicle it would be understood that in some embodiments at least part of the apparatus described herein may be implemented within a mobile terminal or user equipment of a wireless communication system. The apparatus 10 can in some embodiments comprise an audio subsystem. The audio subsystem for example can include in some embodiments an array of microphones 1 1 for audio signal capture. In some embodiments the array of microphones are solid state microphones, in other words capable of capturing acoustic signals and outputting a suitable digital format audio signal. In some other embodiments the array of microphones 1 1 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone array 1 1 can in some embodiments output the generated audio signal to an analogue-to-digital converter (ADC) 14.

In some embodiments the apparatus and audio subsystem includes an analogue- to-digital converter (ADC) 14 configured to receive the analogue captured audio signal from the microphones and output the audio captured signal in a suitable digital form. The analogue-to-digital converter 14 can be any suitable analogue-to- digital conversion or processing means.

In some embodiments the apparatus 10 and audio subsystem further includes a digital-to-analogue converter (DAC) 32 for converting digital audio signals from a processor 21 to a suitable analogue format. The digital-to-analogue converter (DAC) or signal processing means 32 can in some embodiments be any suitable DAC technology. Furthermore the audio subsystem can include in some embodiments at least one speaker 33. The speaker 33 can in some embodiments receive the output from the digital-to-analogue converter 32 and present the analogue audio signal to the user. In some embodiments the speaker 33 can be representative of the in car or vehicle speaker system or at least one set of headphones, or at least one set of cordless headphones.

Although the apparatus 10 is shown having both audio capture and audio presentation components, it would be understood that in some embodiments the apparatus 10 can comprise the audio capture only such that in some embodiments of the apparatus the microphones (for audio capture) and the analogue-to-digital converter are present. In some embodiments the apparatus 10 comprises a processor 21 . The processor 21 is coupled to the audio subsystem and specifically in some examples the analogue-to-digital converter 14 for receiving digital signals representing audio signals from the microphone 1 1 , and the digital-to-analogue converter (DAC) 32 configured to output processed digital audio signals.

The processor 21 can be configured to execute various program codes. The implemented program codes can comprise for example source determination, audio source direction estimation, and audio source motion to user interface gesture mapping code routines.

In some embodiments the apparatus further comprises a memory 22. In some embodiments the processor 21 is coupled to memory 22. The memory 22 can be any suitable storage means. In some embodiments the memory 22 comprises a program code section 23 for storing program codes implementable upon the processor 21 such as those code routines described herein. Furthermore in some embodiments the memory 22 can further comprise a stored data section 24 for storing data, for example audio data that has been captured in accordance with the application or audio data to be processed with respect to the embodiments described herein. The implemented program code stored within the program code section 23, and the data stored within the stored data section 24 can be retrieved by the processor 21 whenever needed via a memory-processor coupling. In some further embodiments the apparatus 10 can comprise a user interface (Ul) 15. The user interface 15 can be coupled in some embodiments to the processor 21 . In some embodiments the processor can control the operation of the user interface and receive inputs from the user interface 15. In some embodiments the user interface 15 can enable a user to input commands to the electronic device or apparatus 10, for example via a keypad, and/or to obtain information from the apparatus 10, for example via a display which is part of the user interface 15. The user interface 15 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 10 and further displaying information to the user of the apparatus 10.

In some embodiments the apparatus further comprises a transceiver 13, the transceiver in such embodiments can be coupled to the processor and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 13 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver 13 can communicate with further devices by any suitable known communications protocol, for example in some embodiments the transceiver 13 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

In some embodiments the transceiver is configured to transmit and/or receive the audio signals for processing according to some embodiments as discussed herein. It is to be understood again that the structure of the apparatus 10 could be supplemented and varied in many ways.

With respect to Figure 2 an example apparatus is shown implemented within a vehicle. The vehicle shown in Figure 2 is a sectioned idealised car 101 and the apparatus comprises the microphone array 1 1 , which in this example comprises three microphones mounted approximately centrally with respect to the passenger section in the roof headlining of the car 101 . It would be understood that in some embodiments the microphone array can be any suitable arrangement or configuration of microphones, for example a distributed array arranged throughout the car or the array can be formed from microphones installed in devices carried by or located within the car or vehicle. The microphone array 1 1 can be configured to be coupled to the vehicle processing unit 103. The coupling can in some embodiments be wired or wireless (for example using a Bluetooth wireless connection). In some embodiments at least one of the microphones are temporarily or otherwise mounted in the vehicle. For example in some embodiments at least one of the microphones is a microphone from a mobile communications device (such as a mobile phone located on a holder in the vehicle) which is coupled to the vehicle processing unit.

In some embodiments the apparatus comprises a vehicle or vehicle mounted processing unit 103 which in turn comprises the processor 121 , coupled to a memory 122, and further coupled to a transceiver 1 13. The transceiver can in some embodiments be configured, as described herein, to communicate with the microphone array 1 1 and the speaker array 33. In some embodiments the vehicle mounted processing unit 103 is configured to be removable or temporarily or otherwise mounted in the vehicle. For example in some embodiments the processing unit is a mobile communications device (such as a mobile phone located on a holder in the vehicle) which is coupled to the microphones and the speaker array.

Furthermore in some embodiments the apparatus comprises a speaker array 33 which is shown within the vehicle 101 comprising a pair of front speakers, a front right 33i and front left 332 speaker and a pair of rear speakers, a rear right 333 and rear left 33₄ speaker. However it would be understood that the speaker array or system can comprise any number and arrangement of speakers. The speaker array 33 can be configured to be coupled to the vehicle processing unit 103. The coupling can in some embodiments be wired or wireless (for example using a Bluetooth wireless connection). In some embodiments the speakers can be carried either temporarily or otherwise in the vehicle. For example in some embodiments at least one of the speakers is a speaker from a mobile communications device (such as a mobile phone located on a holder in the vehicle) which is coupled to the vehicle processing unit.

In some embodiments the vehicle 101 comprises a user interface or touchscreen interface 1 15 further coupled to the vehicle processing unit 103. The user interface 1 15 can for example control the operation of the vehicle processing unit 103 and furthermore control the playback or other of any other suitable audio source such as compact disc, Bluetooth audio input, radio receiver. In some embodiments the user interface is temporarily or otherwise mounted in the vehicle. For example in some embodiments the user interface is that provided by a mobile communications device or other suitable electronic device, such as computer, tablet. For example in some embodiments the user interface 1 15 is provided by a mobile phone located on a holder in the vehicle coupled to the vehicle processing unit 103.

With respect to Figure 3 the processing apparatus, such as the processing apparatus shown in Figure 1 or the vehicle processing unit 103 shown in Figure 2 is shown in further detail. Furthermore with respect to Figure 4 the operation of the processing apparatus, such as the processing apparatus shown in Figure 1 or the vehicle processing unit 103 shown in Figure 2 according to some embodiments is further described.

In some embodiments the processing apparatus comprises an input configured to receive the audio signals from the microphone array 1 1 . It would be understood that the audio signals can be received in any suitable format. The operation of receiving the audio signals from the microphones is shown in Figure 4 by step 301 . In some embodiments the processing apparatus comprises a classifier 201 . The classifier 201 can be configured to receive the audio signals from the microphone array 1 1 or via the input.

In some embodiments the classifier 201 can be configured to determine whether the audio signals captured by the microphone comprise at least one speech component. In other words whether the audio signals originate from a speech source within the vehicle. The classifier 201 can use any suitable classification or voice activity determination (VAD) method. For example in some embodiments the classifier 201 can be configured to perform an unsupervised classification on the audio signals determining whether or not the audio signals are speech or non- speech. In some embodiments the classifier 201 is a supervised classifier. For example in some embodiments the classifier 201 comprises a support vector machine (SVM) which can be applied to provide better adaptation to specific speakers through a manually performed training operation or step. The training operation or step can be performed by the classifier 201 providing a suitable prompt (either visual or audible) such that the passengers or persons within the vehicle are encouraged to speak a small segment or phrase which can be used to determine whether or not speech has been received at a later time. The trainer or training operation can then be considered to be in some embodiments where the classifier receives at least two audio signals comprising a speech component originating from a speech source and furthermore receives an indication that the at least two audio signals comprise the speech component (in other words that the audio is training audio). The classifier operating in a training mode can therefore be considered to be using the audio signals and the indication to train a classifier to classify successive audio signals comprising a similar speech component as a speech component. In other words in some embodiments the classifier or suitable means for classifying that receive audio signals can be configured to receive at least two audio signals comprising a speech component originating from a speech source and an indication that the at least two audio signals comprise the speech component. The classifier can then in some embodiments be configured to be trained based on the indication and the audio signals such that later or successive audio signals comprising a similar speech component is identified or classified as comprising a speech component. In some embodiments the classifier can be configured to determine whether the audio signals captured by the microphones comprise at least one speech component common to at least two of the audio signals.

The classifier 201 can be configured to output the classification results to the source estimator 205 and furthermore in some embodiments to a spatial processor 207.

The operation of classifying the audio and determining whether it contains speech or not is shown in Figure 4 by step 303. In some embodiments where there is no speech then the operation passes back to receive further audio signals from the microphones. In other words the operation passes back to step 301 .

Where the audio has been determined to contain speech signals then the operation can pass onwards to step 305.

In some embodiments the apparatus comprises a source estimator 205. The source estimator 205 can be configured in some embodiments to receive the audio signals from the microphone array 1 1 and furthermore receive the output of the classifier 201 . The source estimator 205 can, in some embodiments when the audio signal has been determined to comprise speech, determine the location of the at least one speech sources. This can in some embodiments be considered to be where the source estimator is configured to analyse the at least two audio signals to determine at least one time delay for a common element of the at least one speech component and determine a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element. In some embodiments where the information or audio signals are available the source estimator or suitable means for determining a speech component origin or source can be configured to analyse a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component and then determine a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

In other words the source estimator 205 can be configured to determine direction of arrival information associated with suitably determined or detected speech acoustic sources (the passengers within the vehicle as they talk). The estimate of the at least one source speech location can be passed to the spatial processor 207. The operation of determining the speech source location is shown in Figure 4 by step 305.

An example microphone array arrangement such as shown in Figure 2 shows a first microphone, a second microphone and a third microphone. In this example the microphones are arranged at the vertices of an equilateral triangle. However the microphones can be arranged in any suitable shape or arrangement. In this example each microphone is separated by a dimension or distance d from each other and each pair of microphones can be considered to be orientated by an angle of 120° from the other two pairs of microphone forming the array. The separation between each microphone is such that the audio signal received from a signal source can arrive at a first microphone, for example microphone 3 earlier than one of the other microphones, such as microphone 2. This can for example be a time domain audio signal fi(t) occurring at the first time instance and the same audio signal being received at the third microphone f2(t) at a time delayed with respect to the second microphone signal.

In the following examples the processing of the audio signals with respect to a single microphone array pair is described. However it would be understood that any suitable microphone array configuration can be scaled up from pairs of microphones where the pairs define lines or planes which are offset from each other in order to monitor audio sources with respect to a single dimension, for example azimuth or elevation, two dimensions, such as azimuth and elevation and furthermore three dimensions, such as defined by azimuth, elevation and range.

An example of estimating the direction of arrival of an audio source is described in further detail hereafter. In some embodiments the source estimator 205 comprises a framer. The framer can be configured to receive the audio signals from the microphones and divide the digital format signals into frames or groups of audio sample data. In some embodiments the framer can furthermore be configured to window the data using any suitable windowing function. The framer can be configured to generate frames of audio signal data for each microphone input wherein the length of each frame and a degree of overlap of each frame can be any suitable value. For example in some embodiments each audio frame is 20 milliseconds long and has an overlap of 10 milliseconds between frames. The framer can be configured to output the frame audio data to a Time-to-Frequency Domain Transformer. In some embodiments the source estimator 205 comprises a Time-to-Frequency Domain Transformer. The Time-to-Frequency Domain Transformer can be configured to perform any suitable time-to-frequency domain transformation on the frame audio data. In some embodiments the Time-to-Frequency Domain Transformer can be a Discrete Fourier Transformer (DTF). However the Transformer can be any suitable Transformer such as a Discrete Cosine Transformer (DCT), a Modified Discrete Cosine Transformer (MDCT), or a quadrature mirror filter (QMF). The Time-to-Frequency Domain Transformer can be configured to output a frequency domain signal for each microphone input to a sub-band filter.

In some embodiments the source estimator 205 comprises a sub-band filter. The sub-band filter can be configured to receive the frequency domain signals from the Time-to-Frequency Domain Transformer for each microphone and divide each microphone audio signal frequency domain signal into a number of sub-bands.

The sub-band division can be any suitable sub-band division. For example in some embodiments the sub-band filter can be configured to operate using psycho- acoustic filtering bands. The sub-band filter can then be configured to output each domain range sub-band to a direction analyser.

In some embodiments the source estimator 205 can comprise a direction analyser. The direction analyser can in some embodiments be configured to select a sub- band and the associated frequency domain signals for each microphone of the sub-band.

The direction analyser can then be configured to perform directional analysis on the signals in the sub-band. The directional analyser can be configured in some embodiments to perform a cross correlation between the microphone pair sub- band frequency domain signals.

In the direction analyser the delay value of the cross correlation is found which maximises the cross correlation product of the frequency domain sub-band signals. This delay time value can in some embodiments be used to estimate the angle or represent the angle from the dominant audio signal source for the sub- band. This angle can be defined as a. It would be understood that whilst a pair or two microphones can provide a first angle, an improved directional estimate can be produced by using more than two microphones and preferably in some embodiments more than two microphones on two or more axes. Specifically in some embodiments this direction analysis can be defined as receiving the audio sub-band data. The direction analyser received the sub-band data;

Xt^' (n) = X_k(n_b + n), n = 0₍ ... , - ¾ - 1, b = 0, ..... _> i? - i where _¾ is the first index of 6th subband. In some embodiments for every subband the directional analysis is performed as described herein. First the direction is estimated with two channels (For example using microphones 2 and 3). The direction analyser finds delay _¾ that maximizes the correlation between the two channels for subband b. DFT domain representation of e.g. f GO can be shifted r_b time domain samples using

The optimal delay in some embodiments can be obtained from

where Re indicates the real part of the result and ^* denotes complex conjugate .J5f| and are considered vectors with length of samples.

The direction analyser can in some embodiments implement a resolution of one time domain sample for the search of the delay. In some embodiments the direction analyser with the delay information generates a sum signal. The sum signal can be mathematically defined as.

In other words the direction analyser is configured to generate a sum signal where the content of the channel in which an event occurs first is added with no modification, whereas the channel in which the event occurs later is shifted to obtain best match to the first channel.

It would be understood that the delay or shift ^■ _¾ indicates how much closer the sound source is to the microphone 2 than microphone 3 (when _¾ is positive sound source is closer to microphone 2 than microphone 3). The direction analyser can be configured to determine actual difference in distance as

where Fs is the sampling rate of the signal and v is the speed of the signal in air. The angle of the arriving sound is determined by the direction analyser as,

^A23 + ²^_m ^A23

d_b = + cos -<

²d_md_sm where dm is the distance between the pair of microphones and d_sm is the estimated distance between sound sources and nearest microphone. In some embodiments the direction analyser can be configured to set the value of d_sm to a fixed value. For example dsm = 2 meters has been found to provide stable results. It would be understood that the determination described herein provides two alternatives for the direction of the arriving sound as the exact direction cannot be determined with only two microphones.

In some embodiments the directional analyser can be configured to use audio signals from a third channel or the third microphone to define which of the signs in the determination is correct. The distances between the third channel or microphone and the two estimated sound sources are:

S_b = J(h - d_sm sin (a ))² + [ ^d _i + d_sm cos(d )

where h is the height of the equilateral triangle, i.e.

The distances in the above determination can be considered to be equal to delays (in samples) of;

Where v is the speed of sound. Out of these two delays the direction analyser in some embodiments is configured to select the one which provides better correlation with the sum signal. The correlations can for example be represented as

= R V The directional analyser can then in some embodiments then determine the direction of the dominant sound source for subband b as: f £¾ C_b > C_h

⁽X- = ϊ - .

{-¾ t < c In some embodiments the source estimator 205 further comprises a mid/side signal generator. Following the directional analysis, the mid/side signal generator can be configured to determine the mid and side signals for each sub-band. The main content in the mid signal is the dominant sound source found from the directional analysis. Similarly the side signal contains the other parts or ambient audio from the generated audio signals. In some embodiments the mid/side signal generator can determine the mid M and side S signals for the sub-band according to the following equations:

It is noted that the mid signal M is the same signal that was already determined previously and in some embodiments the mid signal can be obtained as part of the direction analysis. The mid and side signals can be constructed in a perceptually safe manner such that the signal in which an event occurs first is not shifted in the delay alignment. The mid and side signals can be determined in such a manner in some embodiments that it is suitable where the microphones are relatively close to each other. Where the distance between the microphones is significant in relation to the distance to the sound source then the mid/side signal generator can be configured to perform a modified mid and side signal determination where the channel is always modified to provide a best match with the main channel. In some embodiments the processing apparatus comprises a spatial processor 207. The spatial processor 207 can in some embodiments receive the audio signals from the microphone array 1 1 , the classifier information from the classifier 201 and the estimated source location from the source estimator 205. The spatial processor 207 can be configured to process the audio signals from the microphones 1 1 based on the source location. For example in some embodiments the spatial processor 207 can be configured to process the audio signals such that the direction of the audio signals is reversed. In such embodiments there can be a simple front-rear reversal, in other words where the audio signals arrive from the rear of the vehicle or car they are repositioned to be at the front of the vehicle or car and vice versa. In some embodiments there can be left-right reversal, in other words where the voice signals appear to originate from a source on the left of the car they are processed to be on the right side of the car and vice versa. In some embodiments there can be a sectorized reversal, in other words where the source is determined to be from a single sector (for example the front left of the vehicle or car) then the source is processed to be output from all of the sectors other than the original sector (for example all of the directions other than front left - such as front right, rear left and rear right). Although this has been discussed as a reversal operation it would be understood that any suitable directional displacement of the voice source can be performed.

For example in some embodiments the spatial processor 207, having received the mid, side signals and the direction of arrival information can be configured to change the direction of arrival to the opposite direction.

Furthermore in some embodiments the mid, side and direction of arrival information for the speech audio signals can then be rendered into the number of channels for output by the speaker system and these rendered audio channels passed to an amplifier.

The processed audio signals can then be output to a multichannel amplifier 209. In some embodiments the spatial processor generates the suitable multichannel audio signals to be output to the speaker array 33 directly. In other words in some embodiments the spatial processor can be configured to resolve the directional components of the audio signals to suitable speaker channel audio signal forms.

In such embodiments once the DOA estimate for the received speech signal is obtained, the signal processor and amplifiers effectively apply a decision to which audio speakers are used for outputting the signal. In some such embodiments the rear speakers are used for the speech spoken in the front seats, and vice versa. This means that in some embodiments only a rough DOA estimate is required for making the decision.

The operation of processing the audio signals based on the source location is shown in Figure 4 by step 307.

In some embodiments the processing apparatus comprises a multichannel amplifier 209. The multichannel amplifier 209 is configured to receive the output of the spatial processor 207, either in processed audio source or suitable resolved multichannel audio format and generate suitably amplified channel outputs which can be passed to the speaker array 33. In some embodiments the speech/processed audio signal is amplified using the regular car playback system, to which the microphone array/processor apparatus is integrated to. The amount of amplification in some embodiments is automatically set to a default value based on the level of background noise (in some embodiments this background noise level can be determined as the level of microphone signals before the obtained speech input). In some embodiments the amplification or processing level can be adjusted manually during the conversation (for example in the same manner as if the volume of the car radio is adjusted).

The operation of amplifying the audio signals and outputting them to the speakers is shown in Figure 4 by step 309.

In some embodiments the spatial processor or suitable signal processing means processes the audio signals based on the location associated with the at least one speech component to generate at least one output audio signal by being configured to generate a combined/selected audio signal based on the at least two audio signals and at least one speech component. Furthermore in some embodiments the spatial processor itself can be configured to amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component. In a manner as described above the amplification of the combined/selected audio signal can in some embodiments amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component (for example if the speech is from the rear of the car the amplified signal is passed to the loudspeakers elsewhere in the car). In some embodiments amplifying the combined selected audio signal can in some embodiments amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component (for example if the speech is from the rear of the car the amplified signal is passed to the loudspeakers in the front of the car). In some embodiments amplifying the combined/selected audio signal can in some embodiments amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component (for example if the speech is from the rear of the car the amplified signal is passed to the loudspeakers in the front of the car and vice versa). In some embodiments amplifying the combined/selected audio signal can in some embodiments amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component (for example if the speech is from the left of the car the amplified signal is passed to the loudspeakers to the right of the car and vice versa).

It would be understood that in some embodiments the processing of the audio signals generated by the microphones based on the location of the at least one speech component can in some embodiments cause the spatial processor to determine a location parameter based on the at least two audio signals and at least one speech component (for example by receiving it from the source estimator or other suitable analyser). Furthermore, the spatial processor can in some embodiments be configured to process the location parameter based on the location associated with the at least one speech component to generate an output location parameter. Furthermore in such embodiments the spatial processor can be configured to render at least one output audio signal based on the output location parameter. The modification of the location parameter can as described herein be any suitable modification and in some embodiments be a different location or sector, an opposite location or sector, a laterally opposite or different location or sector, or a longitudinally opposite or different location or sector.

In some embodiments the apparatus comprises an output such as the speaker array 33 shown in Figure 3. The speaker array 33 can be configured to receive the audio signals from the multichannel amplifier 209 and output the processed audio signals.

Outputting the processed audio signals via speakers is shown in Figure 4 by step 31 1 .

In some embodiments the spatial processor 207 can further be configured to receive further inputs. For example in some embodiments the spatial processor 207 can be configured to receive an audio signal or audio signals from an audio source such as a radio receiver, CD input, MP3 or other suitable audio input. The audio source 203 can be passed to the spatial processor 207 to be processed and mixed into the audio signals being output. For example in some embodiments the spatial processor can be configured to suppress the audio source audio signals for directions where the spatial processor is configured to emphasise or amplify the speech audio signals. In other words in some embodiments where the apparatus determines that there are voice signals from the rear of the vehicle or car the spatial processor can suppress the audio source audio signals being output to the front of the car to effectively further emphasise the voice audio signals being passed to the front of the car.

In some embodiments the spatial processor 207 can further receive a user input from a user input device 15/1 15. The user input can effectively control the level of emphasis provided by the spatial processor 207 and further enable or disable the spatial processing of the voice signals.

In some embodiments the complexity of the processing is chosen such that the delay between the actual speech and the signal output from the speakers does not become disturbingly notable. It would be understood that as found in any digital audio system, some amount of latency is caused due to the analogue-to-digital (AD) (and possible digital-to-analogue (DA)) conversions performed for analyzing the captured audio. However, the amount of delay caused by the converters is small. Typically, AD and DA converters together delay the signal by considerably less than 5 ms, and it is generally agreed that most people's perceptions are not sharp enough to notice a delay of this order.

After the conversion, the DOA estimation, which is a real-time process is performed. In combination with the classifier such as a SVM-based speech signal recognition the processing delay may increase. In some embodiments where the delay should become too long for a convenient use, a low-complexity and low- delay VAD algorithm can be applied for the speech detection. For example, the G.729 Annex B algorithm makes VAD decisions frame by frame, allowing a moderate algorithmic delay of 10ms.

The actual playback of the speech can be performed in any suitable manner for example by sending the acoustic analogue signal directly to the speakers. This causes no additional delay to the signal path. For comparison, an audio delay of about 25ms or more is heard as an echo or reverb, meaning that the proposed system is able to avoid causing such an annoying effect inside the vehicle or car cabin. With respect to Figure 5, a flow diagram of an example operation of the processing apparatus performing speech enhancement is shown.

The processing apparatus, for example the classifier 201 can be configured to receive the input signals captured by the microphone array and analyse the audio content of the audio signals. The analysis can as discussed herein be in the form of determining whether the audio signals contain speech or not.

The operation of analysing the audio content to determine whether the audio content contains speech is shown in Figure 5 by step 401 .

Where the audio content does not contain speech then the operation is shown in Figure 5 as looping back on itself to further analyse the audio content. Where the audio content is analysed and determined to contain speech then the apparatus, for example the source estimator, can be configured to estimate the direction of arrival of the voice audio sources.

The operation of estimating the direction of arrival of the speech source is shown in Figure 5 by step 403.

In the example shown herein where the speech audio source is determined as coming from the front seats then the spatial processor and multichannel amplifier can be configured to amplify the input signal such that it is output to the rear speakers.

The operation of determining that the direction of arrival is from the front seats and therefore outputting an amplified audio signal using the rear speakers is shown in Figure 5 by step 405.

Where the speech audio source direction of arrival is determined to come from the back seats then the spatial processor and multichannel amplifier can be configured to amplify the input signal and output the amplified audio signal to the front speakers.

The operation of determining that the direction of arrival is from the back seats and outputting an amplified audio signal using the front speakers is shown in Figure 5 by step 407.

Furthermore the spatial processor can further determine whether or not the radio (or other suitable audio source) is on and configured to output audio signal.

The operation of determining whether the radio (or other audio source) is on is shown in Figure 5 by step 409.

Where the radio (or other audio source) is not on then the operation can pass back to the initial operation of analysing the audio content.

Where the radio (or other audio source) is determined to be on then the processor can be configured to attenuate the radio (or other audio source) signal. The operation of attenuating the radio or audio source signal is shown in Figure 5 by step 41 1 .

With respect to Figure 6 the operations of the classifier 201 and source estimator 205 are shown in further detail. The classifier 201 shown with respect to Figure 6 is a supervised classifier and are defined by a pre-calibration or training phase and an actual or operational phase.

The classifier 201 can for example be configured to receive training utterances provided by the users.

The operation of receiving training utterances provided by the users is shown in Figure 6 by step 597. The classifier 201 can in some embodiments extract features from these training utterances. These features may be any suitable (audio) feature or features.

The operation of extracting features from the training utterances is shown in Figure 6 by step 598.

The classifier 201 can then be configured to train a suitable SVM using the extracted features. This trained SVM values can then be passed to a classification operation for actual use.

The operation of training the SVM (and optionally fitting a hyperplane to separate speech from the other audio signals) is shown in Figure 6 by step 599.

In the actual operation phase the classifier can be configured to receive the microphone channel inputs.

The operation of receiving the microphone channel inputs is shown in Figure 6 by step 501 . In order to classify the received audio signals with the classifier 201 , it is then needed to extract features from the microphone channel inputs.

The operational extraction of features from the microphone channel inputs is shown in Figure 6 by step 503.

These extracted features can then be employed to classify the audio signals based on or with respect to the trained SVM data and therefore to generate information of whether or not the audio signals contain speech. The operation of classifying is shown in Figure 6 by step 505.

The source estimator can then be triggered in some embodiments by the classification determination indicating that the audio signals comprise speech to perform a cross-correlation between microphone channels to determine time delays between the microphones audio signals.

The operation of performing cross-correlation between microphone channels is shown in Figure 6 by step 507.

Furthermore the audio signals can be divided into a number of frequency sub bands (b). The division of the audio signal into (b) frequency subbands is shown in Figure 6 by step 509.

Then source estimator can then be configured to perform a direction of arrival estimation for each of the (b) frequency subbands based on the determined time delays.

The operation of determining the direction of arrival estimate for each (b) frequency subband based on the time delays is shown in Figure 6 by step 51 1 . It would be understood that in some embodiments the spatial processor further comprises an adaptive echo cancellation processor or means for adaptive echo cancellation employed such that whenever an amplified speech signal is played using the car speakers, the adaptive echo cancellation is applied to prevent the replayed audio from being re-captured and re-amplified. An example of an adaptive echo cancellation processor is shown in Figure 7 should be applied. In such embodiments a filtered version of the originally spoken speech signal, r(t), is subtracted from the signal captured by the microphones 1 1 , y(t). This is performed with a dedicated adaptive filter h 603, which models the car speakers 33, the microphones 1 1 and the acoustical attributes of the car cabin. The filter coefficients are trained using the error signal e(t) = y(t) - r(t), such that the adaptive echo cancellation processor adapts to the environment over time. In practice, however, with a static car environment these coefficients can be pre- adjusted to match with the microphones, speakers and the car cabin dimensions, such that only fine tuning needs to be performed while the system is in use.

In some embodiments multiple voice sources can be determined and processed. These "double talk" situations can occur where there are persons speaking simultaneously in the front and back seats. In such embodiments the subband- based DOA estimation approach can be configured to detecting whenever more than one voice source (speaker) exists in the car. This can for example be determined by estimating a separate DOA for each frequency subband, enabling direction estimates to be obtained for both from the front and back seat voice sources simultaneously. In such embodiments the DOA estimate of a particular subband is obtained based on which one of the sources possesses higher energy at that frequency band. In such embodiments whenever there are DOA estimates clearly obtained from two substantially different directions at the same time, a double talk mode can be activated. In some embodiments the determination of two or more substantially different directions at the same time requires the definition of threshold values. For example, where there are 32 subbands being analysed, a double talk mode could be activated whenever there are 5 or more subbands where DOA estimates substantially differ from the other or majority of subband DOA estimates.

Following the activation of the "double talk" mode there can be a determination of whether one or more than one sets of speakers should be used to output the voice audio signals (in other words amplified). For example in some embodiments all conversations (voice sources) that are received by the microphones are processed and output using all of the speakers in the car. This 'all' speaker output would ensure that everything that is said inside the car is audible for everyone. In some embodiments the processor does not enhance any of the voice sources or in other words is prevented from amplifying the signal at all in case of multiple persons speaking at the same time, as there might be two separate conversations ongoing in the front and back seats. In some embodiments the processing can be user defined to enable the users to select the conversation they wish to listen to or to select the seat to talk to. In other words the processing of the voice sources can be manually selected by the users depending on the situation.

In some embodiments where at least one of the microphones and/or speakers employed are mobile devices the positions of the devices are determined. I some embodiments this can be done by mechanical means - such as inserting the mobile device into a suitable docking station or by any suitable locating means such as distance, direction determination means from a known locus.

In some embodiments where the vehicle is particularly large, for example in large cars, minivans, buses, then the source estimator can further determine the (voice) sound source distance estimate which can then be used by the spatial processor.

In some embodiments the user input from the user interface or other input or notification can further control the spatial processing of the voice audio signals. For example the processing and amplification scheme can in some embodiments be controlled and therefore modified based on the 'situation' within the car. For example, speech from specific direction can be discarded where the person is having a private phone call. Similarly one or more of the car speakers can be muted whenever there is a person sleeping or otherwise not participating to the on-going conversation. Although the embodiments described herein have been based on vehicles and specifically car based voice emphasis processing it would be understood that some other embodiments are applicable to other acoustic spaces where background noise or other audio disturbances cause difficulties for people to follow discussions and participate in the discussion. For example the acoustic space could be any 'noisy' room. Examples of such 'noisy' rooms are kitchens, restaurants, factory halls. In such embodiments the microphones, speakers and processing unit can be integrated within the acoustic space infrastructure or be deployable or temporary elements located within the acoustic space. In the description herein the components can be considered to be implementable in some embodiments at least partially as code or routines operating within at least one processor and stored in at least one memory.

It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.

Furthermore elements of a public land mobile network (PLMN) may also comprise apparatus as described above.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1 . Apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to:

receive at least two audio signals from at least two microphones located within an acoustic space;

determine the at least two audio signals comprise at least one speech component;

determine a location associated with the at least one speech component; process the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal; output the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

2. The apparatus as claimed in claim 1 , wherein determining the at least two audio signals comprise at least one speech component causes the apparatus to: classify the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component;

determine at least one speech component common to the at least two audio signals.

3. The apparatus as claimed in any of claims 1 and 2, wherein determining the at least two audio signals comprise at least one speech component causes the apparatus to:

receive at least two audio signals comprising a speech component originating from a speech source;

receive an indication that the at least two audio signals comprise the speech component; train a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

4. The apparatus as claimed in any of claims 1 to 3, wherein determining a location associated with the at least one speech component causes the apparatus to:

analyse the at least two audio signals to determine at least one time delay for a common element of the at least one speech component;

determine a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

5. The apparatus as claimed in claim 4, wherein determining a location associated with the at least one speech component causes the apparatus to:

analyse a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component;

determine a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

6. The apparatus as claimed in any of claims 1 to 5, wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal causes the apparatus to:

generate a combined/selected audio signal based on the at least two audio signals and at least one speech component;

amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

7. The apparatus as claimed in claim 6, wherein amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component causes the apparatus to perform at least one of:

amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component;

amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component;

amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component;

amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

8. The apparatus as claimed in any of claims 1 to 5, wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal causes the apparatus to:

determine a location parameter based on the at least two audio signals at least one speech component;

process the location parameter based on the location associated with the at least one speech component to generate an output location parameter;

render at least one output audio signal based on the output location parameter.

9. The apparatus as claimed in claim 8, wherein processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter causes the apparatus to perform at least one of: modify the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component;

modify the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component;

modify the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component;

modify the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component.

10. The apparatus as claimed in any of claims 1 to 9, wherein the apparatus is caused to determine a parameter associated with the situation of the acoustic space and wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal further causes the apparatus to process the at least two audio signals based on the parameter associated with the situation of the acoustic space.

1 1 . An apparatus comprising:

means for receiving at least two audio signals from at least two microphones located within an acoustic space;

means for determining the at least two audio signals comprise at least one speech component;

means for determining a location associated with the at least one speech component;

means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal;

means for outputting the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

12. The apparatus as claimed in claim 1 1 , wherein the means for determining the at least two audio signals comprise at least one speech component comprises: means for classifying the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component;

means for determining at least one speech component common to the at least two audio signals.

13. The apparatus as claimed in any of claims 1 1 and 12, wherein the means for determining the at least two audio signals comprise at least one speech component comprises:

means for receiving at least two audio signals comprising a speech component originating from a speech source;

means for receiving an indication that the at least two audio signals comprise the speech component;

means for training a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

14. The apparatus as claimed in any of claims 1 1 to 13, wherein the means for determining a location associated with the at least one speech component comprises:

means for analysing the at least two audio signals to determine at least one time delay for a common element of the at least one speech component;

means for determining a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

15. The apparatus as claimed in claim 14, wherein the means for determining a location associated with the at least one speech component comprises: means for analysing a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component;

means for determining a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

16. The apparatus as claimed in any of claims 1 1 to 15, wherein the means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal comprises:

means for generating a combined/selected audio signal based on the at least two audio signals and at least one speech component;

means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

17. The apparatus as claimed in claim 16, wherein the means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component comprises at least one of:

means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component;

means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component;

means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component; means for amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

18. The apparatus as claimed in any of claims 1 1 to 15, wherein the means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal comprises:

means for determining a location parameter based on the at least two audio signals at least one speech component;

means for processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter;

means for rendering at least one output audio signal based on the output location parameter.

19. The apparatus as claimed in claim 18, wherein the means for processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter comprises at least one of:

means for modifying the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component;

means for modifying the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component;

means for modifying the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component;

means for modifying the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component.

20. The apparatus as claimed in any of claims 1 1 to 19, further comprising means for determining a parameter associated with the situation of the acoustic space and wherein the means for processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal further comprises means for processing the at least two audio signals based on the parameter associated with the situation of the acoustic space.

21 . Apparatus comprising:

an input configured to receive at least two audio signals from at least two microphones located within an acoustic space;

a classifier configured to determine the at least two audio signals comprise at least one speech component;

a source estimator configured to determine a location associated with the at least one speech component;

a processor configured to process the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal;

an output configured to output the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

22. The apparatus as claimed in claim 21 , wherein the classifier is configured to:

classify the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component;

23. The apparatus as claimed in any of claims 21 and 22, wherein the classifier is configured to: receive at least two audio signals comprising a speech component originating from a speech source;

receive an indication that the at least two audio signals comprise the speech component;

train the classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

24. The apparatus as claimed in any of claims 21 to 23, wherein the source estimator comprises:

a correlator configured to analyse the at least two audio signals to determine at least one time delay for a common element of the at least one speech component;

a location determiner configured to determine a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

25. The apparatus as claimed in claim 24, wherein the source estimator comprises:

the correlator configured to analyse a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component;

the location determiner configured to determine a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

26. The apparatus as claimed in any of claims 21 to 25, wherein the processor comprises:

a signal combiner configured to generate a combined/selected audio signal based on the at least two audio signals and at least one speech component; an selective multichannel amplifier configured to amplify the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

27. The apparatus as claimed in claim 26, wherein the selective multichannel amplifier is configured to perform at least one of:

28. The apparatus as claimed in any of claims 21 to 25, wherein the processor comprises:

a location parameter determiner configured to determine a location parameter based on the at least two audio signals at least one speech component; a spatial processor configured to process the location parameter based on the location associated with the at least one speech component to generate an output location parameter;

a signal renderer configured to render at least one output audio signal based on the output location parameter.

29. The apparatus as claimed in claim 28, wherein the spatial processor is configured to perform at least one of: modify the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component;

30. The apparatus as claimed in any of claims 21 to 29, wherein the apparatus comprises an acoustic space determiner configured to determine a parameter associated with the situation of the acoustic space and wherein the processor is configured to process the at least two audio signals based on the parameter associated with the situation of the acoustic space.

31 . A method comprising:

receiving at least two audio signals from at least two microphones located within an acoustic space;

determining the at least two audio signals comprise at least one speech component;

determining a location associated with the at least one speech component; processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal;

outputting the at least one output audio signal within the acoustic space, such that the at least one speech component is enhanced for a location within the acoustic space other than the location associated with the at least one speech component.

32. The method as claimed in claim 31 , wherein determining the at least two audio signals comprise at least one speech component comprises:

classifying the at least two audio signals according to whether the at least two audio signals comprise a speech or non-speech component;

determining at least one speech component common to the at least two audio signals.

33. The method as claimed in any of claims 31 and 32, wherein determining the at least two audio signals comprise at least one speech component comprises: receiving at least two audio signals comprising a speech component originating from a speech source;

receiving an indication that the at least two audio signals comprise the speech component;

training a classifier based on the indication with the at least two audio signals comprising a speech component originating from a speech source to classify successive at least two audio signals comprising a similar speech component as a speech component.

34. The method as claimed in any of claims 31 to 33, wherein determining a location associated with the at least one speech component comprises:

analysing the at least two audio signals to determine at least one time delay for a common element of the at least one speech component;

determining a location direction relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element.

35. The method as claimed in claim 34, wherein determining a location associated with the at least one speech component comprises:

analysing a further pair of the at least two audio signals to determine at least one further time delay for a common element of the at least one speech component;

determining a location distance relative to the at least two microphones based on the at least one time delay for a common element of the at least one speech element and the at least one further time delay for a common element of the at least one speech component.

36. The method as claimed in any of claims 31 to 35, wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal comprises:

generating a combined/selected audio signal based on the at least two audio signals and at least one speech component;

amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location based on the location associated with the at least one speech component.

37. The method as claimed in claim 36, wherein amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker located based on the location associated with the at least one speech component comprises at least one of:

amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location different to the location associated with the at least one speech component;

amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location opposite to the location associated with the at least one speech component;

amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location longitudinally opposite to the location associated with the at least one speech component;

amplifying the combined/selected audio signal to generate the at least one output audio signal to be output by a speaker at a location laterally opposite to the location associated with the at least one speech component.

38. The method as claimed in any of claims 31 to 35, wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal comprises: determining a location parameter based on the at least two audio signals at least one speech component;

processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter; rendering at least one output audio signal based on the output location parameter.

39. The method as claimed in claim 38, wherein processing the location parameter based on the location associated with the at least one speech component to generate an output location parameter comprises at least one of: modifying the location parameter to generate the output location parameter at a location different to the location associated with the at least one speech component;

modifying the location parameter to generate the output location parameter at a location opposite to the location associated with the at least one speech component;

modifying the location parameter to generate the output location parameter at a location longitudinally opposite to the location associated with the at least one speech component;

modifying the location parameter to generate the output location parameter at a location laterally opposite to the location associated with the at least one speech component.

40. The method as claimed in any of claims 31 to 39, further comprising determining a parameter associated with the situation of the acoustic space and wherein processing the at least two audio signals based on the location associated with the at least one speech component to generate at least one output audio signal further comprises processing the at least two audio signals based on the parameter associated with the situation of the acoustic space.