GB2487399A

GB2487399A - Audio signal synthesis

Info

Publication number: GB2487399A
Application number: GB201100983A
Authority: GB
Inventors: Lionel Le Scolan; Sebastien Lasserre
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2011-01-20
Filing date: 2011-01-20
Publication date: 2012-07-25
Anticipated expiration: 2031-01-20
Also published as: GB201100983D0; GB2487399B

Abstract

The present invention discloses a method and apparatus for producing or synthesizing a sound/acoustic output from an input digital signal, comprising the steps of or means for: decomposing the digital signal into at least one waveform and at least one scale factor, the at least one scale factor being generated such that when combined with a respective waveform from the at least one waveform, produces a sound component of the sound output; and reproducing a respective one of said at least one waveform scaled by a respective scale factor. The reproduction is performed for a plurality of waveforms scaled by a plurality of respective scale factors using a plurality of sound reproduction units to output a superimposed sound field from sound reproduction units such as loudspeakers. The waveforms may be time and frequency localized and/or they may be predetermined.

Description

Acoustical Synthesis The present invention relates to a method and apparatus for synthesizing a sound signal from a digital audio signal.

In the development of loudspeakers for playing sounds such as music, an aim of loudspeaker manufacturers has always been to improve the quality of the sound. Specifically, the aim has been to reproduce faithfully an original sound created by instruments or voices, for example.

Loudspeakers have classically been in the form of electrodynamic transducers as drivers that convert electrical signals into sound waves. There are two main areas in which development of the sound reproduction by loudspeakers has been concentrated. The first area is the frequency domain and the second area is the time domain.

With respect to the frequency domain, in order to reproduce a wide range of frequencies, most loudspeaker systems employ more than one driver. A large range of drivers is particularly useful when the sound pressure level is high (e.g. the sound is louder) or the reproduction of the range of frequencies is to be more accurate. In this case, separate drivers are used to reproduce different frequency ranges. The performance (i.e. the faithful reproduction of sound) of an analogue loudspeaker is typically limited by physical limitations of the speaker, particularly with respect to frequency response and distortion.

A large frequency response is the ability of a speaker system to reproduce a large range of frequencies (i.e. a large bandwidth). Ideally, a driver outputs a signal with a flattened (i.e. smooth) frequency spectrum so that signals containing all frequencies that are reproduced by the driver are reproduced to the same intensity level. However, in reality, there is often a drop off at high and low frequencies, which means that the driver reproduces less well frequencies at the high and low ends of its frequency response bandwidth and risks losing those high and low frequencies. To overcome this, drivers are preferably used that overlap in frequency response. As it is most practical to have as few drivers as possible (to inhabit as little space as possible and to have as low a cost as possible), frequency ranges allocated to each driver tend to be large; i.e. individual drivers also tend to have a large frequency response. However, large bandwidth (i.e. large frequency response) drivers are difficult to manufacture, and in particular, are difficult to manufacture well so that the entire frequency spectrum (i.e. bandwidth) is reproduced with the same quality. A balance must thus be found between a practical (i.e. small) number of drivers and small enough frequency ranges per driver to enable faithful reproduction of a range of sounds.

Distortion is a sound artefact caused by a driver attempting to recreate frequencies outside of its preferred bandwidth; i.e. the driver is mechanically less able to create those frequencies and so the output sound is not faithfully reproduced. Either the driver cannot vibrate fast enough to create high frequencies or cannot physically push the air required for low frequencies. These physical limitations cause an audible distortion of the output frequency. As above, a balance between a number of drivers and allocated frequency ranges (that the drivers are capable of reproducing) must be found in known systems to obtain an optimal output sound.

Sound reproduction systems (including loudspeakers) are typically evaluated in the frequency domain and improvement in the design of sound reproduction systems is usually concentrated on obtaining a better frequency response (which is relevant for wind and string instruments, as well as voices) and reducing distortion as described above.

An alternative to the frequency domain is the time domain. With respect to the time domain, it is the superposition of sound elements in a unit of time that is taken into account, especially in the case of transients. Transients are typically short bursts of sound (i.e. that last for a short time duration) that are usually non-periodic and have a large number of high frequencies and few harmonics, such as are caused by percussion instruments. It is difficult to recreate such an accurate superposition of frequencies in a short time. In order to improve the reproduced sound overall, it is desirable to attempt to reproduce the intricacies of these transients of the source signal. Efforts to improve this are discussed in an

example below.

Natural sound is rarely composed of pure sine waves or fixed tones, but is often made up of impulses, with various dynamic and transient effects. Both the frequency domain and the time domain are therefore useful domains in which to improve a sound reproduction system. Below will be described two specific examples of improvements in sound reproduction by observing either the frequency domain or the time domain.

The frequency domain is discussed first. Typical audio signal reproduction systems use the principle of direct transduction using electric signals as the excitation means to cause a membrane or plate to vibrate and radiate the sound.

More elaborate systems such as multi-way loudspeakers, as shown in Figure IA, split the electrical signal 100 into several frequency bands 111, 112, 113 using a crossover network 110 to drive specialized drivers 130 (woofer (131), mid-range driver (132) and tweeter (133)) that are adapted to reproduce an audio signal in specific respective frequency bands. In the example shown, the woofer 131 reproduces frequencies 111 in the range 20Hz to 700Hz, the mid-range driver 132 reproduces frequencies 112 in the range 700Hz to 5000Hz and the tweeter 133 reproduces frequencies 113 in the range 5000Hz to 20000Hz. The sound waves radiated by this limited number of drivers are recombined in a listening area 130 to reproduce the whole audio frequency band of the original signal.

Figure lB shows a frequency-time graph for the three drivers. The top frequency bandwidth represents that of the tweeter 133. The tweeter will thus emit sound within the illustrated bandwidth constantly over time as long as sounds in that frequency range are required. The same applies to the middle layer, which represents the frequency bandwidth of the mid-range driver 132 and the bottom layer, which represents the frequency bandwidth of the woofer. The higher in pitch the frequency bandwidth, the broader that bandwidth, here. Each driver will not derogate from its frequency bandwidth for the entire time that the sound is being played, though it will move around within the frequency bandwidth as required by the sound to be produced. Thus, the drivers are somewhat localised in the frequency domain, but play continuously and so are not localised in the time domain.

This principle is well known and mastered by most loudspeaker makers. It globally improves the quality of the sound reproduction but it suffers from various mismatches between drivers. For example, the overlap of frequency bandwidths of neighbouring drivers might be at a set of frequencies that is not optimally reproduced by either. Alternatively or additionally, the radiation pattern of the sound from different drivers might not be the same so that different listening positions will have different intensities of sound from the different drivers.

Currently, the use of digital technology and digital signal processing (DSP) -in particular, fast Fourier Transform (FFT) application to signals -gives the ability of real-time digital filtering, digital/analogue conversion and active amplification 120, the latter using amplifiers 121. These improvements raise the quality of the signal processing but still use drivers that are arranged linearly and each output large frequency bandwidths, which is today the main source of difficulties as discussed above.

A time-domain approach is the generation of sound by superposition of discrete digital sound pulses. This approach is described by the US patent N° 7 089 069 from Carnegie Mellon University and is summarised in Figure 2A. This technology is called Digital Sound Reconstruction (DSR) where the sound waveform is generated from the summation of discrete pulses of acoustic energy produced by an array of speakers (or speaklets) 230. The principle is to quantify 210 (by analogue to digital conversion) the source signal 200 at regular time

S

intervals t0, t1, t2... and to generate sound pulses (clicks) from one or more of the elementary speaklets 230. In DSR, each speaklet produces a stream of clicks.

Clicks are discrete pulses of acoustic energy and they are summed to generate the desired sound waveform. The frequency of sound is changed by having a different number of clicks -i.e. a different number of activated speaklets -per unit time, as will be discussed below with reference to Figures 2A and 2B. By having different numbers of speaklets clicking at different times, overlapping frequencies may be created to build up waveforms.

Furthermore, having plural speaklets clicking at the same frequency will increase the pressure (i.e. the sound volume) of that frequency. With DSR, louder sound is not generated by greater motion of a diaphragm (as in traditional loudspeakers), but rather by a greater number of speaklets emitting clicks simultaneously. Thus, having 50 then 20 then 0 then 45 then 35 speaklets clicking in five consecutive instances will be louder than having 10 then 4 then 0 then 9 then 7 speaklets clicking in five consecutive instances, but the pitch will be the same because the shape of the wave created by the speaklets will be the same (just with 1/5th of the amplitude).

In Figure 2A, a single speaklet 232 will click with an output 242 that is shown in graph 210 at times to and t2. At time to, output 242 of speaklet 232 is superimposed with clicks 244, 246 and 248 of other speaklets 234, 236 and 238.

Thus, by varying the number of speaklets clicking at the same time, an output waveform of varying frequency and volume may be generated as shown.

Figure 2B shows a frequency-time graph representing the action of the speakiets. Each speakiet emits a click or pulse of the same frequency, and superposition of the speaklets can (in theory at least) produce any frequency desired, but each click lasts only a very short time and so the localisation of this method is in the time domain, Thus with DSR, the time-varying sound level is not generated by a time-varying diaphragm motion as in classic transducers, but rather by time-varying numbers of speaklets emitting clicks. The speaklets effectively act as the individual building blocks of sound by being able to vary the frequency of their own clicking and the relationship of their clicking with the clicking occurring around them to build up a complete sound signal with varying pitch and volume.

As the number of clicks varies for each unit of time, this method is considered to be an improvement in the time domain. The main problem with this type of system is that the superposition of the clicks occurs at a limited listening position, which is vey close to the speaklets. DSR is therefore more suitable for speakers to be held next to a person's ear (such as in headphones or hearing aids), rather than for loudspeakers that emit sound into a room or even larger spaces.

What has not been considered in the art is whether both the frequency and time domains may be improved simultaneously, especially in loudspeakers.

It is desirable to provide a sound reproduction solution that works both in frequency and time domains and that improves the reproduction of natural sound including transients.

According to a first aspect of the invention, there is provided an acoustical synthesis apparatus for producing a sound output from an input digital signal, wherein the apparatus comprises: decomposing means for decomposing the digital signal into a plurality of waveforms that are each localised in frequency and time domains, the plurality of localised waveforms being distributed within at least said frequency domain, and a plurality of scale factors; and sound reproduction means for reproducing, as output sound components, each of said waveforms scaled using a respective scale factor.

Preferably, the decomposing means is arranged to supply the waveforms or an indication of the waveforms generated to at least one of the sound reproduction means and a waveform database. The decomposing means preferably decomposes the digital signal into a plurality of basic or elementary waveforms and a plurality of corresponding scale factors, the scale factors being determined so that when combined with the plurality of basic waveforms in a plurality of sound reproduction units, they reproduce the digital signal as faithfully as possible.

According to a second aspect of the invention, there is provided an acoustical synthesis apparatus for producing a sound output from an input digital signal, wherein the apparatus comprises: decomposing means for decomposing the digital signal into a plurality of predetermined waveforms and a plurality of scale factors; and sound reproduction means for reproducing, as output sound components, each of said predetermined waveforms scaled using a respective scale factor received from the decomposing means.

Further within the second aspect of the invention, there may be provided means for extracting, from a database of waveforms, waveforms equivalent to each predetermined waveform output by the decomposing means, wherein the sound reproduction means are configured to reproduce, as output sound components, each of said extracted waveforms scaled using the respective scale factor output by the decomposing means. The decomposing means in this aspect is preferably configured to derive the at least one scale factor by performing an analysis on sample blocks of the input digital signal taking into account waveforms available in the waveform database. This analysis may include applying a transform to the input digital signal; an MDCT; a 32-sample transform such that 32 different waveforms and 32 corresponding scale factors are output; and/or a 32-sample transform such that 64 waveforms and 64 scale factors are output, the 64 waveforms including 32 different waveforms, a first of each different type of waveform being offset with respect to a second of each type of waveform by half a waveform length.

Within either the first or second aspect of the invention, there may be provided, in the decomposing means, means arranged to forward information to the sound reproduction means indicating which waveforms are to be reproduced.

The sound reproduction means preferably comprises a plurality of elementary sound reproduction units, each elementary sound reproduction unit being arranged to output a different sound component, wherein the plurality of sound components, when superimposed, produces an output sound corresponding to the input digital signal. In this case, each elementary sound reproduction unit is arranged to output a different sound component with a different, limited frequency band. Furthermore, each elementary sound reproduction unit is preferably optimised to reproduce a waveform limited in frequency and time domains. The sound reproduction means advantageously comprises a plurality of gain-controlled amplifiers for receiving each a scale factor and applying the scale factor to a respective, associated waveform. According to any embodiment described herein, the scale factor may include, if necessary, a phase difference factor to enable control of superposition positions of sound outputs of the sound reproduction means, and thus enable various arrangements of elementary sound reproduction unit or a variety of optimum listening positions.

According to a third aspect of the present invention, there is provided an acoustical decomposition apparatus for decomposing an input digital signal into a plurality of waveforms that are each localised in frequency and time domains, the plurality of localised waveforms being distributed within at least said frequency domain; and a plurality of associated scale factors.

According to a fourth aspect of the present invention, there is provided an acoustical decomposition apparatus for decomposing an input digital signal into a plurality of predetermined waveforms and a plurality of associated scale factors.

In either the third or fourth aspect, the acoustical decomposition apparatus is preferably configured to forward, to a sound reproduction means comprising elementary sound reproduction units, an indication of which waveforms are to be scaled by the respective scale factors and reproduced by appropriate respective elementary sound reproduction units.

According to the fourth aspect in particular, the acoustical decomposition apparatus may comprise: reception means for receiving the input digital signal; determination means for determining a plurality of sound components making up the input digital signal; extraction means for extracting, from a waveform database, information regarding an available waveform; and generation means for generating a scale factor corresponding to said available waveform. The generation means may be configured to generate the scale factor such that, when the scale factor is output to an elementary sound reproduction unit, it causes the elementary sound reproduction unit to reproduce the available waveform according to the information extracted by the extraction means and to scale the available waveform by the generated scale factor such that the reproduced waveform corresponds to one of the plurality of sound components determined by the determination means.

According to a fifth aspect of the present invention, there is provided a sound reproduction means comprising a plurality of elementary sound reproduction units, each elementary sound reproduction unit being configured to output, as a sound component, a single scaled waveform localised in time and frequency domains. Preferably, the elementary sound reproduction units are arranged in an array such that the sound components output from all of the elementary sound reproduction units superimpose at a predetermined position in the air and/or each elementary sound reproduction unit is arranged to receive information indicating two waveforms that are the same shape but offset in time and to superimpose said two waveforms to output said two superimposed waveforms as a single output sound component.

According to a sixth aspect of the present invention, there is provided a signal containing a waveform localised in time and frequency domains. This signal may be the output of the decomposing means. This signal may contain two superimposed overlapping waveforms, the waveforms being the same but offset by half a waveform length and each being scaled with a respective scale factor.

There may be, according to a seventh aspect of the present invention, a waveform database containing at least one waveform to be output as such a signal.

According to an eighth aspect of the invention, there may be provided a signal containing a scale factor for modulating a waveform localised in time and frequency domains to produce a sound component to be superimposed in air by an elementary sound reproduction unit, the scale factor being determined by the decomposition of a digital signal into a plurality of component waveforms and a plurality of component scaling factors, the plurality of component waveforms, when recombined each with a respective scaling factor, generating a sound output corresponding to the digital signal. The signal may further contain information indicating with which waveform the scale factor is to be combined.

This signal may be output from the decomposing means.

According to a ninth aspect of the present invention, there may be provided a signal containing a waveform scaled by a scale factor to be output by a sound reproduction means as a sound component. This signal may be output from a gain-controlled amplifier to a loudspeaker.

According to a tenth aspect of the present invention, there is provided an acoustic synthesis method of producing a sound output from an input digital signal, the method comprising: decomposing the digital signal into a plurality of waveforms each localised in time and frequency domains; the plurality of waveforms being distributed in at least the frequency domain, and at least one scale factor, the at least one scale factor being generated such that when used to scale a respective waveform, produces a sound component of the sound output; the method also including reproducing a respective one of said at least one waveform scaled using a respective scale factor.

Alternatively or additionally, according to an eleventh aspect of the invention, there is provided an acoustic synthesis method of producing a sound output from an input digital signal, the method comprising: decomposing the digital signal into a plurality of predetermined waveforms and a plurality of corresponding scale factors, each scale factor being generated such that when used to scale a respective waveform, produces a sound component of the sound output; and reproducing a respective one of said at least one waveform scaled using a respective scale factor. In the methods of the invention, there may further be provided the step of superimposing, in the air using a plurality of loudspeakers, a plurality of reproduced waveforms according to a plurality of respective scale factors to produce the sound output corresponding to the input digital signal.

The main advantage of the embodiments is that a loudspeaker system may be created that separates frequencies of an audio signal into much smaller bands (i.e. in the form of elementary waveforms), each emitted by a specialised elementary speaker such that a more faithful sound may be produced than previously possible.

The invention will hereinbelow be described, purely by way of example, and with reference to the attached figures, in which: Figures IA and IB, discussed hereinbefore, depict an overview of an audio reproduction system using a crossover filter and electrodynamic loudspeakers; Figure 2A and 2B, discussed herein before, depict an overview of a Digital Sound Reproduction system; Figures 3A and 3B depict a sound reproduction system according to a first embodiment of the present invention; Figures 4A, 4B and 4C depict a first decomposition process of the first embodiment of the invention; Figure 5 depicts a second decomposition step of the first embodiment of the invention; Figure 6 depicts a mathematical representation of a reconstruction step of the first embodiment of the invention; Figures 7A, 7B, 70 and 7D depict a full waveform database for use in sound reproduction; Figure 8 depicts an elementary waveform spectrum; Figure 9 depicts a first embodiment of the reproduction step; Figure 10 depicts a second embodiment of the reproduction step; and Figures 11, 12A and 12B depict arrangements of elementary sound reproduction units according to embodiments of the invention.

In order to convert an input, digital, audio signal into an output, complex sound wave, the present invention involves two fundamental steps, which are outlined as follows. The first is a decomposition step and the second is a reconstruction or sound reproduction step.

In practical embodiments of the invention, the decomposition is performed by a processor (referred to as a decomposition unit or decomposition means) that mathematically divides an audio signal into a set of waveforms (y,) and a set of scale (or modulation) factors (Xk). The decomposition step is mathematically represented in Figures 4A, 4B, 4C and 5 and will be described below.

On the other hand, the reproduction step is in fact performed physically (rather than mathematically) in the air by a set of elementary sound reproduction units comprising gain-controlled amplifiers (to recombine the waveforms and scale factors) and loudspeakers (to output each scaled waveform (ye) so that the scaled waveforms combine in the air to give the reconstructed audio signal as an audible output sound). The recombined waveform is thus generated from a basic waveform according to "instructions" defined by the scale factor, the instructions including a "sound pressure" factor to adjust the volume of the output sound component or a gain control coefficient to control the gain of each waveform signal. This may be understood as a waveform (e.g. y,i) and a scale factor (e.g. Xi) being multiplied together by an elementary gain-controlled amplifier 51 (or a set 50 of amplifiers) and emitted as a single sound component (i.e. a scaled waveform) by an elementary loudspeaker 62. The elementary loudspeaker 62 is in an array 60 of elementary loudspeakers. Each elementary loudspeaker of the array outputs a sound component that is derived from a single, scaled waveform.

The array of differently-tuned elementary loudspeakers thus emits sound components corresponding to all of the scaled waveforms and these sound components are superimposed (i.e. summed) in the air once emitted from the array of elementary loudspeakers to make up the complete sound within a "listening area" or "listening position". The reproduction of the acoustic sound therefore comprises scaling and adding of waveforms, as will be shown mathematically below.

Figure 3A illustrates the basic constituents of an embodiment of the invention. An audio signal 10 to be reproduced is sampled 14 using an analogue to digital converter (ADC) (if it is an analogue signal) to give a pulse-code modulated (PCM) signal and is routed to a decomposition unit 20. The decomposition unit 20 decomposes the sampled source audio signal 10 into two sets of data 30, 40 by applying a transform to the sample points. The two sets of data 30, 40 respectively contain elementary waveforms 31 (y<) and scale factors 41 (Xk). The scale factors 41 (Xk) are generated by the decomposition unit 20 such that when an elementary waveform 31 (Yn,k) is recombined with its associated scale factor 41, a resultant scaled waveform 52 is generated.

In a preferred embodiment, the waveforms 31 are not then transmitted from the decomposition unit 20 to the elementary sound reproduction units 51,62.

Rather, it is the scale factors 40 that are transmitted from the decomposition unit to the array of elementary sound reproduction units, where each elementary loudspeaker 61, 62 is built or arranged to reproduce one of the scaled waveforms 52. in order to accomplish this, the waveforms are effectively "predetermined".

This predetermination may be performed in several ways. A first embodiment has the available waveforms dictated by the capability of each elementary loudspeaker 61, 62. In this case, the main task of the decomposition unit 20 is to generate the set 40 of scale factors that will recombine with the available elementary waveforms 31 of the database 30. The loudspeakers (e.g. 62) will then obtain the scale factors (e.g. 41) from the decomposition unit 20 and extract the appropriate waveform (e.g. 31) from the database 30 to combine them to create a scaled waveform (e.g. 52). Preferably, the loudspeakers are optimised by taking the analysis method of the decomposition unit into consideration (and/or vice versa), such that the waveforms 31 that are available can be combined with scale factors 41 to give rise to the sound components 72 that most closely match the desired sound. In this way, the decomposition process associates a scale factor with a predetermined waveform and the elementary sound reproduction units each receive the appropriate scale factor that, when combined with a respective waveform (that the elementary loudspeaker is able to produce), causes the elementary sound reproduction unit to output a sound component 71,72 of the reproduced sound 70. A database 30 preferably exists that contains the predetermined waveforms that the loudspeakers are able to reproduce and the decomposition unit may have access to this database in order to "know" what waveforms are available and thus to calculate what scale factors are appropriate for the available waveforms. Alternatively, the decomposition unit may supply waveforms to this database that have been determined by the transform and the elementary sound reproduction units refer to the database to determine what waveforms are to be reproduced.

The decomposition process may alternatively or additionally dictate, according to the transforms used, what sort of waveforms and scale factors will be generated. In this case, then the decomposition unit 20 may supply 21 the waveforms or an indication of what waveforms are to be used either directly to the gain-controlled amplifier 51, to the loudspeaker 62 or to the waveform database 30. The range of waveforms that will be produced by the decomposition process can thus be stored in the waveform database 30 for access by the respective elementary sound reproduction unit.

Yet alternatively, when a waveform is not required, the scale factor for the respective sound reproduction unit may be set to zero so that all waveforms are always input into each sound reproduction unit, but are scaled down when not required.

In fact, there may be no database at all, but a program in the decomposition unit that indicates what waveforms the elementary loudspeakers are designed to reproduce. Yet alternatively, if the elementary loudspeakers are capable of reproducing a small range of waveforms, the decomposition unit could be programmed to indicate to the sound reproduction units which waveforms are to be reproduced, along with which scale factors are to be applied.

The waveforms that are the result of the decomposition by the decomposition unit (and/or that are reproduced by the elementary sound reproductions units) are preferably limited or "localised" in both the time and the frequency domains. What this means is that the waveforms all have a very small frequency range (such that the elementary sound reproduction devices will have a very small frequency response) and the waveforms will last for only a limited duration. The frequency range and time duration will both be limited by a transform and a sampling frequency that are applied to the input audio signal in the decomposition process and which will be discussed further below. In the specific embodiment described below, the waveforms are created (and thus localised) according to Modified Discrete Cosine Transform (MDCT) in a similar way to MP3 (Motion Pictures Expert Group 3) deconstruction that is used for the compression of audio signals.

This time and frequency localisation is illustrated in Figure 3B, which shows a frequency-time graph. As can be seen, the sound components have a defined, limited frequency bandwidth on the vertical axis and a defined, limited duration on the horizontal axis. In an embodiment where there are 32 different waveforms with different frequency ranges (as a result of the MDCT process which will be explained later), the y-axis showing frequency would be divided into 32 segments. The figure is schematically drawn and in reality there would not be such distinguished segments. Rather, there would be some overlapping between rows (i.e. between frequency bands). This is compared with Figure IB, in which the three loudspeakers are each at the limit of their frequency response capability and there is no real overlap.

With respect to frequency localisation, what this means is that each waveform is limited in frequency as mentioned above: an ideal system would have each elementary waveform containing only a single frequency, all elementary waveforms being different from each other. However, this is impractical because the number of loudspeakers that would be needed to play all audible frequencies would be far too great. Furthermore, it is not necessary to break down the sound into thousands of frequency bands, as the human ear would not distinguish a difference in quality. Thus, the number of waveforms is chosen as an optimum between good quality sound and a practical number of loudspeakers and the number of waveforms in turn determines the frequency range that each waveform will have, and thus the frequency range that each loudspeaker will need to be able to reproduce. For example, if the total range of frequencies is similar to the classical three-driver system described above, namely about I to 20000Hz, each of 32 elementary sound reproduction devices may have a frequency response of 625Hz. Thus, a first elementary sound reproduction device may have a frequency response range from I to 625Hz; a second may have a frequency response range from 625 to 1250Hz; a third, from 1250 to 1875 Hz; a fourth from 1875 to 2500Hz and so on. This division of frequency band is shown in Figure 8. On the x-axis, 0.5 represents a frequency of 5kHz; 1.0 = 10kHz, and so on. For a frequency range between 5kHz and 10kHz, there are six or seven maxima and so there are six or seven waveforms available for this frequency range.

In terms of the time axis, the length of each division shown in Figure 3B is the length of one waveform, namely 64 samples for a 64-sample MDCT decomposition process. The duration of each time division is dependent on the sampling frequency. For audio signals (e.g. as stored on a compact disc (CD)), the sampling frequency is typically 44.1 kHz. For an audiovisual signal as stored on a digital versatile disc (DVD), the sampling frequency is typically 48kHz. Thus, with a sampling rate of 48kHz, a 64-sample waveform will have a duration of 64x1/48000 seconds, which is I.3ms.

In Figure 3B, the frequency and time divisions are shown to be synchronised and are preferably synchronised in this way so that waveforms begin at the same moment for all frequencies. However, there are transforms that will produce different frequency/time graphs. For example, some transforms will create longer times at lower frequencies and shorter times at higher frequencies (this sort of transform is usually used in image processing rather than sound processing). This gives rise to a more refined acoustic response at higher frequencies.

The scale factors 41 are thus the result of the decomposition of the source signal 10, preferably according to the available elementary waveforms 31 in the database 30. Thus, each elementary waveform 31 from the waveform database may be sampled and routed 32 to an amplifier 51 to be scaled by the associated scale factor 41 issued from the decomposition unit 20 and then is routed as a scaled waveform 52 to a dedicated elementary loudspeaker 62. At each transmission -of a scale factor, waveform or waveform indication from the decomposition unit 20 to the gain-controlled amplifier 51, of a waveform or waveform indication from the database to the gain-controlled amplifier and of a scaled waveform from the amplifier 51 to the loudspeaker 61 -a signal is formed, sent and received to perform this transmission. A set of frequency signals (i.e. scaled waveforms) are routed to a set 60 (e.g. an array) of elementary loudspeakers including loudspeakers 61, 62. The resulting sound components or sound waves 70 (including a sound component 71 from loudspeaker 61 and a sound component 72 from loudspeaker 62) radiated by the set 60 of elementary loudspeakers are then summed (in the air) by direct acoustic synthesis and

produce the desired sound field.

The decomposition unit 20 and its decomposing process are explained in more detail with reference to Figures 4, 5 and 6. The main task of the decomposition unit 20, as mentioned above, is to derive appropriate scale factors for the available waveforms. The mathematical generation of the scale factors will now be described in more detail.

In an embodiment of the invention, the decomposition unit 20 uses a Modified Discrete Cosine Transform (MDCT) as the means to generate the set of scale factors Xk. Other embodiments may use other transforms or other analysis methods to obtain scale factors. For example, a Modulated Lapped Transform as devised by H. Malvar (c.f. ICASSP 99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference -Volume 03) may be used. Desirable characteristics for such analysis methods are that the recombined form of the sound is time-limited; the spectral contents of the audio signal are frequency-limited; and there is preferably adjustability in the time-and frequency-measured windows. In summary, the decomposition is preferably decomposed in a time-and frequency-localised basis, so more than just filtering is occurring.

The MDCT is a Fourier-type transform that is based on a discrete cosine transform (DCT), which expresses a sequence of sample points (n) as a sum of cosine functions oscillating at different frequencies. MDCT is linear in both time and frequency domains, but time-frequency transforms can be variable in either domain. Other transforms may be used instead that are not linear in both frequency and time, for instance.

Figures 4A, 4B and 4C show the process of sampling of a source signal, the sampled points later undergoing the transform.

Figure 4A shows the source signal Sm(n) 300 on a graph against sample number n, where in the present example, n goes from 0 to 4N-1 samples. Two steps are performed on the source signal to sample it. The first step is the discrete sampling of a continuous signal with n sample points. The second step is the grouping of the n discrete sample points into data units of 2N sample points (e.g. 0 to 2N-1, N-i to 3N-i and 2N-i to 4N-1). The groupings are performed such that the data units 310 of 2N sample points overlap. The extent of the overlapping of the data units 310 in the illustrated embodiment is by N sample points. In other words, each data unit m overlaps with each of its neighbours m-1 and m+1 by half a data unit length. The number of a current data unit is labelled m (with m-1 as a preceding data unit and m+1 as a succeeding data unit) such that the label Sm(n) represents the source signal for the mth data unit and nth sample point.

A third optional step is that of applying a window function to the data units, as will be discussed below.

As mentioned above, MDCT expresses a sequence of sample points as a sum of cosine functions and in the example shown in Figures 4A, 4B and 40, the sample points n are shown on the x-axis of the graph in Figure 4A. The sequence of sample points n are grouped into a series of overlapping data units 310 as shown in Figure 4B. Each of the overlapping data units 310 will be represented as a cosine function in the MDCT process shown in Figure 5. Thus, the sequence of 4N data points shown in Figure 4A will be represented as a sum of cosine functions when a transform has been applied to them. As will be described later, each cosine function is a mathematical representation of a waveform that will be reproducible or "playable" by an individual loudspeaker. The transform properties are optionally further improved (i.e. made more robust) by using a window function w. The window function w may be applied according to the following equation (1): r( w =sinI -I n+-(1) L2Nc 2) In Figure 4A, the source signal is labelled 300 and is shown as a solid line.

Dotted curves 320, 322 and 324 are window functions and are simply sine functions as shown in equation (I) that are multiplied with the source signal to create windowed functions shown in Figure 4C. The left-most graph 330 in Figure 4C shows curve 322 multiplied with the portion of the source signal 300 on which it is superimposed. Similarly, graph 332 shows curve 324 multiplied with the source signal 300. Finally, the right-most graph 334 of Figure 40 shown the curve 320 multiplied with the portion of the source signal 300 on which it is superimposed. Each window function has a value of zero outside of its curve and so the windowed curves in Figure 4C are also forced to zero outside of the 2N-I sample points allocated to each window. Because the original signal 300 has a zero value until the sample point at N-I, the data unit windowed by window function 322 is represented as a signal that is 0 until N-I. Because the window function goes to 0 at 2N-I, the same left-most data unit 330 shown in Figure 40 returns toO at2N-l.

The window function in this case multiplies the decomposed audio source signal with a sine window in order to avoid discontinuities at the data unit boundaries. In other words, it is the window function that ensures that in practice, the function goes smoothly to zero at those points (at points N-l and 2N-I in the case of data unit 330 windowed by window function 322). This gives a defined length of waveform without edge artefacts (and means that the waveforms overlap by a controlled number of sample points: 32 in the illustrated example).

Windowing is also known as time domain aliasing cancellation (TDAC) because the values outside of the predefined boundaries are aliased with lower values, the aliasing being performed in the time domain.

Each data unit 310 is windowed by a sine window function 322, 324, 320 of the form described above (using equation (1)) to give rise to windowed data units 330, 332, 334. If these three windowed data units 330, 332, 334 were added together and inverse windowing performed, they would give rise to the source waveform 300.

As can be seen from Figures 4A and 4B, the data units 310 and therefore the windows 320, 322 and 324 overlap by half. What this means is that there are two sampled points for each value of n for the overlapped portions of the data units (which is every value of n when the data units overlap by 50% or more). In an embodiment of the present invention, if the sampling rate is 64 samples per data unit, the central windowed data unit 332 shown in Figure 40 will have been sampled with 64 sample points, but both halves of the information for the window 324 will also be available in preceding 322 and succeeding 320 windows. Thus the information in the central windowed data unit 332 is represented by 128 sample points over the three windowed data units in Figure 4C.

Following on from Figures 4A, 4B and 40 which show the sampling, Figure 5 shows the decomposition step applied, for example, to the central windowed data unit 332 from Figure 40. It is a peculiarity of MDCT that means that the output of the transform contains half as many real values as the input.

This peculiarity is a result of the twice-sampled data points. The principle is discussed in Chapter 3 of "Applying the MDCI to Image Compression: Dissertation presented for the degree of Doctor of Science at Stellenbosch University" by Rikus Muller, for example. Thus, 2N windowed data units are processed (i.e. decomposed) by the MDCI to produce N real values of the scale factor Xk and N cosine functions representing N waveforms.

The scale factors Xk in terms of the windowed, transformed signal is given by the mathematical equation below (equation (2)) and is shown in Figure 5.

= Sm (ii) x x cos[-_1n + 1 + 9[k + 19J (2) where k=O to N-I. This means that there are N values of k and thus N instances of Xk.

The cosine function gives rise to the different waveforms. As k goes from 0 to N-I, there are N waveforms, the windowing function, w, windows the cosine waveforms. The scale factors Xk are thus associated with windowed, cosine waveforms according to equation (2). Specifically, a scale factor is equated to the sum over all sample points n of the product of the source signal Sm(n) and a windowed (wa) waveform (the cosine function).

The waveforms produced by this MDCI can be seen in equation (2) above as being cosine functions with smooth sine envelopes (the sine envelopes being because of the sine function windowing we). These waveforms can be seen as graphical representations of the cosine functions in Figures 7A, 7B, 7C and 7D.

As can be seen more clearly in the graphical representations shown in Figures 7A, 7B, 7C and 7D, the waveforms have the properties of being of fixed frequency content (each a separate frequency or number of maxima per unit time) and of finite duration (64 samples). In other words, the waveforms are frequency-and time-localised. This gives waveforms that may be more accurately and easily reproduced by dedicated or optimised loudspeakers 62 as aural representations of the waveforms of equation (2) and the graphical representations of Figures 7.

In an embodiment, the source signal is processed using blocks of 64 sampling points; (three blocks being shown in Figures 4A, 4B and 4C, so here 2N = 64). Because of the way the 32-point MDCT (i.e. 64-point data unit) works on the sampling points (i.e. using overlapping blocks), 32 scale factors Xk are available that will be used during the reconstruction process to scale 32 precompiled waveforms. Figures 7A to D show the 32 waveforms (in groups of 8 for visibility) in a database according to an embodiment of the invention. It can be seen that the waveforms are cosine waves in sine envelopes. The preferred embodiment uses a transform that inputs a 64-sample block because an output of 32 different waveforms is a good balance between having sufficiently few waveforms that a reasonable number of elementary loudspeakers (i.e. 64 or 32 loudspeakers) can be used, and having a sufficiently large number of waveforms that almost all distinctly audible sounds in the input digital signal are reproducible.

Each waveform can be seen to have a duration of 64 samples (from about 240 to about 304 samples on the scale shown.) The elementary waveforms are shown in Figures 7A, 7B, 7C and 7D as being localised in both frequency and time domains. Specifically, Figures 7A, B, C and D show the resulting waveforms of a simulation of MDCT and IMDCT (inverse MDCI) processes performed on an audio signal. The waveforms each last for 64 samples, from n = approximately 240 to n = approximately 304. The amplitude values are based on the IMDCT output shown in Figure 6 and discussed below. As n is the number of samples (32) and k is the number of the current waveform (1 to 32), if k is fixed and n is varied, the solutions for the output signal yn,k of equation (4) below include a normalised gain coefficient, which is shown as the amplitude in Figures 7A, B, C and D. This gain/amplitude is of course scalable to create different volumes of the sound.

It can also be seen from Figures TA, B, C and D that the envelope of the waveform seems to be different for the different waveforms. The reason for this is that in this simulation, 64 samples are allocated for each waveform, no matter how many changes of direction occur (i.e. no matter how high the frequency).

Thus, for higher frequencies, fewer sample points are available per waveform feature and the accuracy of the waveform decreases. This is not typically problematic, as mathematical synthesis of the waveforms will not have this limitation and in acoustic synthesis, good synchronisation of the waveforms by the sound output units decreases any audible inaccuracy. This is a characteristic that is balanced with the number of sound output units that it is practical to have.

If more sample points were used, more accurate waveforms would be available, but then a greater number of waveforms would also be produced, which would need to be output by a greater number of sound output units in order for no sound features to be lost. Thus, 32 sound output units outputting 32 different waveforms is the preferred balance.

Waveforms may be of a different shape from those illustrated. Specifically, waveforms are not limited to cosine waveforms in sine envelopes as produced by the described windowing and MDCT. For example, the database may contain Malvar-Wilson waveforms using a Malvar transform (i.e. a Modulated Complex Lapped Transform or MCLT) where the transform creates a cosine and sine waveform with a sine envelope when the windowing function is a sine function. In the latter example, both real and imaginary results are considered, as compared with just the real results of MDCT.

These 32 waveforms may be superimposed in many different ways to give rise to more complex complete sound waves, thus potentially reproducing sound more faithfully than signals that have not been decomposed to such an extent.

As mentioned above, the nature of the MDCT is such that it produces N different real values of Xk from 2N real values (of the windowed Sm(n)).

Practically, what this means is that the output of the MDCT process comprises N different elementary waveforms. An example where N=32 is shown in Figures 7 as discussed above. However, each of the elementary waveforms is in fact created twice, but the two instances of an elementary waveform overlap by half a waveform length (because of the overlap of the sample points: the same waveform is produced twice because each sample point is sampled twice, but for different data units that are shifted with respect to each other in time). Thus, in this case, 2N gain-controlled amplifiers 51 are in fact required as shown in Figures 9 and 10. Figure 9 shows the identical overlapping elementary waveforms 1000 each being output by a respective loudspeaker. This means that 64 loudspeakers are used for a 64-sample decomposition. However, this is not very practical because it is difficult with a large number of loudspeakers to fit them all into a space that will create a resultant sound in a suitable listening position. Simply put, 64 is an impracticably large number of speakers. Thus the set-up of Figure 10 is envisaged. As each loudspeaker is made or optimised to output one type of waveform, the two identical waveforms that overlap are modulated using their respective scale factors and then combined to give a resultant waveform that is then emitted by a single loudspeaker. In this way, half the number of loudspeakers may be used, namely 32 in this embodiment. Of course, a loudspeaker may be arranged to output superimposed waveforms that are not identical, but only similar, or even that are different from each other.

Returning to Figure 5, the resultant scale factors Xk that are output from the MDCT process are combined with the synthesised or extracted waveforms (shown in Figures 7) to create the waveform components for reproduction by the array 60 of loudspeakers. This is the reproduction step. MDCT is a lossless transform so applying the Inverse Modified Discrete Cosine Transform (IMDCT) to the result of the MDCT gives the original data unit. The present embodiment uses the MDCT as a tool to decompose the audio signal and take advantage of the lossless principle when no compression is performed.

Figure 6 shows a theoretical inverse MDCT (IMDCT) process. IMDCT of the waveforms and scale factors outputs exactly what was input, namely the source audio signal. The mathematical theory shown in Figure 6 is in fact performed acoustically in the reproduction step. In reproducing the output acoustic signal, a process similar to IMDCT is performed. Specifically, the individual combination of the waveforms with scaling factors is equivalent to the multiplication of Xk with the cosine function. The reproduction of the scaled waveforms by elementary loudspeakers, and their combination in the air is equivalent to the summing function to build an output physical sound. The difference between the IMDCT and the present recombination is thus the format of the output. The principle of IMDCT may nonetheless be used to explain the reproduction step below. Furthermore, calculating the IMDCT can be used to predetermine the waveforms for the database.

iN-i rj 1 N1 y =-Ec cos -+-+-xw (3) Nk0 LN 2 2) 2)J By applying the IMDCT shown in equation (3) above and then the windowing (we) function again, the original data units y=S can be recovered.

The summation shown in the IMDCT box is in fact performed in the air acoustically by the loudspeakers as the loudspeakers sum the already-multiplied values of waveforms and scaling factors (the multiplication having been performed by the gain-controlled amplifiers) to create the complete sound that is heard by a listener. Thus, looking at the equation (3), the multiplication of Xk and is performed by the amplifiers for each of the values of LNK 2 2) 2) k (0 to N-I) and the summation over those same values of k is performed in the air. Thus, to sum all of the k waveforms, there are k amplifiers and k elementary loudspeakers each producing one of the k waveforms.

By evaluating this mathematical expression (equation (3)) of the IMDCT, the resulting data unit value can be expressed as the sum of k=N elementary waveforms scaled by the scale factor Xk which were output from the decomposition unit. Elementary waveforms are defined by the following mathematical expression (equation (4)) which is also shown in Figure 6 and labelled "waveform database definition".

i r 1 N i xw (4) N) LN 2 2) 2) The set of elementary waveforms y< forms the waveform database 30 for the reproduction step. Thus, the waveform database can be precompiled and used both by the decomposing process to determine the scale factors and subsequently as a waveform component signal 32 for driving the sound reproduction devices 60 as already presented in Figure 3A.

The elementary loudspeakers 61, 62 can be built or adapted to be able to reproduce a particular elementary waveform yfl, faithfully. Thus, each sound reproduction device in the array 60 is preferably specialised for a specific elementary waveform. The sound emitted in this case is more faithful to the original audio signal while the possibility of overlap of the frequencies is decreased. As each driver is responsible for one waveform, frequency response is decreased because of the larger number of specialised drivers and the risk of distorlion is decreased.

The frequency spectra of individual waveforms (three waveform spectra are in fact shown in the graph) are presented in Figure 8, where the frequency band-limited property of the waveform is apparent as the side-lobes drop off quickly. The main lobe of the spectrum shows where the elementary loudspeaker is specialised. The shaping of the side lobes (in other words, the limiting of high and low frequency levels) is achieved using a shape of the windowing function described earlier. The dot-dashed lines represent the main lobes of other waveform spectra. This shows the limited nature of the frequency of each waveform emitted by each elementary loudspeaker.

The waveforms from the database are, by construction, localised in a time domain (i.e. they are of a fixed duration) and localised in a frequency domain (i.e. they have a band-limited spectrum) and enable easier and more accurate reproduction by sound reproduction devices.

Referring now to Figure 11, for the reproduction process to be effective, the acoustic summation in the listening area or at the listening position P is also preferably synchronized in time (so that a listener hears the sound from all elementary loudspeakers at the same time). It means that all the radiated sound waves 530, 531, ... output from the sound reproduction devices 60 (including sound reproduction device 61) are correctly superimposed in the air in order to reconstruct the global sound wave at a position that the listener is likely to be listening from. This is optimal if the distances (or propagation delays) from all the sound reproduction devices to the listening position are equivalent. As it is difficult to put all the sound reproduction devices at the same physical location, in a practical implementation, these distances are not exactly the same. However, this can be controlled by positioning the sound reproduction devices equidistant from the listening area (e.g. the devices 60 may be extended in an arc centred toward the listening position) as shown in Figure 11. Alternatively or additionally, the scale factor Xk may be complex to include a phase difference component in the scaled waveforms that enables waveforms travelling different distances to be nonetheless superimposed at the correct point.

As mentioned above, ideally, there would be 64 loudspeakers; one for each of the 32 waveforms that are produced during a first time period and one for each of the 32 waveforms that are produced at a second time period that starts before the first time period is finished. 64 loudspeakers would thus recreate all of the sounds sampled of the audio signal to be reproduced. However, 64 is an impractical number of loudspeakers: the cost and complexity in arranging this number of loudspeakers may be significant. Thus, this number can be reduced to 32 loudspeakers as described above with respect to Figure 10, namely the two overlapping identical waveforms can be output using the same loudspeaker.

Preferred arrangements of the elementary loudspeakers will now be described. Smaller loudspeakers may, of course, be more closely packed together and reduce difficulties with synchronisation. However, the sound may not be as powerful or have as large a radiation pattern as larger elementary loudspeakers have. In particular, lower-frequency loudspeakers tend to be larger in order to have the correct frequency response.

A further consideration when arranging the loudspeakers is the radiation patterns of loudspeakers. At lower frequencies, larger loudspeakers have a broader radiation pattern. As the frequency goes up, the radiation pattern narrows to a long, narrow lobe in front of the loudspeaker. It is thus preferable to optimise the loudspeaker size so that it has an optimum radiation pattern. An optimum radiation pattern is not necessarily the widest radiation pattern, but is rather one that best matches the other radiation patterns of the other loudspeakers in the array so that the sound is the same wherever the listener is in front of the loudspeaker array.

It is possible to time-delay each signal (including a sound component made up of a scaled waveform) feeding the loudspeakers individually to synchronise the sound synthesis. The influence of this time-delay difference is more noticeable at high frequencies as it is equivalent to an increased phase shift of the waveform. Loudspeakers that will be used for reproducing the waveforms with the higher frequencies may optimally be set as close as possible to each other. For the loudspeakers intended to reproduce the waveform with the lower frequency, this constraint may be relaxed as the radiation pattern of lower-frequency loudspeakers is broader, enabling the lower frequency loudspeakers to be further apart from each other but still have overlapping radiation patterns. This leads to a positioning of the sound reproduction devices as presented Figures 12A and 12B. In Figure 12A, the reproduction system 400 is composed of (smaller) inner sound reproduction devices 420 that reproduce the high frequency waveforms and outer sound reproduction devices 410 for the lower frequency waveforms. The loudspeakers may be arranged in concentric rings.

Alternatively, the configuration may be a spiral as shown in Figure 12B, with smaller loudspeakers in the middle of the spiral and larger loudspeakers on the outer portions of the spiral. This specific arrangement limits the effect of the propagation delay differences and enables the reproduction of the complete sound wave. If there are below a threshold number of loudspeakers, the loudspeakers in the centre are preferably spread out. If there are above a threshold number of devices, the devices in the centre of the array are closer together such that the superimposed sound signal has an optimal listening area where all waveforms meet. Other arrangements of the speakers will be determinable by the skilled person based on the explanation above.

Claims

CLAIMS: 1. An acoustical synthesis apparatus for producing a sound output from an input digital signal, wherein the apparatus comprises: decomposing means for decomposing the digital signal into a plurality of waveforms that are each localised in frequency and time domains, the plurality of localised waveforms being distributed within at least said frequency domain, and into a plurality of scale factors; and sound reproduction means for reproducing, as output sound components, each of said waveforms scaled using a respective scale factor.
2. An acoustical synthesis apparatus according to claim 1, wherein the decomposing means is arranged to supply the waveforms or an indication of the waveforms to at least one of the sound reproduction means and a waveform database.
3. An acoustical synthesis apparatus for producing a sound output from an input digital signal, wherein the apparatus comprises: decomposing means for decomposing the digital signal into a plurality of predetermined waveforms and a plurality of scale factors; and sound reproduction means for reproducing, as output sound components, each of said predetermined waveforms scaled using a respective scale factor.
4. An acoustical synthesis apparatus according to claim 3, further comprising: means for extracting, from a waveform database, waveforms equivalent to each predetermined waveform, wherein the sound reproduction means are configured to reproduce, as output sound components, each of said extracted waveforms scaled using a respective scale factor.
5. An acoustical synthesis apparatus according to claim 4, wherein the decomposing means is configured to derive the at least one scale factor by performing an analysis on sample points of the input digital signal taking into account waveforms available in the waveform database.
6. An acoustical synthesis apparatus according to claim 5, wherein the analysis includes applying a transform to the sample points of the input digital signal.
7. An acoustical synthesis apparatus according to claim 5 or 6, wherein the analysis comprises an MDCT.
8. An acoustical synthesis apparatus according to any one of claims 5 to 7, wherein the analysis comprises a 32-sample transform such that 32 different waveforms and 32 corresponding scale factors are output.
9. An acoustical synthesis apparatus according to any one of claims 5 to 8, wherein the analysis comprises a 32-sample transform such that 64 waveforms and 64 scale factors are output, the 64 waveforms including 32 different waveforms, a first of each different type of waveform being offset with respect to a second of each type of waveform by half a waveform length.
10. An acoustical synthesis apparatus according to any preceding claim, wherein the decomposing means is arranged to forward information to the sound reproduction means indicating which waveforms are to be reproduced.
11. An acoustical synthesis apparatus according to any preceding claim, wherein the sound reproduction means comprises a plurality of elementary sound reproduction units, each elementary sound reproduction unit being arranged to output a different sound component, wherein the plurality of sound components, when superimposed, produce an output sound corresponding to the input digital signal.
12. An acoustical synthesis apparatus according to claim 11, wherein each elementary sound reproduction unit is arranged to output a different sound component with a different, limited frequency band from the other elementary sound reproduction units.
13. An acoustical synthesis apparatus according to claim 12, wherein each elementary sound reproduction unit is optimised to reproduce a waveform limited in frequency and time domains.
14. An acoustical synthesis apparatus according to any preceding claim, wherein the sound reproduction means comprises a plurality of gain-controlled amplifiers for receiving each a scale factor and applying the scale factor to a respective, associated waveform.
15. An acoustical synthesis apparatus according to any preceding claim, wherein the scale factor includes a phase difference factor to enable control of superposition positions of sound outputs of the sound reproduction means.
16. An acoustical decomposition apparatus for decomposing an input digital signal into a plurality of waveforms that are each localised in frequency and time domains, the plurality of localised waveforms being distributed within at least said frequency domain; and into a plurality of associated scale factors.
17. An acoustical decomposition apparatus for decomposing an input digital signal into a plurality of predetermined waveforms and a plurality of associated scale factors.
18. An acoustical decomposition apparatus according to claim 16 or 17, configured to forward, to a sound reproduction means comprising elementary sound reproduction units, an indication of which waveforms are to be scaled using the respective scale factors and reproduced by appropriate respective elementary sound reproduction units.
19. An acoustical decomposition apparatus according to claim 17, comprising: reception means for receiving the input digital signal; determination means for determining a plurality of sound components making up the input digital signal; extraction means for extracting, from a waveform database, information regarding an available waveform; and generation means for generating a scale factor corresponding to said available waveform, wherein the generation means is configured to generate the scale factor such that, when the scale factor is output to an elementary sound reproduction unit, it causes the elementary sound reproduction unit to reproduce the available waveform according to the information extracted by the extraction means and to scale the available waveform by the generated scale factor such that the reproduced waveform corresponds to one of the plurality of sound components determined by the determination means.
20. A sound reproduction means comprising a plurality of elementary sound reproduction units, each elementary sound reproduction unit being configured to output, as a sound component, a single scaled waveform localised in time and frequency domains.
21. A sound reproduction means according to claim 20, wherein the elementary sound reproduction units are arranged in an array such that the sound components output from all of the elementary sound reproduction units superimpose at a predetermined position in the air.
22. A sound reproduction means according to claim 20 or 21, wherein each elementary sound reproduction unit is arranged to receive information indicating two waveforms that are offset in time and to superimpose said two waveforms to output said two superimposed waveforms as a single output sound component.
23. A signal containing a waveform localised in time and frequency domains.
24. A signal according to claim 23 containing two superimposed overlapping waveforms, the waveforms being the same but offset by half a waveform length and each being scaled with a respective scale factor.
25. A waveform database containing at least one waveform to be output in a signal according to claim 23 or 24.
26. A signal containing a scale factor for modulating a waveform localised in time and frequency domains to produce a sound component to be superimposed in air by an elementary sound reproduction unit, the scale factor being determined by the decomposition of a digital signal into a plurality of component waveforms and a plurality of component scaling factors, the plurality of component waveforms, when recombined each with a respective scaling factor, generating a sound output corresponding to the digital signal.
27. A signal according to claim 26, further comprising information indicating with which waveform the scale factor is to be combined.
28. A signal containing a waveform scaled by a scale factor to be output by a sound reproduction means as a sound component.
29. An acoustic synthesis method of producing a sound output from an input digital signal, the method comprising: decomposing the digital signal into a plurality of waveforms each localised in time and frequency domains, the plurality of waveforms being distributed in at least the frequency domain; and into at least one scale factor, the at least one scale factor being generated such that when used to scale a respective waveform, produces a sound component of the sound output; and reproducing a respective one of said at least one waveform scaled using a respective scale factor.
30. An acoustic synthesis method of producing a sound output from an input digital signal, the method comprising: decomposing the digital signal into a plurality of predetermined waveforms and a plurality of corresponding scale factors, the scale factors being generated such that when used to scale a respective waveform, produces a sound component of the sound output; and reproducing a respective one of said at least one waveform scaled using a respective scale factor.
31. An acoustic synthesis method according to claim 29 or 30, further comprising superimposing, in the air using a plurality of loudspeakers, a plurality of reproduced waveforms according to a plurality of respective scale factors to produce the sound output corresponding to the input digital signal.