[go: up one dir, main page]

HK1064788B - Noise cancellator - Google Patents

Noise cancellator Download PDF

Info

Publication number
HK1064788B
HK1064788B HK04107704.8A HK04107704A HK1064788B HK 1064788 B HK1064788 B HK 1064788B HK 04107704 A HK04107704 A HK 04107704A HK 1064788 B HK1064788 B HK 1064788B
Authority
HK
Hong Kong
Prior art keywords
noise
vector
spectrum
section
power
Prior art date
Application number
HK04107704.8A
Other languages
Chinese (zh)
Other versions
HK1064788A1 (en
Inventor
安永和敏
森井利幸
渡边泰助
江原宏幸
Original Assignee
Godo Kaisha Ip Bridge 1
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP29473896A external-priority patent/JP4003240B2/en
Priority claimed from JP31032496A external-priority patent/JP4006770B2/en
Priority claimed from JP03458397A external-priority patent/JP3700310B2/en
Priority claimed from JP03458297A external-priority patent/JP3174742B2/en
Application filed by Godo Kaisha Ip Bridge 1 filed Critical Godo Kaisha Ip Bridge 1
Publication of HK1064788A1 publication Critical patent/HK1064788A1/en
Publication of HK1064788B publication Critical patent/HK1064788B/en

Links

Description

Noise canceller
The present application is a divisional application entitled "audio decoding device for audio encoding device and audio vector generation device", filed on 1997, 11/6/1997, and filed on 97191558. X.
Technical Field
The present invention relates to an acoustic vector generator capable of obtaining a high-quality synthesized speech, and to an acoustic encoding device and an acoustic decoding device capable of encoding and decoding a high-quality speech signal at a low bit rate.
Background
A CELP (Code Excited Linear Prediction) type audio coding apparatus performs Linear Prediction for each frame of a divided audio at a fixed time, and codes a Prediction residual (excitation signal) of the Linear Prediction for each frame by using an adaptive codebook storing a past driving sound source and a noise codebook storing a plurality of noise vectors. For example, a CELP type speech coder is disclosed in "Low Bit Rate High Quality speech" ("High Quality speech Low Bit Rate" m.r.schroeder, proc.icassp' 85, PP 937-940).
Fig. 1 shows a schematic configuration of a CELP type speech encoding apparatus. A CELP type speech encoding apparatus separates and encodes speech information into sound source information and vocal tract information. As for the channel information, a human input sound signal 10 is input to a filter coefficient analysis unit 11 and linear prediction is performed, and a linear prediction coefficient (LPG) is encoded in a filter coefficient quantization unit 12. By providing the synthesis filter 13 with linear prediction coefficients, the channel signal can be doped into the sound source information at the synthesis filter 13. For the sound source information, the adaptive codebook 14 and the noise codebook 15 are searched for each section (referred to as a subframe) of a further subdivided frame. The search of the adaptive codebook 14 and the sound source search of the noise codebook 15 are processes for determining the code number of the adaptive codevector and its gain (pitch gain) and the code number of the noise codevector and its gain (noise code gain) that minimize the coding distortion of expression (1).
‖v·(gaHp+gcHc)‖2 (1)
V: sound signal (vector)
H: impulse response convolution matrix for synthesis filter
Wherein, h: synthesizing the impulse response (vector) of a filter
L: frame length
p: adaptive codevector
c: noise code vector
ga: adaptive code gain (tone gain)
gc: noise code gain
However, since the amount of computation required for code search increases when the code whose closed-loop search minimizes expression (1) is the aforementioned code, in a general CELP type speech coding apparatus, first, adaptive codebook search is performed to specify the code number of an adaptive code vector, and then, the result is received to perform noise codebook search to specify the code number of a noise code vector.
Here, a noise codebook search of the CELP type speech coding apparatus will be described with reference to fig. 2A to 2C.
In the figure, symbol x is a target vector for searching a noise codebook obtained based on expression (2). Let the adaptive codebook search end.
x=v-gaHp (2)
x: noise codebook search target (vector)
v: sound signal (vector)
H: impulse response convolution matrix for synthesis filter
p: adaptive codevector
ga: adaptive code gain (tone gain)
As shown in fig. 2, the noise codebook search is a process for specifying a noise code vector c that minimizes the coding distortion defined by equation (3) in calculation section 16.
‖x-gcHc)‖2 (3)
x: noise codebook search target (vector)
H: impulse response convolution matrix for synthesis filter
c: noise code vector
gc: noise code gain
Distortion calculating section 16 controls control switch 21 to switch the noise code vector read from noise codebook 15 until noise code vector c is determined.
In order to reduce the calculation cost, the actual CELP type speech coding apparatus has the configuration of fig. 2B, and the distortion calculating section 16' performs a process of defining a code number that maximizes the distortion estimated value of equation (4).
x: noise codebook search target (vector)
H: impulse response convolution matrix for synthesis filter
H': transposed matrix of H
x': time-reversed synthesis of the inverted vector for x at H (x ═ x' H)
c: noise code vector
Specifically, noise codebook control switch 21 is connected to end 1 of noise codebook 15, and noise code vector c is read from the address corresponding to the end. The read noise code vector c is synthesized with the channel information by the synthesis filter 13, and a synthesis vector Hc is generated. Next, using a vector x 'obtained by time-inverting, synthesizing, and time-inverting the target x, and a vector Hc obtained by synthesizing the noise code vector and the noise code vector c with a synthesis filter, the distortion calculating section 16' calculates the estimated distortion value of equation (4). Then, the noise codebook control switch 21 is switched to calculate the above-mentioned distortion estimation value for all the noise vectors in the noise codebook.
Finally, the number of the noise codebook control switch 21 connected when the distortion estimated value of expression (4) is maximum is output to the code output unit 17 as the code number of the noise code vector.
Fig. 2C shows a partial structure of the sound decoding transpose. The control noise codebook control switch 21 is switched so as to read out the noise codevector of the transmitted code number. The amplifier circuit 23 and the synthesis filter 24 set the transmitted noise coding gain gc and filter coefficient, and then read the noise code vector to restore the synthesized speech.
In the above-described sound encoding apparatus and decoding apparatus, the more noise code vectors stored as sound source information in the noise codebook 15, the more noise code vectors of sound sources close to actual sound can be searched. However, since the capacity of the noise codebook (ROM) is limited, it is impossible to store numerous noise code vectors corresponding to all sound sources in the noise codebook. Therefore, there is a limit to improving the sound quality.
Furthermore, a sound source is proposed which can significantly reduce the CODING distortion calculation by the distortion calculating unit and can reduce the algebraic structure OF the noise codebook (ROM) (described in "8 KBIT/S ACELP CODING OF SPECHWITH 10MS SPECCH-FRAME: A CANDIDATE FOR CCITT STANDARDIZATION": R.Salami, C.Lafamme, J-P.Adoul, ICASSP' 94, pp.II-97-II-100, 1994).
The algebraic sound source calculates the convolution operation result of the impulse response of the synthesis filter and the target of time reversal and the autocorrelation of the synthesis filter in advance, and expands them in a memory, thereby greatly reducing the cost of coding distortion calculation. By generating the noise codevector algebraically, the ROM storing the noise codevector can be reduced. CS-ACELP and ACELP using the aforementioned algebraic structure sound sources in the noise codebook are proposed by ITU-T as the G.729 recommendation and the G.723.1 recommendation, respectively.
However, in the CELP type speech coding apparatus and speech decoding apparatus including the algebraic structure sound source in the noise codebook, since the target for noise codebook search is encoded by the burst vector continuously, there is a limit in improving the sound quality.
Summary of The Invention
In view of the above circumstances, it is a1 st object of the present invention to provide an acoustic source vector generation device, an acoustic encoding device, and an acoustic decoding device that can significantly reduce the memory capacity and improve the sound quality as compared with the case where a noise code vector is stored in a noise codebook as it is.
It is a2 nd object of the present invention to provide a sound source vector generating device, a sound encoding device, and a sound decoding device, each of which can include an algebraic structure sound source in a noise codebook, can generate a complicated noise code vector and can improve sound quality as compared with a case where a target for noise codebook search is encoded using a burst vector.
The present invention replaces the fixed vector reading means and the fixed codebook of the conventional CELP type audio coding/decoding apparatus with an oscillator that outputs a different vector series corresponding to the input vibration type value and a vibration type storage means that stores a plurality of vibration types (generation oscillators), respectively. This eliminates the need to store the fixed vector in a fixed codebook (ROM) as it is, and thus the memory capacity can be significantly reduced.
The present invention replaces the noise vector reading means and the noise codebook of the conventional CELP type audio coding/decoding apparatus with the oscillator and the seed storage means. This eliminates the need to store the noise vector in a fixed codebook (ROM) as it is, and thus the memory capacity can be significantly reduced.
The structure of the sound source vector generator of the present invention is: a plurality of fixed waveforms are stored, and each fixed waveform is arranged at a start position based on start candidate position information, and the fixed waveforms are added to generate a sound source vector. Therefore, a sound source vector close to the actual sound can be generated.
The present invention is a CELP type speech coder/decoder in which a noise codebook is constructed using the above-described excitation vector generator. The fixed waveform arranging means may algebraically generate start candidate position information of the fixed waveform.
A CELP type speech encoder/decoder stores a plurality of fixed waveforms, generates pulses corresponding to candidate position information for the start of each fixed waveform, convolves the impulse response of a synthesis filter with each fixed waveform to generate waveform-based impulse responses, calculates the autocorrelation and cross-correlation of the waveform-based impulse responses, and develops the autocorrelation and cross-correlation in a correlation matrix memory. Thus, it is possible to obtain a speech encoding/decoding device capable of improving the quality of synthesized speech while obtaining the same computational cost as that in the case where an algebraic structure sound source is used as a noise codebook.
The CELP type speech coder/decoder according to the present invention may include a plurality of noise codebooks and switching means for selecting one of the plurality of noise codebooks, may include at least one noise codebook as the sound source vector generator, may include at least one noise codebook as a vector storage means for storing a plurality of random number sequences or a burst storage means for storing a plurality of bursts, may include at least two noise codebooks having the sound source vector generator, and may include a plurality of fixed waveforms stored in the respective noise codebooks, and the switching means may select any one of the noise codebooks so that coding distortion at the time of searching the noise codebooks is minimized, or may adaptively select any one of the noise codebooks based on the result of the analysis of the speech section.
According to the present invention, there is provided a noise canceller removing a noise component from an input sound signal, the noise canceller including:
a/D conversion means for converting the input sound signal into a digital signal;
a noise reduction coefficient adjusting device for adjusting a noise reduction coefficient for determining a noise reduction amount;
a noise reduction coefficient storage device that stores the noise reduction coefficient;
an input waveform setting device for setting the digital signal obtained by the A/D converter as a digital signal of a predetermined time length suitable for Fourier transform;
an LPC analyzing device for performing linear predictive analysis on the digital signal of a predetermined time length obtained by the input waveform setting device;
a fourier transform device for performing a discrete fourier transform on the digital signal of a predetermined time length obtained by the input waveform setting device to obtain an input spectrum and a complex spectrum;
a noise estimating device for, when the maximum power stored in the maximum power storing device is smaller than the power of the input audio signal, updating the maximum power stored in the maximum power storing device with the power of the input audio signal, checking whether or not the power of the input audio signal or the noise reduction coefficient satisfies a predetermined condition to determine whether or not a noise section in which an input spectrum obtained by the fourier transform device is compared with a plurality of compensation noise spectrum candidates having different magnitudes in ascending order of magnitude, and when the input spectrum is smaller than any one of the compensation noise spectrum candidates, setting the input spectrum as the compensation noise spectrum, and calculating an average noise spectrum obtained by adding the input spectrum at a constant ratio, storing the compensation noise spectrum and the average noise spectrum in a noise spectrum storage device;
a noise spectrum storage device that stores the compensation noise spectrum and the average noise spectrum;
a noise reduction/spectrum compensation device for generating a difference spectrum by multiplying the average noise spectrum stored in the noise spectrum storage device by the noise reduction coefficient obtained by the noise reduction coefficient adjustment device and subtracting the resultant from the input spectrum obtained by the fourier transform device, and for compensating a negative difference spectrum by substituting the product of the level 1 candidate of the noise spectrum for compensation stored in the noise spectrum storage device and the compensation coefficient obtained by the noise reduction coefficient adjustment device;
a spectrum stabilizing device for checking the full-band power of all the frequencies of the spectrum subjected to noise reduction and spectrum compensation by the noise reduction/spectrum compensation device and the mid-band power of a part of frequency bands important for auditory sense, judging whether an input signal is a silence interval according to whether the input power or the mid-band power of a current frame meets a preset condition, performing stabilization processing of adjusting the power of a previous frame by the power of the current frame on the full-band power and the mid-band power in the silence interval, and performing phase rotation of a complex spectrum by random phase data on the complex spectrum obtained by the Fourier transform device outside the silence interval according to information whether the spectrum is subjected to spectrum compensation by the noise reduction/spectrum compensation device;
a previous spectrum storage device for storing information of previous frame power;
a random phase storage device that stores the random phase data;
an inverse Fourier transform means for performing inverse Fourier transform on the frequency spectrum subjected to the stabilization processing by the frequency spectrum stabilization means and the adjusted phase frequency spectrum;
a spectrum enhancement device for preparing a plurality of sets of weighting coefficients for enhancing a spectrum, determining a state of the input audio signal based on a power level of the difference spectrum at the time of noise reduction, selecting a set of weighting coefficients corresponding to the determined state, and performing spectrum enhancement on the signal obtained by the inverse fourier transform device using the selected weighting coefficients;
a waveform matching means for matching the signal obtained by the spectrum enhancing means with a previous frame signal;
front waveform storage means for storing a waveform of a previous frame used by the waveform matching means; and
maximum power storage means for storing the maximum power updated by the noise estimation means.
Brief description of the drawings
Fig. 1 shows a schematic diagram of a conventional CELP type speech encoding apparatus.
Fig. 2A is a block diagram of a sound source vector generation unit of the sound encoding device of fig. 1.
Fig. 2B is a block diagram of a modified acoustic vector generation unit for reducing calculation cost.
Fig. 2C is a block diagram of a sound source vector generation unit in the sound decoding apparatus used in a pair with the sound encoding apparatus of fig. 1.
Fig. 3 is a block diagram of a main part of the audio encoding device according to embodiment 1.
Fig. 4 is a block diagram of an acoustic source vector generator included in the speech encoding device according to embodiment 1.
Fig. 5 is a block diagram of a main part of the speech encoding apparatus according to embodiment 2.
Fig. 6 is a block diagram of an acoustic source vector generator included in the speech encoding device according to embodiment 2.
Fig. 7 is a block diagram showing a main part of the audio encoding apparatus according to embodiments 3 and 4.
Fig. 8 is a block diagram of an acoustic source vector generator included in the speech encoding device according to embodiment 3.
Fig. 9 is a block diagram of a nonlinear digital filter included in the speech encoding device according to embodiment 4.
Fig. 10 shows an addition characteristic diagram of the nonlinear digital filter shown in fig. 9.
Fig. 11 is a block diagram showing a main part of the audio encoding device according to embodiment 5.
Fig. 12 is a block diagram showing a main part of the audio encoding device according to embodiment 6.
Fig. 13A is a block diagram showing a main part of the audio encoding device according to embodiment 7.
Fig. 13B is a block diagram of a main part of the audio encoding device according to embodiment 7.
Fig. 14 is a block diagram showing a main part of the audio decoding apparatus according to embodiment 8.
Fig. 15 is a block diagram of a main part of the audio encoding device according to embodiment 9.
Fig. 16 is a block diagram showing a quantization target LSP addition section included in the audio encoding device according to embodiment 9.
Fig. 17 is a block diagram of an LSP quantization/decoding unit included in the audio encoding device according to embodiment 9.
Fig. 18 is a block diagram showing a main part of the audio encoding device according to embodiment 10.
Fig. 19A is a block diagram showing a main part of the audio encoding device according to embodiment 11.
Fig. 19B is a block diagram of a main part of the audio decoding apparatus according to embodiment 11.
Fig. 20 is a block diagram showing a main part of the audio encoding device according to embodiment 12.
Fig. 21 is a block diagram showing a main part of the audio encoding device according to embodiment 13.
Fig. 22 is a block diagram showing a main part of the audio encoding device according to embodiment 14.
Fig. 23 is a block diagram showing a main part of the audio encoding device according to embodiment 15.
Fig. 24 is a block diagram showing a main part of the speech coding apparatus according to embodiment 16.
Fig. 25 is a block diagram of a vector quantization section according to embodiment 16.
Fig. 26 is a block diagram of a parameter encoding section of the audio encoding device according to embodiment 17.
Fig. 27 is a block diagram of a noise reducing device according to embodiment 18.
Best Mode for Carrying Out The Invention
The embodiments of the present invention will be specifically described below with reference to the drawings.
Embodiment 1
Fig. 3 is a block diagram of a main part of the audio encoding device according to embodiment 1. This speech encoding apparatus includes an acoustic source vector generation apparatus 30 having a seed storage unit 31 and an oscillator 32, and an LPC synthesis filter unit 33.
The seed vibration (a "seed" generating an oscillation) 34 output from the seed vibration storage unit 31 is input to the oscillator 32. The oscillator 32 outputs a different series of vectors corresponding to the input magnitude of the oscillator. The oscillator 32 oscillates with a content corresponding to a value of the seed oscillator (a "seed" generating oscillation) 34, and outputs a sound source vector 35 as a vector series. The LPC synthesis filter unit 33 provides vocal tract information in the form of an impulse response convolution matrix of the synthesis filter, convolves the acoustic source vector 35 with the impulse response, and outputs a synthesized voice 36. The convolution operation of the sound source vector 35 with the impulse response is called LPC synthesis.
Fig. 4 shows a specific configuration of the sound source vector generator 30. The seed vibration storage unit control switch 41 switches the seed vibration read out from the seed vibration storage unit 31 in accordance with the control signal supplied from the distortion calculation unit.
In this way, only a plurality of types of vibration output from the oscillator 32 as different vector series are stored in the vibration type storage unit 31 in advance, and more noise code vectors can be generated with a smaller capacity than when complex noise code vectors are stored in the noise codebook as they are.
Although the present embodiment describes the audio encoding device, the acoustic source vector generation device 30 may be used in an audio decoding device. In this case, the audio decoding apparatus has a seed number storage unit having the same contents as the seed number storage unit 31 of the audio encoding apparatus, and supplies the seed number selected at the time of encoding to the seed number storage unit control switch 41.
Embodiment 2
Fig. 5 is a block diagram showing a main part of the audio encoding device according to the present embodiment. This speech encoding apparatus includes an acoustic source vector generating apparatus 50 having a seed storage unit 51 and a nonlinear oscillator 52, and an LPC synthesis filter unit 53.
The seed vibration (a "seed" generating an oscillation) 54 output from the seed vibration storage unit 51 is input to the nonlinear oscillator 52. The excitation vector 55, which is a vector series output from the nonlinear oscillator 52, is input to the LPC synthesis filter section 53. The output of the synthesis filter unit 53 is synthesized speech 56.
The nonlinear oscillator 52 outputs different vector series corresponding to the value of the input seed 54, and the LPC synthesis filter unit 53 LPC synthesizes the input sound source vector 55 and outputs a synthesized voice 56.
Fig. 6 is a block diagram showing the functions of the sound source vector generation device 50. The seed vibration storage unit control switch 41 switches the seed vibration read out from the seed vibration storage unit 51 in accordance with the control signal supplied from the distortion calculation unit.
By using the nonlinear oscillator 52 as the oscillator of the acoustic vector generator 50, the dispersion can be suppressed by oscillation conforming to the nonlinear characteristic, and a practical acoustic vector can be obtained.
Although the present embodiment describes the audio encoding device, the acoustic source vector generation device 50 may be used in an audio decoding device. In this case, the audio decoding apparatus includes a seed number storage unit having the same contents as the seed number storage unit 51 of the audio encoding apparatus, and supplies the seed number selected at the time of encoding to the seed number storage unit control switch 41.
Embodiment 3
Fig. 7 is a block diagram showing a main part of the audio encoding device according to the present embodiment. This speech encoding apparatus includes a sound source vector generation apparatus 70 having a seed storage section 71 and a nonlinear digital filter 72, and an LPC synthesis filter section 73. Reference numeral 74 denotes a seed vibration (a "seed" for generating oscillation) outputted from the seed vibration storage unit 71 and inputted to the nonlinear digital filter 72, reference numeral 75 denotes a sound source vector as a vector series outputted from the nonlinear digital filter 72, and reference numeral 76 denotes a synthesized voice outputted from the LPC synthesis filter 73.
As shown in fig. 8, the acoustic vector generator 70 has a vibration type storage unit control switch 41 for switching the vibration type 74 read from the vibration type storage unit 71 by the control signal supplied from the distortion calculation unit.
The nonlinear digital filter 72 outputs a series of different vectors corresponding to the value of the input seed vibration, and the LPC synthesis filter unit 73 LPC synthesizes the input sound source vector 75 and outputs a synthesized voice 76.
In this way, by using the nonlinear digital filter 72 in the oscillator of the acoustic vector generator 70, the dispersion can be suppressed by oscillation conforming to the nonlinear characteristic, and a practical acoustic vector can be obtained.
Although the present embodiment describes the audio encoding device, the acoustic source vector generation device 70 may be used in an audio decoding device. In this case, the audio decoding apparatus includes a seed number storage unit having the same contents as the seed number storage unit 71 of the audio encoding apparatus, and supplies the seed number selected at the time of encoding to the seed number storage unit control switch 41.
Embodiment 4
As shown in fig. 7, the speech encoding apparatus according to the present embodiment includes a sound source vector generation apparatus 70 having a seed vibration storage section 71 and a nonlinear digital filter 72, and an LPC synthesis filter section 73.
It is particularly noted that the nonlinear digital filter 72 has a structure shown in fig. 9. The nonlinear digital filter 72 includes an adder 91 having nonlinear addition characteristics as shown in FIG. 10, state variable holding means 92 to 93 having a function of holding the states (values of y (k-1) to y (k-N)) of the digital filter, and multipliers 94 to 95 connected in parallel to the outputs of the state variable holding means 92 to 93, multiplying the state variables by gains, and outputting the result to the adder 91. The state variable holding units 92 to 93 set initial values of state variables in accordance with the vibration type read from the vibration type storage unit 71. Multipliers 94-95 limit the value of the gain so that the poles of the digital filter lie outside the unit circle of the Z plane.
Fig. 10 is a conceptual diagram showing the nonlinear addition characteristic of the adder 91 included in the nonlinear digital filter 72, and shows the input-output relationship of the adder 91 having the complement characteristic of 2. The adder 91 first obtains an adder input sum which is the sum of input values to the adder 91, and then calculates an adder output to the input sum using the nonlinear characteristic shown in fig. 10.
In particular, since the nonlinear digital filter 72 has a 2-time full-pole structure, 2 state variable holding units 92 and 93 are connected in series, and multipliers 94 and 95 are connected to the state variable holding units 92 and 93. A digital filter in which the nonlinear addition characteristic of the adder 91 is a complement of 2 is used. Note that the vibration type storage unit 71 stores, in particular, 32-word vibration type vectors shown in table 1.
Table 1: seed vibration vector for generating noise vector
i Sy(n-1)[i] Sy(n-2)[i] i Sy(n-1)[i] Sy(n-2)[i]
1 0.250000 0.250000 9 0.109521 -0.761210
2 -0.564643 -0.104927 10 -0.202115 0.198718
3 0.173879 -0.978792 11 -0.095041 0.863849
4 0.632652 0.951133 12 -0.634213 0.424549
5 0.920360 -0.113881 13 0.948225 -0.184861
6 0.864873 -0.860368 14 -0.958269 0.969458
7 0.732227 0.497037 15 0.233709 -0.057248
8 0.917543 -0.035103 16 -0.852085 -0.564948
In the audio encoding device having the above-described configuration, the vibration type vector read from the vibration type storage unit 71 is supplied to the state variable holding units 92 and 93 of the nonlinear digital filter 72 as an initial value. The nonlinear digital filter 72 outputs 1 sample (y (k)) every time 0 is input from the input vector (0 series) to the adder 91, and sequentially transfers the sample as a state variable to the state variable holding units 92, 93. At this time, the state variables output from the state variable holding units 92 and 93 are multiplied by gains a1 and a2 by multipliers 94 and 95, respectively. The outputs of the multipliers 94 and 95 are added by an adder 91 to obtain the sum of adder inputs, and the adder output suppressed between +1 and-1 is generated according to the characteristics of fig. 10. While outputting such an adder output (y (k +1)) as a sound source vector, it is sequentially transferred to the state variable holding units 92, 93, and a new sample (y (k +2)) is generated.
In the present embodiment, since the coefficients 1 to N of the multipliers 94 to 95 are fixed as the nonlinear digital filter so as to be extremely outside the unit circle of the Z plane and the adder 91 has the nonlinear addition characteristic, even if the input of the nonlinear digital filter 72 is large, the output divergence can be suppressed and a sound source vector that can be used in practice can be continuously generated. It is also possible to ensure randomness of the generated acoustic source vectors.
Although the present embodiment describes the audio encoding device, the acoustic source vector generation device 70 may be used in an audio decoding device. In this case, the audio decoding apparatus includes a seed number storage unit having the same contents as the seed number storage unit 71 of the audio encoding apparatus, and supplies the seed number selected at the time of encoding to the seed number storage unit control switch 41.
Embodiment 5
Fig. 11 is a block diagram showing a main part of the audio encoding device according to the present embodiment. This speech encoding apparatus includes a speech source vector generation apparatus 110 having a speech source storage section 111 and a speech source addition vector generation section 112, and an LPC synthesis filter section 113.
The sound source storage section 111 stores past sound source vectors, and reads the sound source vectors using a control switch that receives a control signal from a distortion calculation section, not shown.
Acoustic source addition vector generation section 112 generates a new acoustic source vector by performing predetermined processing indicated by a generated vector identification number on the past acoustic source vector read from acoustic source storage section 111. Acoustic source addition vector generation section 112 has a function of switching the processing contents of the past acoustic source vectors in accordance with the generated vector identification number.
In the sound encoding device configured as described above, the generated vector-specifying number is supplied from, for example, a distortion calculation unit that performs sound source retrieval. Acoustic source addition vector generation section 112 generates different acoustic source addition vectors by performing different processing on past acoustic source vectors based on the value of the input generated vector identification number, and LPC synthesis filter section 113 performs LPC synthesis on the input acoustic source vector and outputs a synthesized voice.
In the present embodiment, random excitation vectors can be generated by storing only a small number of excitation vectors in the past in excitation storage section 111 and switching the processing contents in excitation addition vector generation section 112, and since it is not necessary to store noise vectors in the noise codebook (ROM) as they are, the memory capacity can be significantly reduced.
Although the present embodiment describes the audio encoding device, the acoustic source vector generation device 110 may be used in an audio decoding device. In this case, the audio decoding apparatus includes a sound source storage unit having the same contents as the sound source storage unit 111 of the audio encoding apparatus, and supplies the generated vector identification number selected at the time of encoding to the sound source addition vector generation unit 112.
Embodiment 6
Fig. 12 is a block diagram showing functions of the acoustic source vector generator according to the present embodiment. This sound source vector generation device includes a sound source addition vector generation unit 120 and a sound source storage unit 121 that stores a plurality of element vectors 1 to N.
The acoustic addition vector generator 120 includes a read processor 122 for reading a plurality of element vectors of different lengths from different positions in the acoustic source storage 121, an inversion processor 123 for performing inversion arrangement conversion of the plurality of element vectors after the read processing, a multiplication processor 124 for performing multiplication processing of multiplying the plurality of vectors after the inversion processing by different gains, an extraction processor 125 for performing vector length shortening processing of the plurality of vectors after the multiplication processing, an interpolation processor 126 for performing vector length extension processing of the plurality of vectors after the extraction processing, an addition processor 127 for performing addition processing of the plurality of vectors after the interpolation processing, and a processing determination unit having both a function of determining a specific processing method corresponding to an input generated vector specific number code value, instructing the determination of each processor, and a function of holding a number conversion correspondence map 2 referred to when determining the specific processing content And a summation indication unit 128.
Table 2: number translation correspondence mapping
Bit string (MS... LSB) 6 5 4 3 2 1 0
V1 reading position (16 kinds) 3 2 1 0
V2 reading position (32 kinds) 2 1 0
V3 reading position (32 kinds) 4 3 2 1 0
Reverse treatment (2 species) 0
Multiplication process (4 kinds) 1 0
Intermittent extraction treatment (4 kinds) 1 0
Interpolation process (2 kinds) 0
Here, the sound source addition vector generation unit 120 will be described in further detail. The acoustic source addition vector generation unit 120 compares the input generated vector identification number (an integer of 0 to 127 in a 7-bit string) with the number conversion correspondence map 2 to determine the specific processing methods of the read processing unit 122, the inversion processing unit 123, the multiplication processing unit 124, the decimation processing unit 125, the interpolation processing unit 126, and the addition processing unit 127, and outputs the specific processing methods to the respective processing units.
First, focusing on the input of the 4-bit string (n 1: integer value from 0 to 15) that generates the lower end of the vector identification number, the element vector 1 of length 100 is cut out from one end of the sound source storage unit 121 to the position of n1 (V1). Then, focusing on the combination of the input of the low-end 2-bit string and the high-end 5-bit string (n 2: integer values from 0 to 31) of the generated vector identification number, the element vector 2 of length 78 is cut out from one end of the sound source storage unit 121 to the position of n2+14 (integer values from 14 to 45) (V2). Further, focusing on the input of the high-end 5-bit string (n 3: integer value from 0 to 31) that generates the vector identification number, the element vector 3(V3) having the length Ns (52) is extracted from one end of the sound source storage unit 121 to the position of n3+46 (integer value from 46 to 77). The readout processing unit 122 performs processing of outputting V1, V2, V3 to the inversion processing unit 123.
The inversion processing unit 123 performs processing of converting vectors of V1 and V2 and V3 in inverted arrangement as new V1, V2, V3 and outputting to the multiplication processing unit 124 if the lowest 1-digit of the generated vector-specific number is "0", and performs processing of outputting V1 and V2 and V3 to the multiplication processing unit 124 as they are if the lowest 1-digit of the generated vector-specific number is "1".
The multiplication processing unit 124 focuses on the combination of the high-end 7 th bit and high-end 6 th bit strings that have been input and generated as vector-specifying numbers, and multiplies the amplitude of V2 by-2 if the bit string is '00', multiplies the amplitude of V3 by-2 if the bit string is '01', multiplies the amplitude of V1 by-2 if the bit string is '10', multiplies the amplitude of V2 by 2 if the bit string is '11', and outputs the vectors as new V1, V2, and V3, respectively, to the inter-decimation unit 125.
Focusing on the combination of the 2 bit string of the high-end 4 th bit and the high-end 3 rd bit of the generated vector specific number, the decimation unit 125 extracts a vector of 26 samples as new V1, V2, V3 if the bit string is (a) ' 00 ', extracts vectors of 1 sample interval from V1, V2, V3 as new V1, V2, V3, outputs the vectors of 26 samples as new V1, V2, V3 if the bit string is (b) ' 01 ', extracts vectors of 3 sample interval from V1, V3, extracts vectors of 1 sample interval from V2 as new V1, V2, V3, outputs the vectors of 26 samples as new V1, V2, V3 if the bit string is (c) ' 10 ', extracts vectors of 26 sample interval from V1, extracts vectors of 1 sample interval from V2, V3 as new V1, V2, V3, outputs the vectors of 26 samples as new interpolation processing unit 126, and outputs the vectors of 3 sample interval from V1 ' 3611 bit string, 2 samples apart from V2, 1 sample apart from V3, and the vector of 26 samples is taken out as new V1, V2, and V3, and output to the interpolation processing unit 77,
focusing on the fact that the high-end 3 rd digit of the generated vector specification number is input, if the value is (a) '0', the interpolation processing section 126 outputs the vector in the even-numbered samples of the 0 vector having the length Ns (═ 52) into which V1, V2, and V3 are respectively substituted as new V1, V2, and V3, and if the value is (b) '1', the interpolation processing section 126 outputs the vector in the odd-numbered samples of the 0 vector having the length Ns (═ 52) into which V1, V2, and V3 are respectively substituted as new V1, V2, and V3, to the addition processing section 75.
The addition processing unit 127 adds the 3 vectors (V1, V2, V3) generated by the interpolation processing unit 126, generates and outputs a sound source addition vector.
In this way, in the present embodiment, since a plurality of processes are randomly combined in accordance with the generated vector identification number to generate a random acoustic vector, it is not necessary to store the noise vector in advance in the noise codebook (ROM), and the memory capacity can be significantly reduced.
Further, by using the acoustic vector generator according to the present embodiment in the audio encoding device according to embodiment 5, it is possible to generate complex random acoustic vectors without having to hold a large-capacity noise codebook.
Embodiment 7
Next, an example in which the acoustic source vector generation device shown in any of embodiments 1 to 6 described above is used in a CELP type audio coding device based on PSI-CELP, which is an audio coding/decoding standard system of PDC digital cellular phones in japan, will be described as embodiment 7.
Fig. 13A and 13B are block diagrams of the audio encoding device according to embodiment 7. In such an encoding apparatus, the digitized input sound data 1300 is supplied to the buffer 1301 in units of frames (frame length Nf is 104). At this time, the old data in the buffer 1301 is updated by the supplied new data. First, frame power quantization/decoding section 1302 reads out processing frame s (i) (0 ≦ i ≦ Nf-1) having length Nf (104) from buffer 1301, and obtains average power amp of samples in the processing frame from equation (5).
amp: processing average power of samples within a frame
i: processing element number in frame (i is more than or equal to 0 and less than or equal to Nf-1)
s (i): processing intra-frame samples
Nf: processing frame length (52)
The average power amp of the samples in the processing frame thus obtained is converted into a logarithmic conversion value amp by equation (6).
ampliog: processing logarithmic transformed values of average power of samples within a frame
amp: processing average power of samples within a frame
The obtained amplig is stored in power quantization table storage section 1303, scalar quantization is performed using scalar quantization table Cpow of 10 words shown in table 3, to obtain power index Ipow of 4 bits, decoding frame power window is obtained from the obtained power index Ipow of 4 bits, and power index Ipow and decoding frame power window are output to parameter encoding section 133. The power quantization table storage section 1303 stores a 16-word power scalar quantization table (table 3), which is referred to when the frame power quantization/decoding section 1302 performs scalar quantization on a logarithmic conversion value of the average power of samples in a processing frame.
Table 3: power scalar quantization table
i Cpow(i) i Cpow(i)
1 0.00675 9 0.39247
2 0.06217 10 0.42920
3 0.10877 11 0.46252
4 0.16637 12 0.49503
5 0.21876 13 0.52784
6 0.26123 14 0.56484
7 0.30799 15 0.61125
8 0.35228 16 0.67498
LPC analyzing section 1304 reads out analysis section data having an analysis section length Nw (256) from buffer 1301, multiplies the read analysis section data by hamming window Wh having window length Nw (256) to obtain analysis section data multiplied by hamming window, and then obtains an autocorrelation function of the analysis section data multiplied by hamming window a plurality of times until the number of times is equal to the number of prediction times Np (10). The obtained autocorrelation function is multiplied by a 10-word lag window table (table 4) stored in lag window storage section 1305 to obtain an autocorrelation function multiplied by a lag window, and linear prediction analysis is performed on the obtained autocorrelation function multiplied by the lag window to calculate LPC parameters α (i) (1 ≦ i ≦ Np), and the LPC parameters α (i) are output to pitch preselecting section 1308.
Table 4: hysteresis window watch
i Wlag(i) i Wlag(i)
0 0.9994438 5 0.9801714
1 0.9977772 6 0.9731081
2 0.9950056 7 0.9650213
3 0.9911382 8 0.9559375
4 0.9861880 9 0.9458861
Subsequently, the LPC parameters α (i) obtained are converted into LSP (line spectral pair) ω (i) (1 ≦ i ≦ Np) and output to quantization/decoding section 1306. Hysteresis window storage section 1305 stores a hysteresis window to be referred to by LPC analyzing section.
LSP quantizing/decoding section 1306 first refers to the LSP vector quantization table stored in LSP quantization table storage section 1307, performs vector quantization on the LSP received from LPC analyzing section 1304, selects an optimal index, and outputs the selected index to parameter encoding section 1331 as LSP code I1 sp. Next, the centroid corresponding to the LSP code is read out from LSP quantization table storage section 1307 as decoded LSP ω q (i) (1 ≦ i ≦ Np), and the read-out decoded LSP is output to LSP insertion section 1311. Further, the decoded LSP is converted into LPC to obtain decoded LSP α q (i) (1 ≦ i ≦ Np), and the obtained decoded LPC is output to vector weighting filter coefficient calculation section 1312 and perceptual weighting LPC synthesis filter coefficient calculation section 1314.
LSP quantization table storage section 1307 stores an LSP vector quantization table to be referred to when LSP quantization/decoding section 1306 performs vector quantization on an LSP.
Tone preselection section 1308 first applies linear prediction inverse filtering based on LSP α (i) (1. ltoreq. i.ltoreq. Nf-1) received by LPC analyzing section 1304 to processed frame data s (i) (1. ltoreq. i.ltoreq. Np) read out from buffer 1301 to obtain linear prediction residual signal res (i) (1. ltoreq. i.ltoreq. Nf-1), calculates the power of linear prediction residual signal res (i), calculates normalized prediction residual power reset of a value obtained by normalizing the calculated residual signal power by the calculated processed subframe sound sample power, and outputs the normalized prediction residual power reset to parameter encoding section 1331. Next, a hamming window of length Nw (256) is multiplied by the linear prediction residual signal res (i), a linear prediction residual signal resw (i) (1 ≦ i ≦ Nw-1) multiplied by the hamming window is generated, and an autocorrelation function Φ int (i) of the generated resw (i) is obtained in a range of Lmin-2 ≦ i ≦ Lmax +2 (where Lmin is a shortest analysis interval of the long-term prediction coefficient 16 and Lmax is a longest analysis interval of the long-term prediction coefficient, and is 128 of 16, respectively). The 28-word polyphase filter coefficients Cppf (table 5) stored in the polyphase coefficient storage unit 1309 are superimposed on the obtained autocorrelation function Φ int (i), and an autocorrelation function Φ int (i) of the integer lag int, an autocorrelation function Φ dq (i) of the fractional position shifted from the integer lag int-1/4, an autocorrelation function Φ aq (i) of the fractional position shifted from the integer lag int +1/4, and an autocorrelation function Φ ah (i) of the fractional position shifted from the integer lag int +1/2 are obtained, respectively.
Table 5: polyphase filter coefficients Cppf
i Cppf(i) i Cppf(i) i Cppf(i) i Cppf(i)
0 0.100035 7 0.000000 14 -0.128617 21 -0.212207
1 -0.180063 8 0.000000 15 0.300105 22 0.636620
2 0.900316 9 1.000000 16 0.900316 23 0.636620
3 0.300105 10 0.000000 17 -0.180063 24 -0.212207
4 -0.128617 11 0.000000 18 0.100035 25 0.127324
5 0.081847 12 0.000000 19 -0.069255 26 -0.090946
6 -0.060021 13 0.000000 20 0.052960 27 0.070736
Further, the maximum of φ int (i), φ dq (i), φ aq (i), φ ah (i) and φ ah (i) is substituted into φ max (i) for an argument i in the range of Lmin-2. ltoreq. i.ltoreq.Lmax +2, and the processing of the formula (7) is performed to obtain Lmax-Lmin +1 φ max (i).
φmax(i)=MAX(φint(i)、φdq(i)、φaq(i)、φah(i)) (7)
φ max (i): (ii) a maximum value of φ int (i), φ dq (i), φ aq (i), φ ah (i)
I: analysis interval of long-term prediction coefficient (Lmin ≦ i ≦ Lmax)
Lmin: shortest analysis interval of long-term prediction coefficient (═ 16)
Lmax: longest analysis interval of long-term prediction coefficient (-128)
Phi int (i): autocorrelation function of integer lag (int) of prediction residual signal
φ dq (i): autocorrelation function of prediction residual signal fractional lag (int-1/4)
φ aq (i): autocorrelation function of prediction residual signal fractional lag (int +1/4)
φ ah (i): autocorrelation function of prediction residual signal fractional lag (int +1/2)
The 6 ones with the largest value are sequentially selected from among the obtained (Lmax-Lmin +1) ones φ max (i), and stored as pitch candidates psel (i) (0. ltoreq. i.ltoreq.5), and linear prediction residual signal res (i) and pitch 1 st candidate psel (0) are output to pitch enhancement filter coefficient operation section 1310, and psel (i) (0. ltoreq. i.ltoreq.5) are output to adaptive vector generation section 1319.
Polyphase coefficient storage section 1309 stores coefficients of a polyphase filter to be referred to by pitch preselection section 1308 for obtaining an autocorrelation function of a linear prediction residual signal with fractional lag accuracy and for generating an adaptive vector with fractional accuracy by adaptive vector generation section 1319.
Pitch enhancement filter coefficient calculation section 1310 obtains 3 pitch prediction coefficients cov (0 ≦ i ≦ 2) from the linear prediction residual sum res (i) obtained in pitch preselection section 1308 and pitch 1 st candidate psel (0). The impulse response of pitch enhancement filter Q (z) is obtained by equation (8) using obtained pitch prediction coefficient cov (0. ltoreq. i.ltoreq.2), and output to spectral weighting filter coefficient calculation section 1312 and auditory weighting filter coefficient calculation section 1313.
Q (z): transfer function of a pitch enhancement filter
cov (i): pitch prediction coefficient (i is more than or equal to 0 and less than or equal to 2)
λ pi: pitch enhancement constant (═ 0.4)
psel (0): pitch 1 st candidate
LSP interpolation section 1311 first performs LSP interpolation on LSP ω intp (n, i) (1 ≦ i ≦ Np) for each subframe by using expression (9) using decoded LSP ω q (i) of the current processed frame obtained in LSP quantization/decoding section 1306 and decoded LSP ω qp (i) of the previous processed frame obtained and held.
ω intp (n, i): interpolating LSP for nth sub-frame
n: subframe number (═ 1, 2)
ω q (i): processing decoded LSPs of frames
ω qp (i): decoding LSP of pre-processed frames
Then, decoded and interpolated LPC α q (n, i) (1 ≦ i ≦ Np) is obtained by converting the obtained ω intp (n, i) into LPC, and the obtained decoded and interpolated LPC α q (n, i) (1 ≦ i ≦ Np) is outputted to spectral weighting filter coefficient calculating section 1312 and perceptual weighting LPC synthesis filter coefficient calculating section 1314.
Spectral weighting filter coefficient calculation section 1312 constructs MA type spectral weighting filter i (z) of expression (10), and outputs the impulse response thereof to auditory weighting filter coefficient calculation section 1313.
I (z): transfer function of MA type spectral weighting filter
Nfir: number of filters of i (z) (═ 11)
α fir (i): i impulse response of (z) (1. ltoreq. i. ltoreq. Nfir)
The impulse response α fir (i) (1 ≦ i ≦ Nfir) of expression (10) is the impulse response of the ARMA-type spectral enhancement filter g (z) supplied from (11) truncated to the term Nfir (═ 11).
G (z): transfer function of spectral weighting filter
n: subframe number (═ 1, 2)
Np: number of LPC analysis (10)
α (n, i): decoding interpolation LSP of nth sub-frame
λ ma: molecular constant of G (z) (═ 0.9)
λ ar: denominator constant (═ 0.4) of g (z)
Auditory sense weighting filter coefficient calculation section 1313 first forms auditory sense weighting filter w (z) by superimposing the impulse response of spectral weighting filter i (z) received from spectral weighting filter coefficient calculation section 1312 and the impulse response of pitch enhancement filter q (z) received from pitch enhancement filter coefficient calculation section 1310 as an impulse response, and outputs the formed impulse response of auditory sense weighting filter w (z) to auditory sense weighting LPC synthesis filter coefficient calculation section 1314 and auditory sense weighting section 1315.
Perceptual weighting LPC synthesis filter coefficient calculation section 1314 uses decoded interpolated LPC α q (n, i) received from LSP interpolation section 1311 and perceptual weighting filter w (z) received from perceptual weighting filter coefficient calculation section 1313 to construct perceptual weighting LPC synthesis filter h (z) using equation (12).
H (z): transfer function for an auditory weighted synthesis filter
Np: number of LPC analysis
α q (n, i): decoding interpolation LSP of nth sub-frame
n: subframe number (═ 1, 2)
W (z): transfer function of auditory weighting filter (cascade of I (z) and Q (z))
The coefficients of the constructed perceptual weighting LPC synthesis filter h (z) are output to the target generation section a1316, perceptual weighting LPC inverse synthesis section a1317, perceptual weighting LPC synthesis section a1321, perceptual weighting LPC inverse synthesis section B1326, and perceptual weighting LPC synthesis section B1329.
Perceptual weighting section 1315 inputs the sub-frame signal read out from buffer 1301 to perceptual weighting LPC synthesis filter h (z) in the 0 state, outputs the result as perceptual weighting residual spw (i) (0 ≦ i ≦ Ns-1), and outputs the result to target generation section a 1316.
Target generation section a1316 subtracts 0 input response zres (i) (0. ltoreq. i.ltoreq. Ns-1) which is an output when 0 series is input to auditory weighted LPC synthesis filter h (z) determined in auditory weighted LPC synthesis filter coefficient calculation section 1314, from auditory weighted residual spw (i) (0. ltoreq. i.ltoreq. Ns-1) determined in auditory weighting section 1315, and outputs the result to LPC inverse synthesis section a1317 and target generation section B1325 as target vector r (i) (0. ltoreq. i.ltoreq. Ns-1) for sound source selection.
Auditory weighted LPC inverse synthesis section a1317 performs time reversal transform arrangement of target sequence r (i) (0. ltoreq. i.ltoreq. Ns-1) received from target generation section 1316, inputs the transformed vector to auditory weighted LPC synthesis filter h (z) whose initial state is 0, performs time reversal transform arrangement again on the output of the transformed vector, obtains time reversed synthesis vector rh (k) (0. ltoreq. k.ltoreq. Ns-1) of the target sequence, and outputs the time reversed synthesis vector rh (k) (0. ltoreq. k.ltoreq. Ns-1) to comparison section a 1322.
Adaptive codebook 1318 stores past driving sound sources that adaptive vector generation section 1319 refers to when generating adaptive vectors. Adaptive vector generation section 1319 generates Nac adaptive vectors Pacb (i, k) (0. ltoreq. i.ltoreq. Ns-1, 0. ltoreq. k.ltoreq. Ns-1, 6. ltoreq. Nac. 24) while generating 6 tone candidates psel (j) (0. ltoreq. j.ltoreq.5) received from tone preselecting section 1308, and outputs the generated vectors to adaptive/fixed selecting section 1320. Specifically, as shown in Table 6, when 16. ltoreq. psel (j). ltoreq.44, an adaptive vector is generated for 4-point number of lag positions corresponding to one integer lag position, when 45. ltoreq. psel (j). ltoreq.64, an adaptive vector is generated for 2-point number of lag positions corresponding to one integer lag position, and when 65. ltoreq. psel (j). ltoreq.128, an adaptive vector is generated for the integer lag positions. Thus, the number Nac of candidates for the adaptive vector is at least 6 and at most 24, depending on the value of psel (j) (0. ltoreq. j.ltoreq.5).
Table 6: total number of adaptive vectors and fixed vectors
Total number of vectors 255 for
The adaptive vector number is 16-15 (i-4445-6465) (i-128) 222 116 (29 × fractional lags 4) 142 (21 × fractional lags 2) 64 (64 × fractional lags 1)
Number of fixed vectors 32 (16X symbol 2 kinds)
In addition, when generating an adaptive vector of fractional accuracy, interpolation processing is performed by superimposing the polyphase filter coefficients stored in polyphase coefficient storage section 1309 on the past sound source vector read from adaptive codebook 1318 with integer accuracy.
Here, the interpolation of the value corresponding to the value of the lagf (i) is performed so as to correspond to the integer lag position when the value of the lagf (i) is 0, the fractional lag position shifted from the integer lag position by-1/2 when the value of the lagf (i) is1, the fractional lag position shifted from the integer lag position by +1/4 when the value of the lagf (i) is 2, and the fractional lag position shifted from the integer lag position by-1/4 when the value of the lagf (i) is 3.
Adaptive/fixed selecting section 1320 receives one candidate adaptive vector of Nac (6 to 24) generated by adaptive vector generating section 1319, and outputs the result to perceptual weighting LPC synthesizing section a1321 and comparing section a 1322.
First, comparing section a1322 calculates an inner product brae (i) of the time-reversal synthesized vector rh (k) (0 ≦ k ≦ Ns-1) of the target vector received by perceptual weighting LPC inversion synthesizing section 1317 and the adaptive vector Pacb (i, k) by equation (13) in advance for each of Nacb (═ 4) candidates among Nac (6 to 20) candidates in order to adapt to adaptive vector Pacb (i, k) (0 ≦ i ≦ Ns-1, 0 ≦ k ≦ Ns-1, and 6 ≦ Nac ≦ 24) generated by adaptive vector generating section 1319.
prac (i): adaptive vector preselection reference value
Nac (i): preselection self-adaptive vector candidate number (6 to 24)
i: number of adaptive vector (i is more than or equal to 0 and less than or equal to Nac-1)
Pacb (i, k): adaptive vector
rh (k): time-reversed composite vector of target vector r (k)
The obtained inner products prac (i) are compared with each other, and the index when the value becomes larger and the inner product when the index is used as an argument (up to the high-end Nacb (4), and stored as the post-adaptive vector preselection index apsel (j) (0. ltoreq. j. ltoreq. Nacb-1) and the post-adaptive vector preselection reference value prac (apsel (j)) respectively, and the post-adaptive vector preselection index apsel (j) (0. ltoreq. j. ltoreq. Nacb-1) are output to the adaptive/fixed selection section 1320.
Auditory sense weighted LPC synthesis section a1321 performs auditory sense weighted LPC synthesis on preselected adaptive vector Pacb (apsel (j), k) obtained by adaptive/fixed selection section 1320 created in adaptive vector creation section 1319 to create synthesized adaptive vector syncb (apsel (j), k), and outputs the synthesized adaptive vector syncb (apsel (j), k) to comparison section a 1322. Next, the comparison means a1322 obtains the adaptive vector formal selection reference value sacbr (j) from expression (14) in order to formally select Nacb (═ 4) preselected adaptive vectors Pacb (apsel (j), k).
sacbr (j): adaptive vector positive selection reference value
prac (): adaptive vector pre-selection post-reference value
apsel (j): adaptive vector preselection tagging
k: vector degree (k is more than or equal to 0 and less than or equal to Ns-1)
j: the number of the label of the preselected adaptive vector (j 0. ltoreq. Nacb-1)
Ns: subframe length (52)
Nacb: preselection number of adaptive vectors
SYNacb (J, K): synthesizing adaptive vectors
The index when the value of expression (14) is increased and the value of expression (14) when the index is used as an argument are used as the adaptive vector up-selection index ASEL and the adaptive vector formally-selection reference value sacbr (ASEL), respectively, and output to adaptive/fixed selection section 1320.
Fixed codebook 1323 stores Nfc (═ 16) candidates for the vector read by fixed vector reading section 1324. Here, in order to pre-select Nfcb (═ 2) candidates from the Nfc (═ 16) candidates for the fixed vector Pfcb (i, k) (0 ≦ i ≦ Nfc-1, 0 ≦ k ≦ Ns-1) read from the fixed vector reading section 1324, comparing section a1322 obtains the absolute value | prfc (i) |) of the inner product of the time-reversal synthesized vector rh (k) (0 ≦ k ≦ Ns-1) of the target vector received by auditory perception weighted LPC inversion synthesizing section a1317 and the fixed vector Pfcb (i, k) by equation (15).
L prfc (i) |: fixed vector preselection reference value
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
I: number of fixed vector (i is more than or equal to 0 and less than or equal to Nfc-1)
Nfc: fixed vector number (═ 16)
Pfcb (i, k): fixed vector
rh (k): time-reversed composite vector of target vector r (k)
The value | prac (i) | of expression (15) is compared, the absolute value of the inner product (up to the high-end Nfcb (═ 2)) of the index when the value becomes larger and the index when the index is used as an argument is selected, and stored as fixed vector preselection index fpsel (j) (0 ≦ j ≦ Nfcb-1) and fixed vector preselection reference value | prfc (fpsel (j)) |, and fixed vector preselection index fpsel (j) (0 ≦ j ≦ Nfcb-1) is output to adaptive/fixed selection section 1320.
The perceptual weighting LPC synthesizing section a1321 performs perceptual weighting LPC synthesis on the preselected fixed vector Pfcb (fpsel (j), k) read by the adaptive/fixed selection section 1320 in the fixed vector reading section 1324 to generate a synthesized fixed vector SYNfcb (fpsel (j), k), and outputs the synthesized fixed vector SYNfcb (fpsel (j), k) to the comparison section a 1322.
Next, the comparison means a1322 obtains a fixed vector formal selection reference value sfcbr (j) from expression (16) in order to formally select an optimum fixed vector from the preselected Nfcb (═ 2) preselected fixed vectors Pfcb (fpsel (j), k).
sfcbr (j): formal selection reference value of fixed vector
L prfc () |: fixed vector pre-selection reference value
fpsel (j): fixed vector preselection back label (j is more than or equal to 0 and less than or equal to Nfcb-1)
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
j: the number of the preselected fixed vector (j 0. ltoreq. Nfcb-1)
Ns: subframe length (52)
Nacb: preselection of fixed vectors (═ 2)
SYNacb (J, K): synthetic fixed vector
The index when the value of expression (16) is increased and the value of expression (16) when the index is used as an argument are used as the fixed vector post-formal selection index FSEL and the fixed vector post-formal selection reference value Facbr (FSEL), respectively, and are output to adaptive/fixed selection section 1320.
Adaptive/fixed selecting section 1320 selects the formally selected adaptive vector or formally selected fixed vector as adaptive/fixed vector af (k) (0. ltoreq. k.ltoreq.ns-1) using the magnitude and positive/negative relationship (described in expression (17) of prac (asel), sacbr (asel), and | prfc (fsel)) received from comparing section a 1322.
AF (k): adaptive/fixed vector
ASEL: adaptive vector formal selection posttab
FSEL: fixed vector formally selected posttab
k: element number of vector
Pacb (ASEL, k): formally selected adaptive vector
Pfcb (FSEL, k): formally selecting a post-fix vector
sacbr (asel): adaptive vector formal selection post-reference value
sfcbr (fsel): fixed vector formally selected post-reference value
prac (asel): adaptive vector pre-selection post-reference value
prfc (fsel): fixed vector pre-selection reference value
The selected adaptive/fixed vector af (k) is output to perceptual weighting LPC synthesis filter section a1321, and the label indicating the number of generating the selected adaptive/fixed vector af (k) is output to parameter coding section 1331 as adaptive/fixed label AFSEL. Here, since the total number of vectors designed as the adaptive vector and the fixed vector is 255 (see table 6), the adaptive/fixed index AFSEL is 8-bit coding.
Auditory sense weighted LPC synthesis filter section a1321 applies auditory sense weighted LPC synthesis filtering to adaptive/fixed vector af (k) selected in adaptive/fixed selection section 1320, generates synthesized adaptive/fixed vector synaf (k) (0 ≦ k ≦ Ns-1), and outputs the result to comparison section 1322.
First, comparing section 1322 obtains power powp of synthesis adaptive/fixed vector synaf (k) (0. ltoreq. k.ltoreq.Ns-1) received from auditory weighting LPC synthesis filter section A1321 using equation (18).
powp: power of adaptive/fixed vector (SYNaf (k))
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
Ns: subframe length (52)
SYNaf (k): adaptive/fixed vector
Next, the inner product pr of the target vector received from the target generating unit a1316 and the composite adaptive/fixed vector synaf (k) is obtained by equation (19).
pr: inner product of SYNaf (k) and r (k)
Ns: subframe length (52)
SYNaf (k): adaptive/fixed vector
r (k): target vector
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
Further, adaptive/fixed vector af (k) received from adaptive/fixed selection section 1320 is output to adaptive codebook update section 1333, power POWaf of af (k) is calculated, synthesized adaptive/fixed vectors synaf (k) and POWaf are output to parameter coding section 1331, and powp and pr and rh (k) are output to comparison section B1330.
Target generation section B1325 subtracts synthesis adaptive/fixed vector synaf (k) (0 ≦ k ≦ Ns-1) received from comparison section a1322 from target vector r (i) (0 ≦ k ≦ Ns-1) for sound source selection received by target generation section a1316, generates a new target vector, and outputs the generated new target vector to perceptual weighting LPC inverse synthesis section B1326.
Auditory weighted LPC inverse synthesis section B1326 performs time reversal permutation conversion on the new target vector generated in target generation section B1325, inputs the converted vector to the auditory weighted LPC synthesis filter in the 0 state, performs time reversal permutation conversion on the output vector again, generates time reversal synthesis vector ph (k) (0 ≦ k ≦ Ns-1) of the new target vector, and outputs it to comparison section B1330.
The acoustic vector generator 1337 is the same as the acoustic vector generator 70 described in embodiment 3, for example. The acoustic vector generator 70 reads out the 1 st vibration type from the vibration type storage unit 71, inputs the vibration type to the nonlinear digital filter 72, and generates a noise vector. The noise vector generated by the acoustic source vector generation device 70 is output to the perceptual weighting LPC synthesis section B1329 and the comparison section B1330. Next, the 2 nd vibration type is read from the vibration type storage section 71, input to the nonlinear digital filter 72, and a noise vector is generated and output to the perceptual weighting LPC synthesizing section B1329 and the comparing section B1330.
The comparator B1330 obtains the 1 st noise vector preselection reference value cr (i1) (0. ltoreq. i 1. ltoreq. Nstb1-1) from the expression (20) in order to preselect Nst (6) candidates from among Nst (64) candidates for the noise vector generated from the 1 st vibration type.
cr (i 1): 1 st noise vector preselection reference value
Ns: subframe length (52)
rh (j): time-reversal synthesis vector of target vector (rh (j))
powp: power of adaptive/fixed vector (SYNaf (k))
pr: inner product of SYNaf (k) and r (k)
Pstb1(i1, j): 1 st noise vector
ph (j): time-reversed composite vector of SYNaf (k)
i 1: number of 1 st noise vector (0 ≦ i1 ≦ Nst-1)
j: element number of vector
The obtained cr (i1) d1 values are compared, and the index when the value becomes large and the value of expression (20) when the index is used as a quotation (up to the high end Nstb (═ 6)) are selected and stored as the 1 st noise vector preselection index s1pse1(j1) (0 ≦ j1 ≦ Nstb-1) and the 1 st noise vector Pstb1(s1pse1(j1), k) (0 ≦ j1 ≦ Nstb-1, 0 ≦ k ≦ Ns-1), respectively. Subsequently, the same processing as that of the 1 st noise vector is performed for the 2 nd noise vector, and the 2 nd noise vector after preselection s2pse1(j2) (0. ltoreq. j 2. ltoreq. Nstb-1) and the 2 nd noise vector after preselection Pstb1(s2pse2(j2), k) (0. ltoreq. j 2. ltoreq. Nstb-1, 0. ltoreq. k. ltoreq. Ns-1)) are stored as the 2 nd noise vector.
The perceptual weighting LPC synthesizing section B1329 performs perceptual weighting LPC synthesis on the 1 st noise vector Pstb1(s1pse1(j1), k) after preselection to generate a synthesized 1 st noise vector SYNstb1(s1pse1(j1), k), and outputs the synthesized 1 st noise vector SYNstb1(s1pse1(j1), k) to the comparing section B1330. Next, auditory weighted LPC synthesis is performed on the preselected 2 nd noise vector Pstb2(s2pse1(j2), k), and a synthesized 2 nd noise vector SYNstb2(s2pse1(j2), k) is generated and output to comparing section B1330.
The comparison unit B1330 calculates the formula (21) for the synthesized 1 st noise vector synctb 1(s1pse1(j1), k) calculated in the auditory weighted LPC synthesizing unit B1329 in order to formally select the preselected 1 st noise vector and the preselected 2 nd noise vector of itself.
SYNOstb1(s1pse1(j1),k)=(21)
SYNOstb1(s1pse1(j1), k): orthogonalizing a composite 1 st noise vector
SYNstb1(s1pse1(j1), k): synthesizing 1 st noise vector
Pstb1(s1pse1(j1), k): pre-selected 1 st noise vector
SYNaf (j): adaptive/fixed vector
powp: power of adaptive/fixed vector (SYNaf (j))
Ns: subframe length (52)
ph (k): time-reversed composite vector of SYNaf (j)
j 1: number of 1 st noise vector after preselection
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
After the orthogonalized and synthesized 1 st noise vector SYNOstb1(s1pse1(j1, k)) is obtained, the same calculation is performed for the synthesized 2 nd noise vector SYNOstb2(s2psel (j2, k)) to obtain the orthogonalized and synthesized 2 nd noise vector SYNOstb2(s2psel (j2, k)), and the 1 st noise vector present selection reference value scr1 and the 2 nd noise vector present selection reference value scr2 are calculated in a closed loop manner for all combinations (36 items) of ((s1pse1(j1), s2pse1(j2)) by using the expressions (22) and (23), respectively.
scr 1: 1 st noise vector book selection reference value
c scr 1: constant calculated in advance from equation (24)
SYNOstb1(s1pse1(j1), k): quadrature synthesized 1 st noise vector
SYNOstb2(s2pse1(j2), k): quadrature synthesized 2 nd noise vector
r (k): target vector
s1pse1(j1), k: 1 st noise vector preselection posttab
s2pse1(j2), k: 2 nd noise vector preselection posttab
Ns: subframe length (52)
k: element number of vector
scr 2: 2 nd noise vector book selection reference value
c scr 1: constant calculated in advance from equation (25)
SYNOstb1(s1pse1(j1), k): quadrature synthesized 1 st noise vector
SYNOstb2(s2pse1(j2), k): quadrature synthesized 2 nd noise vector
r (k): target vector
s1pse1(j1), k: 1 st noise vector pre-selection index
s2pse1(j2), k: 2 nd noise vector preselection posttab
Ns: subframe length (52)
k: element number of vector
Wherein cs1cr in formula (22) and cs2cr in formula (23) are constants pre-calculated from formula (24) and formula (25), respectively
cscr 1: constant for formula (22)
SYNOstb1(s1pse1(j1), k): quadrature synthesized 1 st noise vector
SYNOstb2(s2pse1(j2), k): quadrature synthesized 2 nd noise vector
r (k): target vector
s1pse1(j1), k: 1 st noise vector preselection posttab
s2pse1(j2), k: 2 nd noise vector preselection posttab
Ns: subframe length (52)
k: element number of vector
cscr 2: constant for formula (23)
SYNOstb1(s1pse1(j1), k): quadrature synthesized 1 st noise vector
SYNOstb2(s2pse1(j2), k): quadrature synthesized 2 nd noise vector
r (k): target vector
s1pse1(j1), k: 1 st noise vector preselection posttab
s2pse1(j2), k: 2 nd noise vector preselection posttab
Ns: subframe length (52)
k: element number of vector
Comparison section B1330 further substitutes the maximum value of s1cr into MAXs1cr and the maximum value of s2cr into MAXs2cr, uses the larger one of MAXs1cr and MAXs2cr as scr, and outputs the value of s1pse1(j1) referred to in obtaining scr as 1 st noise vector formal selection post-reference number SSEL1 to parameter coding section 1331. The noise vector corresponding to SSEL1 is saved as the formally selected 1 st noise vector Pstb1(SSEL1, k), and the present selected 1 st noise vector SYNstb1(SSEL1, k) (0. ltoreq. k.ltoreq.ns-1) corresponding to Pstb1(SSEL1, k) is obtained and output to parameter coding section 1331.
Similarly, the value of s2pse1(j2) referred to when scr is obtained is output to parameter coding section 1331 as the 2 nd noise vector after formal selection reference number SSEL2, and the noise vector corresponding to SSEL2 is saved as the 2 nd noise vector Pstb2(SSEL2, k) after formal selection, and the 2 nd noise vector SYNstb2(SSEL2, k) (0. ltoreq. k. ltoreq. Ns-1) is synthesized after formal selection corresponding to Pstb2(SSEL2, k) and output to parameter coding section 1331.
The comparing section B1330 further obtains signs S1 and S2 multiplied by Pstb1(SSEL1, k) and Pstb2(SSEL2, k), respectively, and outputs the obtained plus/minus information of S1 and S2 as gain plus/minus sign Is1S2 (2-bit information) to the parameter encoding section 1331.
S1: formally selecting the sign of the 1 st noise vector
S2: formally selecting the sign of the 2 nd noise vector
scr 1: output of formula (22)
scr 2: output of equation (23)
cscr 1: output of formula (24)
cscr 2: output of equation (25)
Noise vector ST (k) (0. ltoreq. k.ltoreq.Ns-1) is generated according to equation (27), and is output to adaptive codebook update section 1333, and at the same time, power POWsf thereof is obtained and is output to parameter encoding section 1331.
ST(k)=S1×Pstb1(SSEL1,k)÷S2×Pstb2(SSEL2,k) (27)
ST (k): random vector
S1: formally selecting the sign of the 1 st noise vector
S2: formally selecting the sign of the 2 nd noise vector
Pstb1(SSEL1, k): formally selecting a vector determined at level 1
Pstb2(SSEL2, k): formally selecting a vector determined at level 2
SSEL 1: 1 st noise vector formal selection posttab
SSEL 2: formal selection of post-label for 2 nd noise vector
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
A synthetic noise vector SYNst (k) (0. ltoreq. k. ltoreq. Ns-1) is generated from equation (28) and output to parameter encoding section 1331.
SYNst(k)=S1×SYNstb1(SSEL1,k)÷S2×SYNstb2(SSEL2,k) (28)
SYNst (k): synthesizing random vectors
S1: formally selecting the sign of the 1 st noise vector
S2: formally selecting the sign of the 2 nd noise vector
SYNstb1(SSEL1, k): formally selecting and synthesizing 1 st noise vector
SYNstb2(SSEL2, k): formally selecting and synthesizing 2 nd noise vector
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
Parameter encoding section 1331 first obtains subframe estimated residual power rs from equation (29) using decoded frame power slow obtained in frame power quantizing/decoding section 1302 and normalized predicted residual power resid obtained in pitch preselecting section 1308.
rs=Ns×spow×resid (29)
rs: subframe-estimated residual power
Ns: subframe length (52)
And (4) slow: decoded frame power
And (4) resid: normalized predicted residual power
The quantization gain selection reference value STDg is obtained from equation (30) using the obtained subframe estimation residual power rs, the adaptive/fixed vector power POWaf calculated by comparing section a1322, the noise vector power POWst obtained by comparing section B1330, and the 256-word gain quantization tables (CGaf [ i ], CGst [ i ]) (0. ltoreq. i.ltoreq.127) stored in gain quantization table storage section 1332 shown in table 7.
Table 7: gain quantization table
i CGaf(i) CGst(i)
1 0.38590 0.23477
2 0.42380 0.50453
3 0.23416 0.24761
126 0.35382 1.68987
127 0.10689 1.02035
128 3.09711 1.75430
STDg: quantized gain selection reference value
rs: subframe-estimated residual power
POWaf: adaptive/fixed vector power
POWst: power of noise vector
i: number of gain quantization table (i is more than or equal to 0 and less than or equal to 127)
CGaf (i): component of adaptive/fixed vector field in gain quantization table
Cgat (i): component of noise vector field in gain quantization table
SYNaf (k): synthesizing adaptive/fixed vectors
SYNat (k): synthetic noise vector
r (k): target vector
Ns: subframe length (52)
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
By using the index for which the selected 1 obtained quantization gain selection reference value STDg is the smallest as the gain quantization index Ig, the gain after selection cgaf (Ig) of the adaptive/fixed vector field read out from the gain quantization table based on the selected gain quantization index Ig and the expression (31) for selecting the gain after selection cgst (Ig) and the like from the noise vector side read out from the gain quantization table based on the selected gain quantization index Ig, the formal gain Gaf for the adaptive/fixed vector actually used in af (k) and the formal gain Gst for the noise vector actually used in st (k) are obtained and outputted to the adaptive codebook updating section 1333.
Gaf: adaptive/fixed vector side cost gain
Gst: noise vector codebook gain
rs: rs: subframe-estimated residual power
POWaf: adaptive/fixed vector power
POWst: power of noise vector
Cgaf (ig): power in terms of fixed/adaptive vectors
CGst (Ig): power in noise vector
Ig: gain quantization index
The parameter encoding unit 1331 collects the power label Ipow found in the frame power quantizing and decoding unit 1302, the LSP code Ilsp found in the LSP quantizing and decoding unit 1306, the adaptive/fixed label AFSEL found in the adaptive/fixed selecting unit 1320, the 1 st post-noise-vector formal-selection label SSEL1 and the 2 nd post-noise-vector formal-selection label SSEL2 and the gain positive/negative labels Is1s2 found in the comparing unit B1330, and the gain quantization label Ig found in the parameter encoding unit 1331 itself, as audio codes, and outputs the collected audio codes to the transmitting unit 1334.
Adaptive codebook update section 1333 multiplies adaptive/fixed vector af (k) obtained by comparing section a1322 and noise vector st (k) obtained by comparing section B1330 by adaptive/fixed vector actual gain Gaf and noise vector actual noise Gst obtained by parameter coding section 1331, generates driving source ex (k) (0. ltoreq. k.ltoreq.Ns-1), and outputs generated driving source ex (k) (0. ltoreq. k.ltoreq. Ns-1) to adaptive codebook 1318.
ex(k)=Gaf×AF(k)+Gst*ST(k) (32)
ex (k): driving sound source
AF (k): thd adaptive fixed vector
ST (k): gain of noise vector
k: element number of vector (k is more than or equal to 0 and less than or equal to Ns-1)
At this time, the old driving sound source in the adaptive codebook 1318 is updated with the new driving sound source ex (k) received by the thd adaptive codebook updating unit 1333.
Embodiment 8
Next, an embodiment of the sound source vector generation device described in the above-described embodiments 1 to 6 in the sound decoding device developed by PSI-CELP which is a sound encoding/decoding standard system of a digital mobile phone will be described. This decoding apparatus is a device that is a counterpart of embodiment 7 described above.
Fig. 14 is a functional block diagram of an audio decoding device according to embodiment 8. The parameter decoding unit 1402 obtains the audio codes (power index Ipow, LSP code Ilsp, adaptive/fixed index AFSEL, 1 st noise vector formal selection index SSEL1, 2 nd noise vector formal selection index SSEL2, gain quantization index Ig, gain plus/minus index Is1s2) transmitted from the CELP type audio coding apparatus shown in fig. 13 through the transmission unit 1401.
Next, the scalar value indicated by the power index Ipow is read from the power quantization table (refer to table 3) stored in the power quantization table storage unit 1405 and output to the power restoration unit 1417 as decoded frame power slow, and the indicated vector of the LSP code Ilsp is read from the LSP quantization table stored in the LSP quantization table storage unit 1404 and output to the LSP interpolation unit 1406 as a decoded LSP. Adaptive/fixed label AFSEL is output to adaptive vector generation section 1408 fixed vector reading section 1411 and adaptive/fixed selection section 1412, and 1 st noise vector formally selected label SSEL1 and 2 nd noise vector formally selected label SSEL2 are output to sound source vector generation apparatus 1414. The gain quantization index Ig is read out from the gain quantization table (refer to table 7) stored in the gain quantization table storage unit 1403. The vectors shown (caaf (ig), cgst (ig)), as in the encoding apparatus side, obtain the adaptive/fixed vector real gain Gaf actually used in af (k) and the noise vector real gain Gst actually used in st (k) from equation (31), and output the obtained adaptive/fixed vector real gain Gaf and noise vector real gain Gst to the drive sound source generation unit 1413 together with the gain positive/negative sign Is1s 2.
LSP interpolating section 1406 obtains decoded interpolated LPC by obtaining decoded interpolated LSP ω intp (n, i) (0. ltoreq. i.ltoreq. Np) for each sub-frame from the decoded LSP received from parameter encoding section 1402 in the same manner as in the encoding apparatus, converting the obtained LSP ω intp (n, i) into LPC, and outputting the obtained decoded interpolated LPC to LPC synthesis filter section 1413.
Adaptive vector generation section 1408 superimposes a part of the polyphase coefficients (see table 5) stored in polyphase coefficient storage section 1409 on the vector read from adaptive codebook 1407 based on adaptive/fixed reference number AFSEL received from parameter decoding section 1402, generates an adaptive vector of fractional lag accuracy, and outputs the adaptive vector to adaptive/fixed selection section 1412. Fixed vector reading section 1411 reads a fixed vector from fixed codebook 1410 based on adaptive/fixed index AFSEL received from parameter decoding section 1402, and outputs the read fixed vector to adaptive/fixed selecting section 1412.
Adaptive/fixed selecting section 1412 selects an adaptive vector inputted from adaptive vector generating section 1408 or a fixed vector inputted from fixed vector reading section 1411 as adaptive/fixed vector af (k) based on adaptive/fixed flag AFSEL received from parameter decoding section 1402, and outputs the selected adaptive/fixed vector af (k) to drive sound source generating section 1413. The acoustic vector generator 1414 extracts the 1 st and 2 nd seed oscillators from the seed oscillator storage unit 71 based on the 1 st noise vector formal selection reference number SSEL1 and the 2 nd noise vector formal selection reference number SSEL2 received from the parameter decoding unit 1402, inputs the extracted seed oscillators to the nonlinear digital filter 72, and generates the 1 st and 2 nd noise vectors, respectively. Thus, the 1 st level information S1 and the 2 nd level information S2, which have gain positive and negative signs, are multiplied to the reproduced 1 st noise vector and the reproduced 2 nd noise vector, respectively, to generate the sound source vector st (k), and the generated sound source vector is output to the driven sound source generating unit 1413.
The excitation source generating unit 1413 multiplies the adaptive/fixed vector af (k) received from the adaptive/fixed selecting unit 1412 and the excitation vector st (k) received from the excitation vector generating device 1414 by the adaptive/fixed vector formal gain Gaf and the noise vector formal gain Gst obtained in the parameter encoding unit 1402, adds or subtracts the resultant gains from the positive or negative gain Is1s2 to obtain an excitation source ex (k), and outputs the obtained excitation source to the LPC synthesis filter 14136 and the adaptive codebook 1407. Here, the old driving sound source in the adaptive codebook 1407 is updated with the new driving sound source input from the driving sound source generating unit 1413.
LPC synthesis filter 1416 performs LPC synthesis on the drive sound source generated by drive sound source generating section 1413 using a synthesis filter composed of the decoded interpolated LPC received from LSP interpolation section 1406, and sends the output of the filter to power restoration section 1417. Power restoration section 1417 first obtains the average power of the synthesized vector of the driving sound source obtained in LPC synthesis filter section 1413, then divides the decoding power slow received from parameter decoding section 1402 by the obtained average power, and multiplies the result by the synthesized vector of the driving sound source, thereby generating synthesized speech 518.
Embodiment 9
Fig. 15 is a block diagram of a main part of the audio encoding device according to embodiment 9. In this audio encoding apparatus, an LSP quantization/decoding section 152 and an LSP quantization error comparing section 153 are added to or partially changed in the quantization target LSP adding section 151 and the LSP quantization error comparing section 153 in the audio encoding apparatus shown in fig. 13.
LPC analyzing section 1304 performs linear predictive analysis on the processing frame in buffer 1301 to obtain an LPC, converts the obtained LPC to generate a quantization target LSP, and outputs the generated quantization target LSP to quantization target LSP adding section 151. Specifically, the function of performing linear prediction analysis on the first read section in the buffer to obtain LPC for the first read section, converting the obtained LPC to generate an LSP for the first read section, and outputting the LSP to quantization target LSP adding section 151 is also provided.
Quantization target LSP adding section 151 generates a plurality of quantization target LSPs in addition to the directly obtained quantization target LSPs by converting the LPC of the frame in LPC analyzing section 1304.
LSP quantization table storage section 1307 stores a quantization table referred to by LSP quantization/decoding section 152, and LSP quantization/decoding section 152 quantizes and decodes the generated quantization target LSP to generate each decoded LSP.
The LSP quantization error comparison unit 153 compares the generated plurality of decoded LSPs, selects 1 decoded LSP having the least abnormal noise in a closed-loop manner, and reuses the selected decoded LSP as a decoded LSP for the processed frame.
Fig. 16 shows a block diagram of the quantized object LSP-adding section 151.
The quantization target LSP adding section 151 includes a current frame LSP storage section 161 that stores the quantization target LSP of the processing frame obtained by the LPC analyzing section 1304, a leading section LSP storage section 162 that stores the leading section LSP obtained by the LPC analyzing section 1304, a preceding frame LSP storage section 163 that stores the decoded LSP of the processing frame, and a linear interpolation section 164 that performs linear interpolation calculation on the LSPs read from the aforementioned 3 storage sections and adds a plurality of quantization target LSPs.
The quantization object LSPs of the processing frame, the LSP of the initial read section, and the decoded LSP of the preprocessing frame are subjected to linear interpolation calculation, a plurality of generated quantization object LSPs are added, and the generated quantization object LSPs are output to all LSP quantization/decoding units 152.
Here, the quantization target LSP adding section 151 will be described in further detail. LPC analyzing section 1304 performs linear prediction analysis on the processed frame in the buffer to obtain LPC α (i) with the number of prediction times Np (═ 10) (0 ≦ i ≦ Np), converts the obtained LPC to generate quantization target LSP ω (i) (0 ≦ i ≦ Np), and stores the generated quantization target LSP ω (i) (0 ≦ i ≦ Np) in current frame LSP storing section 161 in quantization target LSP adding section 151. In addition, linear predictive analysis is performed on the initial reading interval in the buffer to obtain LPC for the initial reading interval, LPC for the initial reading interval obtained by transformation is generated for the initial reading interval, LSP ω (i) (i is greater than or equal to 0 and is less than or equal to Np) for the initial reading interval is generated, and LSP ω (i) (i is greater than or equal to 0 and is less than or equal to Np) for the generated initial reading interval is stored in the initial reading interval LSP storage unit 162 in the quantization object LSP adding unit 151.
Next, linear interpolation section 164 reads out quantization object LSP ω (i) (0 ≦ i ≦ Np) corresponding to the processing frame from current frame LSP storage section 161, LSP ω f (i) (0 ≦ i ≦ Np) corresponding to the first read section from first read section LSP storage section 162, decoded LSP ω qp (i) (0 ≦ i ≦ Np) corresponding to the preceding processing frame from previous frame LSP storage section 163, and generates quantization object addition 1LSP ω 1(i) (0 ≦ i ≦ Np), quantization object addition 2LSP2 ω (i) (0 ≦ i ≦ Np), and quantization object addition 3LSP ω 3(i) (0 ≦ i ≦ Np) respectively by the transformation shown by advance formula (33).
ω 1 (i): quantizing object add 1 st LSP
ω 2 (i): quantizing object add 2 nd LSP
ω 3 (i): quantizing object add 3 rd LSP
i: LPC number (i is more than or equal to 0 and less than or equal to Np)
Np: number of LPC analysis (10)
ω q (i): decoding LSP corresponding to processing frame
ω qp (i): composite LSP corresponding to pre-processed frames
ω f (i): LSP corresponding to first read interval
The generated ω 1(i), ω 2(i), ω 3(i) are output to the LSP quantization/decoding unit 152. After vector-quantizing/decoding all of the 4 quantization objects LSP ω (i), ω 1(i), ω 2(i), and ω 3(i), LSP quantizing/decoding section 152 obtains power Epow (ω) corresponding to the quantization error of ω (i), power Epow (ω 1) corresponding to the quantization error of ω 1(i), power Epow (ω 2) corresponding to the quantization error of ω 2(i), and power Epow (ω 3) corresponding to the quantization error of ω 3(i), and applies transformation of equation (34) to each of the obtained quantization residual powers to obtain decoded LSP selection reference values STDlsp (ω), STDlsp (ω 1), STDlsp (ω 2), and STDlsp (ω 3).
STDlsp (ω): composite LSP selection reference value corresponding to ω (i)
STDlsp (ω 1): composite LSP selection reference value corresponding to ω 1(i)
STDlsp (ω 2): composite LSP selection reference value corresponding to ω 2(i)
STDlsp (ω 3): composite LSP selection reference value corresponding to ω 3(i)
Epow (ω): power corresponding to quantization error of ω (i)
Epow (ω 1): power corresponding to quantization error of ω 1(i)
Epow (ω 2): power corresponding to quantization error of ω 2(i)
Epow (ω 3): power corresponding to quantization error of ω 3(i)
The obtained decoded LSP selection reference value is compared, and the decoded LSP corresponding to the quantization target LSP whose reference value is the smallest is selected and output as the decoded LSP ω q (i) (0. ltoreq. i. ltoreq. Np) corresponding to the processing frame, and the LSP of the next frame is stored in the previous frame LSP storage section 163 so as to be referred to in vector quantization.
The present embodiment effectively utilizes the superiority of interpolation characteristics possessed by LSPs (no abnormal noise is generated even if LSP synthesis after interpolation is used), enables vector quantization of LSPs, and can reduce abnormal noise in synthesized speech that may occur when the quantization characteristics of LSPs are insufficient, because abnormal noise is not generated even in an interval where spectral fluctuation is large, such as speech header.
Fig. 17 is a block diagram of LSP quantization/decoding section 152 according to the present embodiment. LSP quantization/decoding section 152 includes gain information storage section 171, adaptive gain selection section 172, multiplication gain operation section 173, LSP quantization section 174, and LSP decoding section 175.
Gain information storage section 171 stores a plurality of gain candidates referred to when the adaptive gain is selected in adaptive gain selection section 172. Multiplier-gain operation section 173 multiplies the code vector read out from LSP quantization table storage section 1307 by the adaptive gain selected by adaptive gain selection section 172. The LSP quantization unit 174 performs vector quantization on the quantization target LSP using the code vector multiplied by the adaptive gain. LSP decoding section 175 has a function of decoding the vector-quantized LSP to generate and output a decoded LSP, and a function of obtaining an LSP quantization error that is the difference between the quantization target LSP and the decoded LSP and outputting the result to adaptive gain selecting section 172. Adaptive gain selecting section 172 multiplies the code vector by the adaptive gain of the LSP of the previous frame during vector quantization and calculates the adaptive gain to be multiplied by the code vector during vector quantization of the quantization target LSP of the processed frame based on the gain generation information stored in gain storage section 171, with the magnitude of the LSP quantization error corresponding to the previous frame being used as a reference, and outputs the calculated adaptive gain to multiplication gain calculating section 173.
In this way, LSP quantization/decoding section 152 vector quantizes and decodes the quantized LSP while adapting the adaptive gain on the code vector.
Here, LSP quantization/decoding section 152 will be described in further detail. Gain information storage section 171 stores 4 gain candidates (0.9, 1.0, 1.1, 1.2) referred to by adaptive gain selection section 103, and adaptive gain selection section 103 obtains an adaptive gain selection reference value Slsp by an expression (35) of dividing power ERpow generated at the time of quantizing target LSP in the frame before quantization by the square of adaptive gain Gqlsp selected at the time of quantizing target LSP in the frame before vector quantization.
Slsp: adaptive gain selection reference value
ERpow: power of quantization error generated when quantizing LSP of previous frame
Gqlsp: adaptive gain selected when quantizing LSP of previous frame
Based on equation (36) using the obtained adaptive gain selection reference value Slsp, 1 gain is selected from the 4 gain candidates (0.9, 1.0, 1.1, 1.2) read out from gain information storage section 171. The selected value of adaptive gain Gqlsp is output to multiplication gain operation section 173, and information (2-bit information) for specifying which of the 4 kinds of selected adaptive gains is output to parameter encoding section.
Glsp: adaptive gain multiplied on code vector for LSP quantization
Slsp: adaptive gain selection reference value
In the variable Gqlsp and the variable ERpow, the selected adaptive gain Glsp and an error accompanying quantization are held until the quantization object LSP of the next frame is vector-quantized.
Multiplier-gain operation section 173 multiplies the code vector read out from LSP quantization table storage section 1307 by adaptive gain Glsp selected by adaptive gain selection section 172, and outputs the result to LSP quantization section 174. LSP quantization section 174 performs vector quantization on the quantization target LSP using the code vector multiplied by the adaptive gain, and outputs the index to the parameter coding section. LSP decoding section 175 decodes the LSP quantized by LSP quantization section 174 to obtain a decoded LSP, outputs the decoded LSP to the obtained decoded LSP, subtracts the obtained decoded LSP from the LSP to be quantized to obtain an LSP quantization error, calculates power ERpow of the obtained LSP quantization error, and outputs the calculated power ERpow to adaptive gain selection section 172.
This embodiment can reduce abnormal noise in synthesized speech which may occur when the quantization characteristic of LSP is insufficient.
Embodiment 10
Fig. 18 is a block diagram showing the result of the acoustic vector generator according to the present embodiment. This sound source vector generation device includes a fixed waveform storage unit 181 that stores 3 fixed waveforms (V1 (length: L1), V2 (length: L2), and V3 (length: L3)) of channels CH1, CH2, and CH3, a fixed wave placement unit 182 that has fixed waveform start point candidate position information of each channel and places the fixed waveforms (V1, V2, and V3) read out from the fixed waveform storage unit 181 at positions P1, P2, and P3, respectively, and an addition unit 183 that adds the fixed waveforms placed by the fixed waveform placement unit 182 and outputs a sound source vector.
Next, the operation of the sound source vector generator configured as described above will be described.
3 fixed waveforms V1, V2, and V3 are stored in advance in the fixed waveform storage unit 181. The fixed waveform placement unit 182 places (shifts) the fixed waveform V1 read from the fixed waveform storage unit 181 at a position P1 selected from the start candidate positions for CH1 based on the fixed waveform start candidate position information itself shown in table 8, and similarly places the fixed waveforms V2 and V3 at positions P2 and P3 selected from the start candidate positions for CH2 and CH3, respectively.
Table 8: fixed waveform start candidate position information
Channel number Symbol Fixed waveform start candidate position
CH1 ±1 P10,10,20,30,...,60,70
CH2 ±1 2,12,22,32,...,62,72P26,16,26,36,...,66,76
CH3 ±1 4,14,24,34,...,64,74P38,18,28,38,...,68,78
The addition unit 183 adds the fixed waveforms arranged by the fixed waveform arrangement unit 182 and generates a sound source vector.
Here, the fixed waveform start point candidate position information included in the fixed waveform arranging unit 182 is assigned a code number corresponding to combination information of start point candidate positions of the respective fixed waveforms that can be selected (information indicating which position is selected as P1, which position is selected as P2, and which position is selected as P3).
With the sound source vector generating device having such a configuration, it is possible to transmit voice information by using a code number that is associated with the fixed waveform start candidate position information included in the fixed waveform arranging unit 182, and it is possible to generate a sound source vector that approximates to an actual voice without adding a memory necessary for calculation and by using a portion where the code number exists only in the product of the start candidate numbers.
In order to transmit the sound information by using the transmission of the code number, the above-mentioned acoustic source vector generating device can be used as a noise codebook in the sound encoding/decoding device.
In the present embodiment, the case of using 3 fixed waveforms shown in fig. 18 has been described, but the same operation and effect can be obtained even when the number of fixed waveforms (the number of channels in fig. 18 and table 8 are the same) is other numbers.
In the present embodiment, although the fixed waveform arranging means 182 has been described as having the fixed waveform start point candidate position information shown in table 8, the same operation and effect can be obtained also in the case of having the fixed waveform start point candidate position information other than table 8.
Embodiment 11
Fig. 19A is a block diagram showing the configuration of a CELP type speech encoding apparatus according to the present embodiment. Fig. 19B is a block diagram showing a configuration of a CELP type speech decoding apparatus to be mated with the CELP type speech encoding apparatus.
The CELP type speech coding apparatus according to the present embodiment includes an acoustic source vector generating apparatus including a fixed waveform storage unit 181A, a fixed waveform allocation unit 182A, and an addition unit 183A. The fixed waveform storage unit 181A stores a plurality of fixed waveforms, the fixed waveform placement unit 182A places (shifts) the fixed waveforms read out from the fixed waveform storage unit 181A at selected positions based on the own fixed waveform start point candidate position information, and the addition unit 183A adds the fixed waveforms placed by the fixed waveform placement unit 182A to generate the sound source vector C.
This CELP type speech coding apparatus includes time inverting section 191 for time-inverting input noise codebook searching target X, filter 192 for synthesizing the output of time inverting section 191, time inverting section 193 for inverting the output of synthesizing filter 192 again and outputting the result to time-inverted synthesis target X ', synthesizing filter 194 for synthesizing sound source vector C multiplied by noise coding vector gain gc and outputting synthesized sound source vector S, distortion calculating section 205 for inputting X', C, S and calculating distortion, and transmission section 196.
In the present embodiment, since the fixed waveform storage unit 181A, the fixed waveform allocation unit 182A, and the addition unit 183A correspond to the fixed waveform storage unit 181, the fixed waveform allocation unit 182, and the addition unit 183 shown in fig. 18, and the fixed waveform start point candidate positions of the respective channels correspond to table 8, the symbols indicating the channel numbers, the fixed waveform numbers, and the lengths and positions thereof are used as symbols shown in fig. 18 and table 8.
On the other hand, the CELP type speech decoding apparatus of fig. 19B includes a fixed waveform storage unit 181B that stores a plurality of fixed waveforms, a fixed waveform placement unit 182B that places (shifts) the fixed waveforms read from the fixed waveform storage unit 181B at selected positions based on the own fixed waveform start point candidate position information, an addition unit 183B that adds the fixed waveforms placed by the fixed waveform placement unit 182B to generate a sound source vector C, a multiplication gain unit 197 that multiplies a noise coding vector gain gc, and a synthesis filter 198 that synthesizes the sound source vector C and outputs a synthesized sound source vector S.
The fixed waveform storage unit 181B and the fixed waveform allocation unit 182B of the audio decoding apparatus have the same configurations as the fixed waveform storage unit 181A and the fixed waveform allocation unit 182A of the audio encoding apparatus, and the fixed waveforms stored in the fixed waveform storage units 181A and 181B are fixed waveforms having characteristics that statistically minimize the cost function of expression (3) by learning the coding distortion calculation expression of expression (3) using the target for noise codebook search as the cost function.
Next, the operation of the audio encoding device configured as described above will be described.
Noise codebook searching target X is inverted at time inverting section 191, synthesized at the synthesis filter, inverted again at time inverting section 193, and then output to distortion calculating section 205 as noise codebook searching time-inverted synthesis target X'.
Next, the fixed waveform placement unit 182A places (shifts) the fixed waveform V1 read out from the fixed waveform storage unit 181A at the position P1 selected from the start candidate positions for CH1 based on the own fixed waveform start candidate position information shown in table 8, and similarly places the fixed waveforms V2, V3 at the positions P2, P3 selected from the start candidate positions for CH2, CH 3. The arranged fixed waveforms are output to an adder 183A, added to each other, and input to a synthesis filter 194 as a sound source vector C. Synthesis filter 194 synthesizes excitation vector C to generate synthesized excitation vector S, and outputs synthesized excitation vector S to distortion calculating section 205.
Distortion calculating section 205 receives time reversal synthesis target X', excitation vector C, and synthesized excitation vector S as input, and calculates coding distortion of expression (4).
After calculating the distortion, the distortion calculating section 205 repeats the above-described processing from the time when the fixed waveform arranging section 182A selects the start candidate positions corresponding to 3 channels, respectively, to the time when the distortion calculating section 205 calculates the distortion, after the distortion is calculated, for all combinations of the start candidate positions selectable by the fixed waveform arranging section 182A, the signal is sent to the fixed waveform arranging section 182A.
Then, the combination of the start candidate positions with the smallest coding distortion is selected, and the code number corresponding to the combination of the start candidate positions one by one and the optimal noise code vector gain gc at that time are transmitted to transmission section 196 as the code of the noise codebook.
Next, the operation of the audio decoding apparatus in fig. 19B will be described.
The fixed waveform placement unit 181B selects the position of the fixed waveform of each channel from the own fixed waveform start candidate position information shown in table 8 based on the information sent from the transmission unit 196, places (shifts) the fixed waveform V1 read out from the fixed waveform placement unit 181B at the position P1 selected from the start candidate positions for CH1, and similarly places the fixed waveforms V2 and V3 at the positions P2 and P3 selected from the start candidate positions for CH2 and CH 3. The arranged fixed waveforms are output to adder 43, added to become acoustic vector C, multiplied by noise code vector gain gc selected from information from transmission section 196, and output to synthesis filter 198. The synthesis filter 198 synthesizes the excitation vector C multiplied by gc, and generates and outputs a synthesized excitation vector S.
In the sound encoding/decoding device having such a configuration, since the sound source vector generating means including the fixed waveform storage means, the fixed waveform arranging means, and the adder generates the sound source vector, the effect of embodiment 10 is increased, and the synthesized sound source vector obtained by synthesizing such a sound source vector by the synthesis filter has a characteristic statistically close to the actual target, and thus a high-quality synthesized sound can be obtained.
In the present embodiment, although the case where the learned fixed waveforms are stored in the fixed waveform storage units 181A and 181B is described, in the case where a fixed waveform generated from the analysis result of the target X for noise codebook search by statistical analysis is used, a high-quality synthesized sound can be obtained similarly even in the case where a fixed waveform generated from actual knowledge is used.
In the present embodiment, the case where the fixed waveform storage means stores 3 fixed waveforms has been described, but the same operation and effect can be obtained even when the number of fixed waveforms is other.
In the present embodiment, although the case where the fixed waveform placement means has the fixed waveform start point candidate position information shown in table 8 has been described, the same operation and effect can be obtained even when the fixed waveform start point candidate position information other than table 8 is provided.
Embodiment 12
Fig. 20 is a block diagram showing a configuration of a CELP type speech encoding apparatus according to the present embodiment.
The CELP type speech coding apparatus includes a fixed waveform memory 200 storing a plurality of fixed waveforms (CH 1: W1, CH 2: W2, and CH 3: W3 in the present embodiment), and a fixed waveform allocation unit 201 having fixed waveform start point candidate position information as information for generating start point positions of the fixed waveforms stored in the fixed waveform memory 200 by algebraic rules. The CELP type speech coding apparatus includes waveform-based impulse response arithmetic section 202, pulse generator 203, correlation matrix arithmetic section 204, time reversal section 193, and distortion calculating section 205.
The waveform-based impulse response calculation unit 202 has a function of calculating 3 types of waveform-based impulse responses (CH 1: h1, CH 2: h2, CH 3: h3, length L being a sub-frame length) by convolving 3 fixed waveforms from the fixed waveform memory 200 with an impulse response h (length L being a sub-frame length) of a synthesis filter.
The waveform synthesis filter 192' has a function of convolving the output of the time inverting unit 191 which time-inverts the input noise code search target X with the waveform-based impulse responses h1, h2, and h3 from the waveform-based impulse response computing unit 202, respectively.
The pulse generator 203 raises the pulse of amplitude 1 (polarity) only at the start candidate positions P1, P2, and P3 selected by the fixed waveform allocation unit 201, and generates pulses of different channels (CH 1: d1, CH 2: d2, and CH 3: d 3).
The correlation matrix calculator 204 calculates the autocorrelation of each of the waveform impulse responses h1, h2, and h3 and the cross-correlation between h1 and h2, h1 and h3, and h2 and h3 from the waveform impulse response calculator 202, and develops the obtained correlation values in the correlation matrix memory RR.
Distortion calculation section 205 specifies a noise code vector that minimizes coding distortion by using a modified expression (37) of expression (4) with 3 waveform-based time-reversal synthesis targets (X ' 1, X ' 2, X ' 3), correlation matrix memory RR, and 3 channel-based pulses (d1, d2, d 3).
di: channel pulse (vector)
di=±1×δ(k-pi),k=0~L-1,pi: n-th channel fixed waveform start candidate position
HiWaveform-specific impulse response convolution matrix (H)i=HWi)
WiFixed waveform convolution matrix
Wherein WiIs a fixed waveform (length: L) of the ith channeli)
x’i: at HiTime-inverting x to synthesize inverted vector (x't i=Hi)
Here, the transformation from expression (4) to expression (37) is represented by expression (38) and expression (39), respectively, representing the transformation of the denominator term and the numerator term.
x: noise code retrieval target (vector)
xt: reciprocal vector of x
H: impulse response convolution matrix for synthesis filter
c: noise codevector (c ═ W)1d1+W2d2+W3d3)
Wi: fixed waveform convolution matrix
di: channel pulse (vector)
Hi: waveform impulse response convolution matrix (H)i=HWi)
x’i: at HiTime-inverting x to synthesize inverted vector (x't i=xtHi)
H: impulse response convolution matrix for synthesis filter
c: noise codevector (c ═ W)1d1+W2d2+W3d3)
Wi: fixed waveform convolution matrix
di: channel pulse (vector)
Hi: waveform impulse response convolution matrix (H ═ HW)i)
Next, the operation of the CELP type speech encoding apparatus having the above-described configuration will be described.
First, the impulse responses h and the 3 fixed waveforms W1, W2, and W3 stored in the waveform classification impulse response calculation unit 202 are convolved to calculate 3 types of waveform classification impulse responses h1, h2, and h3, and output to the waveform classification synthesis filter 192' and the correlation matrix calculator 204.
Next, waveform synthesis filter 192 ' convolves noise code search target X subjected to time reversal by time reversal section 191 with each of the input 3 types of waveform impulse responses h1, h2, and h3, time reversal section 193 time reverses the 3 types of output vectors from waveform synthesis filter 192 ' again, and generates 3 waveform-based time-reversal synthesis targets X ' 1, X ' 2, and X ' 3, respectively, and outputs the resulting results to distortion calculation section 205.
Next, correlation matrix operation section 204 calculates the autocorrelation of each of input 3 waveform impulse responses h1, h2, and h3 and the cross-correlation between h1 and h2, h1 and h3, and h2 and h3, develops the obtained correlation values in correlation matrix memory RR, and outputs the developed correlation values to distortion operation section 205.
After the above-described processing is performed as preprocessing, the fixed waveform placement unit 201 selects a start candidate position of a fixed waveform for each channel, and outputs the position information to the pulse generator 203.
The pulse generator 203 raises the pulse having the amplitude 1 (polarity) at the selected position obtained from the fixed waveform placement unit 121, generates the channel pulses d1, d2, and d3, and outputs the channel pulses to the distortion calculation unit 205.
Then, distortion calculating section 205 time-reverses 3 waveform-based synthesis targets X ' 1, X ' 2, and X ' 3, correlation matrix RR, and 3 channel-based pulses d1, d2, and d3, and calculates a minimum encoding distortion reference value of expression (37).
Fixed waveform arranging section 201 repeats the above-described processing from selection of start candidate positions corresponding to 3 channels to calculation of distortion by distortion calculating section 205 for all combinations of start candidate positions that can be selected by this section. Then, after the noise code vector gain gc is specified as a code of the noise codebook, the code number corresponding to the combination number of the start candidate position that minimizes the coding distortion search reference value of equation (37) and the optimum gain at that time are transmitted to the transmission unit.
The configuration of the audio decoding device of the present embodiment is the same as that of fig. 19B of embodiment 10, and the fixed waveform storage means and the fixed waveform arrangement means of the audio encoding device have the same configurations as those of the audio decoding device. The fixed waveform stored in the fixed waveform storage means is a fixed waveform having a characteristic of learning expression (3) (coding distortion calculation expression) targeted for search using a noise codebook as a cost function so as to statistically minimize the cost function of expression (3).
In the case where the fixed waveform start point repair position in the fixed waveform placement unit can be calculated by algebraic calculation, the voice encoding/decoding device configured as described above can calculate the numerator of expression (37) by adding the 3 terms of the target waveform inversion synthesis and squaring the result. The numerator term of equation (37) can be calculated by adding 9 terms of the correlation matrix of the waveform impulse responses obtained in the preprocessing stage. Therefore, the search can be completed with the same amount of calculation as in the case where an existing algebraic structure sound source (a sound source vector consisting of several pulses having an amplitude of 1) is used in the noise codebook.
Further, the synthesized acoustic source vector synthesized by the synthesis filter has a characteristic statistically close to the actual target, so that a high-quality synthesized voice can be obtained.
In addition, the present embodiment shows a case where the learned fixed shape is stored in the fixed waveform storage means, and a high-quality synthesized voice can be obtained similarly even in a case where a fixed waveform created from the analysis result by statistically analyzing the target X for noise codebook search is used and a case where a fixed waveform created from actual knowledge is used.
Although the present embodiment has been described with respect to the case where the fixed waveform storage means stores 3 fixed waveforms, the same operation and effect can be obtained even when the number of fixed waveforms is other values.
In addition, although the present embodiment has been described with respect to the case where the fixed waveform allocation means has the fixed waveform start point candidate position information shown in table 8, the same operation and effect can be obtained even in the case where the fixed waveform start point candidate position information other than table 8 is provided if the fixed waveform allocation means can be generated by an algebraic method.
Embodiment mode 13
Fig. 21 is a block diagram showing the configuration of a CELP type speech encoding apparatus according to the present embodiment. The coding apparatus of the present embodiment includes 2 types of noise codebooks a211 and B212, a switch 213 for switching between the two types of noise codebooks, a multiplier 214 for performing an operation of multiplying a noise code vector by a gain, a synthesis filter 215 for synthesizing the noise code vectors output from the noise codebooks connected to the switch 213, and a distortion calculation section 216 for calculating coding distortion of expression (2).
The noise codebook a211 has the configuration of the excitation vector generator according to embodiment 10, and the other noise codebook B212 is constituted by a random number sequence storage section 217 that stores a plurality of random vectors generated from a random number sequence. The switching of the noise codebook is performed in a closed loop. X is a target for noise codebook search.
The following describes the operation of the CRLP type speech encoding apparatus having the above-described structure.
Initially, the switch 213 is connected to the noise codebook a211 side, and the fixed waveform arranging unit 182 arranges (shifts) the fixed waveforms read out from the fixed waveform storage unit 181 to positions selected from the start candidate positions, respectively, based on the own fixed waveform start candidate position information shown in table 8. The arranged fixed waveform is added by an adder 183 to be a noise code vector, multiplied by a noise code vector gain, and input to a synthesis filter 215. Synthesis filter 215 synthesizes the input noise code vectors and outputs the synthesized noise code vectors to distortion calculating section 216.
Distortion calculating section 216 performs processing of equation (2) to minimize coding distortion using synthesis obtained from synthesis filter 215 and target X for search of the noise codebook.
After calculating the distortion, the distortion calculating unit 216 transmits a signal to the fixed waveform arranging unit 182, and repeats the above-described processing from the selection of the start candidate position by the fixed waveform arranging unit 182 to the calculation of the distortion by the distortion calculating unit 216 for all combinations of start candidate positions that can be selected by the fixed waveform arranging unit 182.
Then, a combination of start candidate positions of the minimum coding distortion is selected, and the code number of the noise code vector, the noise code vector gain gc at that time, and the minimum coding distortion value, which correspond one-to-one to the combination of the start candidate positions, are stored.
Next, the switch 213 is connected to the noise codebook B212 side, and the random number sequence read out from the random number sequence storage unit 217 becomes a noise code vector, and is multiplied by a noise code vector gain, and then output to the synthesis filter 215. Synthesis filter 215 synthesizes the input noise code vectors and outputs the synthesized noise code vectors to distortion calculating section 216.
Distortion calculating section 216 calculates coding distortion of expression (2) using target X for noise codebook search and the synthesis vector obtained from synthesis filter 215.
After calculating the distortion, distortion calculating section 216 transfers a signal to random number sequence storing section 217, and repeats the above-described processing from when random number sequence storing section 217 selects a noise code vector to when distortion calculating section 216 calculates the distortion, for all noise code vectors that random number sequence storing section 217 can select.
Then, a noise code vector having the minimum coding distortion is selected, and the code number of the noise code vector, the noise code vector gain gc at that time, and the minimum coding distortion are stored.
Next, distortion calculating section 216 compares the minimum coding distortion obtained when switch 213 is connected to noise codebook a211 with the minimum coding distortion obtained when switch 213 is connected to noise codebook B212, determines that the switch connection information and the code number and noise code vector gain at that time when the small coding distortion is obtained are sound codes, and transmits the sound codes to a transmitting section not shown.
The audio decoding apparatus that is a counterpart of the audio encoding apparatus of the present embodiment is configured by arranging the noise codebook a, the noise codebook B, the switch, the noise codevector gain, and the synthesis filter in the same configuration as in fig. 21, and determines the noise codebook to be used based on the audio code input from the transmission means. And the noise code vector gain to obtain a synthesized sound source vector as the output of the synthesis filter.
With the sound encoding device/decoding device configured as described above, the minimum encoding distortion of expression (2) can be selected in a closed-loop manner from the noise code vector generated from noise codebook a and the noise code vector generated from noise codebook B, and therefore, a synthesized speech can be obtained with high sound quality while generating a sound source vector closer to the actual sound.
Although the present embodiment shows the audio encoding/decoding apparatus based on the configuration shown in fig. 2 as the conventional CELP-type audio encoding apparatus, the same operation and effect can be obtained by using the present embodiment in the CELP-type audio encoding apparatus/decoding apparatus based on the configuration shown in fig. 19A, B or fig. 20.
Although the noise codebook a211 has the configuration shown in fig. 18 in the present embodiment, the same operation and effect can be obtained even when the fixed waveform storage unit 181 has another configuration (for example, 4 kinds of fixed waveforms or the like).
In the present embodiment, the description has been given of the case where the fixed waveform allocation unit 182 of the noise codebook a211 has the fixed waveform start point candidate position information shown in table 8, but the same operation and effect can be obtained even when other fixed waveform start point candidate position information is provided.
Although the present embodiment has been described with respect to the case where the noise codebook B212 is configured by the random number sequence storage unit 217 which directly stores a plurality of random number sequences in a memory, the same operation and effect can be obtained even when the noise codebook B212 has another sound source structure (for example, when it is configured by algebraic structure sound source generation information).
Although the present embodiment has described the CELP type speech coder/decoder having 2 types of noise codebooks, the same operation and effect can be obtained even when the CELP type speech coder/decoder having 3 or more types of noise codebooks is used.
Embodiment 14
Fig. 22 shows a configuration of a CELP type speech encoding apparatus according to the present embodiment. The audio encoding device of the present embodiment has two types of noise codebooks, one of which is the configuration of the acoustic vector generation device shown in fig. 18 of embodiment 10, and the other of which is constituted by a burst storage section that stores a plurality of bursts, and searches for a quantized pitch gain that has been obtained before by using the noise codebook, and adaptively changes the noise codebook.
Noise codebook a211 is composed of fixed waveform storage section 181, fixed waveform arranging section 182, and adder 183, and corresponds to the original vector generating apparatus in fig. 18. The noise codebook B221 is constituted by a pulse train storage unit 222 that stores a plurality of pulse trains. The switch 213' switches the noise codebook a211 and the noise codebook B211. The multiplier 224 outputs an adaptive code vector obtained by multiplying the output of the adaptive codebook 223 by the pitch gain that has been obtained at the time of searching the noise codebook. The output of the pitch gain quantizer 225 is passed to the switch 213'.
Next, the operation of the CELP type speech encoding apparatus having the above-described configuration will be described.
The conventional CELP type audio coding apparatus first performs a search of the adaptive codebook 223, and then receives the result thereof to perform a noise codebook search. The adaptive codebook search is a process of selecting an optimum adaptive codevector from a plurality of adaptive codevectors (vectors obtained by multiplying the adaptive codevector and a noise codevector by respective gains and adding the resultant vectors) stored in the adaptive codebook 223, and as a result, generates the code number and pitch gain of the adaptive codevector.
The CELP type speech coding apparatus according to the present embodiment quantizes the pitch gain in pitch gain quantizing section 225, generates a quantized pitch gain, and then performs a noise codebook search. The quantized pitch gain obtained by the pitch gain quantization section 225 is sent to a switch 213' for switching the noise codebook.
The switch 213' determines that the input sound is strongly unvoiced when the value of the quantization pitch gain is small, and connects the noise codebook a211, and determines that the input sound is strongly voiced when the value of the quantization pitch gain is large, and connects the noise codebook B221.
When the switch 213' is connected to the noise codebook a211 side, the fixed waveform arranging unit 182 arranges (shifts) the fixed waveforms read out from the fixed waveform storage unit 181 to positions selected from the start candidate positions, respectively, based on the own fixed waveform start candidate position information shown in table 8. The arranged fixed waveforms are output to the adder 183, added to be a noise code vector, multiplied by a noise code vector gain, and input to the synthesis filter 215. Synthesis filter 215 synthesizes the input noise code vectors and outputs the synthesized noise code vectors to distortion calculating section 216.
Distortion calculating section 216 calculates coding distortion of expression (2) using target X for noise codebook search and the vector obtained from synthesis filter 215.
After calculating the distortion, the distortion calculating unit 216 transmits the signal 182 to the fixed waveform arranging unit 182, and repeats the above-described processing from the selection of the start candidate position by the fixed waveform arranging unit 182 to the calculation of the distortion by the distortion calculating unit 216 for all combinations of start candidate positions that can be selected by the fixed waveform arranging unit 182.
Then, a combination of start candidate positions with the smallest coding distortion is selected, and the code number of the noise code vector corresponding to the combination of the start candidate positions, the noise code vector gain gc at that time, and the quantized pitch gain are transmitted to the transmission unit as the audio code. In the present embodiment, the fixed waveform map stored in the fixed waveform storage unit 181 is made to exhibit unvoiced sound properties before the sound encoding is performed.
On the other hand, when the switch 213 'is connected to the noise codebook B221 side, the pulse train read out from the pulse train storage section 222 becomes a noise code vector, and the switch 213' is input to the synthesis filter 215 after the multiplication process of the noise code vector gain. The synthesis filter 215 synthesizes the input noise code vectors and outputs the synthesized noise code vectors to the distortion calculation unit 216.
Distortion calculating section 216 searches for target X and the synthesis vector obtained from synthesis filter 215 using the noise codebook, and calculates coding distortion of expression (2).
After calculating distortion, distortion calculating section 216 transfers a signal to burst storage section 222, and repeats the above-described processing from when burst storage section 222 selects a noise code vector to when distortion calculating section 216 calculates distortion for all noise code vectors that burst storage section 222 can select.
Then, a noise code vector having the smallest coding distortion is selected, and the code number of the noise code vector, the noise code vector gain gc at that time, and the quantized pitch gain are transmitted as a sound code to a transmission unit.
The audio decoding apparatus that is a counterpart of the audio encoding apparatus of the present embodiment is an apparatus having a portion in which the noise codebook a, the noise codebook B, the switch, the noise codevector gain, and the synthesis filter are arranged in the same configuration as in fig. 22, and first, receives the transmitted quantized pitch gain, and determines whether the encoder-side switch 213' is connected to the noise codebook a211 side or the noise codebook B221 side according to the magnitude thereof. Then, a synthesized acoustic source vector is obtained as an output of the synthesis filter from the code number and the code of the noise code vector gain.
With the sound source encoding/decoding device having such a configuration, it is possible to adaptively switch between 2 types of noise codebooks in accordance with the characteristics of the input sound (in the present embodiment, the magnitude of the quantized pitch gain is used as voiced/unvoiced determination data), and it is possible to select a pulse train as a noise codevector when the voiced characteristic of the input sound is strong and to select a noise codevector exhibiting unvoiced characteristic when the unvoiced characteristic is strong, thereby generating a sound source vector closer to the original sound and improving the sound quality of the synthesized speech. In the present embodiment, switching is performed in an open loop as described above, and thus information to be transmitted can be increased to improve the operation and effect.
Although the present embodiment shows an audio encoding/decoding apparatus based on the configuration shown in fig. 2 as an existing CELP type audio encoding apparatus, the same effects can be obtained by using the present embodiment in a CELP type audio encoding/decoding apparatus based on the configuration shown in fig. 19A, B or fig. 20.
In the present embodiment, the pitch gain obtained by quantizing the pitch gain of the adaptive code vector in the pitch gain quantizer 225 is used as the parameter for switching the switch 213', but a pitch period calculated from the adaptive code vector by providing a pitch period calculator may be used instead.
Although the noise codebook a211 has the configuration shown in fig. 18 in the present embodiment, the same operation and effect can be obtained even when the fixed waveform storage unit 181 has another configuration (for example, when there are 4 kinds of fixed waveforms).
In the present embodiment, the description has been given of the case where the fixed waveform arranging unit 182 of the noise code account a211 has the fixed waveform start point candidate position information shown in table 8, but the same action and effect can be obtained even when other fixed waveform start point candidate position information is provided.
In the present embodiment, although the case where the noise codebook B211 is configured by the burst storage section 222 that directly stores bursts in the memory has been described, the same operation and effect can be obtained even when the noise codebook B221 has another sound source structure (for example, when it is configured by algebraic structure sound source generation information).
Although the CELP type speech coder/decoder having 2 types of noise codebooks has been described in the present embodiment, the same operation and effects can be obtained by using a CELP type speech coder/decoder having 3 or more types of noise codebooks.
Embodiment 15
Fig. 23 is a block diagram showing the configuration of a CELP type speech encoding apparatus according to the present embodiment. The audio encoding device of the present embodiment has two types of noise codebooks, one type of noise codebook is the configuration of the acoustic vector generation device shown in fig. 18 of embodiment 10, 3 fixed waveforms are stored in the fixed waveform storage means, and the other type of noise codebook is the same configuration as the acoustic vector generation device shown in fig. 18, but the fixed waveforms stored in the fixed waveform storage means are 2, and the switching between the two types of noise codebooks is performed in a closed loop.
The noise codebook a211 is composed of a fixed waveform storage unit a181 that stores 3 fixed waveforms, a fixed waveform allocation unit a182, and an adder 183, and corresponds to a case where 3 fixed waveforms are stored in the fixed waveform storage unit with the configuration of the acoustic vector generation device in fig. 18.
The noise codebook B230 is configured by a fixed waveform storage unit B231 that stores 2 fixed waveforms, a fixed waveform allocation unit B232 that includes the fixed waveform start point candidate position information shown in table 9, and an adder 233 that adds the 2 fixed waveforms allocated by the fixed waveform allocation unit B232 to generate a noise code vector, and corresponds to a case where 2 fixed waveforms are stored in the fixed waveform storage unit with the configuration of the acoustic source vector generating device of fig. 18.
TABLE 9
Channel numbering Symbol Fixed waveform start candidate position
CH1 ± 0,4,8,12,16,...,72,76p12,6,10,14,18,...,74,78
CH2 ± 1,5,9,13,17,...,73,77p23,7,11,15,19,...,75,79
The other structure is also the same as embodiment 13.
Next, the operation of the CELP type speech encoding apparatus having the above-described configuration will be described.
Initially, the switch 213 is connected to the noise codebook a211 side, and the fixed waveform storage unit a181 arranges (shifts) the 3 fixed waveforms read out from the fixed waveform storage unit a181 to positions selected from the start candidate positions, respectively, based on the fixed waveform start candidate position information which the fixed waveform storage unit a181 has itself as shown in table 8. The 3 fixed waveforms thus arranged are output to an adder 183, added to each other to form a noise code vector, and input to a synthesis filter 215 via a switch 213 and a multiplier 213 for multiplying the gain of the noise code vector. The synthesis filter 215 synthesizes the input noise code amounts and outputs the synthesized amount to the distortion calculation unit 216.
The distortion calculating section calculates coding distortion of expression (2) using target X for noise codebook search and the synthesis vector obtained from synthesis filter 215.
After calculating the distortion, the distortion calculating unit 216 transmits a signal to the fixed waveform arranging unit a182, and repeats the above-described processing from the selection of the start candidate position by the fixed waveform arranging unit a182 to the calculation of the distortion by the distortion calculating unit 216 for all combinations of start candidate positions that can be selected by the fixed waveform arranging unit a 182.
Then, a combination of start candidate positions at which coding distortion is minimum is selected, and the code number of a noise code vector corresponding one-to-one to the combination of the start candidate positions, the noise code vector gain gc at that time, and the coding distortion minimum value are stored.
In the present embodiment, the fixed waveform map stored in the fixed waveform storage unit a181 is obtained by learning to minimize distortion under the condition that there are 3 fixed waveforms before performing audio encoding.
Next, the switch 213 is connected to the noise codebook B230 side, and the fixed waveform storage unit B231 arranges (shifts) the 2 fixed waveforms read out from the fixed waveform storage unit B231 to positions selected from the start candidate positions, respectively, based on the fixed waveform start candidate position information itself shown in table 9. The arranged 2 fixed waveforms are output to an adder 233, added to be a noise code vector, and input to a synthesis filter 215 via a switch 213 and a multiplier 214 for multiplying the gain of the noise code vector. The synthesis filter 215 synthesizes the input noise code vectors and outputs the synthesized noise code vectors to the distortion calculation unit 216.
Distortion calculating section 216 calculates coding distortion of expression (2) using target X for noise codebook search and the synthesis vector obtained from synthesis filter 215.
After calculating the distortion, the distortion calculating unit 216 transmits a signal to the fixed waveform arranging unit B232, and repeats the above-described processing from the selection of the start candidate position by the fixed waveform arranging unit B232 to the calculation of the distortion by the distortion calculating unit 216 for all combinations of start candidate positions that can be selected by the fixed waveform arranging unit B232.
Then, a combination of start candidate positions at which coding distortion is minimum is selected, and the code number of a noise code vector corresponding one-to-one to the combination of the start candidate positions, the noise code vector gain gc at that time, and the coding distortion minimum value are stored. In the present embodiment, the fixed waveform map stored in the fixed waveform storage section B231 is obtained by learning to minimize distortion under the condition that 2 fixed waveforms exist, before performing audio encoding.
Next, distortion calculating section 216 compares the minimum coding distortion obtained when switch 213 is connected to noise codebook a211 with the minimum coding distortion obtained when switch 213 is connected to noise codebook B230, determines that the switch connection information, the code number at that time, and the noise code vector gain are sound codes when the small coding distortion is obtained, and transmits the sound codes to the transmitting section.
The audio decoding device according to the present embodiment has a configuration in which the noise codebook a, the noise codebook B, the switch, the noise codevector gain, and the synthesis filter are arranged in the same manner as in fig. 23, and determines the noise codebook, the noise codevector, and the noise codevector gain to be used, based on the audio code input from the transmission means, and obtains the synthesized excitation vector as the output of the synthesis filter.
With the sound encoding/decoding device configured as described above, the noise code vector that minimizes the encoding distortion of expression (2) can be selected from the noise code vector generated by noise codebook a and the noise code vector generated by noise codebook B in a closed loop, so that it is possible to generate a sound source vector closer to the original sound and to obtain a synthesized voice with high sound quality.
Although the present embodiment shows an audio encoding/decoding apparatus based on the configuration shown in fig. 2 as an existing CELP type audio encoding apparatus, the same effects can be obtained by using the present embodiment also in a CELP type audio encoding/decoding apparatus based on the configuration shown in fig. 19A, B or fig. 20.
In the present embodiment, the case where the fixed waveform storage unit a181 of the noise codebook a211 stores 3 fixed waveforms has been described, but the same operation and effect can be obtained even when the fixed waveform storage unit a181 has another number of fixed waveforms (for example, when there are 4 fixed waveforms). The same applies to noise codebook B230.
In the present embodiment, the description has been given of the case where the fixed waveform allocation unit a182 of the noise codebook a211 has the fixed waveform start point candidate position information shown in table 8, but the same operation and effect can be obtained even when other fixed waveform start point candidate position information is provided. The same applies to noise codebook B230.
Although the present embodiment has described the CELP type speech coder/decoder having 2 types of noise codebooks, the same operation and effect can be obtained even when the CELP type speech coder/decoder having 3 or more types of noise codebooks is used.
Embodiment 16
Fig. 24 is a functional block diagram of a CELP type speech encoding apparatus according to the present embodiment. In this speech encoding apparatus, LPC analyzing section 242 performs autocorrelation analysis and LPC analysis on input speech data 241 to obtain LPC coefficients, encodes the obtained LPC coefficients to obtain LPC codes, and encodes the obtained LPC codes to obtain decoded LPC coefficients.
Next, excitation generating section 245 extracts an adaptive code vector and a noise code vector from adaptive codebook 243 and excitation vector generating apparatus 244, and sends them to LPC synthesizing section 246. The acoustic vector generator 244 is any one of the acoustic vector generators of embodiments 1 to 4 and 10 described above. Then, LPC synthesizing section 246 filters the 2 sound sources obtained by sound source generating section 245 based on the decoded LPC coefficients obtained by LPC analyzing section 242, thereby obtaining two synthesized voices.
Also, the comparison unit 247 analyzes the relationship between the 2 kinds of synthesized voices obtained in the LPC synthesizing unit 246 and the input sound, finds the optimum values (optimum gains) of the two kinds of synthesized voices, adds up the synthesized voices power-adjusted according to the optimum gains to obtain a total synthesized voice, and calculates the distance between the total synthesized voice and the input sound.
Further, for all the sound source samples generated by adaptive codebook 243 and sound source vector generating apparatus 244, the distances between the input sound and the plurality of synthesized voices obtained by operating sound source generating section 245 and LPC synthesizing section 246 are calculated, the number of the sound source sample at the time when the distance obtained as a result is the smallest is obtained, and two sound sources corresponding to the number are transmitted to parameter encoding section 248.
Parameter encoding section 248 performs encoding of the optimal gain to obtain a gain code, and then, the LPC code and the sound source sample number are collected and transmitted to transmission channel 249. The actual sound source signal is generated from the gain code and the two sound sources corresponding to the labels, and is stored in the adaptive codebook 243 while discarding the old sound source samples.
Fig. 25 is a functional block diagram of a portion related to gain vector quantization in the parametric coding unit 248.
Parameter encoding section 248 includes: a parameter conversion unit 2502 for converting the sum of components of the input optimal gain 2501 and calculating a quantization target vector from the ratio of the sum, a target extraction unit 2503 for calculating a target vector from a past decoded code vector stored in a decoded vector storage unit and a prediction coefficient stored in a prediction coefficient storage unit, a decoded vector storage unit 2504 for storing a past decoded code vector, a prediction coefficient storage unit 2505 for storing a prediction coefficient, a distance calculation unit 2506 for calculating the distance between a plurality of code vectors stored in a vector codebook and a target vector obtained by the target extraction unit using a prediction coefficient stored in the prediction coefficient storage unit, a vector codebook 2507 for storing a plurality of code vectors, and a control vector codebook and distance calculation unit for obtaining the number of an optimal code vector from a comparison of the distances obtained by the distance calculation unit and taking out the code vector stored in the vector storage unit based on the obtained number, and a comparison unit 2508 for updating the content of the decoded vector storage unit with the code vector.
The operation of the parameter encoding unit 248 having the above-described structure will be described in detail below. A vector codebook 2507 storing a representative sample (code vector) of a plurality of quantized object vectors is generated in advance, which is generally generated by an LBG algorithm (ieee transactions ON COMMUNICATIONS, vol. com-28, No.1, pp84-95, JANUARY 1980) ON the basis of a plurality of vectors obtained by analyzing a plurality of sound data.
In prediction coefficient storage section 2505, coefficients for performing prediction encoding are stored. The algorithm for the prediction coefficient will be described later. Also, a numerical value indicating an unvoiced state is stored in advance in decoding vector storage section 2504 as an initial value. Such as the code vector with the least power.
First, parameter conversion section 2502 converts the input optimum gain 2501 (the gain of the adaptive sound source and the gain of the noise sound source) into a vector of elements of the sum and ratio (input). The transformation method is shown in formula (40):
P=log(Ga+Gs)
R=Ga/(Ga+Gs) ……(40)
(Ga + Gs): optimum gain
Ga: adaptive sound source gain
Gs: random sound source gain
(P, R): input vector
P: and
r: ratio of
In the above amounts, Ga does not necessarily have to be a positive value, and R may have a negative value. Then, when Ga + Gs is a negative value, Ga + Gs is substituted into a fixed value prepared in advance.
Next, target extraction section 2503 obtains a target vector using the past decoded vector stored in decoded vector storage section 2504 and the prediction coefficient stored in prediction coefficient storage section 2505, based on the vector obtained in parameter conversion section 2052. The formula for calculating the target vector is shown in formula (41):
(Tp, Tr): target vector
(P, R): input vector
(pi, ri): past decoded vector
Upi, Vpi, Uri, Vri: prediction coefficient (fixed value)
i: index of the first decoded vector
l: number of predictions
Next, distance calculating section 2506 calculates the distance between the target vector obtained at target extracting section 2503 and the code vector stored in vector codebook 2507 using the prediction coefficient stored in prediction coefficient storage section 2505.
The calculation formula of the distance is shown in formula (42):
Dn=Wp×(Tp-UpO×Cpn-VpO×Crn)2
+Wr×(Tr-UpO×Cpn-VrO×Crn)2 (42)
dn: distance of target vector from code vector
(Tp, Tr): target vector
UpO, VpO, UrO, VrO: prediction coefficient (fixed value)
(Cpn, Crn): code vector
n: numbering of code vectors
Wp, Wr: weighting factor (fixed) for adjusting sensitivity to distortion
Next, comparing section 2508 controls vector codebook 2507 and distance calculating section 2506, and obtains, as code 2509 of gain, the code vector number whose distance calculated by distance calculating section 2506 is the smallest among the plurality of code vectors stored in vector codebook 2507. Further, a code vector is obtained based on the obtained gain code 2509, and the content of the decoded vector storage section 2504 is updated using the vector. The method of solving for the code vector is shown in equation (43):
(Cpn, Crn): code vector
(p, r): decoding vectors
(pi, ri): past decoded vector
Upi, Vpi, Uri, Vri: prediction coefficient (fixed value)
i: index of the first decoded vector
l: number of predictions
n: numbering of code vectors
The method of updating is shown in equation (44).
The sequence of treatment:
pO=CpN
rO=CrN
pi=pi-l(i=l~l)
ri=ri-l(i=l~l) (44)
n: code of gain
On the other hand, the decoding apparatus (decoder) is provided with the same vector codebook, prediction coefficient storage means, and decoding vector storage means as the encoding apparatus, and performs decoding by the encoding vector generation function of the comparison means and the update function of the decoding vector storage means in the encoding apparatus, based on the code of the gain delivered from the encoding apparatus.
Here, a method of setting a prediction coefficient stored in prediction coefficient storage section 2505 will be described.
First, a large number of audio data for learning are quantized, an input vector obtained from the optimal gain of the quantized audio data and a decoded vector at the time of quantization are collected and encoded into a group, and then a prediction coefficient is obtained for the group by minimizing the total distortion shown by the following equation (45). Specifically, the total distortion equation is partially differentiated for each Upi and Uri, and the obtained simultaneous equations are solved to obtain values of Upi and Uri.
pt,O=Cp
rp,O=Crn ……(45)
Total: total distortion
t: time (frame number)
T: number of data of vector group
(Pt, Rt): optimum gain in time t
(pti, rt, i): decoding vector in time t
Upi, Vpi, Uri, Vri: prediction coefficient (fixed value)
i: reference numerals indicating the first decoded vectors
l: number of predictions
(Cpn (t), Crn (t)): code vector in time t
Wp, Wr: weight coefficient (fixed) for adjusting sensitivity to distortion
By adopting such a vector quantization method, the optimum gain can be vector-quantized as it is, and the correlation between the relative magnitudes of power and each gain can be utilized by the characteristics of the parameter transformation means, so that the gain predictive coding using the correlation between the relative relationships of power and 2 gains by the characteristics of the decoding vector storage means, the prediction coefficient storage means, the target extraction means, and the distance calculation means can be realized, and the correlation between the parameters can be fully utilized by the characteristics.
Embodiment 17
Fig. 26 is a block diagram showing the function of the parameter encoding unit of the audio encoding device according to the present embodiment. In the present embodiment, vector quantization is performed while estimating distortion due to gain quantization from two synthesized voices corresponding to the labels of the sound sources and the perceptually weighted input sound.
As shown in fig. 26, the parameter encoding unit includes: a parameter calculation unit 2602 for calculating parameters necessary for distance calculation from the input auditory input speech, the auditory weighted LPC synthesis adaptive speech, the input data as the auditory weighted LPC synthesis noise speech source 2601, the decoded vectors stored in the decoded vector storage unit, and the prediction coefficients stored in the prediction coefficient storage unit, a decoded vector storage unit 2603 for storing the past decoded code vectors, a prediction coefficient storage unit 2604 for storing the prediction coefficients, a distance calculation unit 2605 for calculating the coding distortion when decoding with a plurality of code vectors stored in the vector codebook using the prediction coefficients stored in the prediction coefficient storage unit, a vector codebook 2606 for storing a plurality of code vectors, and a control vector codebook and distance calculation unit for obtaining the number of the optimal code vector from the comparison of the coding distortions obtained from the distance calculation unit and extracting the code vector stored in the vector storage unit from the obtained number, the comparison unit 2607 that updates the content of the decoded vector storage unit with the code vector.
The vector quantization operation of the parameter coding unit having the above-described structure will be described below. A vector codebook 2606 storing representative samples (code vectors) of a plurality of quantization target vectors is generated in advance. Usually generated according to the LBG algorithm (IEEE TRANSACTIONS COMMUNICATIONS, vol. COM-28, NO.1, PP84-95, JANUARY 1980) and the like. Also, the prediction coefficient storage unit 2604 stores coefficients for prediction encoding in advance. The same prediction coefficients as those stored in prediction coefficient storage section 2505 described in embodiment 16 are used for these coefficients. The decoding vector storage unit 2603 also stores a numerical value indicating an unvoiced state as an initial value.
First, parameter calculation section 2602 calculates parameters necessary for distance calculation from the perceptually weighted input speech, the perceptually weighted LPC synthesis adaptive speech source, the perceptually weighted LPC synthesis noise speech source 2601, the decoded vector stored in decoded vector storage section 2603, and the prediction coefficient stored in prediction coefficient storage section 2604. The distance of the distance calculation unit is calculated according to the following equation (46):
Gan=Orn×e×p(Opn)
Gsn=(1-Orn)×e×p(Opn)
Opn=Yp+UpO×Cpn+VpO×Crn
gan, Gsn: decoding gain
(Opn, Orn): decoding vectors
(Yp, Yr): prediction vector
En: coding distortion using gain code vector n
Xi: auditory weighted input sound
Ai: auditory weighting LPC synthesis adaptive sound source
Si: auditory weighted LPC synthesis random sound source
n: numbering of code vectors
i: sound source data labeling
I: sub-frame length (coding unit of input sound)
(Cpn, Crn): code vector
(pj, rj): past decoded vector
Upj, Vpj, Urj, Vrj: prediction coefficient (fixed value)
j: reference numerals indicating the first decoded vectors
J: number of predictions
Thus, a portion irrelevant to the number of the code vector is calculated at the parameter calculation unit 2602. The correlation and power between the prediction vector and the 3 synthesized voices are pre-calculated. The calculated formula is shown in formula (47):
(Yp, Yr): prediction vector
Dxx, Dxa, Dxs, Daa, Das, Dss: correlation, power between synthesized voices
Xi: auditory weighted input sound
Ai: auditory weighting LPC synthesis adaptive sound source
Si: auditory weighted LPC synthesis random sound source
i: sound source data labeling
I: sub-frame length (coding unit of input sound)
(pj, rj): past decoded vector
Upj, Vpj, Urj, Vrj: prediction coefficient (fixed value)
j: reference numerals indicating the first decoded vectors
J: number of predictions
Next, distance calculation section 2605 calculates coding distortion from each parameter calculated by parameter calculation section 2602, the prediction coefficient stored in prediction coefficient storage section 2604, and the code vector stored in vector codebook 2606. The calculated formula is shown in formula (48):
En=Dxx+(Gan)2×Daa+(Gsn)2×Dss
-Gan×Dxa-Gsn×Lxs+Gan×Gsn×Das
Gan=Orn×exp(Opn)
Gsn=(1-Orn)×exp(Opn)
Opn=Yp+UpO×Cpn+VpO×Crn
Orn=Yr+UrO×Cpn+VrO×Crn (48)
en: numbering distortion when using gain code vector n
Dxx, Dxa, Dxs, Daa, Das, Dss: correlation, power between synthesized voices
Gan, Gsn: decoding gain
(Opn, Orn): decoding vectors
(Yp, Yr): prediction vector
UpO, VpO, UrO, VrO: prediction coefficient (fixed value)
(Cpn, Crn): code vector
n: numbering of code vectors
In reality, Dxx is not related to the code vector number n, and therefore, the addition operation can be omitted.
Comparing section 2607 controls vector codebook 2606 and distance calculating section 2605 to determine, as gain code 2608, the code vector number of the code vector having the smallest distance calculated by distance calculating section 2605 among the plurality of code vectors stored in vector codebook 2606. The code vector is solved based on the obtained gain code 2608, and the content of the decoding vector storage unit 2603 is updated with the code vector. The decoded vector is obtained from equation (43).
Also, an update method (44) is used.
On the other hand, the audio decoding apparatus is provided with a vector codebook, a prediction coefficient storage means, and a decoded vector storage means, which are the same as those of the audio encoding apparatus, in advance, and performs decoding by using a function of generating a decoded vector by the encoder comparing means and an update function of the decoded vector storage means, based on the gain code sent thereto from the encoder.
In the embodiment mode having such a configuration, it is possible to perform vector quantization while estimating distortion due to gain quantization from two types of synthesized speech and input sound corresponding to the sound source label, and to utilize the correlation between the relative magnitudes of power and each gain by using the characteristics of the parameter conversion means, and therefore it is possible to realize gain predictive coding that utilizes the correlation between the relative relationships between power and 2 gains by using the characteristics of the decoding vector storage means, the prediction coefficient storage means, the target extraction means, and the distance calculation means, and it is possible to fully utilize the correlation between the parameters.
Embodiment 18
Fig. 27 is a main functional block diagram of the noise reducing device according to the present embodiment. The noise reduction device is equipped with the voice coding device. For example, the audio encoding device shown in fig. 13 is provided in the front stage of the buffer 1301.
The noise reduction device shown in fig. 27 includes: a/D converter 272, noise reduction coefficient storage section 273, noise reduction coefficient adjustment section 274, input waveform setting section 275, LPC analysis section 276, fourier transform section 277, noise reduction/spectrum compensation section 278, spectrum stabilization section 279, inverse fourier transform section 280, spectrum enhancement section 281, waveform matching section 282, noise estimation section 284, noise spectrum storage section 285, pre-spectrum storage section 286, random phase storage section 287, pre-waveform storage section 288, and maximum power storage section 289.
The initial setting will be explained first. Table 10 shows names and setting examples of the fixed parameters.
Watch 10
Fixed parameters Example of setting
Frame length first reading data length FFT times LPC prediction times noise frequency spectrum reference holding reading appoints minimum power AR enhancement coefficient 0MA enhancement coefficient 0 high frequency enhancement coefficient 0AR enhancement coefficient 1-0MA enhancement coefficient 1-0AR enhancement coefficient 1-1MA enhancement coefficient 1 high frequency enhancement coefficient 1 power enhancement coefficient noise reference power silent power reduction coefficient compensation power enhancement coefficient noise reference continuous number noise reduction coefficient learning coefficient appoints noise reduction coefficient 160 (20 msec in 8Khz sampling data) 80 (10 msec in the above data) 256103020.00.50.80.40.660.640.70.60.31.220000.00.32.050.80.051.5
Also, the random phase storage unit 287 stores phase data for adjusting the phase in advance. These data are used to rotate the phase at the spectral stabilization unit 279. Table 11 shows 8 examples of the phase data.
TABLE 11
Phase data
(-0.51,0.86),(0.98,-0.17)(0.30,0.95),(-0.53,-0.84)(-0.94,-0.34),(0.70,0.71)(-0.22,0.97),(0.38,-0.92)
A counter (random phase counter) for the purpose of using the phase data is also stored in the random phase storage unit 287. This value is stored in advance initialized to 0.
Then, a static RAM area is set. That is, the noise reduction coefficient storage unit 273, the noise spectrum storage unit 285, the pre-spectrum storage unit 286, the pre-waveform storage unit 288, and the maximum power storage unit 289 are cleared. Next, description and setting examples of each memory cell will be described.
The noise reduction coefficient storage unit 273 is a region for storing noise reduction coefficients, and stores 20.0 as an initial value. Noise spectrum storage section 285 stores, for each frequency, a region indicating the average noise power, the average noise frequency, and the number of frames (persistence number) in which the spectrum values of the compensation noise spectrum of the level 1 candidate and the compensation noise spectrum of the level 2 candidate change several frames ago, and stores, as an initial value, a value sufficiently large for the average noise power, a specified minimum power for the average noise spectrum, and a sufficiently large number for each of the compensation noise spectrum and the persistence number.
The previous spectrum storage unit 286 is a region that stores the compensation noise power, the power of the previous frame (full band, intermediate band) (previous frame power), the smoothing power of the previous frame (full band, intermediate band) (previous frame smoothing power), and the noise continuation number, and stores a sufficiently large value as the compensation noise power, 0.0 as both the previous frame power and the full frame smoothing power, and a noise reference continuation number as the noise continuation number.
The front waveform storage unit 288 is a region that stores data of the last first-read data length share of the previous frame output signal for matching the output signal, and stores 0 as an initial value in its entirety. Spectral enhancement section 281 performs ARMA and high-frequency enhancement filtering, and clears 0 the state of each filter for this purpose. The maximum power storage unit 289 is a region that stores the maximum value of the power of the input signal, and stores 0 as the maximum power.
The noise reduction algorithm is explained below in each block diagram with fig. 27.
First, an analog input signal including sound is a/D converted by the a/D converter 272, and a 1-frame length + initial reading data length (160 +80 in the above-described setting example, 240 points) is input. The noise reduction coefficient adjusting unit 274 calculates a noise reduction coefficient and a compensation coefficient by using equation (49) based on the noise reduction coefficient, the specified noise reduction coefficient, the noise reduction coefficient learning coefficient, and the compensation power increase coefficient stored in the noise reduction coefficient storage unit 273. Then, the obtained noise reduction coefficient is stored in the noise reduction coefficient storage unit 273, and the input signal obtained by the a/D converter 272 is transmitted to the input waveform setting unit 275, and the compensation coefficient and the noise reduction coefficient are transmitted to the noise estimation unit 284 and the noise reduction spectrum compensation unit 278.
q=q*C+Q*(l-C)
r=Q/q*D ……(49)
q: noise reduction coefficient
Q: specified noise reduction factor
C: learning coefficient of noise reduction coefficient
r: compensation factor
D: compensating power rise factor
The noise reduction coefficient is a coefficient indicating a ratio of noise reduction, the specified noise reduction coefficient is a fixed noise reduction coefficient specified in advance, the noise reduction coefficient learning coefficient is a coefficient indicating a ratio of noise reduction coefficient close to the specified noise reduction coefficient, the compensation coefficient is a coefficient for adjusting compensation power for spectrum compensation, and the compensation power increase coefficient is a coefficient for adjusting the compensation coefficient.
In order to enable FFT (fast fourier transform), input waveform setting section 275 writes the input signal from a/D converter 272 into a memory array having a length of 2-power from the rear. The front part is filled with 0. In the above setting example, 0 to 15 and 16 to 255 input signals are written in an array having a length of 256. This array is used as the real part when an 8 th order Fast Fourier Transform (FFT) is performed. In addition, an array having the same length as the real part is prepared for the imaginary part, and all 0's are written.
LPC analyzing section 276 adds a hamming window to the real number region set by input waveform setting section 275, performs autocorrelation analysis on the hamming window-added waveform, obtains an autocorrelation function, and performs LPC analysis by an autocorrelation method to obtain a linear prediction coefficient. The resulting linear prediction coefficients are then passed to the spectral enhancement unit 281.
The fourier transform unit 277 performs a discrete fourier transform using a fast fourier transform on the memory array of the real part and the imaginary part obtained by the input waveform setting unit 275. The sum of the absolute values of the real part and the imaginary part of the complex frequency spectrum is calculated to obtain an analog amplitude spectrum (hereinafter referred to as an input spectrum) of the input signal. The sum of the input spectrum values (hereinafter referred to as input power) of the respective frequencies is obtained and transmitted to noise estimation section 284. The complex spectrum itself is in turn passed to the spectrum stabilization unit 279.
The processing by the noise estimation unit 284 will be described below.
Noise estimating section 284 compares the input power obtained by fourier transform section 277 with the maximum power value stored in maximum power storing section 289, stores the input power value as the maximum power value in maximum power storing section 289 when the maximum power is small, estimates noise when at least one of the following three conditions is satisfied, and does not estimate noise when the conditions are not satisfied.
(1) The input power is less than the product of the maximum power times the silence detection factor.
(2) The noise reduction coefficient is larger than the sum of the specified noise reduction coefficient plus 0.2.
(3) The input is smaller than the product of the average noise power obtained from the noise spectrum storage unit 285 multiplied by 1.6.
Here, the noise estimation algorithm of the noise estimation unit 284 will be described.
First, the persistence numbers of all frequencies of the level 1 candidates and the level 2 candidates stored in noise spectrum storage section 285 are updated (1 is added). Then, the number of persistence of each frequency of the level 1 candidates is examined, and when the number of persistence is larger than a preset noise spectrum reference number of persistence, the compensation spectrum and the number of persistence of the level 2 candidates are set as the level 1 candidates, the compensation spectrum of the level 2 candidates is set as the compensation spectrum of the level 3 candidates, and the number of persistence is set to 0. However, when the compensation spectrum of the 2-level candidate is exchanged, the 3-level candidate is not stored, but the 2-level candidate is substituted by a little amplification, so that the memory can be saved. In the present embodiment, the compensation spectrum of the 2-stage candidate is substituted by 1.4 times with an amplification factor.
After the continuation number update, the compensation noise spectrum is compared with the input spectrum for each frequency. First, the input spectrum of each frequency is compared with the compensation noise spectrum of the 1 st candidate, and if the input spectrum is small, the compensation noise spectrum of the 1 st candidate and the continuation number are taken as 2 nd candidates, the input spectrum is taken as the compensation spectrum of the 1 st candidate, and the continuation number of the 1 st candidate is taken as 0. If the input spectrum is small, the input spectrum is taken as the compensation spectrum of the 2-level candidate, and the number of continuation of the 2-level candidate is taken as 0. Then, the compensation frequencies and the persistence numbers of the obtained 1 or 2-level candidates are stored in the compensation noise spectrum storage unit 285. At the same time, the average noise spectrum is also updated according to the following equation (50).
si=si*g+Si*(l-g) ……(50)
s: average noise spectrum S: input frequency spectrum
g: 0.9 (in the case where the input power is larger than half the average noise power)
0.5 (in the case where the input power is smaller than half of the average noise power)
i: frequency numbering
The average noise spectrum is obtained by simulation, and the coefficient g in equation (50) is a coefficient for adjusting the learning speed of the average noise spectrum. That is, if the input power is smaller than the noise power, it is determined that the section is likely to be the section having only noise, and the learning speed is increased.
Then, the sum of the frequency values of the average noise spectrum is obtained as the average noise power. The compensation noise spectrum, average noise spectrum, and average noise power are stored in the noise spectrum storage unit 285.
In the above noise estimation process, if the noise spectrum of 1 frequency is associated with the input spectrum of a plurality of frequencies, the RAM capacity constituting the noise spectrum storage unit 285 can be saved. Next, the RAM capacity of noise spectrum storage section 285 when estimating a noise spectrum of 1 frequency from an input spectrum of 4 frequencies in the case of using the FFT of 256 points according to the present embodiment will be described as an example. Considering that the (analog) amplitude spectrum is symmetrical about the frequency axis, when estimating all frequencies, it is necessary to store 128 (frequencies) × 2 (spectrum and persistence number) × 3 (1, 2-level candidates for compensation, average) for the spectrum and persistence number of 128 frequencies, that is, RAM capacity of 768W in total.
On the contrary, when the noise spectrum of 1 frequency is associated with the input spectrum of 4 frequencies, 32 (frequency) × 2 (spectrum and continuation number) × 3 (1, 2-level candidates for compensation, average) is required, that is, the RAM capacity of 192W in total is sufficient. It was experimentally confirmed that although the resolution of the noise spectrum frequency is lowered in this case, the performance hardly deteriorates in the above-described 1 to 4 cases. Further, since this method does not estimate the noise spectrum with the spectrum of 1 frequency, even when a stationary sound (sine wave, vowel, etc.) continues for a long time, there is an effect of preventing such a spectrum from being erroneously estimated as the noise spectrum.
The processing performed by the noise reduction/spectral compensation unit 278 is explained below.
The product of the average noise spectrum stored by the noise spectrum storage unit 285 and the noise reduction coefficient obtained by the noise reduction coefficient adjustment unit 274 (hereinafter referred to as a difference spectrum) is subtracted from the input spectrum. In the case of saving the RAM capacity of the noise spectrum storage unit 285 shown in the above description of the noise estimation unit 284, the product of the average noise spectrum of the frequency corresponding to the input spectrum and the noise reduction coefficient is subtracted. Then, when the difference spectrum is negative, the product of the 1 st order candidate of the compensation noise spectrum stored in noise spectrum storage section 285 and the compensation coefficient obtained by noise reduction coefficient adjustment section 274 is substituted for compensation. This is done for all frequencies. Further, flag data is generated for each frequency so as to identify the frequency that compensates the difference spectrum. For example, there is a region for each frequency, 0 for uncompensated compensation and 1 for compensated compensation. This flag data is sent to the spectrum stabilization unit 279 together with the difference spectrum. The value of the flag data is examined to determine the total number of compensations (the number of compensations), and this is also sent to spectrum stabilizing section 279.
Next, the processing of the spectrum stabilizing unit 279 will be described. This processing is mainly to play a role of reducing the abnormal feeling to the section not containing sound.
First, the sum of the difference spectra of the frequencies obtained by the noise reduction/spectrum compensation unit 278 is calculated to obtain the power of the current frame. The power of the current frame is calculated for the full frequency band and the middle frequency band. The full band is obtained for all frequencies (so-called full band, which is 0 to 128 in the present embodiment), and the mid band is obtained for a band near the middle where auditory sensation is important (so-called mid band, which is 16 to 79 in the present embodiment).
Similarly, the sum of the 1 st-order candidates of the compensation noise spectrum stored in noise spectrum storage section 285 is obtained as the current frame noise power (full band, mid band). Here, the compensation value obtained by noise reduction/spectrum compensation section 278 is examined, and when it is sufficiently large and at least 1 of the following 3 conditions is satisfied, it is determined that the current frame is a noise-only section, and the spectrum stabilization processing is performed.
(1) The input power is less than the product of the maximum power times the silence detection factor.
(2) The current frame power (mid-band) is smaller than the current frame noise power (mid-band) multiplied by 5.0.
(3) The input power is less than the noise reference power.
When the stabilization processing is not performed, the noise continuation number stored in the previous spectrum storage unit 286 is decreased as a positive value, and the current frame noise power (full band, middle band) is stored in the previous spectrum storage unit 286 as the previous frame power (full band, middle band), respectively, and the phase diffusion processing is performed.
The spectral stabilization process is explained here. The purpose of this processing is to achieve spectral stabilization and power reduction for silent intervals (intervals where no sound is present but only noise). There are two types of processing, and processing 1 is performed when the noise duration is smaller than the noise reference duration, and processing 2 is performed when the former exceeds the latter. Two processes are explained below.
Process 1
The noise persistence number stored in the previous spectrum storage unit 286 is added with 1, and the current frame noise power (full band, middle band) is stored in the previous spectrum storage unit 286 as the previous frame power (full band, middle band), respectively, and the phase adjustment processing is performed.
Treatment 2
The previous frame power, the previous frame smoothing power, and the silence power reduction coefficient, which is a fixed coefficient, stored in the previous spectrum storage unit 286 are referred to and changed according to equation (51).
Dd80=Dd80*0.8+A80*0.2*P
D80=D80*0.5+Dd80*0.5
Dd129=Dd129*0.8+A129*0.2*P (51)
D129=D129*0.5+Dd129*0.5
Dd 80: previous frame smoothing power (middle frequency band)
D80: front frame power (middle frequency band)
D129: previous frame smoothing power (full frequency band)
D129: front frame power (full frequency band)
A80: current frame noise power (middle frequency band)
A129: current frame noise power (full frequency band)
These powers are then reflected in the difference spectrum. For this purpose, two coefficients are calculated, such as a coefficient by which the intermediate band is multiplied (hereinafter referred to as coefficient 1) and a coefficient by which the full band is multiplied (hereinafter referred to as coefficient 2). First, a coefficient 1 is calculated by the following equation (52)).
r1 ═ D80/A80 (when A80 > 0)
1.0 (when A800) (52)
r 1: coefficient of 1
D80: front frame power (middle frequency band)
A80: current frame noise power (middle frequency band)
Since the coefficient 2 is affected by the coefficient 1, the calculation means is somewhat complicated. The steps are as follows.
(1) If the previous frame smoothing power (full band) is smaller than the previous frame power (middle band), or if the current frame noise power (full band) is smaller than the current frame noise power (middle band), the procedure proceeds to step (2), otherwise the procedure proceeds to step (3).
(2) The coefficient 2 is 0.0, and the previous frame power (full band) is used as the previous frame power (intermediate band), and the process proceeds to step (6).
(3) And (4) when the current frame noise power (full frequency band) is equal to the current frame noise power (middle frequency band), switching to the step (5) when the current frame noise power (full frequency band) is not equal to the current frame noise power (middle frequency band).
(4) The coefficient was taken to 1.0 and turned to (6).
(5) The coefficient 2 is calculated by the following formula (53), and the process proceeds to (6).
r2=(D129-D80)/(A129-A80) (53)
r 2: coefficient 2
D129: front frame power (full frequency band)
D80: front frame power (middle frequency band)
A129: current frame noise power (full frequency band)
A80: current frame noise power (middle frequency band)
(6) The coefficient 2 calculation processing ends.
Both coefficients 1 and 2 obtained by the above algorithm clamp the upper limit to 1.0 and the lower limit to the silent power reduction coefficient. Then, the product obtained by multiplying the difference spectrum of the frequencies of the intermediate frequency band (16 to 79 in this example) by the coefficient 1 is used as the difference spectrum, and the product obtained by multiplying the difference spectrum of the frequencies (0 to 15, 80 to 128 in this example) from which the intermediate frequency band is removed out of the entire frequency band of the difference spectrum by the coefficient 2 is used as the difference spectrum. At the same time, the previous frame power (full band, mid band) is converted by the following equation (54).
D80=A80*r1
D129=D80+(A129-A80)*r2 (54)
r 1: coefficient of 1
r 2: coefficient 2
D80: front frame power (middle frequency band)
A80: current frame noise power (middle frequency band)
D129: front frame power (full frequency band)
A129: current frame noise power (full frequency band)
All of the various power data thus obtained are stored in pre-spectrum storage section 286, and processing (2) is terminated.
Spectral stabilization is achieved in spectral stabilization unit 279 according to the above-described approach.
The phase adjustment process will be described below. In the conventional spectral subtraction, the phase is basically unchanged, but in the present embodiment, when the frequency spectrum at that frequency is compensated for in the subtraction, the phase is randomly modified. This process enhances the randomness of the remaining noise, and thus has the effect of making it less audible and less impressive.
First, a random phase counter stored by the random phase storage unit 287 is obtained. Then, referring to the flag data (data indicating whether or not compensation is performed) of all frequencies, when compensation is performed, the phase of the complex spectrum obtained by the fourier transform unit 277 is rotated by the following equation (55).
Bs=Si*Rc-Ti*Rc+1
Bt=Si*Rc+1+Ti*Rc
Si=Bs (55)
Ti=Bt
Si, Ti: complex spectrum, i: reference symbols representing frequency
R: random phase data, c: random phase counter
Bs and Bt: compute radix register
In equation (55), two random phase data are used in pairs. Therefore, each time the above-described processing is performed, the random phase counter is incremented by 2, and when the count reaches the upper limit (16 in the present embodiment), the count is set to 0. The random phase counter is stored in the random phase storage 287, and the obtained complex spectrum is transmitted to the inverse fourier transform unit 280. The sum of the difference spectra (hereinafter referred to as difference spectral power) is found and transmitted to frequency enhancement section 281.
Inverse fourier transform section 280 forms a new complex spectrum from the amplitude of the difference spectrum obtained by spectrum stabilizing section 279 and the phase of the complex spectrum, and performs inverse fourier transform by FFT. (the resulting signal is referred to as the 1 st output signal). The resulting 1 st output signal is then passed to spectral enhancement unit 281.
The processing by the spectral enhancement unit 281 is explained below.
First, referring to the average noise power stored in noise spectrum storage section 285, the difference spectrum power obtained by spectrum stabilizing section 279, and the noise reference power as a constant, the MA emphasis coefficient and the AR emphasis coefficient are selected. The selection was made based on evaluation of the following two conditions.
Condition 1
The difference spectrum power is larger than the product of the average noise power stored in the noise spectrum storage unit 285 multiplied by 0.6, and the average noise power is larger than the noise reference power.
Condition 2
The difference spectral power is greater than the average noise power.
When the condition (1) is satisfied, the voiced sound interval is taken as a voiced sound interval, the MA enhancement coefficient is taken as the MA enhancement coefficient 1-1, the AR enhancement coefficient is taken as the AR enhancement coefficient 1-1, and the high-frequency enhancement coefficient is taken as the high-frequency enhancement coefficient 1. When the condition (1) is not satisfied and the condition (2) is satisfied, the section is regarded as an unvoiced segment, the MA enhancement coefficient is taken as MA enhancement coefficient 1-0, the AR enhancement coefficient is taken as AR enhancement coefficient 1-0, and the high-frequency enhancement coefficient is taken as 0. When the condition (1) is not satisfied and the condition (2) is not satisfied, the MA enhancement coefficient is taken as a "silent section (a section having only noise)", the AR enhancement coefficient is taken as an MA enhancement coefficient 0, the AR enhancement coefficient is taken as an AR enhancement coefficient 0, and the high-frequency enhancement coefficient is taken as a high-frequency enhancement coefficient 0.
Then, using the linear prediction coefficient, the MA enhancement coefficient, and the AR enhancement coefficient obtained by LPC analyzing section 276, the MA coefficient and the AR enhancement coefficient of the pole enhancement filter are calculated from equation (56) below.
α (ma) i: coefficient of MA
α (ar) i: AR coefficient
α i: linear prediction coefficient
Beta: MA enhancement factor
γ: AR enhancement factor
i: numbering
Then, the 1 st output signal obtained by inverse fourier transform section 280 is multiplied by a pole boosting filter using the MA coefficient and the AR coefficient. The transfer function of this filter is shown in equation (57) below.
α (ma) i: coefficient of MA
α (ar) i: AR coefficient
j: number of times
Further, the high-frequency boosting filter is multiplied by the high-frequency boosting coefficient in order to boost the high-frequency component. The transfer function of this filter is shown in the following equation (58).
1-δZ-1 ……(58)
δ: enhancing the coefficient for high frequencies
The signal resulting from the above process is referred to as the 2 nd output signal. Also, the state of the filter is kept inside the spectral enhancement unit 281.
Finally, waveform matching section 282 uses a triangular window to superimpose the signal stored in pre-waveform storage section 288 on the signal of the 2 nd order output signal obtained by spectrum enhancement section 281, thereby obtaining an output signal. The data of the last first read data length fraction of the output signal is also stored in the front waveform memory unit 288. The matching method at this time is shown in the following formula (59).
Oj=(j×Dj+(L-j)×Zj)/L (j=0~L-1)
Oj=Dj (j=L~L-M-1)
Zj=OM+1 (j=0~L-1)
(59)
Oj: output signal
And Dj: 2 nd output signal
Zi: output signal
L: first read data length
M: frame length
It should be noted here that, as an output signal, data of the first reading data length + the frame length share is output, but only a section having a length equal to the frame length from the start end of the data can be processed as a signal. This is because the data of the subsequent first-read data length is rewritten when the next output signal is output. However, since the continuity is compensated for in all the intervals of the output signal, it can be used for frequency analysis such as LPC analysis and filter analysis.
With this embodiment, the noise spectrum estimation can be performed both in the audio section and outside the audio section, and the noise spectrum can be estimated even when it is not clear at which time the audio exists in all the data.
Further, the feature of the input spectral envelope can be enhanced by the linear prediction coefficient, and deterioration of sound quality can be prevented even in the case where the noise level is high.
The spectrum of the noise can also be estimated from the average and the lowest two directions, and thus more appropriate noise reduction processing can be performed.
Further, the average spectrum of the noise is used for the noise reduction processing, so that the noise spectrum can be cut down to a greater extent, and the compensation spectrum can be estimated separately to perform more appropriate compensation.
Further, the spectrum of the section containing no sound but only noise can be smoothed, and therefore, it is possible to prevent the spectrum of the same section from causing an abnormal feeling due to an extreme spectrum variation due to a reduction in noise.
It is also possible to make the frequency component to be compensated random, and convert the noise left without being clipped into noise having a small auditory abnormal feeling.
Further, it is possible to apply more appropriate weighting to the audio section, and to suppress abnormal feeling due to the auditory weighting in the unvoiced section and the unvoiced section.
Industrial applicability
As described above, the acoustic vector generator, the audio encoder, and the audio decoder according to the present invention are useful for acoustic vector search, and are suitable for improving sound quality.

Claims (1)

1. A noise canceller that removes a noise component from an input sound signal, the noise canceller comprising:
a/D conversion means for converting the input sound signal into a digital signal;
a noise reduction coefficient adjusting device for adjusting a noise reduction coefficient for determining a noise reduction amount;
a noise reduction coefficient storage device that stores the noise reduction coefficient;
an input waveform setting device for setting the digital signal obtained by the A/D converter as a digital signal of a predetermined time length suitable for Fourier transform;
an LPC analyzing device for performing linear predictive analysis on the digital signal of a predetermined time length obtained by the input waveform setting device;
a fourier transform device for performing a discrete fourier transform on the digital signal of a predetermined time length obtained by the input waveform setting device to obtain an input spectrum and a complex spectrum;
a noise estimating device for, when the maximum power stored in the maximum power storing device is smaller than the power of the input audio signal, updating the maximum power stored in the maximum power storing device with the power of the input audio signal, checking whether or not the power of the input audio signal or the noise reduction coefficient satisfies a predetermined condition to determine whether or not a noise section in which an input spectrum obtained by the fourier transform device is compared with a plurality of compensation noise spectrum candidates having different magnitudes in ascending order of magnitude, and when the input spectrum is smaller than any one of the compensation noise spectrum candidates, setting the input spectrum as the compensation noise spectrum, and calculating an average noise spectrum obtained by adding the input spectrum at a constant ratio, storing the compensation noise spectrum and the average noise spectrum in a noise spectrum storage device;
a noise spectrum storage device that stores the compensation noise spectrum and the average noise spectrum;
a noise reduction/spectrum compensation means for generating a difference spectrum by multiplying the average noise spectrum stored in the noise spectrum storage means by the noise reduction coefficient obtained by the noise reduction coefficient adjustment means and subtracting the resultant from the input spectrum obtained by the fourier transform means, and for compensating a negative difference spectrum by substituting the product of the level 1 candidate of the noise spectrum for compensation stored in the noise spectrum storage means and the compensation coefficient obtained by the noise reduction coefficient adjustment means;
a spectrum stabilizing device for checking the full-band power of all the frequencies of the spectrum subjected to noise reduction and spectrum compensation by the noise reduction/spectrum compensation device and the mid-band power of a part of frequency bands important for auditory sense, judging whether an input signal is a silence interval according to whether the input power or the mid-band power of a current frame meets a preset condition, performing stabilization processing of adjusting the power of a previous frame by the power of the current frame on the full-band power and the mid-band power in the silence interval, and performing phase rotation of a complex spectrum by random phase data on the complex spectrum obtained by the Fourier transform device outside the silence interval according to information whether the spectrum is subjected to spectrum compensation by the noise reduction/spectrum compensation device;
a previous spectrum storage device for storing information of previous frame power;
a random phase storage device that stores the random phase data;
an inverse Fourier transform means for performing inverse Fourier transform on the frequency spectrum subjected to the stabilization processing by the frequency spectrum stabilization means and the adjusted phase frequency spectrum;
a spectrum enhancement device for preparing a plurality of sets of weighting coefficients for enhancing a spectrum, determining a state of the input audio signal based on a power level of the difference spectrum at the time of noise reduction, selecting a set of weighting coefficients corresponding to the determined state, and performing spectrum enhancement on the signal obtained by the inverse fourier transform device using the selected weighting coefficients;
a waveform matching means for matching the signal obtained by the spectrum enhancing means with a previous frame signal;
front waveform storage means for storing a waveform of a previous frame used by the waveform matching means; and
maximum power storage means for storing the maximum power updated by the noise estimation means.
HK04107704.8A 1996-11-07 1999-05-27 Noise cancellator HK1064788B (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
JP29473896A JP4003240B2 (en) 1996-11-07 1996-11-07 Speech coding apparatus and speech decoding apparatus
JP294738/1996 1996-11-07
JP310324/1996 1996-11-21
JP31032496A JP4006770B2 (en) 1996-11-21 1996-11-21 Noise estimation device, noise reduction device, noise estimation method, and noise reduction method
JP03458397A JP3700310B2 (en) 1997-02-19 1997-02-19 Vector quantization apparatus and vector quantization method
JP03458297A JP3174742B2 (en) 1997-02-19 1997-02-19 CELP-type speech decoding apparatus and CELP-type speech decoding method
JP34582/1997 1997-02-19
JP34583/1997 1997-02-19
HK99102382.4A HK1017472B (en) 1996-11-07 1997-11-06 Sound source vector generator and method for generating a sound source vector

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
HK99102382.4A Addition HK1017472B (en) 1996-11-07 1997-11-06 Sound source vector generator and method for generating a sound source vector
HK06100488.3A Division HK1080597A (en) 1996-11-07 1999-05-27 Sound source vector generator, voice encoder, and voice decoder

Related Child Applications (2)

Application Number Title Priority Date Filing Date
HK99102382.4A Division HK1017472B (en) 1996-11-07 1997-11-06 Sound source vector generator and method for generating a sound source vector
HK06100488.3A Addition HK1080597A (en) 1996-11-07 1999-05-27 Sound source vector generator, voice encoder, and voice decoder

Publications (2)

Publication Number Publication Date
HK1064788A1 HK1064788A1 (en) 2005-02-04
HK1064788B true HK1064788B (en) 2006-11-03

Family

ID=

Similar Documents

Publication Publication Date Title
CN1167047C (en) Sound source vector generating device and method
CN1205603C (en) Indexing pulse positions and signs in algebraic codebooks for coding of wideband signals
CN100346392C (en) Encoding device, decoding device, encoding method and decoding method
CN1160703C (en) Speech coding method and device, and sound signal coding method and device
CN1632864A (en) Diffusion vector generation method and diffusion vector generation device
CN1877698A (en) Excitation vector generator, speech coder and speech decoder
CN1808569A (en) Voice encoding device,orthogonalization search, and celp based speech coding
HK1064788A1 (en) Noise cancellator
HK1064788B (en) Noise cancellator
HK1041969B (en) Sound source vector generator, voice encoder and voice decoder
HK1041966B (en) Sound source vector generator, voice encoder and voice decoder
HK1017472B (en) Sound source vector generator and method for generating a sound source vector
HK1041968B (en) Sound source vector generator, voice encoder and voice decoder
HK1064787B (en) Sound source vector generator, voice encoder, and voice decoder
HK1041967B (en) Speech coder or decoder and speech coding or decoding method
HK1080597A (en) Sound source vector generator, voice encoder, and voice decoder
HK1050262B (en) Method and device for indexing pulse positions and signs in algebraicodebooks of efficient coding of wideband signals