EP1112625B1

EP1112625B1 - Method for coding an information signal

Info

Publication number: EP1112625B1
Application number: EP99943854A
Authority: EP
Inventors: James P. Ashley; Weimin Peng
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 1998-09-11
Filing date: 1999-08-24
Publication date: 2006-05-31
Anticipated expiration: 2019-08-24
Also published as: JP4460165B2; KR100409167B1; EP1112625A1; EP1112625A4; DE69931641D1; JP2002525667A; DE69931641T2; ATE328407T1; KR20010073146A; WO2000016501A1

Abstract

To achieve high quality speech reconstruction at low bit rates, constraints on position combinations among two or more pulses (403) are implemented. By placing constraints on position combinations, certain combinations of pulses are prohibited which allows the most significant pulses to always be coded, thereby improving speech quality. After all valid combinations are considered, a list of pulse pairs (codebook) which can be indexed using a single, predetermined bit length codeword is produced. The codeword is transmitted to a destination where it is used by a decoder to reconstruct the original information signal.

Description

FIELD OF THE INVENTION

The present invention relates, in general, to communication systems and, more particularly, to coding information signals in such communication systems.

BACKGROUND OF THE INVENTION

Code-division multiple access (CDMA) communication systems are well known. One exemplary CDMA communication system is the so-called IS-95 which is defined for use in North America by the Telecommunications Industry Association (TIA). For more information on IS-95, see TIA/EIA/IS-95, Mobile Station-Base-station Compatibility Standard for Dual Mode Wideband Spread Spectrum Cellular System, January 1997, published by the Electronic Industries Association (EIA), 2001 Eye Street, N.W., Washington, D.C. 20006. A variable rate speech codec, and specifically Code Excited Linear Prediction (CELP) codec, for use in communication systems compatible with IS-95 is defined in the document known as IS-127 and titled Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems, September 1996. IS-127 is also published by the Electronic Industries Association (EIA), 2001 Eye Street, N.W., Washington, D.C. 20006.
Another example of speech codec is disclosed in the document by Cheng Deguan "An 8 kbls Low Complexity ACELP Speech Codec" in Proceeding of ICSP'96, October 1996, XP 10209596. In this document the linear prediction error signal is encoded using pulses having positions preset in so-colled pulse tracks.
In modem CELP codecs, there is a problem with maintaining high quality speech reproduction at low bit rates. The problem originates since there are too few bits available to appropriately model the "excitation" sequence or "codevector" which is used as the stimulus to the CELP synthesizer. Thus, a need exists for an improved method and apparatus which overcomes the deficiencies of the prior art.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 generally depicts a CELP decoder as is known in the prior art.
FIG. 2 generally depicts a Code Excited Linear Prediction (CELP) encoder as is known in the prior art.
FIG. 3 generally depicts a joint interleaved pulse permutation matrix in accordance with the invention.
FIG. 4 generally depicts a flow chart describing how the codebook is generated in accordance with the invention.
FIG. 5 generally depicts a joint interleaved pulse permutation matrix for pulses 3 and 4 in accordance with the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The invention is defined by a method claim 1.
Stated generally, to achieve high quality speech reconstruction at low bit rates, constraints on position combinations among two or more pulses are implemented. By placing constraints on position combinations, certain combinations of pulses are prohibited which allows the most significant pulses to always be coded, thereby improving speech quality. After all valid combinations are considered, a list of pulse pairs (codebook) which can be indexed using a single, predetermined bit length codeword is produced. The codeword is transmitted to a destination where it is used by a decoder to reconstruct the original information signal.
Stated specifically, a method for coding an information signal comprises the steps of dividing the information signal into blocks and deriving a target signal based on a block of the information signal. The method further includes the steps of coding the target signal using pulse positioning techniques based on an error criteria, wherein the allowable positions of a given pulse are dependent on the positions of one or more other pulses, to produce coded pulse positions and transmitting the coded pulse positions to a destination.
In the preferred embodiment, the information signal further comprises a speech signal or an audio signal and a block of the information signals further comprise a frame or a subframe of the information signals. The error criteria further comprises a perceptually weighted squared error criteria and the allowable pulse positions are determined using an arbitrary closed-form expression F(λ), in which at least one of the conditions within the expression pertain to at least two of the elements within λ.
FIG. 1 generally depicts a Code Excited Linear Prediction (CELP) decoder 100 as is known in the art. In modem CELP decoders, there is a problem with maintaining high quality speech reproduction at low bit rates. The problem originates since there are too few bits available to appropriately model the "excitation" sequence or "codevector" c_k which is used as the stimulus to the CELP decoder 100.
As shown in FIG. 1, the excitation sequence or "codevector" c_k, is generated from a fixed codebook 102 (FCB) using the appropriate codebook index k. This signal is scaled using the FCB gain factor γ and combined with a signal E(n) output from an adaptive codebook 104 (ACB) and scaled by a factor β, which is used to model the long term (or periodic) component of a speech signal (with period r). The signal E _t (n), which represents the total excitation, is used as the input to the LPC synthesis filter 106, which models the coarse short term spectral shape, commonly referred to as "formants". The output of the synthesis filter 106 is then perceptually postfiltered by perceptual postfilter 108 in which the coding distortions are effectively "masked" by amplifying the signal spectra at frequencies that contain high speech energy, and attenuating those frequencies that contain less speech energy. Additionally, the total excitation signal E _l (n) is used as the adaptive codebook for the next block of synthesized speech.
FIG. 2 generally depicts a CELP encoder 200. Within CELP encoder 200, the goal is to code the perceptually weighted target signal x _w (n), which can be represented in general terms by the z-transform: $X_{w} (z) = S (z) W (z) - β E (z) H_{Z S} (z) - H_{Z I R} (z),$

where W(z) is the transfer function of the perceptual weighting filter 208, and is of the form: $W (z) = \frac{A (\frac{z}{λ_{1}})}{A (\frac{z}{λ_{2}})}$
and H(z) is the transfer function of the perceptually weighted synthesis filters 206 and 210, and is of the form: $H (z) = \frac{1}{A_{q} (z)} W (z),$
and where A(z) are the unquantized direct form LPC coefficients, A _q (z) are the quantized direct form LPC coefficients, and λ₁ and λ₂ are perceptual weighting coefficients. Additionally, H _zs (z) is the "zero state" response of H(z) from filter 206, in which the initial state of H(z) is all zeroes, H _ZIR (z) is the "zero input response" of H(z) from filter 210, in which the previous state of H(z) is allowed to evolve with no input excitation. The initial state used for generation of H _ZIR(z) is derived from the total excitation E _t (n) from the previous subframe.
To solve for the parameters necessary to generate x _w(n), a fixed codebook (FCB) closed loop analysis in accordance with the invention is described. Here, the codebook index k is chosen to minimize the mean square error between the perceptually weighted target signal x _w(n) and the perceptually weighted excitation signal x̂ _w(n). This can be expressed in time domain form as: $\min_{k} {\sum_{n = 0}^{L - 1} {(x_{w} (n) - γ_{k} c_{k} (n) * h (n))}^{2}}, 0 \leq k < M,$

where c _k(n) is the codevector corresponding to FCB codebook index k, γ_k is the optimal FCB gain associated with codevector c _k (n), h(n) is the impulse response of the perceptually weighted synthesis filter H(z), M is the codebook size, L is the subframe length, * denotes the convolution process and x̂ _w (n) = γ _k c _k (n)*h(n). In the preferred embodiment, speech is coded every 20 milliseconds (ms) and each frame includes three subframes of length L.
Eq. 4 can also be expressed in vector-matrix form as: $\min_{k} {{(x_{w} - γ_{k} H c_{k})}^{T} (x_{w} - γ_{k} H c_{k})}, 0 \leq k < M,$

where c _k and x _w are length L column vectors, H is the L x L zero-state convolution matrix:
and T denotes the appropriate vector or matrix transpose. Eq. 5 can be expanded to: $\min_{k} {x_{w}^{T} x_{w} - 2 γ_{k} x_{w}^{T} H c_{k} + γ_{k}^{2} c_{k}^{T} H^{T} H c_{k}}, 0 \leq k < M,$
and the optimal codebook gain γ_k for codevector c _k can be derived by setting the derivative (with respect to γ_k ) of the above expression to zero: $\frac{\partial}{\partial γ_{k}} (x_{w}^{T} x_{w} - 2 γ_{k} x_{w}^{T} H c_{k} + γ_{k}^{2} c_{k}^{T} H^{T} H c_{k}) = 0,$
and then solving for γ _k to yield: $γ_{k} = \frac{x_{w}^{T} H c_{k}}{c_{k}^{T} H^{T} H c_{k}} .$
Substituting this quantity into Eq. 7 produces: $\min_{k} {x_{w}^{T} x_{w} - \frac{{(x_{w}^{T} H c_{k})}^{2}}{c_{k}^{T} H^{T} H c_{k}}}, 0 \leq k < M .$
Since the first term in Eq. 10 is constant with respect to k, it can be written as: $\max_{k} {\frac{{(x_{w}^{T} H c_{k})}^{2}}{c_{k}^{T} H^{T} H c_{k}}}, 0 \leq k < M .$
From Eq. 11, it is important to note that much of the computational burden associated with the search can be avoided by precomputing the terms in Eq. 11 which do not depend on k; namely, by letting $d^{T} = x_{w}^{T} H$
and Θ =H ^T H. When this is done. Eq. 11 reduces to: $\max_{k} {\frac{{(d^{T} c_{k})}^{2}}{c_{k}^{T} Θ c_{k}}}, 0 \leq k < M,$
which is equivalent to equation 4.5.7.2-1 of IS-127. The process of precomputing these terms is known as "backward filtering". The result is that the index k corresponding to the codevector c _k that results in the minimum squared error between the perceptually weighted target signal x _w(n) and the perceptually weighted excitation signal x̂ _w(n) can be found by maximizing the term in Eq. 12.
In the IS-127 half rate case (4.0 kbps), the FCB utilizes a multipulse configuration in which the excitation vector c _k contains very few non-zero, unit magnitude values. This configuration is known in the art as Algebraic CELP, or ACELP. Since there are very few non-zero elements within c _k, the computational complexity involved with Eq. 12 is relatively low. For the IS-127 three "pulse" case, there are only 10 bits allocated for the pulse positions and associated signs for each of the three subframes (of length of L = 53, 53, 54). In this configuration, an associated "track" defines the allowable positions for each of the three pulses within c _k (3 bits per pulse plus 1 bit for composite sign of +, -, + or -, +, -). As shown in Table 4.5.7.4-1 of IS-127, pulse 1 can occupy positions 0, 7, 14, ..., 49, pulse 2 can occupy positions 2, 9, 16, ..., 51, and pulse 3 can occupy positions 4, 11, 18, ....53. This is known as "interleaved pulse permutation", which is well known in the art. The positions of the three pulses are optimized jointly so Eq. 12 is executed 8³ = 512 times. The sign bit is then set according to the sign of the gain term γ _k. Table 1

Pulse Positions

p0

0 7 14 21 28 35 42 49

p1 2 9 16 23 30 37 44 51

p2 4 11 18 25 32 39 46 53
Table 1 generally depicts pulse positions defined for IS-127 Rate 1/2. One problem in the above scenario is that the excitation codevector c_k can contain "holes" in which certain positions are not represented by the vector space. That is, an optimal match to the target vector may require a pulse at position 12, but the definitions of the pulse positions in Table I does not allow a pulse to be located at that position. The constraints on positions may cause the pulse to be placed either at locations close to the optimal position, or worse, the energy of the target signal may be completely missed at that position. This can cause distortion, and possibly audible artifacts in the synthesized speech signal.
In a similar example, a design requirement may be to have four pulses with one pulse on each of four separate tracks, with a subframe sizes of L = [53, 53, 54], and a bit allocation of 16 bits per subframe. In this scenario, the tracks would be configured as 4 pulses x 14 positions = 56 total positions, which could be positioned according to the prior art as in Table 2, which depicts examples of pulse positions as used in the prior art. Here, the bit allocation of 16 bits would be divided between the four tracks equally so that each track would receive four bits. The four bits per track would further be composed of three bits for position (comprising 8 different positions) and one sign bit to indicate the polarity of the pulse. Table 2

Pulse Positions

p

₀ 0 7 14 21 28 35 42 49

p ₁ 2 9 16 23 30 37 44 51

p ₂ 3 10 17 24 31 38 45 52

p ₃ 5 12 19 26 33 40 47 54
As can be seen from this example, there are still holes in the vector space since all of the pulse positions cannot be adequately represented. One solution would be to allow all fourteen positions to be valid, e.g., the positions of pulse p₀ would be [0, 4, 8,..., 52], p₁ would be [1, 5, 9,..., 53], etc. The problem with this method is that four bits would be required to encode the position information, thereby violating the 16 bit per subframe requirement (4 tracks x (4 position bits + 1 sign bit) = 20 bits).
Another method for pulse coding that is known in the prior art deals with multiplexing the indices of two pulses into a single codeword. For example, in the IS-127 Rate 1 case (8.5 kbps), there are 11 possible pulse positions spread over five tracks. Rather than using four bits for each pulse position, the positions of two pulses can be coded jointly using only seven bits. This is accomplished by considering that the total number of positions for two pulses is 11 x 11 = 121, which is less than the total number of positions that can be coded with seven bits (2⁷ = 128). Details of the coding can then be expressed as: $Codeword = 11 ⌊ \frac{p_{i}}{5} ⌋ + ⌊ \frac{p_{j}}{5} ⌋,$

where p _i and p _j are the positions of the i-th and j-th pulses, and └x┘ represents the largest integer ≤ x.
The pulse positions can then be extracted at the decoder by: $λ_{i} = ⌊ \frac{Codeword}{5} ⌋, λ_{j} = Codeword - 11 λ_{i},$

where λ_i and λ_j are the decimated positions within the appropriate track, which can be decoded using Table 2, where the value of λ corresponds to the column in the table. The problem with using this method for the 14 position case in Table 2 is that a 14 x 14 = 196 position multiplex would still require 8 bits (2⁸ = 256 possible positions), so there is no savings over simply using four bits per pulse. Clearly, with all of the above prior art methods, all positions are not adequately represented by the vector space which would allow efficient, low rate coding of pulse positions.
As previously mentioned, design of an efficient 16 bit, 4 pulse, 56 position codebook (with all positions representable) is not readily achievable in the prior art. In accordance with the present invention, however, a method is presented which allows all pulse positions to be coded, while maintaining the design constraints as presented in the previous example. In addition, the present invention provides a general flexibility which allows efficient solutions to a wide variety of design constraints.
The present invention solves the aforementioned problems by placing constraints on position combinations among two or more pulses. For example, the allowable positions for a given pulse are jointly dependent on the associated positions of one or more other pulses. This can be seen for the 14 position track example in FIG. 3, where a joint interleaved pulse permutation matrix in accordance with the invention is shown. In this embodiment, the matrix depicted in FIG. 3 is for pulses 0 and 1, and the subframe length is L=54. In this figure, the respective positions of pulse 0 are shown along the horizontal axis, and the positions of pulse 1 are shown along the vertical axis. The "forbidden" pulse combinations are designated by the shaded regions while the allowable combinations are unshaded. As one may notice, the number of unshaded regions is exactly the number of combinations that can be represented by the given number of bits, in this case 2⁷ = 128, and the number of shaded regions is exactly the total number of decimated positions of pulse 0 times the total number of decimated positions of pulse 1 minus the number of combinations that can be represented by the given number of bits, i.e., (14 x 14) - 128 = 68.
As the various pulse position codevectors are searched (via Eq. 12), when pulse p₁ is placed at λ₁ = 0 (corresponding to position (0 x 4) + 1 = 1), then the allowable positions for pulse p₀ would be [4, 8, 16, 20, 28, 32, 40, 48, 52]. Likewise, when pulse p₁ is placed at position 5 (λ₁ = 1), the allowable positions for pulse p₀ would be [0, 8, 12, 20, 24, 32, 36, 44, 52], and so on. After considering all valid combinations, a 128 x 2 list of pulse pairs (codebook) that can be indexed using a single 7 bit codeword is produced in accordance with the invention. This codeword is suitable for transmission to a destination for decoding and reconstruction. Furthermore, this codebook can be generated algebraically at run time, stored in volatile memory (RAM), or stored in nonvolatile memory (ROM).
FIG. 4 generally depicts a flow chart describing how the codebook is generated in accordance with the invention. First, the flowchart shows a basic nested loop structure in which all permutations of 0 ≤ i < M and 0 ≤ j < N are generated. In this example, N and M are the total number of allowable positions for each pulse. The decision in the innermost loop simply checks for forbidden combinations [i,j] according to function F(i,j) at step 402, which in the example of FIG. 3 is described as: $F (i, j) = {\begin{matrix} 1, & | i, j | \in [0, 3, 6, 9, 11] \\ 0, & otherwise \end{matrix} .$
This function returns a value of 1 for cases when the absolute value of the difference of i and j is an element of the given set; otherwise, a zero is returned. This is shown in step 403. The elements of the given set correspond to the distances between the diagonal shaded elements of FIG. 3, and the expression is therefore sufficient in describing all necessary shaded regions. For allowed pulse combinations, the respective positions are calculated using the following expression: $G (λ, n) = λ \times N_{tracks} + n,$

where λ is the decimated track position, N _tracks is the number of tracks, and n is the track number. Once the codebook entry has been generated at step 403, the codebook index k is incremented at step 404, and the process continues until the entire codebook is filled via steps 400-401 and 405-408. A similar technique would be used for generating position information for pulses p₂ and p₃ of the given example.
Although the previous example shows the forbidden regions to be strict upper left to lower right diagonal, any pattern utilizing 128 unshaded regions is feasible and assumed to be within the scope of the invention. Another aspect of the preferred embodiment is explained as follows: there are 4 x 14 = 56 total possible pulse positions. The length of a subframe, however, is not greater than 54 samples. Therefore, dedicating positions to locations greater than 53 (or 52 for subframes one and two) results in reduced coding efficiency, and thus, degraded quality. FIG. 5 generally depicts a joint interleaved pulse permutation matrix for pulses p₂ and p₃ in accordance with the present invention. As shown in FIG. 5, the positions 54 and 55 are omitted by the shaded regions, which allows more combinations to be represented in the valid vector space since the total number of unshaded regions is still 128. This can be observed by comparing the relative spacing between the diagonals in FIG. 3 and FIG. 5, where FIG. 3 has generally two spaces between forbidden diagonals while FIG. 5 has three spaces. The closed form expression for the forbidden combinations of FIG. 5 can be expressed as: $F (i, j) = {\begin{matrix} 1, & | i - j | \in [0, 4, 8] \\ 1, & i = M - 1 or j = N - 1 \\ 0, & otherwise \end{matrix} .$
As one may observe, the example in FIG. 5 is inherently less restrictive and therefore results in higher coding accuracy.
As one skilled in the art will appreciate, it is possible to form upper right to lower left diagonals and a number of various other patterns that may benefit a specific application using the techniques described herein in accordance with the invention. Furthermore, it is possible to extend the dimension of the number of pulses to beyond two so that any closed-form expression F(λ) is allowed, where λ = [λ₀,λ₁,...,λ_n-1] is the vector of candidate pulse positions, and n is the number of pulses.
While the invention has been particularly shown and described with reference to a particular embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention as defined by the claims.

Claims

A method for coding a speech or audio signal based on linear prediction comprising the steps of:
a) dividing the speech or audio signal into blocks;

b) deriving a target signal based on a representation of the difference between a weighted version of said speech or audio signal and a weighted syntherized wherein of said signal derived by linear prediction from a block of the information signal;

c) characterized by coding the target signal using pulse positioning techniques based on an error criteria, wherein the allowable positions of a given pulse are dependent on the positions of one or more other pulses, to produce coded pulse positions; and

d) transmitting the coded pulse positions to a destination.
The method in claim 1, wherein a block of the information signals further comprise a frame or a subframe of the information signals.
The method in claim 1, wherein the error criteria further comprises a perceptually weighted squared error criteria.
The method in claim 1, wherein the allowable pulse positions are determined using an arbitrary closed-form expression F(λ), in which at least one of the conditions within the expression pertain to at least two of the elements within λ.