EP0950238B1

EP0950238B1 - Speech coding and decoding system

Info

Publication number: EP0950238B1
Application number: EP97930643A
Authority: EP
Inventors: Costas Xydeas
Original assignee: Victoria University of Manchester
Current assignee: University of Manchester
Priority date: 1996-07-05
Filing date: 1997-07-07
Publication date: 2003-09-10
Anticipated expiration: 2017-07-07
Also published as: AU3452397A; JP2000514207A; ATE249672T1; WO1998001848A1; EP0950238A1; DE69724819D1; CA2259374A1

Abstract

A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.

Description

The present invention relates to speech synthesis systems, and in particular to speech systems coding and synthesis systems which can be used in speech communication systems operating at low bit rates.
Speech can be represented as a waveform the detailed structure of which represents the characteristics of the vocal tract and vocal excitation of the person producing the speech. If a speech communication system is to be capable of providing an adequate perceived quality, the transmitted information must be capable of representing that detailed structure. Most of the power in voiced speech is at relatively low frequencies, for example below 2kHz. Accordingly good quality speech synthesis can be achieved on the basis of speech waveforms that have been low pass filtered to reject higher frequency components. The perceived speech quality is however adversely effected if the frequency is restricted much below 4kHz.
Many models have been suggested for defining the characteristics of speech. The known models rely upon dividing a speech signal into blocks or frames and deriving parameters to represent the characteristics of the speech within each frame. Those parameters are then quantized and transmitted to a receiver. At the receiver the quantization process is reversed to recover the parameters, and a speech signal is then synthesised on the basis of the recovered parameters.
The common objective of the designers of the known models is to minimise the volume of data which must be transmitted whilst maximising the perceived quality of the speech that can be synthesised from the transmitted data. In some of the models a distinction is made between whether or not a particular frame is "voiced" or "unvoiced". In the case of voiced speech, speech is produced by glottal excitation and as a result has a quasi-periodic structure. Unvoiced speech is produced by turbulent air flow at a constriction and does not have the "periodic" spectral structure characteristic of voiced speech. Most models seek to take advantage of the fact that voiced speech signals evolve relatively slowly in the context of frames the duration of which is typically 10 to 30msecs. Most models also rely upon quantization schemes intended to minimise the amount of information which must be transmitted without significant loss of perceived quality. As a result of the work done to date it is now possible to produce speech synthesis systems capable of operating at bit rate of only a few thousand bits per second.
One model which has been developed is known as "sinusoidal coding" (R.J. McAulay and T.F. Quatieri, "Low Rate Speech Coding Based on Sinusoidal Coding", Advances in Speech Signal Processing, Editors S. Furui and M.M. Sondhi, Chapter 6, pp. 165-208, Markel Dekker, New York, 1992). This approach relies upon an FFT analysis of each input frame to produce a magnitude spectrum, estimating the pitch period of the input frame from that spectrum, and defining the amplitudes at the pitch related harmonics, the harmonics being multiples of the fundamental frequency of the frame. An error measure is calculated in the time domain representing the difference between harmonic and aharmonic speech spectra and that error measure is used to define the degree of voicing of the input frame in terms of a frequency value. Thus the parameters used to represent a frame are the pitch period, the magnitude and phase values for each harmonic, and the frequency value. Proposals have been made to operate this system such that phase information is predicted in a coherent way across successive frames.
In another system known as "multiband excitation coding" (D.W. Griffin and J.S. Lim, "Multiband Excitation Vocoder" IEEE Transaction on Acoustics, Speech and Signal Processing, vol. 36, pp 1223-1235, 1988 and Digital Voice Systems Inc, "INMARSAT M Voice Codec, Version 3.0", Voice Coding System Description, Module 1, Appendix 1, August 1991) the amplitude and phase functions are determined in a different way from that employed in sinusoidal coding. The emphasis in this system is placed on dividing a spectrum into bands, for example up to twelve bands, and evaluating the voiced/unvoiced nature of each of these bands. Bands that are classified as unvoiced are synthesised using random signals. Where the difference between the pitch estimates of successive frames is relatively small, linear interpolation is used to define the required amplitudes. The phase function is also defined using linear frequency interpolation but in addition includes a constant displacement which is a random variable and which depends on the number of unvoiced bands present in the short term spectrum of the input signal. The system works in a way to preserve phase continuity between successive frames. When the pitch estimates of successive frames are significantly different, a weighted summation of signals produced from amplitudes and phases derived for successive frames is formed to produced the synthesised signal.
Thus the common ground between the sinusoidal and multiband systems referred to above is that both schemes directly model the input speech signal which is DFT analysed, and both systems are at least partially based on the same fundamental relationship for representing speech to be synthesised. The systems differ however in terms of the way in which amplitudes and phase are estimated and quantized, the way in which different interpolation methods are used to define the necessary phase relationships, and the way in which "randomness" is introduced in the recovered speech.
Various versions of the multiband excitation coding system have been proposed, for example an enhanced multiband excitation speech coder (A. Das and A. Gersho, Variable-Dimension Spectral Coding of Speech at 2400 bps and below with phonetic classification", IEEE Proc. ICASSP-95, pp. 492-495, May 1995) in which input frames are classified into four types, that is noise, unvoiced, fully voiced and mixed voiced, and a variable dimension vector quantization process for spectral magnitude is introduced, the bi-harmonic spectral modelling system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. Banga, "Speech Coding Using Bi-Harmonic Spectral Modelling", Proc. EUSIPCO-94, Edingburgh, Vol. 2, pp. 391-394, September 1994) in which the short term magnitude spectrum is divided into two bands and a separate pitch frequency is calculated for each band, the spectral excitation coding system (V. Cuperman, P. Lupini and B. Bhattacharya, "Spectral Excitation Coding of Speech at 2.4 kb/s", IEEE Proc. ICASSP-95, pp. 504-507, Detrpot, May 1995) which applies sinusoidal based coding in the linear predictive coding (LPC) residual domain where the synthesised residual signal is the summation of pitch harmonic oscillators with appropriate amplitude and phase functions and amplitudes are quantized using a non-square transformation, the band-widened harmonic vocoder (G. Yang, G Zanellato and H. Leich, "Band Widened Harmonic Vocoder at 2 to 4 kbps", IEEE Proc. ICASSP-95, pp. 504-507, Detroit, May 1995) in which randomness in the signal is introduced by adding jitter to the amplitude information on a per band basis, pitch synchronous multiband coding (H. Yang, S. N. Koh and P. Sivaprakasapilai, "Pitch Synchronous Multi-Band (PSMB) Speech Coding", IEEE Proc. ICASSP-95, pp. 516-519, Detroit, May 1995) in which a CELP (code-excited linear prediction) based coding scheme is used to encode speech period segments, multi band LPC coding (S. Yeldener, M. Kondoz and G. Evans, "High Quality Multiband LPC Coding of Speech at 2.4 kbits/s", Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single amplitude value is allocated to each frame to in effect specify a "flat" residual spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto, "Harmonic and Noise Coding of LPC Residuals with Classified Vector Quantisation", IEEE Proc. ICASSP-95, pp. 484-487, Detroit, May 1995) with classified vector quantization which operates in the LPC residual domain, an input signal being classified as voiced or unvoiced and being full band modelled.
A further type of coding system exists, that is the prototype interpolation coding system. This relies upon the use of pitch period segments or prototypes which are spaced apart in time and reiteration/interpolation techniques to synthesise the signal between two prototypes. Such a system was described as early as 1971 (J.S. Severwight, "Interpolation Reiterations Techniques for Efficient Speech Transmission", Ph.D. Thesis, Loughborough University, Department of Electrical Engineering, 1971). More sophisticated systems of the same general class have been described more recently, for example in the paper by W.B. Kleijn, "Continuous Representations in Linear Predictive Coding, Proc. ICASSP-91, pp201-204, May 1991. The same author has published a series of related papers. The system employs 20msecs coding frames which are classified as voiced or unvoiced. Unvoiced frames are effectively CELP coded. Pitch prototype segments are defined in adjacent voiced frames, in the LPC residual signal, in a way which ensures maximum alignment (correlation) of the prototypes and defines the prototype so that the main pitch excitation pulse is not near to either of the ends of the prototype. A pitch period in a given frame is considered to be a cycle of an artificial periodic signal from which the prototype for the frame is obtained. The prototypes which have been appropriately selected from adjacent frames are Fourier transformed and the resulting coefficients are coded using a differential vector quantization scheme.
With this scheme, during synthesis of voiced frames, the decoded prototype Fourier representations for adjacent frames are used to reconstruct the missing signal waveform between the two prototype segments using linear interpolation. Thus the residual signal is obtained which is then presented to an LPC synthesis filter the output of which provides the synthesised voiced speech signal. An amount of randomness can be introduced into voiced speech by injecting noise at frequencies larger than 2khz, the amplitude of the noise increasing with frequency. In addition, the periodicity of synthesised voiced speech is controlled during the quantization of prototype parameters in accordance with a long term signal to change ratio measure that reflects the similarity which exists between the prototypes of adjacent frames in the residual excitation signal.
The known prototype interpolation coding systems rely upon a Fourier Series synthesis equation which involves a linear-with-time-interpolation process. The assumption is that the pitch estimates for successive frames are linearly interpolated to provide a pitch function and an associated instant fundamental frequency. The instant phase used in the cosine and sine terms of the Fourier series synthesis equation is the integral of the instantaneous harmonic frequencies. This synthesis arrangement allows for the linear evolution of the instantaneous pitch and the non-linear evolution of the instantaneous harmonic frequencies.
A development of this system is described by W.B. Kleijn and J. Haaden, "A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc. ICASSP-95, pp508-511, Detroit, May 1995. In the described system the Fourier series coefficients are low pass filtered over time, with a cut-off frequency of 20Hz, to provide a "slowly evolving" waveform component for the LPC excitation signal. The difference between this low pass component and the original parameters provides the "rapidly evolving" components of the excitation signal. Periodic voice excitation signals are mainly represented by the "slowly evolving" component, whereas random unvoiced excitation signals are represented by the "rapidly evolving" component in this dual decomposition of the Fourier series coefficients. This removes effectively the need for treating voiced and unvoiced frames separately. Furthermore, the rate of quantization and transmission of the two components is different. The "slowly evolving" signal is sampled at relatively long intervals of 25msecs, but the parameters are quantized quite accurately on the basis of spectral magnitude information. In contrast, the spectral magnitude of the "rapidly evolving" signal is sampled frequently, every 4msecs, but is quantized less accurately. Phase information is randomised every 2msecs.
Other developments of the prototype interpolation coding system have been proposed. For example one known system operates on 5msec frames, a pitch period being selected for voiced frames and DFT transformed to yield prototype spectral magnitude values. These values are quantized and the quantized values for adjacent frames are linearly interpolated. Phase information is defined in a manner which does not satisfy any frequency restrictions at the interpolation boundaries. This causes problems of discontinuity at frame boundaries. At the receiver the excitation signal is synthesised using a decoded magnitude and estimated phase values, via an inverse DFT process. The resulting signal is filtered by a following LPC synthesis filter. This model is purely periodic during voiced speech, and this is why a very short duration frame is used. Unvoiced speech is CELP coded.
The wide range of speech synthesis models currently being proposed, only some of which are described above, and the range of alternative approaches proposed to implement those models, indicates the interest in such systems and the lack of any consensus as to which system provides the most advantageous performance.
It is an object of the present invention to provide an improved low bit rate speech synthesis system.
In known systems in which it is necessary to obtain an estimate of the pitch of a frame of a speech signal, it has been thought necessary, if high quality of synthesised speech is to be achieved, to obtain high resolution non-integer pitch period estimates. This requires complex processes, and it would be highly desirable to reduce the complexity of the pitch estimation process in a manner which did not result in degraded quality.
According to an aspect of the present invention, there is provided a speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.
The result of the above system is that an integer pitch period value is obtained. The system avoids undue complexity and may be readily implemented.
Preferably the pitch estimate is defined using an iterative process. A single reference sample may be used, for example centred with respect to the respective frame, or alternatively multiple pitch estimates may be derived for each frame using different reference samples, the multiple pitch estimates being combined to define a combined pitch estimate for the frame. The pitch estimate may be modified by reference to a voiced/unvoiced status and/or pitch estimates of adjacent frames to define a final pitch estimate.
The correlation function may be clipped using a threshold value, remaining peaks being rejected if they are adjacent to larger peaks. Peaks are initially selected and can be rejected if they are smaller than a following peak by more than a predetermined factor, for example smaller than 0.9 times the following peak.
Preferably the pitch estimation procedure is based on a least squares error algorithm. Preferably the algorithm defines the pitch as a number whose multiples best fit the correlation function peak locations. Initial possible pitch values may be limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by the lower of those two numbers.
It is well known from the prior art to classify individual frames as voiced or unvoiced and to process those frames in accordance with that classification. Unfortunately such a simple classification process does not accurately reflect the true characteristics of speech. It is often the case that individual frames are made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior attempts to address this problem have not proved particularly effective.
It is an object of the present invention to provide an improved voiced or unvoiced classification system.
According to a second aspect of the present invention there is provided a speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed voiced classification which classifies harmonics in the magnitude spectrum of voiced frames as strongly voiced or weakly voiced, wherein a series of samples centred on the middle of the frame are windowed to form a data array which is Fourier transformed to produce a magnitude spectrum, a threshold value is calculated and used to clip the magnitude spectrum, the clipped data is searched to define peaks, the locations of peaks are determined, constraints are applied to define dominant peaks, and harmonics not associated with a dominant peak are classified as weakly voiced.
Peaks may be located using a second order polynomial. The samples may be Hamming windowed. The threshold value may be calculated by identifying the maximum and minimum magnitude spectrum values and defining the threshold as a constant multiplied by the difference between the maximum and minimum values. Peaks may be defined as those values which are greater than the two adjacent values. A peak may be rejected from consideration if neighbouring peaks are of a similar magnitude, e.g. more than 80% of the magnitude, or if there are spectral magnitudes in the same range of greater magnitudes. A harmonic may be considered as not being associated with a dominant peak if the difference between two adjacent peaks is greater than a predetermined threshold value.
The spectrum may be divided into bands of fixed width and a strongly/weakly voiced classification assigned for each band. Alternatively the frequency range may be divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced classification of harmonics.
Thus, the spectrum may be divided into fixed bands, for example fixed bands each of 500Hz, or variable width bands selected in dependence upon the strongly/weakly voiced status of harmonic components of the excitation signal. A strongly/weakly voiced classification is then assigned to each band. The lowest frequency band, e.g. 0-500Hz, may always be regarded as strongly voiced, whereas the highest frequency band, for example 3500Hz to 4000Hz, may always be regarded as weakly voiced, In the event that a current frame is voiced, and the previous frame is unvoiced, other bands within the current frame, e.g. 3000Hz to 3500Hz may be automatically classified as weakly voiced. Generally the strongly/weakly voiced classification may be determined using a majority decision rule on the strongly/weakly voiced classification of those harmonics which fall within the band in question. If there is no majority, alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.
Given the classification of a voiced frame such that harmonics are classified as either strongly or weakly voiced, it is necessary to generate an excitation signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.
According to a third aspect of the present invention, there is provided a speech synthesis system in which a speech signal is divided into a series of frames, each frame is defined as voiced or unvoiced, each frame is converted into a coded signal including a pitch period value, a frame voiced/unvoiced classification and, for each voiced frame, a mixed voiced spectral band classification which classifies harmonics within spectral bands as either strongly or weakly voiced, and the speech signal is reconstructed by generating an excitation signal in respect of each frame and applying the excitation signal to a filter, wherein for each weakly voiced spectral band, an excitation signal is generated which includes a random component in the form of a function which is dependent upon the respective pitch period value.
Thus for each frame which has a spectral band that is classified as weakly voiced, the excitation signal is represented by a function which includes a first harmonic frequency component, the frequency of which is dependant upon the pitch period value appropriate to that frame, and a second random component which is superimposed upon the first component.
The random component may be introduced by reducing the amplitude of harmonic oscillators assigned the weakly voiced classification, for example by reducing the power of the harmonics by 50%, while disturbing the oscillator frequencies, for example by shifting the oscillators randomly in frequency in the range of 0 to 30 Hz such that the frequency is no longer a multiple of the fundamental frequency, and then adding further random signals. The phase of the oscillators producing random signals may be randomised at pitch intervals.
Thus for a weakly voiced band, some periodicity remains but the power of the periodic component is reduced and then combined with a random component.
In a speech synthesis system in which a speech signal is represented in part by spectral information in the form of harmonic magnitude values, it is possible to process an input speech signal to produce a series of spectral magnitude values and then to use all of those magnitude values at harmonic locations in subsequent processing steps. In many circumstances however at least some of the magnitude values contain little information which is useful in the recovery of the input speech signal. Accordingly when magnitude values are quantized for transmission to a receiver it is sensible to discard magnitude values which contain little useful information.
In one known system an input speech signal is processed to produce an LPC residual signal which in turn is processed to provide harmonic magnitude values, but only a fixed number of those magnitude values is vector quantized for transmission to a receiver. The discarded magnitude values arc represented at the receiver as identical constant values. This known system reduces redundancy but is inflexible in that the locations of the fixed number of magnitude values to be quantized are always the same and predetermined on the basis of assumption that may be inappropriate in particular circumstances.
It is an object of the present invention to provide an improved magnitude value quantization system.
According to a fourth aspect of the present invention, there is provided a speech synthesis system in which a speech signal is divided into a series of frames, and each voiced frame is converted into a coded signal including a pitch period value, LPC coefficients, and pitch segment spectral magnitude information, wherein the spectral magnitude information is quantized by sampling the LPC short term magnitude spectrum at harmonic frequencies, the locations of the largest spectral samples are determined to identify which of the magnitudes are relatively more important for accurate quantization, and the magnitudes so identified are selected and vector quantized.
Thus rather than relying upon a simple location selection strategy of a fixed number of magnitude values for quantization and transmission, for example the "low part" of the magnitude spectrum, the invention selects only those values which make a significant contribution according to the subjectively important LPC magnitude spectrum, thereby reducing redundancy without compromising quality.
In one arrangement in accordance with the invention a pitch segment of P_n LPC residual samples is obtained, where P_n is the pitch period value of the nth frame, the pitch segment is DFT transformed, the mean value of the resultant spectral magnitudes is calculated, the mean value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.
Alternatively, the RMS value of the pitch segment is calculated, the RMS value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.
At the receiver, the selected magnitudes are recovered, and each of the other magnitude values is reproduced as a constant value.
Interpolation coding systems which employ a pitch-related synthesis formula to recover speech generally encounter the problem of coding a variable length, pitch dependant spectral amplitude vector. The quantization scheme referred to above in which only the magnitudes of relatively greater importance are quantized avoids this problem by quantizing only a fixed number of magnitude values and setting the rest of the magnitude values to a constant value. Thus at the receiver a fixed length vector can be recovered. Such a solution to the problem however may result in a relatively spectrally flat excitation model which has limitations in providing high recovcred speech quality.
In an ideal world output speech quality would be maximised by quantizing the entire shape of the magnitude spectrum, and various approaches have been proposed for coding the entire magnitude spectrum. In one approach, the spectrum is DFT transformed and coded differentially across successive spectra. This and similar coding schemes are rather inefficient however and operate with relatively high bit rates. The introduction of vector quantization allowed for the development of sinusoidal and prototype interpolation systems which operate at lower bit rates, typically around 2.4Kbits/sec.
Two vector quantization methodologies have been reported which quantize a variable size input vector with a fixed size code vector. In a first approach, the input vector is transformed to a fixed size vector which is then conventionally vector quantized. An inverse transform of the quantized fixed size vector yields the recovered quantized vector. Transformation techniques which have been used include linear interpolation, band limited interpolation, all pole modelling and non-square transformation. This approach however produces an overall distortion which is the summation of the vector quantization noise and a component which is introduced by the transformation process. In a second known approach, a variable input vector is directly quantized with a fixed size code vector. This approach is based on selecting only a limited number of elements from each codebook vector to form a distortion measure between a codebook vector and an input vector. Such a quantization approach avoids the transformation distortion of the alternative technique mentioned above and results in an overall distortion that is equal to the vector quantization noise, but this is significant.
It is an object of the present invention to provide an improved variable sized spectral vector quantization scheme.
According to a fifth aspect of the present invention, there is provided a speech synthesis system in which a variable size input vector of coefficients to be transmitted to a receiver for the reconstruction of a speech signal is vector quantized using a codebook defined by vectors of fixed size, the codebook vectors of fixed size are obtained from variable size training vectors and an interpolation technique which is an integral part of the codebook generation process, codebook vectors are compared to the variable sized input vector using the interpolation process, and an index associated with the codebook entry with the smallest difference from the comparison is transmitted, the index being used to address a further codebook at the receiver and thereby derive an associated fixed size codebook vector, and the interpolation process being used to recover from the derived fixed sized codebook vector an approximation of the variable sized input vector.
The invention is applicable in particular to pitch synchronous low bit rate coders of the type described in this document and takes advantage of the underlying principle of such coders which means that the shape of the magnitude spectrum is represented by a relatively small number of equally spaced samples.
Preferably the interpolation process is linear. For an input vector of given dimension, the interpolation process is applied to produce from the codebook vectors a set of vectors of that given dimension. A distortion measure is then derived to compare the interpolated set of vectors and the input vector and the codebook vector which yields the minimum distortion is selected.
Preferably the dimension of the input vectors is reduced by taking into account only the harmonic amplitudes with the input brandwidth range, for example 0 to 3.4kRz. Preferably the remaining amplitudes i.e. in the region of 3.4kHz to 4 kHz are set to a constant value. Preferably, the constant value is equal to the mean value of the quantized amplitudes.
Amplitude vectors obtained from adjacent residual frames exhibit significant amounts of redundancy which can be removed by means of backward prediction. The backward prediction may be performed on a harmonic basis such that the amplitude value of each harmonic of one frame is predicted from the amplitude value of the same harmonic in the previous frame or frames. A fixed linear predictor may be incorporated in the system, together with mean removal and gain shape quantization processes which operate on a resulting error magnitude vector.
Although the above described variable sized vector quantization scheme provides advantageous characteristics, and in particular provides for good perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some environments a lower bit rate would be highly desirable even at the loss of some quality. It would be possible for example to rely upon a single value representation and quantization strategy on the assumption that the magnitude spectrum of the pitch segment in the residual domain has an approximately flat shape. Unfortunately systems based on this assumption have a rather poor decoded speech quality.
It is an object of the present invention to overcome the above limitation in lower bit rate systems.
According to a sixth aspect of the present invention, there is provided a speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including an estimated pitch period, an estimate of the energy of a speech segment the duration of which is a function of the estimated pitch period, and LPC filter coefficients defining an LPC spectral envelope, and a speech signal of related power to the power of the input speech signal is reconstructed by generating an excitation signal using spectral amplitudes which are defined from a modified LPC spectral envelope sampled at the harmonic frequencies defined by the pitch period.
Thus, although a single value is used to represent the spectral envelope of the excitation signal, the excitation spectral envelope is shaped according to the LPC spectral envelope. The result is a system which is capable of delivering high quality speech at 1.5Kbits/sec. The invention is based on the observation that some of the speech spectrum resonance and anti-resonance information is also present in the residual magnitude spectrum, since LPC inverse filtering cannot produce a residual signal of absolutely flat magnitude spectrum. As a consequence, the LPC residual signal is itself highly intelligible.
The magnitude values may be obtained by spectrally sampling a modified LPC synthesis filter characteristic at the harmonic locations related to the pitch period. The modified LPC synthesis filter may have reduced feed back gain and a frequency response which consists of equalised resonant peaks, the locations of which are close to the LPC synthesis resonant locations. The value of the feed back gain may be controlled by the performance of the LPC model such that it is for example proportional to the normalised LPC prediction error. The energy of the reproduced speech signal may be equal to the energy of the original speech waveform.
It is well known that in prototype interpolation coding speech synthesis systems there are often substantial similarities between the prototypes of adjacent frames in the residual excitation signals. This has been used in various systems to improve perceived speech quality by ensuring that there is a smooth evolution of the speech signal over time.
It is an object of the present invention to provide an improved speech synthesis system in which the excitation and vocal tract dynamics are substantially preserved in the recovered speech signal.
According to a seventh aspect of the present invention, there is provided a speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including LPC filter coefficients and at least one parameter associated with a pitch segment magnitude, and the speech signal is reconstructed by generating two excitation signals in respect of each frame, each pair of excitation signals comprising a first excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of one frame and a second excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of a second frame which follows and is adjacent to the said one frame, applying the first excitation signal to a first LPC filter the characteristics of which are determined by the LPC filter coefficients of the said one frame and applying the second excitation signal to a second LPC filter the characteristics of which are determined by the LPC filter coefficients of the said second frame, and weighting and combining the outputs of the first and second LPC filters to produce one frame of a synthesised speech signal.
Preferably the first and second excitation signals include the same phase function and different phase contributions from the two LPC filters involved in the above double synthesis process. This reduces the degree of pitch periodicity in the recovered signals. This and the combination of the first and second LPC filter outputs ensures an effective smooth evolution of the speech spectral envelope on a sample by sample basis.
Preferably the outputs of the first and second LPC filters are weighted by half a window function such as a Hamming window such that the magnitude of the output of the first filter is decreasing with time and the magnitude of the output of the second filter is increasing with time.
According to the present invention, there is provided a speech coding system which operates on a frame by frame basis, and in which information is transmitted which represents each frame as either voiced or unvoiced and, for each voiced frame, represents that frame by a pitch period value, quantized magnitude spectral information with associated strong/weak voiced harmonic classification, and LPC filter coefficients, the received pitch period value and magnitude spectral information being used to generate residual signals at the receiver which are applied to LPC speech synthesis filters the characteristics of which are determined by the transmitted filter coefficients, wherein each residual signal is synthesised according to a sinusoidal mixed excitation synthesis process, and a recovered speech signal is derived from the combination of the outputs of the LPC Synthesis filters.
Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 is a general block diagram of the encoding process in accordance with the present invention;
Figure 2 illustrates the relationship between coding and matrix quantisation frames;
Figure 3 is a general block diagram of the decoding process;
Figure 4 is a block diagram of the excitation synthesis process;
Figure 5 is a schematic diagram of the overlap and add process;
Figure 6 is a schematic diagram of the calculation of an instantaneous scaling factor;
Figure 7 is a block diagram of the overall voiced/unvoiced classification and pitch estimation process;
Figure 8 is a block diagram of the pitch estimation process;
Figure 9 is a schematic diagram of two speech segments which participate in the calculation of a crosscorrelation function value;
Figure 10 is a schematic diagram of speech segments used in the calculation of the crosscorrelation function value;
Figure 11 represents the value allocated to a parameter used in the calculation of the crosscorrelation function value for different delays;
Figure 12 is a block diagram of the process used for calculated the crosscorrelating function and the selection of its peaks;
Figure 13 is a flow chart of a pitch estimation algorithm;
Figure 14 is a flow chart of a procedure used in the pitch estimation process;
Figure 15 is a flow chart of a further procedure used in the pitch estimation process;
Figure 16 is a flow chart of a further procedure used in the pitch estimation process.
Figure 17 is a flow chart of a threshold value selection procedure;
Figure 18 is a flow chart of the voiced/unvoiced classification process;
Figure 19 is a schematic diagram of the voiced/unvoiced classification process with respect to parameters generated during the pitch estimation process;
Figure 20 is a flow chart of the procedure used to determine offset values;
Figure 21 is a flow chart of the pitch estimation algorithm;
Figure 22 is a flow chart of a procedure used to impose constraints on output pitch estimates to ensure smooth evolution of pitch values with time;
Figures 23, 24 and 25 represent different portions of a flow chart of a pitch post processing procedure;
Figure 26 is a general block diagram of the LPC analysis and LPC quantisation process;
Figure 27 is a general flow chart of a strongly or weakly voiced classification process;
Figure 28 is a flow chart of the procedure responsible for the strongly/weakly voiced classification.
Figure 29 represents a speech waveform obtained from a particular speech utterance;
Figure 30 shows frequency tracks obtained for the speech utterance of Figure 29;
Figure 31 shows to a larger scale a portion of Figure 30 and represents the difference between strongly and weakly voiced classifications;
Figure 32 shows a magnitude spectrum of a particular speech segment and the corresponding LPC spectral envelope and the normalised short term magnitude spectra of the corresponding residual segment, excitation segment obtained using a binary excitation model and an excitation segment obtained using the strongly/weakly voiced model;
Figure 33 is a general block diagram of a system for representing and quantising magnitude information;
Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
Figure 35 is a general block diagram of a quantisation process;
Figure 36 is a general block diagram of a differential variable size spectral vector quantiser; and
Figure 37 represents the hierarchical structure of a mean gain shape quantiser.

A system in accordance with the present invention is described below, firstly in general terms and then in greater detail. The system operates on an LPC residual signal on a frame by frame basis.
Speech is synthesised using the following general expression: where i is the sampling instant and A_k(i) represents the amplitude value of the kth cosine term cos(Θ_k(i)) (with Θ_k(i) = ϑ_k(i) + φ_k) as a function of i. In voiced speech K depends on the pitch frequency of the signal.
A voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways. Unvoiced frames are modelled in terms of an RMS value and a random time series. In voiced frames a pitch period estimate is obtained and used to define a pitch segment which is centred at the middle of the frame. Pitch segments from adjacent frames are DFT transformed and only the resulting pitch segment magnitude information is coded and transmitted. Furthermore, pitch segment magnitude samples are classified as strongly or weakly voiced. Thus in addition to voiced/unvoiced information, the system transmits for every voiced frame the pitch period value, the magnitude spectral information of the pitch segment, the strong/weak voiced classification of the pitch magnitude spectral values, and the LPC coefficient. Thus, the information which is transmitted for every voiced frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude spectral information of the pitch segment, and the LPC filter coefficients.
At the receiver a synthesis process, that includes interpolation, is used to reconstruct the waveform between the middle points of the current (n+1)th and previous nth frames. The basic synthesis equation for the residual signal is: where MĜ _j are decoded pitch segment magnitude values and phase_j(i) is calculated from the integral of the linearly interpolated instantaneous harmonic frequencies ω_j(i). K is the largest value of j for which ω_j ⁿ(i)≤π.
In the transitions from unvoiced to voiced, the initial phase for each harmonic is set to zero. Phase continuity is preserved across the boundaries of successive interpolation intervals.
The synthesis process is performed twice however, once using the magnitude spectral values MG_j ⁿ⁺¹ of the pitch segment derived from the current (n+1)th frame and again using the magnitude values MG_j ⁿ of the pitch segment derived in the previous nth frame. The phase function phase_j(i) in each case remains the same. The resulting residual signals Res_n(i) and Res_n+1(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+1)th speech frames. The two LPC synthesised speech waveforms are then weighted by W_n+1(i) and W_n(i) to yield the recovered speech signal.
Thus the overall synthesis process, for successive voiced frames, can be described by: where H ⁿ(ω $\binom{n}{j}$ (i)) is the frequency response of the nth frame LPC synthesis filter calculated, at the ω_j ⁿ(i) harmonic frequency function at the ith instant. ϕ"(ω"_l(i)) is the associated phase response of this filter. ω_j ⁿ(i) and phase_j ⁿ(i) are the frequency and phase functions defined for the sampling instants i, with i covering the middle of the nth frame to the middle of the (n+1)th frame segments. K is the largest value of j for which ω_j ⁿ(i)≤π.
The above speech synthesis process introduces two "phase dispersion" terms i.e. ϕ"(ω $\binom{n}{j}$ (i)) and ϕⁿ ⁺¹(ω $\binom{n}{j}$ (i)) which effectively reduce the degree of pitch periodicity in the recovered signal. In addition, this "double synthesis" arrangement followed by an overlap-add process ensures an effective smooth evolution of the speech spectral envelope (LPC) on a sample by sample basis.
The LPC excitation signal is based on a "mixed" excitation model which allows for the appropriate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating the system such that the magnitude spectrum of the residual signal is examined, and applying a peak-picking process, near the ω_j resonant frequencies, to detect possible dominant spectral peaks. A peak associated with a frequency ω_j indicates a high degree of voicing (represented by hv_j=1) for that harmonic. The absence of an adjacent spectral peak, on the other hand, indicates a certain degree of randomness (represented by hv_j=0). When hv_j=1 (to indicate "strong" voicing) the contribution of the jth harmonic to the synthesis process is MĜ _i cos(phase _j(i)) However, when hv_j=0 (to indicate "weak" voicing) the frequency of the jth harmonic is slightly dithered, its magnitude MĜ _i is reduced to (MĜ _i / $\sqrt{2}$ ) and random cosine terms are added symmetrically alongside the jth harmonic ω_j. The terms "strong" and "weak" are used in this sense below. The number NRS of these random terms is where ┌┐ indicates rounding off to the next larger integer value. Furthermore, the NRS random components are spaced at 50 Hz intervals symmetrically about ω_j. ω_j being located in the middle of such a 50 Hz interval. The amplitudes of the NRS random components are set to (MĜ _j / $\sqrt{2 NRS}$ ×) Their initial phases are selected randomly from the [-π, +π] region at pitch period intervals.
The hv_j information must be transmitted to be available at the receiver and, in order to reduce the bit rate allocated to hv_j, the bandwidth of the input signal is divided into a number of fixed size bands BD_k and a "strongly" or "weakly" voiced flag Bhv_k is assigned for each band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly" voiced band, a signal which combines both periodic and aperiodic components is required. These bands are classified as strongly voiced (Bhv_k=1) or weakly voiced (Bhv_k=0) using a majority decision rule approach on the hv_j classification values of the harmonics ω_j contained within each frequency band.
Further restrictions can be imposed on the strongly/weakly voiced profiles resulting from the classification of bands. For example, the first λ bands may always be strongly voiced i.e. hv_j=1 for BD_k with k=1,2,...,λ, and λ being a variable. The remaining spectral bands can be strongly or weakly voiced.
Figure 1 schematically illustrates processes operated by the system encoder. These processes are referred to in Figure 1 as Processes I to VII and these terms are used throughout this document. Figure 2 represents the relationship between analysis/coding frame sizes employed. These are M samples per coding frame, e.g. 160 samples per frame, and k frames are analysed in a block, for example k=4. This block size is used for matrix quantization. A speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission.
Assuming that the first Matrix Quantization analysis frame (MQA) of k×M samples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (V_n) using, Process I. A pitch estimation part of Process I provides a pitch period value P_n only when a coding frame is voiced.
Process II operates in parallel on the input speech samples and estimates p LPC filter coefficients a (for example p=10) every L samples (L is a multiple of M i.e. L=m×M, and m may be equal to for example 2). In addition, k/m is an integer and represents the frame dimension of the matrix quantizer employed in Process III. Thus the LPC filter coefficients are quantized, using Process III and transmitted. The quantized coefficients â are used to derive a residual signal Rⁿ(i).
When an input coding frame is unvoiced, the Energy E_n of the residual obtained for this frame is calculated (Process VII). $\sqrt{E_{n}}$ is then quantized and transmitted.
When the nth coding frame is classified as voiced, a segment of P_n residual samples is obtained (P_n is the pitch period value associated with the nth frame). This segment is centred in the middle of the frame. The selected P_n samples are DFT transformed (Process V) to yield ┌(P _n + 1)/2┐ spectral magnitude values MĜ $\binom{n}{j}$ , 0≤j<┌(P _n + 1)/2┐, and ┌(P _n + 1)/2┐phase values. The phase information is neglected. The magnitude information is coded (using Process VI) and transmitted. In addition a segment of 20 msecs, which is centred in the middle of the nth coding frame, is obtained from the residual signal Rⁿ(i). This is input to Process IV, together with P_n to provide the strongly/weakly voiced classification parameters hv_j ⁿ of the harmonics ω_j ⁿ. Process IV produces quantized Bhv information, which for voiced frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced decision V_n, the pitch period P_n, the quantized LPC coefficients â of the corresponding LPC frame, and the magnitude values MĜ $\binom{n}{j}$ . In unvoiced frames only the $\sqrt{E_{n}}$ quantized value and the quantized LPC filter coefficients â are transmitted.
Figure 3 schematically illustrates processes operated by the system decoder. In general terms, given the received parameters of the nth coding frame and those of the previous (n-1)th coding frame, the decoder synthesises a speech signal S_n(i) that extends from the middle of the (n-1)th frame to the middle of the nth frame. This synthesis process involves the generation in parallel of two excitation signals Res_n(i) and Res_n-1(i) which are used to drive two independent LPC synthesis filters 1/A _n(z) and 1/A _n _-1 (z) the coefficients of which are derived from the transmitted quantized coefficients â. The outputs X_n(i) and X_n-1(i) of these synthesis filters are weighted and added to provide a speech segment which is then post filtered to yield the recovered speech S_n(i). The excitation synthesis process used in both paths of Figure 3 is shown in more detail in Figure 4.
The process commences by considering the voiced/unvoiced status V_k, where k is equal to n or n-1, (see Figure 4). When the frame is unvoiced i.e. V_k=0, a gaussian random number generator RG(0,1) of zero mean and unit variance, provides a time series which is subsequently scaled by the value received for this frame. This is effectively the required: signal which is then presented to the corresponding LPC synthesis filter 1/A _k (z), k=n or n-1. Performance could be increased if the $\sqrt{E_{k}}$ value was calculated, quantized and transmitted every 5msecs. Thus, provided that bits are available when coding unvoiced speech, four $\sqrt{E_{kξ}}$ , ξ=0,..,3, values are transmitted for every unvoiced frame of 20msecs duration (160 samples).
In the case where V_k=1, the Res_k(i) excitation signal is defined as the summation of a "harmonic" Res_k ^h(i) component and a "random" Res_k ^r(i) component. The top path of the V_k=1 part of the synthesis in Figure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harmonic frequency function ω_j ⁿ(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-1)th frames. (i.e. this action is independent of the value of k). Thus, when decoding the nth frame, ω_j ⁿ(i) is calculated using the pitch frequencies f $\binom{1,n}{j}$ , f $\binom{2,n}{j}$ and linear interpolation i.e. $ω_{j}^{n} (i) = 2 π \frac{ƒ_{j}^{1,n} {-ƒ}_{j}^{2,n}}{M} i+ 2 {πƒ}^{2,} _{j}^{n}$ with 0≤j<┌(P _max+1)/2┐, 0≤i≤M and P _max = max[P _n,P _n _-1]
The frequencies, f_j ^1,n and f_j ^2,n are defined as follows:

I) When both the nth and (n-1)th coding frames are voiced i.e V_n=1 and V_n-1=1, then the pitch frequencies are estimated as follows:
- a) $If | P_{n} - P_{n} _{-1} |≤0.2×(P_{n} + P_{n} _{-1})$ which means that the pitch values of the nth and (n-1)th coding frames are rather similar, then: $ƒ_{j}^{1, n} = j \frac{1}{P_{n}} +(1 {-hv}_{j}^{n}) ×RU (-a,+a)$ The ƒ $\binom{1, n -1}{j}$ value is calculated during the decoding process of the previous (n-1)th coding frame. hv_j ⁿ is the strongly/weakly voiced classification (0, or 1) of the jth harmonic ω_j ⁿ. P_n and P_n-1 are the received pitch estimates from the n and n-1 frames. RU(-a,+a) indicates the output of a random number generator with uniform pdf within the -a to +a range. (a=0.00375)
- b) if $| P_{n} - P_{n} _{-1} |>0.2×(P_{n} + P_{n} _{-1})$ then and $ƒ_{j}^{2, n} = ƒ_{j}^{1,n-1} + b × j$
where b is defined as: Notice that in case (b) which applies for significantly different P_n and P_n-1 pitch estimates, equations 11 and 12 ensure that the rate of change of the ω_j ⁿ(i) function is restricted to
II) When one of the two coding frames (i.e. n, n-1) is unvoiced, one of the following two definitions is applicable:
- a) for V_n-1 =0 and V_n=1 $ƒ_{j}^{2, n} = \frac{1}{P_{n}} j$ and f $\binom{1,n}{j}$ is given by Equation (8).
- b) for V_n-1=1 and V_n=0
  f $\binom{2,n}{j}$ is set to the f $\binom{1,n-1}{j}$ value, which has been calculated during the decoding process of the previous (n-1)th coding frame and f $\binom{1,n}{j}$ =f $\binom{2,n}{j}$ .

_j

ⁿ

_j

ⁿ

Furthermore, the "harmonic" component Re s $\binom{h}{k}$ (i) of the residual signal is given by: where k=n or n-1, and
MĜ $\binom{k}{j}$ j = 0,...,└(P _k + 1)/2┘-1 are the received magnitude values of the "kth" coding frame, with k=n or k=n-1.
The second path of the V_k=1 case in Figure 4 provides the random excitation component Re s $\binom{r}{k}$ (i). In particular, given the recovered strongly/weakly voiced classification values hv_j ^k, the system calculates for those harmonics with hv_j ^k=0 the number of random sinusoidal NRS comoonents, which are used to randomise the corresponding harmonic. This is: where fs is the sampling frequency. Notice that the NRS random sinusoidal components are located symmetrically about the corresponding harmonic ω $\binom{k}{j}$ and they are spaced 50 Hz apart.
The instantaneous frequency of the qth random component, q=0,1,...,NRS-1, for the jth harmoric ω $\binom{k}{j}$ is calculated by: and 0≤ i≤ M
The associated phase value is:
and 0≤ i≤ M where ϕ_j _, _q = RU(π,-π). In addition, the Ph $\binom{k}{j,q}$ (i) function is randomised at pitch intervals (i.e. when the phase of the fundamental harmonic component is a multiple of 2π, i.e. mod(phase $\binom{n}{p}$ (i),2π)=0).
Given the Ph $\binom{k}{j,q}$ (i), the random excitation component Res_kr(i) is calculated as follows: where
Thus for V_k=1 voiced coding frames, the mixed excitation residual is formed as: $Re s_{k} (i) = Re s_{k}^{h} (i) + Re s_{k}^{r} (i)$
Notice that when V_k=0, instead of using Equation 5, the random excitation signal Res_k(i) can be generated by the summation of random cosines located 50 Hz apart, where their phase is randomised every λ samples. and λ<M, i.e ξ = 0,1,2,..., and 0≤i<M and
ζ is defined so as to ensure that the phase of the cos terms is randomised every λ samples across frame boundaries. The resulting Res_n(i) and Res_n-1(i) excitation sequences, see Figure 4, are processed by the corresponding 1/A _n (z) and 1/A _n _-1 (z) LPC synthesis filters. When coding the next (n+1)th frame, 1/A _n _-1(z) becomes 1/A_n(z) (including the memory) and 1/A_n(z) becomes 1/A _n ₊₁(z) with the memory of 1/A_n(z). This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1/A _n+1(z) filter is set to zero. The coefficients of the 1/A_n(z) and 1/A _n-1(z) synthesis filters are calculated directly from the nth and (n-1)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L≠M (usually L>M) linear interpolation is used on the filter coefficients (defined every L samples) so that the transfer function of the synthesis filter is updated every M samples.
The output signals of these filters, denoted as X_n _-1(i) and X_n(i), are weighted, overlapped and added as schematically illustrated in Figure 5 to yield X̂ _n (i) i.e: ${\hat{X}}_{n} (i) = W_{n} _{-1} (i) X_{n} _{-1} (i)+ W_{n} (i) X_{n} (i)$ where and
X̂ _n(i) is then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech segment S'_n(i). PF(z) is the conventional post filter: $PF (z)= \frac{A (z / b)}{A (z / c)} (1-µ z^{-1})$ with b=0.5, c=0.8 and µ=0.5K $\binom{n}{j}$ .K $\binom{n}{j}$ is the first reflection coefficient of the nth coding frame. HP(z) is defined as: $HP (z)= \frac{b^{l} {-c}_{1} z^{-l}}{{l-a}_{1} z^{-l}}$ with b₁=c₁=0.9807 and a₁=0.961481.
In order to ensure that the energy of the recovered S(i) signal is preserved, as compared to that of the X̂(i) sequence, a scaling factor SC is calculated every LPC frame of L samples. where: SC_l is associated with the middle of the Ith LPC frame as illustrated in Figure 6. The filtered samples from the middle of the (l-1)th frame to the middle of the Ith frame are then multiplied by SC₁(i) to yield the final output of the system, S_|(i)=SC_l(i)×S'_l(i) where: ${SC}_{l} (i)= {SC}_{l} W_{l} (i) {+SC}_{l} _{-1} W_{l} _{-1} (i) 0≤i<L$ and
The scaling process introduces an extra half LPC frame delay into the coding-decoding process.
The above described energy scaling procedure operates on an LPC frame basis in contrast to both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame of M samples.
Details of the coding processes represented in Figure 1 will now be described.
Process I derives a voiced/unvoiced (V/UV) classification V_n for the nth input coding frame and also assigns a pitch estimate P_n to the middle sample M_n of this frame. This process is illustrated in Figure 7.
The V/UV and pitch estimation analysis frame is centred at the middle M_n+1 of the (n+1)th coding frame with 237 samples on either side. The signal x(i) in the above analysis frame is low pass filtered with a cut off frequency f_c=1.45KHz and the resulting (-147, 147) samples centred about M_n+1 are used in a pitch estimation algorithm, which yields an estimate P_Mn+1. The pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the pitch estimation process. The 294 input samples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20≤d≤147. Figure 9 shows the two speech segments which participate in the calculation of the crosscorrelation function value at "d" delay. In particular, for a given value of d, the crosscorrelation function ρ^d(j) is calculated for the segments {x_L}^d, (x_R}^d,as: where:

x_L ^d (i)=x(M_n+1-d+j+i), x_R ^d (i)=x(M_n+1+j+i), for 0≤i≤d-j-1, j=0,1,...,f(d) (Figure 10 schematically represents the X $\binom{d}{l}$ _, and X $\binom{d}{R}$ speech segments used in the calculation of the value CR(d) and the non linear relationship between d and f(d) is given in Figure 11 $\bar{x}$ $\binom{d}{L}$ and
$\bar{x}$ $\binom{d}{R}$ represent the mean value of the {x_L}^d and {x_R}^d sequences respectively.

In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection of its peaks", whose detailed diagram is shown in Figure 12, provides also the locations loc(k) of the peaks of the CR(d) function, where k=1,2,...,Np and Np is the number of peaks in a CR(d) function.
Figure 12 is a block diagram of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a threshold th(d) is determined as: $th (d)= CR (d^{n} _{max}^{+1}) -b- ({d-d}^{n+} _{max}^{1})× a-c$ where c=0.08 when AND(d > 0.875 × P ¹ _n)AND(d < 1.125 × P ¹ _n)
or c=0 elsewhere.
and constants a and b are defined as:
d $\binom{n +1}{max}$ is equal to the value of d for which CR(d) is maximised to CR $\binom{max}{M_{n} _{+1}}$ . Using this threshold the CR(d) function is clipped to CR_L(d). i.e.
CR_L(d)=0 for CR(d)≤th(d)
CR_L(d)=CR(d) otherwise.
CR_L(d) contains segments G_s s=1,2,3...., of positive values separated by G₀ runs of zero values. The algorithm examines the length of the G₀ runs which exist between successive G_s segments (i.e. G_s and G_s+1), and when G₀ < 17, then the G_s segment with the max CR_L(d) value is kept. This procedure yields CR̂ _L(d), which is then examined by the following "peak picking" procedure. In particular those CR̂ _L(d) values are selected for which: $C {\hat{R}}_{L} (d) >C {\hat{R}}_{L} (d -1) and C {\hat{R}}_{L} (d) > C {\hat{R}}_{L} (d + 1)$
However certain peaks can be rejected if: $C {\hat{R}}_{L} (loc (k)) ≤C {\hat{R}}_{L} (loc (k+ 1))×0.9$ This ensures that the final CR̂ _L(loc(k)) k=1,...,Np does not contain spurious low level CR̂ _L(d) peaks. The locations d of the above defined CR̂ _L _.(d) peaks are given by loc(k) k=1,2,...,Np.
CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch. Estimation algorithm (MHRPE) shown in Figure 8, whose output is P_Mn+1. The flowchart of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the estimated P is the requested P_Mn+1. In Figure 13 the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows:
For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j. i.e. j ∈{21,23,25,27,30,33,36,40,44,48,53,58,64,70,77,84,92,101,111,122,134}. (Thus 21 iterations are performed.)

1) Form the multiplication factor vector:
2) Reject possible pitch j and go back to (l) if
- a) the same element occurs in $\bar{u}$ _j twice.
- b) the elements of $\bar{u}$ _j have as a common factor a prime number.
3) Form the following error quantity $E_{l} = {\bar{loc}}^{T} \bar{loc} -2 p_{l} {\bar{u}}_{i} ^{T} \bar{loc} + p_{l}^{2} {\bar{u}}_{j} ^{T} {\bar{u}}_{j}$ where $p_{j} = \frac{{\bar{loc}}^{T} {\bar{u}}_{j}}{{\bar{u}}_{j} {\bar{u}}_{j} ^{T}}$
4) Select the p_js value for which the associated Error quantity E_js is minimum. (i.e. js:E _jx ≤ E _{j ∀j} ∈ {21,23,...134}). Set P=p_js.

The next two general conditions "Reject Highest Delay" loc(Np) and "Reject Lowest Delay" loc(1) are included in order to reject false pitch, "double" or "half" values and in general to provide constraints in the pitch estimates of the system. The "Reject Highest Delay" condition involves 3 constraits:

i) if P=0 then reject loc(Np).
ii) if loc(Np) >100 then find the local maximum CR(d_lm) in CR(d) at the vicinity of the estimated pitch P (i.e 0.8×P to 1.2×P) and compare this with th(d_lm), which is determined as in Equation 28 Reject loc(Np) when CR(d_lm)<th(d_lm)-0.02.
iii) If the error E_js of the LSE algorithm is larger than 50 and $\bar{u}$ _p (Np=Np with Np>2 then reject loc(Np).
The flowchart of this is given in Figure 14.

The "Reject Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects loc(1) when the following three constraints are simultaneously satisfied:

i) The density of detection of the peaks of the correlation coefficient function is less than or equal to 0.75. i.e. $\frac{Np}{u _{p_{n}} (Np)} ≤ 0.75$
ii) If the location of the first peak is neglected (i.e. loc(1)), then the remaining locations exhibit a common factor.
iii) The value of the correlation coefficient function at the locations of the missing peaks is relatively small compared to adjacent detected peaks. i.e.
If u_Pn ^k-u_Pn(k)>1, for k=1,...Np. then ${for i=u}_{Pn} {(k)+1 : u}_{Pn} (k+1)-1$
- a) find local maximum CR(d_lm) in the range from (i-0.1)×loc(1) to (i+0.1)× loc(1).
- b) if CR(d_lm) <0.97×CR(u_Pn(k)) then Reject Lowest Delay, END. else Continue

This concludes the pitch estimation procedure of Figure 7 whose output is P_Mn+1. As is also illustrated in Figure 7 however, in parallel to the pitch estimation, Process I obtains 160 samples centred at the middle of the M_n+1 coding frame, removes their mean value, and then calculates R0, R1 and the average R_av of the energies of the previous K non-silence coding frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100 with the next 50 non-silence coding frames, and then remains constant at the value of 100. The flowchart of the procedure that calculates R_av, R1, R0 and updates the R_av buffer is shown in Figure 16, where "Count" represents the number of non-silence speech frames, and "++" denotes increase by one. Notice that TH is an adaptive threshold that is representative of a silence (non speech) frame and is defined as in Figure 17. CR in this case is equal to CR $\binom{max}{M_{n} _{+1}}$ .
Given R0, R1, R_av and CR $\binom{max}{M_{n} _{+1}}$ , the V/UV part of Process I calculates the status V_Mn+1of the n+1 frame. The flowchart of this part of the algorithm is shown in Figure 18 where "V" represents the output V/UV flag of this procedure. Setting the "V" flag to 1 or 0 indicates voiced or unvoiced classification respectively. The "CR" parameter denotes the maximum value of the CR function which is calculated in the pitch estimation process. A diagrammatic representation of the voiced/unvoiced procedure is given in Figure 19.
Having the V_Mn+1 value, the P_Mn+1 estimate and the V'_n and P'_n estimates which have been produced from Process I operating on the previous nth coding frame, as illustrated in Figure 7, part b, two further locations M_n+1+d1 and M_n+1+d2 are estimated and the corresponding [-147,147] segments of filtered speech samples are obtained as illustrated in Figure 7, part b. These additional two analysis frames are used as input to the "Pitch Estimation process" of Figure 8 to yield P_Mn+1+d1 and P_Mn+1+d2. The procedure for calculating d1 and d2 is given in the flowchart of Figure 20.
The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification procedure of Figure 8 with inputs R0, R1, R_av, and to yield a preliminary value V $\binom{pr}{n +1}$ .
In addition, a multipoint pitch estimation algorithm accepts P_Mn+1, P_Mn+1+d1, P_Mn+1+d2, V_n-1, P_n-1, V'_n, P'_n to provide a preliminary pitch value P $\binom{pr}{n +1}$ . The flowchart of this multipoint pitch estimation algorithm is given in Figure 21, where P₁, P₂ and P_o represent the pitch estimates associated with the M_n+1+d1, M_{n+1 +d2} and M_n+1 points respectively, and P denotes the output pitch estimate of the process, that is P_n+1.
Finally part (b) Process I of Figure 7 imposes constraints on the V $\binom{pr}{n +1}$ and P $\binom{pr}{n +1}$ estimates in order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is given in Figure 22. At the start of this process "V" and "P" represent the voicing flag and pitch estimate values before constraints are applied (V $\binom{pr}{n +1}$ and P $\binom{pr}{n +1}$ in Figure 7) whereas at the end of the process "V" and "P" represent the voicing flag and pitch estimate values after the constraints have been applied (V'_n ₊₁ and P'_n ₊₁) The V'_n+1 and P'_n+1 produced from this section are then used in the next pitch past processing section together with V_n-1, V'_n, P_n-1 and P'_n to yield the final voiced/unvoiced and pitch estimate parameters V_n and P_n for the nth coding frame. This pitch post processing stage is defined in the flowchart of Figures 23, 24 and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25. At the start of this procedure "P_n" and "V_n" represent the pitch estimate and voicing flag respectively, which correspond to the nth coding frame prior to post processing (i.e. P $\binom{1}{n}$ , V $\binom{1}{n}$ ) whereas at the end of the procedure "P_n" and "V_n" represent the final pitch estimate and voicing flag associated with the nth frame (i.e. P_n, V_n).
The LPC analysis process (Process II of Figure 1) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, although simple autocorrelation schemes could be employed without a noticeable effect in the decoded speech quality. The LPC coefficients are then transformed to an LSP representation. Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and described in the literature, for example "Digital Processing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentice - Hall Inc., Englewood Cliffs, New Jersey, 1978. Similarly, LSP representations are well known, for example from "Line Spectrum Pair and Speech Data Compression", F Soong and B.H. Juang, Proc. ICASSP-84, pp 1.10.1-1.10.4, 1984. Accordingly these processes and representations will not be described further in this document.
. In process II, ten LSP coefficients are used to represent the data. These 10 coefficients could be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,4,3,3]. This is a relatively simple process, but the resulting bit rate of 1850 bits/second is unnecessarily high. Alternatively the LSP coefficients can be Vector Quantised (VQ) using a Split-VQ technique. In the Split-VQ technique an LSP parameter vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subvector is Vector Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used). In effect, the LSP transformed coefficient vector, C, which consists of "p" consecutive coefficients (c₁,c₂,...,c_p) is split into "K" vectors, C^k (1≤k≤K), with the corresponding dimensions d_k (1≤d_k≤p). p=d₁+d₂+...+d_k. In particular, when "K" is set to "p" (i.e. when C is partitioned into "p" elements) the Split-VQ becomes equivalent to Scalar Quantisation. On the other hand, when K is set to unity (K=1, d_k=p) the Split-VQ becomes equivalent to Full Search VQ.
The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to 1.4Kbits/sec. In order to minimize further the bit rate of the voice coded system described in this document a Split Matrix VQ (SMQ) has been developed in the University of Manchester and reported in "Efficient Coding of LSP Parameters using Split Matrix Quantisation", C.Xydeas and C.Papanastasiou, Proc ICASSP-95, pp 740-743, 1995. This method results in transparent LPC quantisation at 900bits/sec and offers a flexible way to obtain, for a given quantisation accuracy, the required memory/complexity characteristics for Process III. An important feature of SMQ is a new weighted Euclidean distance which is defined in detail as follows. where L'_k(1) represents the kth (k=1,...,K) quantized submatrix and are its elements. m(k) represents the spectral dimension of the kth submatrix and N is the SMQ frame dimension. Note also that: when the N LPC frames consist of both voiced and unvoiced frames
w_t(t) = En(t)^α1 otherwise
where Er(t) is the normalised energy of the prediction error of the (1+t)th frame, En(t) is the RMS value of the (1+t)th speech frame and Aver(En) is the average RMS value of the N LPC frames used in SMQ. The values of the constants α and α1 are set to 0.2 and 0.15 respectively.
Also: $w_{s,k} (s,t)=| {LSP}^{x} _{s}^{+1} _{(} _{k} _{-1)+} _{s} |^{β}$ where P(l _k+s ⁿ⁺¹) is the value of the power envelope spectrum of the (1+t) speech frame at the l_k+s LSP $\binom{s+t}{s (k -1)+ z}$ frequency. β is equal to 0.15
The overall SMQ quantisation process that yields the quantised LSP coefficients vectors l̂ ¹ to Î ^l+N-1 for the 1 to l+N-1 analysis frames is shown in Figure 26. This figure also includes the inverse process, which accepts the above Î ^l+i vectors i=0,..,N-1 and provides the corresponding LPC coefficients vector a ^l to a ^l+N ^-1. The a ^l+i i=0,..,N-1, coefficients vectors are modified, prior to the LPC to LSP transformation, by a 10 Hz bandwidth expansion as indicated in Figure 26. A 5Hz bandwidth expansion is also included in the inverse quantisation process.
Process IV of Figure l will now be described. This process is concerned with the mixed voiced classification of harmonics. When the nth coding frame is classified as voiced, the residual signal Rⁿ(i) of length 160 samples centred at the middle M_n of the nth coding frame and the pitch period P_n for that frame are used to determine the strongly voiced (hv_j=1)/weakly voiced (hv_j=0) classification associated with the jth harmonic ω_j ⁿ. The flowchart of Process IV is given in Figure 27. The R ⁿ array of 160 samples is Hamming windowed and augmented to form a 512 size array, which is then FFT processed. The maximum and minimum values MGR_max, MGR_min of the resulting 256 spectral magnitude values are determined, and a threshold TH0 is calculated. TH0 is then used to clip the magnitude spectrum. The clipped MGR array is searched to define peaks MGR(P) satisfying: $MGR(P)>MGR(P+1) and MGR(P)>MGR(P-1)$
For each peak, MGR(P), "supported" by the MGR(P+1) and MGR(P-1) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected :

a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), whose value is larger than 80% of MGR(P) or
b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P).
After applying these two constraints the remaining spectral peaks are characterised as "dominant" peaks. The objective of the remaining part of the process is to examine if there is a "dominant" peak near a given harmonic j×ω₀, in which case the harmonic is classified as strongly voiced and hv_j=1, otherwise hv_j=0. In particular, two thresholds are defined as follows: ${TH1=0.15×fo, TH2=(1.5/P}_{n})×fo$ with fo=(1/P_n)×fs and fs is the sampling frequency.

The difference (loc(MGR_d(k))-loc(MGR_d(k-1)) is compared to 1.5×fo+TH2, and if larger a related harmonic is not associated with a "dominant" peak and the corresponding classification hv is zero (weakly voiced). (loc(MGR_d (k)) is the location of the kth dominant peak and k=1,...,D where D is the number of dominant peaks. This procedure is described in detail in Figure 28, in which it should be noted that the harmonic index j does not always correspond to the magnitude spectrum peak index k, and loc(k) is the location of the kth dominant peak, i.e. loc (MGR_d(k)) = loc(K).
In order to minimise the bit rate associated with the transmission of the hv_j information, two schemes have been employed which coarsely represent hv.

Scheme I

The spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band: The first and last 500Hz bands i.e. 0 to 500 and 3500 to 4000Hz are always regarded as strongly voiced (Bhv=1) and weakly voiced (Bhv=0) respectively. When V_n=1 and V_n-1=1 the 500 to 1000 Hz band is classified as voiced i.e. Bhv=1. Furthermore, when V_n=1 and V_n-1=0 the 3000 to 3500 Hz band is classified as weakly voiced i.e. Bhv=0. The Bhv values of the remaining 5 bands are determined using a majority decision rule on the hv_j values of the j harmonics which fall within the band under consideration. When the number of harmonics for a given band is even and no clear majority can be established i.e. the number of harmonics with hv_j=1 is equal to the number of harmonics with hv_j=0, then the value of Bhv for that band is set to the opposite of the value assigned to the immediately preceding band. At the decoding process the hv_j of a specific harmonic j is equal to the Bhv value of the corresponding band. Thus the hv information may be transmitted with 5 bits.

Scheme II

In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands. When V_n=1 and V_n-1=0 the Fc frequency that separates these two bands can be one of the following:

(A) 680, 1360, 2040, 2720.
whereas, when V_n=1 and V_n-1=1, Fc can be one of the following frequencies:
(B) 1360, 2040, 2720, 3400.
Furthermore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv=1 and Bhv=0 respectively. The Fc frequency is selected by examining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e. the number of harmonics with hv_j=0 is larger than to the number of harmonics with hv_j=1, then Fc is set to the lower boundary of this band and the remaining spectral region is classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is strongly voiced with Bhv=1, whereas the higher band is weakly voiced with Bhv=0.

To illustrate the effect of the mixed voice classification on the speech synthesised from the transmitted information, Figures 29 and 30 represent respectively an original speech waveform obtained for the utterance "Industrial shares were mostly a" and frequency tracks obtained for that utterance. The horizontal axis represents time in terms of frames each of 20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents frequency tracks by full lines for the case when the voiced frames are all deemed to be strongly voiced (hv=1) and by dashed lines when the strongly/weakly voiced classification is taken into account so as to introduce random perturbations when hv=0.
Figure 32 shows four waveforms A, B, C and D. Waveform A represents the magnitude spectrum of a speech segment and the corresponding LPC spectral envelope (log₁₀ domain). Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the corresponding residual segment (B), the excitation segment obtained using the binary (voiced/unvoiced) excitation model (C), and the excitation segment obtained using the strongly voiced/weakly voiced/unvoiced hybrid excitation model (D). It will be noted that the hybrid model introduces an appropriate amount of randomness where required in the 3π/4 to π range such that curve D is a much closer approximation to curve B than curve C.
Process V of Figure 1 will now be described. Once the residual signal has been derived, a segment of P_n samples is obtained in the residual signal domain. The magnitude spectrum of the segment, which contains excitation source information, is derived by applying a P_n points DFT. An alternative solution, in order to avoid the computational complexity of the P_n points DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude spectrum at the desired points, using linear interpolation.
For a real-valued sequence x(i) of P points the DFT may be expressed as: The P_n point DFT will yield a double-side spectrum. Thus, in order to represent the excitation signal as a superposition of sinusoidal signals, the magnitude of all the non DC components must be multiplied by a factor of 2. The total number of single side magnitude spectrum values, which are used in the reconstruction process, is equal to ┌(P _n + 1) / 2┐
Process VI of Figure 1 will now be described. The DFT (Process V) applied on the Pn samples of a pitch segment in the residual domain, yields ┌(P _n + 1) /2┐ spectral magnitudes (MG_j ⁿ, 0<j<┌(P _n + 1)/2┐) and ┌(P _n+1)/2┐ phase values. The phase information is neglected. However, the continuity of the phase between adjacent voiced frames is preserved. Moreover, the contribution of the DC magnitude component is assumed to be negligible and thus, MG₀ ⁿ is set to 0. In this way, the non-DC magnitude spectrum is assumed to contain all the perceptually important information.
Based on the assumption of an "approximately" flat shape magnitude spectrum for the pitch residual segment, various methods could be used to represent the entire magnitude spectrum with a single value. Specifically, a modified single value spectral amplitude representation (MSVSAR) technique is described below.
MSVSAR is based on the observation that some of the speech spectrum resonance and anti-resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc., Vol. ASSP-33, pp.377-386, 1985). LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum mainly due to: a) the "cascade representation" of formats by the LPC filter 1/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the 1/A(z) all-pole filter and b) the LPC quantisation noise. As a consequence, the LPC residual signal is itself highly intelligible. Based on this observation the MG_j ⁿ magnitudes are obtained by spectral sampling at the harmonic locations, ω_j ⁿ, j=1,..., ∟(P _n + 1)/2┘, of a modified LPC synthesis filter, that is defined as follows: where, â $\binom{n}{p}$ i=1,...,p represent the p quantised LPC coefficients of the nth coding frame and G_R and G_N are defined as follows: and where K_i ⁿ, i=1,...,p are the reflection coefficients of the nth coding frame, x_n ^rm(i) represents a sequence of 2P_n speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and removed, M̂P(ω $\binom{n}{j}$ ) and H(ω $\binom{n}{j}$ ) represent the frequency response of the MP(z) and 1/A(z) filters respectively at the ω_j ⁿ frequency. Notice that the M̂P(ω $\binom{n}{j}$ ) values are calculated assuming G_N=1. The _R parameter represents a constant whose value is set to 0.25.
Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the feedback gain G_R is controlled by the performance of the LPC model (i.e. it is proportional to the normalised LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by computing the speech RMS value over two pitch periods.
Two alternative magnitude spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.
The first of the alternative magnitude spectrum representations techniques is referred to below in the "Na amplitude system". The basic principle of this MG $\binom{n}{j}$ quantisation system is to represent accurately those MG $\binom{n}{j}$ values which correspond to the Na largest speech Short Term (ST) spectral envelope values. In particular, given the LPC coefficients of the nth coding frame, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the harmonic frequencies ω $\binom{n}{j}$ and the locations lc(j), j=1,...,Na of the largest Na spectral samples are determined. These locations indicate effectively which of the ┌(P_n + 1)/2┐-1 MG $\binom{n}{j}$ magnitudes are subjectively more important for accurate quantization. The system subsequently selects MG_jn j=lc(1),...,lc(Na) and Vector Quantizes these values. If the minimum pitch value is 17 samples, the number of non-DC MG $\binom{n}{j}$ amplitudes is equal to 8 and for this reason Na≤8. Two variations of the "Na-amplitudes system" were developed with equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) respectively.

i) Na-amplitudes system with Mean Normalization Factor. In this variation, a pitch segment of P_n residual samples Rⁿ(i), centered about the middle M_n of the nth coding frame is obtained and DFT transformed. The mean value of the spectral magnitudes MG $\binom{n}{p}$ , j=1,..., └(P _n+1)/2┘ is calculated as: m is quantized and then used as the normalization factor of the Na selected amplitudes MG $\binom{n}{j}$ , j=lc(1),...,lc(Na). The resulting Na amplitudes are then vector quantized to MG $\binom{n}{j}$ .
ii) Na-amplitudes system with RMS Normalization Factor. In this variation the RMS value of the pitch segment centered about the middle M_n of the nth coding frame, is calculated as: g is quantized and then used as the normalization factor of the Na selected amplitudes MG $\binom{n}{j}$ , j=lc(1),...,lc(Na). These normalized amplitudes are then Vector Quantised to MG $\binom{n}{j}$ . Notice that the P_n points DFT operation can be avoided in this case, since the magnitude spectrum of the pitch segment is calculated only at the Na selected harmonic frequencies ω $\binom{n}{j}$ , $j=lc(1),...,lc(Na).$

In both cases the quantisation of the m and g factors, used to normalize the MG $\binom{n}{j}$ values, is performed using an adaptive µ-law quantiser with a non-linear characteristic as: $c (A)= A_{max} \frac{{log}_{e} (1+µ| A |/ A_{max})}{{log}_{e} (1+µ)} sgn(A) with µ=255$
This arrangement for the quantization of g or m extends the dynamic range of the coder to not less than 25dBs.
At the receiver end the decoder recovers the MG $\binom{n}{j}$ magnitudes as MG $\binom{n}{j}$ = MG $\binom{n}{j}$ x A, j=lc(1),...,lc(Na). The remaining ┌(P _n+1)/2┐-Na-1 MG $\binom{n}{p}$ values are set to a constant value A. (where A is either "m" or "g"). The block diagram of the adaptive µ-law quantiser is shown in Figure 34.
The second of the alternative magnitude spectrum representation techniques is referred to below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)" system. Coding systems, which employ the general synthesis formula of Equation (1) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG. The "Na- amplitudes" MG $\binom{n}{j}$ quantisation schemes described in Figure 33 avoid this problem by Vector Quantising the minimum expected number of spectral amplitudes and by setting the rest of the MG $\binom{n}{j}$ amplitudes to a fixed value. However, such a partially spectrally flat excitation model has limitations in providing high recovered speech quality. Thus, in order to improve the output speech quality, the shape of the entire {MG $\binom{n}{j}$ } magnitude spectrum should be quantised. Various techniques have been proposed for coding {MG $\binom{n}{j}$ }. Originally ADPCM has been used across the MG $\binom{n}{j}$ values associated to a specific coding frame. Also {MG $\binom{n}{j}$ } has been DCT transformed and coded differentially across successive MG $\binom{n}{j}$ magnitude spectra. However, these coding schemes are rather inefficient and operate with relatively high bit rates. The introduction of Vector Quantisation on the {MG $\binom{n}{j}$ } spectral amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation systems which operate at around 2.4 Kbits/sec. Two known {MG $\binom{n}{j}$ } VQ methods are described below which quantise a variable size (vs_n) input vector with a fixed size (fxs) codevector.

i) The first VQ method involves the transformation of the input vector to a fixed size vector followed by conventional Vector Quantisation. The inverse transformation on the quantised fixed size vector yields the recovered quantised MĜ ⁿ vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transformation. However, the overall distortion produced by this approach is the summation of the VQ noise and a component, which is introduced by the transformation process.
ii) The second VQ method achieves the direct quantisation of a variable input vector with a fixed size code vector. This is based in selecting only vs_n elements from each codebook vector, to form a distortion measure between a codebook vector and an input MGⁿ vector. Such a quantisation approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Quantisation noise.

An improved VQ method will now be described which is referred to below as the Variable Size Spectral Vector Quantisation (VS/SVQ) scheme. This scheme was developed to take advantage of the underlying principle that the actual shape of the { MG $\binom{n}{j}$ } magnitude spectrum is defined by a minimum ┌(P _n + l)/ 2┐ of equally spaced samples. If we consider the maximum expected pitch estimate P_max, then any { MG $\binom{n}{j}$ } spectral shape can be represented adequately by ┌(P _n + 1)/2┐ samples. This suggests that the fixed size fxs of the codebook vectors S ⁱ representing the MG $\binom{n}{j}$ shapes should not be larger than ┌(P _n + 1)/2┐. Of course this also implies that given the ┌(P _n + 1)/2┐ samples of a codebook vector, the complete spectral shape, defined at any frequency, is obtained via an interpolation process.
Figure 35 highlights the VS/SVQ process. The codebook CBS having cbs fixed fxs dimension vectors S' _j, j=1,...,fxs and i=1,...,cbs, where fxs is┌(P _n + 1)/2┐, is used to quantise an input vector MG $\binom{n}{j}$ , j=1,...,vs_n of dimension vs_n. Interpolation (in this case linear) is used on the S ⁱ vectors to yield S'' vectors of dimension vs_n.. The S' to S'' interpolation process is given by: for i=1,...,cbs and j=1,..., vs_n
This process effectively defines S" spectral shapes at the ω $\binom{n}{j}$ frequencies of the MG $\binom{n}{j}$ vector. A distortion measure D( S",MG ⁿ) is then defined between the S ^l and MG ⁿ vectors, and the codebook vector S ^l that yields the minimum distortion is selected and its index I is transmitted. Of course in the receiver, Equation (38) is used to define MĜ ⁿ from S ^l.
If we assume that P_max≈120 then fxs=60. However this value can be reduced to 50 without significant degradation by low pass filtering the signal synthesised from Equation (1). This is achieved by setting to zero all the harmonics MG $\binom{n}{j}$ in the region of 3.4 to 4.0KHz, in which case: and vs_n≤fxs.
Amplitude vectors, obtained from adjacent residual frames, exhibit significant redundancy, which can be removed by means of backward prediction. Prediction is performed on a harmonic basis i.e. the amplitude value of each harmonic MG_j ⁿ is predicted from the amplitude value of the same harmonic in previous frames i.e. MG $\binom{n -1}{j}$ . A fixed linear predictor=b × MĜ ^n-1 may be incorporated in the VS/SVQ system, and the resulting DPCM structure is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ)). In particular, error vectors are formed as the difference between the original spectral amplitudes MG_j ⁿ and their predicted ones $\binom{n}{j}$ , i.e.: where the predicted spectral amplitudes $\binom{n}{j}$ are given as: and
Furthermore the quantised spectral amplitudes MĜ $\binom{n}{j}$ are given as: where Ê $\binom{n}{j}$ denotes the quantised error vector.
The quantisation of the E $\binom{n}{j}$ 1≤j≤vs_n error vector incorporates Mean Removal and Gain Shape Quantisation techniques, using the hierarchical VQ structure of Figure 36.
A weighted Mean Square Error is used in the VS/SVQ stage of the system. The weighting function is defined as the frequency response of the filter: W(z) = 1/A _n(z/γ), where A_n(z) is the short-term linear prediction filter and γ is a constant, defined as γ=0.93. Such a weighting function that is proportional to the short-term envelope spectrum, results in substantially improved decoded speech quality. The weighting function W $\binom{n}{j}$ is normalised so that:
The pdf of the mean value of E ⁿ is very broad and, as a result, the mean value differs widely from one vector to another. This mean value can be regarded as statistically independent of the variation of the shape of the error vector E ⁿ and thus, can be quantised separately without paying a substantial penalty in compression efficiency. The mean value of an error vector is calculated as follows: M is Optimum Scalar Quantised to M̂ and is then removed from the original error vector to form Erm ⁿ = ( E ⁿ - M̂). The overall quantization distortion is attributed to the quantization of the "Mean Removed" error vectors (Erm ⁿ ), which is performed by a Gain-Shape Vector Quantiser.
The objective of the Gain-Shape VQ process is to determine the gain value Ĉ and the shape vector Ŝ so as to minimise the distortion measure:
A gain optimised VQ search method, similar to techniques used in CELP systems, is employed to find the optimum Ĝ and Ŝ. The shape Codebook (CBS) of vectors S ⁱ is searched first to vield an index I, which maximises the ouantitv: where cbs is the number of codevectors in the CBS. The optimum gain value is defined as: and is Optimum Scalar Quantised to Ĝ.
During shape quantisation the principles of VS/SVQ are employed, in the sense that the S' ⁱ, vs_n size vectors are produced using Linear Interpolation on fxs size codevectors S ⁱ. Both trained and randomly generated shape CBS codebooks were investigated. Although Erm ⁿ has noise-like characteristics, systems using randomly generated shape codebooks resulted in unsatisfactory muffled decoded speech and were inferior to systems employing trained shape codebooks.
A closed-loop joint predictor and VQ design process was employed to design the CBS codebook, the optimum scalar quantisers CBM and CBG of the mean M and gain G values respectively, and also to define the prediction coefficient b of Figure 36. In particular, the following steps take place in the design process.

STEP A0 (k=0). Given a training sequence of MG_j ⁿ the predictor b⁰ is calculated in an open loop fashion (i.e. = b × MG $\binom{n -1}{j}$ for 1≤j<┌(P _n+1)/2┐ when V_n-1=1, or= 0 elsewhere). Furthermore, the CBM⁰ mean, CBG⁰ gain and CBS⁰ shape codebooks are designed independently and again in an open loop fashion using unquantized E ⁿ. In particular:
- a) Given a training sequence of error vectors E ⁿ 0, the mean value of each E ⁿ ⁰ is calculated and used in the training process of an Optimum Scalar Quantiser (CBM⁰).
- b) Given a training sequence of error vectors E ⁿ ⁰ and the CBM⁰ mean quantiser, the mean value of each error vector is calculated, quantised using the CBM⁰ quantiser and removed from the original error vectors E ⁿ ⁰ to yield a sequence of "Mean Removed" training vectors Erm ⁿ ⁰.
- c) Given a training sequence of Erm ⁿ ⁰ vectors, each "Mean Removed" training vector is normalised to unit power (i.e. is divided by the factor linear interpolated to fxs points, and then used in the training process of a conventional Vector Quantiser of fxs dimension. (CBS⁰).
- d) Given a training sequence of Erm ^{n 0} vectors and the CBS^D shape codebook, each "Mean Removed" training vector is encoded using Equations 46 and 47 and the value G of Equation 47 is used in the training process of an Optimum Scalar Quantiser (CBG⁰).
  k is set to 1 (k=1).
STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the previous k-1 iterations (i.e. CBM^k-1, CBG^k-1, CBS^k-1), the optimum prediction coefficient b^k is calculated.
STEP A2 Given a training sequence of MG_j, an optimum prediction coefficient b^k and CBM^k-1, CBG^k-1, CBS^k-1, a training sequence of error vectors E ⁿ ^k is formed, which is then used for the design of new mean, gain and shape codebooks (i.e. CBMk, CBG^k, CBS^k).
STEP A3 The performance of the kth iteration quantization system (i.e. b^k, CBM^k, CBG^k, CBS^k) is evaluated and compared against the quantization system of the previous iteration (i.e. b^k-1, CBM^k-1, CBG^k-1, CBS^k-1). If the quantization distortion converges to a minimum, the quantization design process stops. Otherwise, k=k+1 and steps A1, A2 and A3 are repeated.

The performance of each quantizer (i.e. b^k, CBM^k, CBG^k, CBS^k) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective performance of the system.
The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the following two steps :

STEP B1 Given a training sequence of error vectors E ⁿ ^k, the mean value of each E ⁿ ^k is calculated and used in the training process of an Optimum Scalar Quantiser (CBM^k).
STEP B2 Given a training sequence of error vectors E ⁿ ^k and the CBM^k mean quantizer, the mean value of each residual vector is calculated, quantized and removed from the original residual vectors E ⁿ ^k to yield a sequence of "Mean Removed" training vectors Erm ⁿ ^k, which are then used as the training data in the design of an optimum Gain Shape Quantizer (CBG^k and CBS^k). This involves steps C1 - C4 below. (The quantization design process is performed under the assumption of any independent gain shape quantiser structure, i.e. an input error vector Emr ⁿ can be represented by any possible combination of S ⁱ codebook shape vectors and Ĝ gain quantizer levels.)
STEP C1 (v=0). Given a training sequence of vectors Erm ⁿ ^k and an initial CBG^k,0 and CBS^k,0 gain and shape codebooks respectively, compute the overall average distortion distance D_k,0 as in Equation 44. Set v equal to I (v=1).
STEP C2 Given a training sequence of vectors Erm ⁿ ^k and the CBG^k,v-1 gain codebook from the previous iteration, compute the new shape codebook CBS^k,v which minimises the VQ distortion measure. Notice that the optimum CBS^k,v shape codebook is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in M1_k,v iterations.
STEP C3 Given a training sequence of vectors Erm ⁿ ^k and the CBS^k,v shape codebook, compute a new gain quantiser CBG^k,v, which minimise the distortion measure of Equation (44). This optimum CBG^k,v gain quantiser is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in M2_k,v iterations.
STEP C4 Given a training sequence of vectors Erm ⁿ ^k and the shape and gain codebooks CBS^k,v and CBG^k,v, compute the average overall distortion measure. If (D_k,v-1- D_k,v)/D_k,v<ε stop. Otherwise, v=v+1 and go back to STEP C2.

The centroids S $\binom{k, v, m}{i, u}$ ,i=1,...,cbs and u=1,...,fxs of the shape Codebook CBS^k,v,m, are updated during the mth iteration performed in STEP C2 (m=1,...,M1_k,v) as follows: where ${DC}_{j,n,u} = W_{j}^{n} (G_{jn}^{k, v -1} × f_{u,} _{j,n})^{2},$ $ƒ_{u,j,n} = 1-| \frac{ƒ_{xs}}{{vs}_{n}} - u |,$ and
Q_i denotes the cluster of Erm ⁿ ^k error vectors which are quantised to the S $\binom{k, v, m -1}{i,}$ codebook shape vector, cbs represents the total number of shape quantisation levels, J_n represents the CBG^k,v-1 gain codebook index which encodes the Erm ^{n k} error vector and 1≤j≤vs_n.
The gain centroids, G $\binom{k, v, m}{i,}$ , i=1,...,cbg of the CBG^k,v,m gain quantiser, which are computed during the mth iteration in STEP C3 (m=1,...,M2_k,v), are given as: where D_j denotes the cluster of Erm ⁿ ^k error vectors which are quantised to the G $\binom{k, v, m -1}{i,}$ gain quantiser level, cbg represents the total number of gain quantisation levels, I_n represents the CBS^k,v shape codebook index which encodes the Erm ⁿ ^k error vector and 1≤j≤vs_n.
The above employed design process is applied to obtain the optimum shape codebook CBS, optimum gain and mean quantizers, CBG and CBM and the optimum prediction coefficient b which was finally set to b=0.35.
Process VII calculates the energy of the residual signal. The LPC analysis performed in Process II provides the prediction coefficients a; 1≤i≤p and the reflection coefficients k_i 1≤i≤p. On the other hand, the Voiced/Unvoiced classification performed in Process I provides the short term autocorrelation coefficient for zero delay of the speech signal (R0) for the frame under consideration. Hence, the Energy of the residual signal E_n value is given as:
The above expression represents the minimum prediction error as it is obtained from the Linear Prediction process. However, because of quantization distortion the parameters of the LPC filter used in the coding-decoding process are slightly different from the ones that achieve minimum prediction error. Thus, Equation (50) gives a good approximation of the residual signal energy with low computational requirements. The accurate E_n value can be given as:
The resulting $\sqrt{E_{n}}$ is then Scalar Quantised using an adaptive µ-law quantised arrangement similar to the one depicted in Figure 34. In the case where more than one $\sqrt{E_{n}}$ are used in the system i.e. the energy E_n is calculated for a number of subframes then E _n _ξ is given by the general equation: Notice that when Ξ = 1, M_s=M and for Ξ=4, M_s=M/4.

Claims

A speech coding system which operates on a frame by frame basis, and in which information is transmitted which represents each frame as either voiced or unvoiced and, for each voiced frame, represents that frame by a pitch period value, quantized magnitude spectral information with associated strong/weak voiced harmonic classification, and LPC filter coefficients, the received pitch period value and magnitude spectral information being used to generate residual signals at the receiver which are applied to LPC speech synthesis filters the characteristics of which are determined by the transmitted filter coefficients, wherein each residual signal is synthesised according to a sinusoidal mixed excitation synthesis process, and a recovered speech signal is derived from the combination of the outputs of the LPC synthesis filters.
A system according to claim 1, wherein a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.
A system according to claim 2, wherein the pitch estimate is defined using an iterative process.
A system according to claim 2 or 3, wherein a single reference sample may be used, centred with respect to the respective frame.
A system according to claim 2 or 3, wherein multiple pitch estimates are derived for each frame using different reference samples, the multiple pitch estimates being combined to define a combined pitch estimate for the frame.
A system according to any of claims 2 to 5, wherein the pitch estimate is modified by reference to a voiced/unvoiced status and/or pitch estimates of adjacent frames to define a final pitch estimate.
A system according to any any of claims 2 to 6, wherein the correlation function is clipped using a threshold value, remaining peaks being rejected if they are adjacent to larger peaks.
A system according to claim 7, wherein peaks are selected which are larger that either adjacent peak and peaks are rejected if they are smaller than a following peak by more than a predetermined factor.
A system according to any of claims 2 to 8, wherein the pitch estimation procedure is based on a least squares error algorithm.
A system according to claim 9, wherein the pitch estimation algorithm defines the pitch valve as a number whose multiples best fit the correlation function peak locations.
A system according to any of claims 2 to 10, wherein possible pitch values are limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by the lower of those two numbers.
A system according to claim 1, wherein a speech signal is divided into a series of frames, and each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed voiced classification which classifies harmonics in the magnitude spectrum of voiced frames as strongly voiced or weakly voiced, wherein a series of samples centred on the middle of the frame are windowed to form a data array which is Fourier transformed to produce a magnitude spectrum, a threshold value is calculated and used to clip the magnitude spectrum, the clipped data is searched to define peaks, the locations of peaks are determined, constraints are applied to define dominant peaks, and harmonics not associated with a dominant peak are classified as weakly voiced.
A system according to claim 12, wherein peaks are located using a second order polynomial
A system according to claim 12 or 13, wherein the samples are Hamming windowed.
A system according to claim 12, 13 or 14, wherein the threshold value is calculated by identifying the maximum and minimum magnitude spectrum values and defining the threshold as a constant multiplied by the difference between the maximum and minimum values.
A system according to any one of claims 12 to 15, wherein peaks are defined as those values which are greater than the two adjacent values, a peak being rejected from consideration if neighbouring peaks are of a similar magnitude or if there are spectral magnitudes in the same range of greater magnitude.
A system according to any one of claims 12 to 16, wherein a harmonic is considered as not being associated with a dominant peak if the difference between two adjacent peaks is greater than a predetermined threshold value.
A system according to any one of claims 12 to 17, wherein the spectrum is divided into bands of fixed width and a strongly/weakly voiced classification is assigned for each band.
A system according to any one of claims 12 to 18, wherein the frequency range is divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced classification of harmonics.
A system according to claim 18 or 19, wherein the lowest frequency band is regarded as strongly voiced, whereas the highest frequency band is regarded as weakly voiced.
A system according to claim 20, wherein the event that a current frame is voiced, and the following frame is unvoiced, further bands within the current frame will be automatically classified as weakly voiced.
A system according to claim 20 or 21, wherein the strongly/weakly voiced classification is determined using a majority decision rule on the strongly/weakly voiced classification of those harmonics which fall within the band in question.
A system according to claim 22, wherein, if there is no majority, alternate bands are alternately assigned strongly voiced and weakly voiced classifications.
A speech synthesis system in which a speech signal is divided into a series of frames, each frame is defined as voiced or unvoiced, each frame is converted into a coded signal including a pitch period value, a frame voiced/unvoiced classification and, for each voiced frame, a mixed voiced spectral band classification which classifies harmonics within spectral bands as either strongly or weakly voiced, and the speech signal is reconstructed by generating an excitation signal in respect of each frame and applying the excitation signal to a filter, wherein for each weakly voiced spectral band, an excitation signal is generated which includes a random component in the form of a function which is dependent upon the respective pitch period value.
A system according to claim 24, wherein the spectrum is divided into bands and a strongly/weakly voiced classification is assigned to each band.
A system according to claim 24 or 25, wherein the random component is introduced by reducing the amplitude of harmonic oscillators assigned the weakly voiced classification, disturbing the oscillator frequencies such that the frequency is no longer a multiple of the fundamental frequency, and then adding further random signals.
A system according to claim 26, wherein the phase of the oscillators is randomised.
A system according to claim 1, wherein a speech signal is divided into a series of frames, and each voiced frame is converted into a coded signal including a pitch period value, LPC coefficients and pitch segment spectral magnitude information, wherein the spectral magnitude information is quantized by sampling the LPC short term magnitude spectrum at harmonic frequencies, the locations of the largest spectral samples are determined to identify which of the magnitudes are relatively more important for accurate quantization, and the magnitudes so identified are selected and vector quantized.
A system according to claim 28, wherein a pitch segment of P_n LPC residual samples is obtained, where P_n is the pitch period value of the nth frame, the pitch segment is DFT transformed, the mean value of the resultant spectral magnitudes is calculated, the mean value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.
A system according to claim 28, wherein the RMS value of the pitch segment is calculated, the RMS value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.
A system according to any one of claims 28 to 30, wherein , at the receiver, the selected magnitudes are recovered, and each of the other magnitude values is reproduced as a constant value.
A system according to claim 1, wherein a variable size input vector of coefficients to be transmitted to a receiver for the reconstruction of a speech signal is vector quantized using a codebook defined by vectors of fixed size, the codebook vectors of fixed size are obtained from variable sized training vectors and an interpolation technique which is an integral part of the codebook generation process, codebook vectors are compared to the variable sized input vector using the interpolation process, and an index associated with the codebook entry with the smallest difference from the comparison is transmitted, the index being used to address a further codebook at the receiver and thereby derive an associated fixed size codebook vector, and the interpolation process being used to recover from the derived fixed sized codebook vector an approximation of the variable sized input vector.
A system according to claim 32, wherein the interpolation process is linear, and for an input vector of given dimension, the interpolation process is applied to produce from the codebook vectors a set of vectors of that given dimension, a distortion measure is then derived to compare the interpolated set of vectors and the input vector, and the codebook vector is selected which yields the minimum distortion.
A system according to claim 33, wherein the dimension of the vectors is reduced by taking into account only the harmonic amplitudes within an input bandwidth range.
A system according to claim 34, wherein the remaining amplitudes are set to a constant value.
A system according to claim 35, wherein the constant value is equal to the mean value of the quantized amplitudes.
A system according to any one of claims 32 to 36, wherein redundancy between amplitude vectors obtained from adjacent residual frames is removed by means of backward prediction.
A system according to claim 37, wherein the backward prediction is performed on a harmonic basis such that the amplitude value of each harmonic of one frame is predicted from the amplitude value of the same harmonic in the previous frame or frames.
A system according to claim 1, wherein a speech signal is divided into a series of frames, each frame is converted into a coded signal including an estimated pitch period, an estimate of the energy of a speech segment the duration of which is a function of the estimated pitch period, and LPC filter coefficients defining an LPC spectral envelope, and a speech signal of related power to the power of the input speech signal is reconstructed by generating an excitation signal using spectral amplitudes which are defined from a modified LPC spectral envelope sampled at harmonic frequencies defined by the pitch period.
A system according to claim 39, wherein the magnitude values are obtained by spectrally sampling a modified LPC synthesis filter characteristic at the harmonic locations related to the pitch period.
A system according to claim 40, wherein the modified LPC synthesis filter has reduced feed back gain and a frequency response which consists of equalised resonant peaks, the locations of which are close to the LPC synthesis resonant locations.
A system according to claim 41, wherein the value of the feed back gain is controlled by the performance of the LPC model such that it is related to the normalised LPC prediction error.
A system according to any one of claims 39 to 42, wherein the energy of the reproduced speech signal is equal to the energy of the original speech waveform.
A system according to claim 1, wherein a speech signal is divided into a series of frames, each frame is converted into a coded signal including LPC filter coefficients and at least one parameter associated with a pitch segment magnitude, and the speech signal is reconstructed by generating two excitation signals in respect of each frame, each pair of excitation signals comprising a first excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of one frame and a second excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of a second frame which follows and is adjacent to the said one frame, applying the first excitation signal to a first LPC filter the characteristics of which are determined by the LPC filter coefficients of the said one frame and applying the second excitation signal to a second LPC filter the characteristics of which are determined by the LPC filter coefficients of the said second frame, and weighting and combining the outputs of the first and second LPC filters to produce one frame of a synthesised speech signal.
A system according to claim 44, wherein the first and second excitation signals include the same phase function and different phase contributions from the two LPC filters.
A system according to claim 45, wherein the outputs of the first and second LPC filters are weighted by half a window function such that the magnitude of the output of the first filter is decreasing with time and the magnitude of the output of the second filter is increasing with time.