HK1058427B - Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder - Google Patents
Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder Download PDFInfo
- Publication number
- HK1058427B HK1058427B HK04101153.7A HK04101153A HK1058427B HK 1058427 B HK1058427 B HK 1058427B HK 04101153 A HK04101153 A HK 04101153A HK 1058427 B HK1058427 B HK 1058427B
- Authority
- HK
- Hong Kong
- Prior art keywords
- band
- frequency
- adjacent band
- adjacent
- energy
- Prior art date
Links
Description
Background
I. Field of the invention
The present invention relates to the field of speech processing, and more particularly, to a method and apparatus for band recognition in a speech coder to compute linear phase shifts between frame prototypes.
II. background of the invention
Voice transmission using digital technology is becoming increasingly common, especially in long-range telephony and digital wireless telephony applications. This in turn has drawn attention to determine the amount of information that can be sent over the channel to the minimum while maintaining the perceived quality of the reconstructed speech. If voice is transmitted with simple sampling and digitization, a data rate of around 64 kilobits per second (Kbps) is required to achieve the voice quality of conventional analog phones. However, by employing voice analysis followed by appropriate encoding, transmission, and re-synthesis at the receiver, a significant reduction in data rate can be achieved.
Voice compression devices are employed in many telecommunications fields. One example of this area is wireless communications. The field of wireless communications has many applications including, for example, cordless telephones, paging, wireless local loops, wireless telephones such as cellular and PCS telephone systems, mobile Internet Protocol (IP) telephones, and satellite communication systems. A particularly important application is the wireless telephony of mobile users.
Wireless communication systems have developed various air interfaces including, for example, Frequency Division Multiple Access (FDMA), Time Division Multiple Access (TDMA), and Code Division Multiple Access (CDMA). Various national and international standards have been established in this regard, including, for example, "advanced Mobile Phone service" (AMPS), "Global System for Mobile communications" (GSM), and "transitional Standard" 95 (IS-95). A typical wireless telephone communication system is a Code Division Multiple Access (CDMA) system. The Telecommunications Industry Association (TIA) and other well-known standards bodies promulgate the IS-95 standard and its derivatives, IS-95A, ANSI J-STD-008, IS-95B, the proposed 3 rd generation standards IS-95C and IS-2000, etc. (collectively referred to herein as IS-95), to specify the use of the CDMA air interface for cellular or PCS telephone communication systems. 5103459 and 4901307, which are assigned to the assignee of the present invention and incorporated herein by reference, describe exemplary wireless communication systems configured substantially in accordance with the application of the IS-95 standard.
An apparatus for compressing voice using various methods by extracting parameters related to a human language generation model is called a voice encoding device. A speech coding device divides an input speech signal into time blocks or analysis frames. The apparatus generally comprises an encoder and a decoder. The encoder analyzes the incoming voice frames, extracts certain relevant parameters, and quantizes the parameters into a binary representation, i.e., a binary bit group or binary data packet. The data packets are sent over a channel to a receiver and decoder. The decoder processes the data packets, dequantizes them to produce parameters, and resynthesizes the speech frames using the dequantized parameters.
The function of a speech coder is to compress a digital speech signal into a low bit rate signal by removing all the natural redundancies inherent in speech. Digital compression is achieved by representing the input voice frame by a set of parameters and representing the parameters by a binary set of bits using quantization. If an input speech frame has Ni bits and a speech encoding device produces data packets having No bits, the device achieves a compression factor of Cr Ni/No. The challenge is to maintain the decoded pitch quality while achieving the target compression factor. The performance of a speech coding device depends on (1) the sophistication of the speech model or combination of the above analysis and synthesis processes, and (2) the sophistication of the parameter quantization process at the No bit-per-frame target bit rate. Therefore, the goal of the speech model is to obtain speech signal elements or target speech quality with a small number of parameters per frame.
Perhaps the most important in speech coder design is to find a good set of parameters (including vectors) to describe the speech signal. Good parameter sets require a small system bandwidth for a perceptually correct voice signal to be reconstructed. Examples of speech coding parameters are pitch, signal power, spectral envelope (or formants), amplitude spectrum, and phase spectrum.
The speech coder may be implemented as a time-domain encoder that attempts to capture a time-domain speech waveform by encoding a small number of speech segments (typically 5 millisecond (ms) sub-frames) at a time using a high time resolution process. Each subframe finds a highly accurate representation from the codebook space by means of various finding algorithms known in the art. Alternatively, the speech encoding apparatus may be implemented as a frequency domain encoder that attempts to capture the short-term speech spectrum of an input speech frame with a set of parameters (analysis), and reconstruct the speech waveform from the spectral parameters with a corresponding synthesis process. The parameter quantizer retains these parameters by representing the representative parameters by stored code vectors according to a well-known Quantization technique shown in Vector Quantization and signal compression, a.gersho and r.m. gray, 1992.
A well-known time-domain Speech coder is the code-excited linear prediction (CELP) coder, which is described in Digital Processing of Speech Signals (pages 396-453, published by l.b. rabiner and r.w. schafer, 1978), which is incorporated herein by reference. In a CELP decoder, short-term correlation or redundancy in the speech signal is removed by applying Linear Prediction (LP) analysis to calculate the coefficients of a short-term formant filter. A short-term prediction filter is applied to the input speech frame to generate an LP residual signal, which is then modeled and quantized with long-term prediction filter parameters and a subsequent random codebook. Thus, CELP coding divides the time-domain speech waveform coding task into separate LP short-term filter coefficient coding task and LP residual coding task. Time domain coding may be performed at a fixed rate (i.e., the same number of bits No is used for each frame) or at a variable rate (different bit rates are used for different types of frame content). The variable rate encoder attempts to encode the codec parameters only as many bits as are needed to achieve the appropriate degree of target quality. 5414796, which is assigned to the assignee of the present invention and incorporated herein by reference.
Time domain coders, such as CELP coders, typically rely on a high number of bits per frame to keep the time domain speech waveform accurate. The encoder typically provides superior speech quality if the number of bits per frame is high (e.g., 8Kbps or more). However, at low bit rates (4Kbps and below), time-domain coders are limited by the number of bits available and cannot maintain high quality and robust performance. At low bit rates, conventional time-domain coders, which are successfully deployed in high-rate commercial applications, have waveform matching performance that is constrained by the limited codebook space. Thus, despite improvements from time to time, many CELP coding systems operating at low bit rates suffer from perceptually significant distortion, which is often characterized as noise.
Research currently emerging to develop high quality voice coding equipment operating at medium to low bit rates (i.e., in the 2.4 to 4Kbps range and below) is focused on the hot stream and strong commercial demand. Applications range from wireless telephony, satellite communications, internet telephony, various multimedia and streaming applications, voice mail, and other voice storage systems. The driving force is the high capacity requirement and the robust performance requirement in case of packet loss. Various recent speech coding standards bodies are another direct impetus to advance the development of low-rate speech coding algorithms. Low rate vocoders establish more channels or users per available bandwidth. A low rate speech encoder combined with an additional layer of appropriate channel coding can accommodate the total number of bits budget of the encoder specification and provide robust performance in case of channel errors.
One voice coding technique that is effective at low bit rates is multi-mode coding. An example of a multi-mode CODING technique is described in U.S. patent application serial No. 09/217341, entitled "variable rate speech CODING" (variable RATE SPEECH CODING), filed on 21.12.1998, assigned to the assignee of the present invention and incorporated herein by reference. Conventional multi-mode encoders employ different modes (or coding algorithms) for different types of input voice frames. The modes or coding processes are tailored in the most efficient way to best represent certain types of speech segments, such as voiced speech (voiced speech), unvoiced speech (unvoiced speech), transitional speech (e.g., between voiced and unvoiced speech), and background noise (non-speech). An outer open loop mode decision mechanism examines the incoming voice frame and decides which mode to use for the frame. Open-loop mode decisions are typically made by extracting a certain number of parameters from the input frame, evaluating these parameters with respect to certain temporal and spectral characteristics, and based on a decision mode on the evaluation.
Coding systems operating at rates around 2.4Kbps are generally parametric. That is, the coding system operates by transmitting parameters describing the pitch period and spectral envelope (or formants) of a voice signal at regular intervals. An illustrative example of these so-called parametric encoders is the LP vocoder system.
The LP vocoder models voiced speech signals with one signal pulse per pitch period. This basic approach can be extended to include sending spectral envelope information in various parameters. While LP vocoders generally provide reasonable information, they introduce perceptually significant distortion, often characterized as buzz.
In recent years, coders have come in a hybrid of waveform coders and parametric coders. An illustrative example of these "hybrid coders" is the prototype waveform-interleaved (PWI) speech coding system. The system may also be referred to as a Prototype Pitch Period (PPP) speech coder. The PWI coding system provides an efficient coding method for voiced speech. The basic concept of PWI is: a representative pitch period (prototype waveform) is extracted at fixed time intervals, a description thereof is transmitted, and a voice signal is reconstructed by interpolation between the prototype waveforms. The PWI method may operate on the LP residual signal or the voice signal. An example of a PWI or PPP speech CODING device is set forth in U.S. patent application serial No. 09/217494, filed 12/21 of 1998, entitled "periodic speech CODING," assigned to the assignee of the present invention and incorporated herein by reference. 5884253 and "waveform interpolation in Speech Coding" method for Speech Coding (W.Bastiaan Kleijn and Wolfgang Granzow, 1991, 1 Digital Signal Processing, pages 215-230) describe other PWI or PPP Speech coders.
In conventional speech coding equipment, the entire phase information for each pitch prototype within each speech frame is transmitted. However, in low bit rate speech coders, it is desirable to maintain the bandwidth as much as possible. Thus, it would be advantageous to provide a method of transmitting fewer phase parameters. Thus, a speech encoder that transmits a small amount of phase information per frame is required.
Disclosure of Invention
The present invention relates to a voice encoding apparatus that transmits a small amount of phase information per frame. Thus, in one aspect of the present invention, a method of partitioning a prototype spectrum of a frame advantageously comprises the steps of: dividing a frequency spectrum into a plurality of segments; allocating a plurality of frequency bands to each segment; a bandwidth group for a plurality of frequency bands is established for each segment.
In another aspect of the present invention, a speech coder configured to partition a frame prototype spectrum advantageously comprises: means for dividing the frequency spectrum into a plurality of segments; means for allocating a plurality of frequency bands to each segment; means for establishing a bandwidth group of a plurality of frequency bands for each segment.
In yet another aspect of the present invention, a speech encoding apparatus advantageously comprises: a prototype extractor that extracts a prototype from a frame processed by the voice encoding apparatus; and a prototype quantizer connected to the prototype extractor and configured to divide the frequency spectrum of the prototype into a plurality of segments, allocate a plurality of frequency bands to each segment, and establish a bandwidth group for the plurality of frequency bands for each segment.
Drawings
Fig. 1 is a block diagram of a wireless telephone system.
Fig. 2 is a block diagram of a channel terminated at each end by a speech coder.
Fig. 3 is a block diagram of an encoder.
Fig. 4 is a block diagram of a decoder.
Fig. 5 is a flowchart illustrating a speech coding decision process.
Fig. 6A is a graph of voice signal amplitude versus time.
Fig. 6B is a graph of Linear Prediction (LP) residual amplitude versus time.
FIG. 7 is a block diagram of a Prototype Pitch Period (PPP) speech encoding apparatus.
FIG. 8 is a flowchart illustrating the steps of an algorithm executed by a PPP speech coder, such as the speech coder of FIG. 7, to identify a frequency band in a Discrete Fourier Series (DFS) representation of a prototype pitch period.
Detailed description of the preferred embodiments
The exemplary embodiments described below reside in a wireless telephone communication system configured to utilize a CDMA air interface. However, those skilled in the art will appreciate that the subsampling method and apparatus embodying features of the present invention may be present in any communication system utilizing a variety of techniques known to those skilled in the art.
As shown in fig. 1, a CDMA wireless telephone system generally comprises a plurality of mobile subscriber units 10, a plurality of base stations 12, a Base Station Controller (BSC)14, and a Mobile Switching Center (MSC) 16. The MSC16 is configured to interface with a conventional Public Switched Telephone Network (PSTN) 18. The MSC16 is also configured to interface with the BSC 14. The BSC14 connects to the base station 12 by a backhaul line. The roundabout line may be configured to support any known interface including, for example, E1/T1, ATM, IP, PPP, frame Relay, HDSL, ADSL, or XDSL. It will be appreciated that there may be more than 2 BSCs 14 in the system. It is advantageous that each base station 12 comprises at least one sector (not shown), each sector comprising an omni-directional antenna or an antenna pointing in a particular direction radially away from the base station 12. Or each sector may include 2 antennas for diversity reception. Each base station 12 may advantageously be designed to support multiple frequency allocations. The intersection of the sector and frequency assignment may be referred to as a CDMA channel. The base station 12 may also be referred to as a Base Transceiver Subsystem (BTS) 12. Alternatively, a "base station" may be used in the industry to refer collectively to a BSC14 and more than one BTS 12. BTS12 may also be referred to as "cell site" 12. In addition, the various sectors of a given BTS12 may also be referred to as cell sites. The mobile subscriber units 10 are typically cellular or PCS telephones 10. Advantageously, the system IS configured for use in accordance with the IS-95 standard.
During typical operation of a cellular telephone system, the base station 12 receives a set of reverse link signals from a number of mobile units 10. Mobile unit 10 makes a telephone call or other communication. Each reverse link signal received by a given base station 12 is processed within that base station 12. The resulting data is passed to the BSC 14. The BSC14 provides call resource allocation and mobility management functions including coordinating soft handoffs between base stations 12. The BSC14 also sends the received data to the MSC16, which provides additional routing services to interface with the PSTN 18. Similarly, the PSTN18 also interfaces with the MSC16, and the MSC16 interfaces with the BSC14, which in turn controls the base station 12 to transmit a set of forward link signals to a number of mobile units 10.
In fig. 2, the 1 st encoder 100 receives and encodes digitized speech samples s (n) for transmission on a transmission medium 102 (or channel 102) to the 1 st decoder 104. The decoder 104 decodes the encoded speech samples and synthesizes them into a speech output signal SSYNTH(n) of (a). To transmit in the reverse direction, the 2 nd encoder 106 encodes the digital speech samples S (n) and transmits them over the channel 108. The 2 nd decoder 110 receives the encoded voice samples and decodes them to produce an integrated voice output signal SSYNTH(n)。
The speech samples s (n) represent the speech signal digitized and quantized according to any method known in the art, such as companded mu-law or a-law Pulse Code Modulation (PCM). As is known in the art, the speech samples s (n) are combined into input data frames, each frame containing a predetermined number of digitized speech samples s (n). In the exemplary embodiment, a sampling rate of 8KHz is used, including 160 samples per 20ms frame. In the embodiment described below, the data transmission rate is advantageously varied from 13.2Kbps (full rate) to 6.2Kbps (half rate) to 2.6Kbps (1/4 rate) to 1Kbps (1/8 rate) on a frame-by-frame basis. Changing the data transmission rate is advantageous because a low bit rate can be selected for frames containing less voice information. Those skilled in the art will appreciate that other sampling rates, frame sizes, and transmission rates may be used.
The 1 st encoder 100 and the 2 nd decoder 110 together comprise a 1 st speech coder or speech encoder. The speech coder may be used in any communication device for transmitting speech signals, including, for example, a subscriber unit, BTS or BSC as described above with reference to fig. 1. Similarly, the 2 nd encoder 106 and the 1 st decoder 104 together form a 2 nd speech encoding device. Those skilled in the art will appreciate that the speech coding devices may be implemented in a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and microprocessor. The software emulation may reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art. In addition, any conventional processor, controller or state machine may be substituted for the microprocessor. 5727123 describes an exemplary ASIC specifically designed for voice coding, as well as U.S. patent application Ser. No. 08/197417 entitled "vocoder ASIC (vocoder ASIC)", filed on 16.2.1994, which 2 patents are assigned to the assignee of the present invention and are incorporated herein by reference.
In FIG. 3, an encoder 200 that may be used in a speech encoding apparatus includes a mode decision module 202, a pitch estimation module 204, an LP analysis module 206, an LP analysis filter 208, an LP quantization module 210, and a residue quantization module 212. The input speech frame s (n) is provided to a mode decision module 202, a pitch estimation module 204, an LP analysis module 206 and an LP analysis filter 208. The mode decision module 202 generates a mode index I based on periodicity, energy, signal-to-noise ratio (SNR), or zero-crossing rate in each of the characteristics of each input voice frame S (n)MAnd a mode M. 5911128, which is assigned to the assignee of the present invention and incorporated herein by reference, describes various methods of classifying speech frames according to periodicity. The industry interim standards TIA/EIA IS-127 and TIA/EIA IS-733 of the Telecommunications industry Association are also incorporated into these methods. The above-mentioned U.S. patent application serial No. 09/217341 also teaches an example of a mode decision method.
The pitch estimation module 204 generates a pitch index Ip and a lag value Po from each input speech frame s (n). The LP analysis module 206 performs linear prediction analysis on each input voice frame s (n) to generate LP parameters a. The parameter a is provided to the LP quantization module 210. The module 210 also receives a mode M to perform the quantization process in a mode dependent manner. The module 210 generates the LP index ILPAnd quantized LP parameters *. The LP analysis filter 208 receives the quantized LP parameters * in addition to the input speech frames s (n). The filter 208, based on the quantized linear prediction parameters *, generates a LP residual signal R [ n ] representing errors between the input speech frames S (n) and reconstructed speech]. The LP residual signal R [ n ]]Mode M and quantized LP parameters * are provided to residue quantization module 212. The module 212 generates the residue index I from these valuesRAnd quantizing the residual signal
In fig. 4, a coder 300 that may be used in a speech encoding apparatus includes an LP parameter coding module 302, a residue coding module 304, a mode coding module 306, and an LP synthesis filter 308. The mode decoding module 306 receives the mode index IMIt is decoded, resulting in mode M. The LP parameter decoder module 302 receives mode M and LP index ILP. The module 302 decodes the received values to produce quantized LP parameters *. The residue decoding module 304 receives the residue index IRTone index number IPAnd mode index IM. The module 304 decodes the received values to generate a quantized residual signalThe signalAnd the quantized LP parameters * are provided to the LP synthesis filter 308 for synthesis into a decoded output speech signal
The operation and implementation of the various modules of the encoder 200 of fig. 3 and the decoder 300 of fig. 4 are well known in the art, as is described in the aforementioned united states patent No. 5414796 and Digital Processing of speech Signals (l.b. rabiner and r.w. schafer, pages 396-453, 1978).
As shown in the flow chart of fig. 5, a speech coding apparatus according to one embodiment follows the following steps in processing the transmitted speech samples. In step 400, the speech coding apparatus receives digital samples of a speech signal in successive frames. The speech coding apparatus proceeds to step 402 when a given frame is received. In this step 402, the speech coder detects the energy of the frame. This energy is a measure of the voice activity of the frame. Voice detection is performed by summing the squares of the magnitudes of the digital voice samples and comparing the resulting energy to a threshold value. In one embodiment, the threshold is adaptive to electrical variations in the background noise level. The above-mentioned us patent No. 5414796 describes an example of a variable threshold voice activity detector. Some unvoiced speech sounds may be very low energy samples that are incorrectly encoded as background noise. To avoid this, the low energy sampled spectrum may be skewed to distinguish between unvoiced speech and background noise, as described in the above U.S. patent No. 5414796.
After detecting the frame energy, the speech coder proceeds to step 404. In step 404, the speech encoding device determines whether the detected frame energy is sufficient to distinguish between frames containing speech information. If the detected frame energy falls below the predetermined level, the speech coder proceeds to step 406. In step 406, the device encodes the frame as background noise (i.e., non-speech or silence). In one embodiment, the background noise frames are encoded at 1/8 rate or 1 Kbps. If the frame energy detected in step 404 meets or exceeds the predetermined threshold level, the speech encoding apparatus proceeds to step 408 after distinguishing the frame as speech.
In step 408, the speech coder determines whether the frame is unvoiced speech, i.e., the speech coder examines the periodicity of the frame. Various known periodic decision methods include, for example, using zero crossings and using a normalized autocorrelation function (NACF). Specifically, the above-mentioned U.S. patent No. 5911128 and U.S. patent application serial No. 09/217341 describe the use of zero crossings and NACF detection periodicity. In addition, the telecommunication industry Association interim standards TIA/EIA IS-127 and TIA/EIA IS-733 are incorporated into the above-described method for distinguishing between unvoiced and voiced speech. If the frame is determined to be unvoiced in step 408, the speech encoding apparatus proceeds to step 410. In step 410, the device encodes this frame as unvoiced speech. In one embodiment, unvoiced speech frames are encoded at 1/4 rate or 2.6 Kbps. If the frame is not determined to be unvoiced in step 408, the speech encoding apparatus proceeds to step 412.
In step 412, the speech coder determines whether the frame is a transition phrase using a periodic detection method known in the art, such as that described in U.S. patent No. 5911128, above. If it is determined that the frame is a transition word, the speech encoding apparatus proceeds to step 414. The frame is encoded as a transitional language (i.e., transitioning from unvoiced to voiced speech) at step 414. One embodiment encodes the TRANSITION frames according to the multiple pulse space CODING method described in U.S. patent application serial No. 09/307294 entitled multiple pulse space CODING OF TRANSITION frames (multiple pulse space CODING OF TRANSITION speech) filed on 7.5.1999, assigned to the assignee OF the present invention and incorporated herein by reference. In another embodiment, the transitional speech frames are encoded at full rate or 13.2 Kbps.
If the speech encoding apparatus determines in step 412 that the frame is not a transition word, the apparatus proceeds to step 416. In step 416, the speech coder encodes the frame as voiced speech. In one embodiment, voiced speech frames may be encoded at half rate or 6.1 Kbps. Voiced speech frames may also be encoded at full rate or 13.2Kbps (or 8Kbps full rate in an 8K CELP coding device). However, those skilled in the art will appreciate that half-rate voiced frame encoding allows an encoding device to save valuable bandwidth by taking advantage of the stationarity of voiced frames. Furthermore, regardless of the rate at which voiced speech is encoded, it is advantageous to encode voiced speech using information from past frames, so that the speech is predictively encoded.
Those skilled in the art will appreciate that the speech signal or corresponding LP residue may be encoded in the steps shown in fig. 5. The waveform characteristics of noise, unvoiced, transitional, and voiced speech are seen in the graph of fig. 6A as a function of time, and the LP residual waveform characteristics of noise, unvoiced, filtered, and voiced speech are seen in the graph of fig. 6B as a function of time.
In one embodiment, the Prototype Pitch Period (PPP) speech encoding apparatus 500 includes an inverse filter 502, a prototype extractor 504, a prototype quantizer 506, a prototype dequantizer 508, an interpolation/synthesis module 510, and an LPC synthesis module 512, as shown in FIG. 7. Advantageously, the speech encoding apparatus 500 may be implemented as a DSP and may reside, for example, in a subscriber unit or base station of a CS or cellular telephone system, or may reside in a subscriber unit or gateway of a satellite system.
In the speech encoder 500, the digitized speech signal s (n) (n is the frame number) is provided to an inverse LP filter 502. In one embodiment, the length of the frame is 20 ms. The transfer function of the inverse filter is calculated as follows:
A(z)=1-a1z-1-a2z-2-…-apz-pin the formula, coefficient aIFor the filter taps, these taps have predetermined values selected in accordance with known methods as described in the above U.S. patent No. 5414796 and U.S. patent application serial No. 09/217494, which were previously incorporated by reference in their entirety. The number p represents the number of previous samples used by the inverse LP filter 502 for prediction. In a specific embodiment, p is set to 10.
The inverse filter 502 provides the LP residual signal r (n) to the prototype extractor 504. The prototype extractor 504 extracts a prototype from the current frame. The prototype is part of the current frame that is linearly interpolated by the interpolation/synthesis module 510 along with the previous frame prototype also within the frame, with the purpose of reconstructing the Lp residual signal at the decoder.
Prototype extractor 504 provides the prototype to prototype quantizer 506, which quantizes the prototype according to any quantization method known in the art. The quantized values, which may be derived from a look-up table (not shown), are assembled into data packets containing the lag parameters and their codebook parameters for transmission over the channel. The packet is provided to a transmitter (not shown) and transmitted over a channel to a receiver (also not shown). Let the inverse LP filter 502, prototype extractor 504, and prototype quantizer 506 perform PPP analysis on the current frame.
The receiver receives the packet and provides it to the prototype dequantizer 508. The prototype dequantizer 508 may dequantize the packet according to any well-known technique. The prototype dequantizer 508 provides the dequantized prototype to the interpolation/synthesis module 51O. The module 510 interpolates the prototype with the previous frame prototype, also located within the frame, to reconstruct the LP residual signal for the current frame. The interpolation and frame integration is advantageously accomplished in accordance with known methods as described in U.S. patent No. 5884254 and U.S. patent application serial No. 09/217494, supra.
The socket/synthesize module 510 combines the reconstructed LP residual signalIs provided to LPC synthesis module 512. The LPC synthesis module 512 also receives Line Spectral Pairs (LSPs) from the transmitted data packets for the reconstructed LP residual signalLPC filtering to produce a reconstructed speech signal for the current frameIn another embodiment, the prototype may be speech signaled prior to current frame interpolation/synthesisThe LPC synthesis of (1). The prototype dequantizer 508, the interpolation/synthesis module 510, and the LPC synthesis module 512 perform PPP synthesis of the current frame.
In one aspect, a PPP speech encoding device, such as speech encoding device 500 of FIG. 7, identifies the number of frequency bands B for which linear phase shifts are to be calculated. It is advantageous to intelligently subsample the PHASE prior to quantization in accordance with the METHODs AND APPARATUS described in the related U.S. patent application filed concurrently herewith, entitled "METHOD AND APPARATUS for subsampling PHASE spectral INFORMATION" (METHOD AND APPARATUS for subsampling PHASE spectral INFORMATION "), assigned to the assignee of the present invention. Advantageously, the speech coding device divides the Discrete Fourier Series (DFS) vector of the prototype of the processed frame into a small number of frequency bands of variable width, in accordance with the importance of the harmonic amplitude in the entire DFS vector, thereby reducing the quantization required in proportion. The entire frequency range from 0Hz to Fm Hz (Fm being the highest frequency of the prototype being processed) is divided into L segments. Thus, the harmonic number M present is equal to Fm/Fo, Fo Hz being the fundamental frequency. Thus, the prototype DFS vector, which consists of an amplitude vector and a phase vector, has M elements. The speech encoding apparatus allocates B1, B2, B3 … bL bands to L segments in advance so that B1+ B2+ B3 … bL is equal to the total number of required bands B. Correspondingly, segment 1 has B1 bands, segment 2 has B2 bands, … …, segment L has bL bands, and the entire frequency range has B bands. In one embodiment, the entire frequency range is from zero to 4000Hz, the range of human spoken speech.
In one embodiment, bi bands are uniformly allocated in the ith segment of the L segments. This is done by dividing the frequency range of the ith segment into bi equal parts. Correspondingly, the 1 st segment is divided into b1 equal frequency bands, the 2 nd segment is divided into b2 equal frequency bands, and the … th segment is divided into bL equal frequency bands.
In another embodiment, a fixed set of non-uniformly placed band edges is selected for each of the bi bands in the ith segment. This is achieved by selecting an arbitrary set of bi bands on the ith segment or taking the total average of the energy histograms of the i segments. A high energy concentration may require a narrow frequency band, while a low energy concentration may require a wider frequency band. Thus, segment 1 is divided into b1 fixed non-equal frequency bands, segment 2 is divided into b2 fixed non-equal frequency bands, and segment … is divided into bL fixed non-equal frequency bands.
In another embodiment, a variable set of band edges is selected for each of the bi bands in each sub-band. This is done by using the target frequency bandwidth equal to the appropriate low value Fb Hz as a starting point. Then, the following steps are performed. The counter n is set to 1. Then, an amplitude vector is found, and the frequency of the maximum amplitude value FbmHz and the corresponding harmonic number mb (equal to Fbm/Fo) are found. This search is done by excluding all previously set band edge coverage ranges (corresponding to iterations 1 through n-1). Then, the band edges of the nth band of the bi bands are set to harmonic numbers mb-Fb/Fo/2 and mb + Fb/Fo/2 corresponding to frequencies Fmb-Fb/2 and Fmb + Fb/2 Hz. Next, the counter n is incremented and the steps of finding the magnitude vector and setting the band edge are repeated until the count n exceeds bi. Thus, segment 1 is divided into b1 variable unequal frequency bands, segment 2 is divided into b2 variable unequal frequency bands, and segment … is divided into bL variable unequal frequency bands.
In the embodiment just described above, the bands are further thinned to remove gaps between adjacent band edges. In one embodiment, the right band edge of the low band and the left band edge of the nearest high band are extended to coincide with each other in the middle of the 2 edge gaps (the 1 st band, which is located to the left of the 2 nd band, has a lower frequency than the 2 nd band). One way to achieve this is to set the edges of the 2 bands to the average of their frequencies (and corresponding harmonic numbers). In another embodiment, either one of the low-band right edge and its nearest high-band left edge is set equal in frequency to the other (or to adjacent harmonics numbers of the other). The equalization of the band edges may be performed in dependence on the energy contained in the bands ending with the right edge and the bands starting with the left edge. The band edge corresponding to the band with higher energy remains unchanged while the other band edge is changed. Alternatively, the band edge corresponding to the band where the central energy is localized higher changes, while the other band edge does not. In another embodiment, both the right band edge and the left band edge are shifted unequal distances in frequency and harmonic number by a ratio of x: y, where x and y are the band energies of the band beginning with the left band edge and the band ending with the right band edge, respectively. Alternatively, x and y may be the ratio of the center harmonic energy to the total energy of the band ending with the right band edge and the ratio of the center harmonic energy to the total energy of the band starting with the left band edge, respectively.
In another embodiment, some L-segment DFS vectors may use uniformly allocated frequency bands, others may use fixed non-uniform frequency bands, and still others may use variable non-uniform frequency bands.
In one embodiment, a PPP speech coding apparatus, such as speech coding apparatus 500 in FIG. 7, performs the algorithm steps illustrated in the flowchart of FIG. 8 to identify a frequency band in a Discrete Fourier Series (DFS) representation of a prototype pitch period. These frequency bands are identified for calculating the frequency band localization or linear phase shift relative to the reference prototype DFS.
In step 600, the speech encoding apparatus starts a process of recognizing a band. The apparatus then proceeds to step 602. The speech coding apparatus calculates the DFS of the prototype at the fundamental frequency Fo in step 602. The apparatus then proceeds to step 604. In step 604, the speech coder divides the frequency range into L segments. In one embodiment, the frequency range is from zero to 4000Hz, the range of human spoken speech. The speech encoding apparatus then proceeds to step 606.
In step 606, the speech coder allocates bL bands to the L segments so that B1+ B2+ … + bL equals the total number B of bands for which linear phase shifts are to be calculated. The device then proceeds to step 608. In step 608, the speech encoding device sets the count of segments i equal to 1. The device then proceeds to step 610. In step 610, the speech coder selects an allocation method for allocating a frequency band in each segment. The apparatus then proceeds to step 612.
In step 612, the speech encoding apparatus determines whether the band allocation method of step 610 allocates bands uniformly within the segment. If the band allocation method of step 610 evenly allocates the band within the segment, the apparatus proceeds to step 614. Otherwise, the band allocation method of step 610 does not allocate the band uniformly within the segment, and the device proceeds to step 616.
In step 614, the speech coder divides the ith segment into bi equal bands. The device then proceeds to step 618. In step 618, the speech coder increments the segment count i. The apparatus then proceeds to step 620. In step 620, the speech encoding apparatus determines whether the segment count i is greater than L. If the segment count i is greater than L, the facility proceeds to step 622. Otherwise, the segment count i is greater than L and the device proceeds to step 622. Otherwise, if the segment count i is not greater than L, the apparatus returns to step 610 to select the band allocation method of the next segment. In step 622, the speech coder exits the band recognition algorithm.
In step 616, the speech encoding apparatus determines whether the band allocation method of step 610 allocates a fixed non-uniform band within a segment. If the band allocation method of step 610 allocates a fixed non-uniform band within a segment, the apparatus proceeds to step 624. Otherwise, the band allocation method of step 610 does not allocate a fixed non-uniform band within a segment, the device proceeds to step 626.
In step 624, the speech coder divides the ith segment into bi unequal preset pre-bands. This can be done in the manner described above. The speech coder then proceeds to step 618 to increment the segment count i and continue to allocate frequency bands for each segment until the entire frequency range is allocated frequency bands.
In step 626, the speech encoding device sets the band count n equal to 1 and the starting bandwidth equal to FbHz. The apparatus then proceeds to step 628. In step 628, the speech coder excludes amplitudes of the bandwidth ranging from 1 to n-1. The device then proceeds to step 630. In step 630, the speech coder classifies the remaining magnitude vectors. The apparatus then proceeds to step 632.
In step 632, the voice encoding apparatus determines the position of the band having the highest harmonic number mb. The apparatus then proceeds to step 634. In step 634, the speech coder sets the band edges around mb such that the total number of harmonics contained between the band edges is equal to Fb/Fo. The apparatus then proceeds to step 636.
In step 636, the speech coder moves the band edges of adjacent bands to fill the band gap. The device then proceeds to step 638. In step 638, the speech coder increments the band count n. The device then proceeds to step 640. The speech encoder determines whether the band count n is greater than bi at step 640. If the band count n is greater than bi, the device proceeds to step 618 to increment the segment count i and allocate a band for each segment until the entire frequency range is allocated a band. Otherwise, if the band count n is not greater than bi, then the device returns to step 638 to establish the width of the next band in the segment.
Thus, a novel band identification method and apparatus for calculating linear phase shifts between frame prototypes in a speech coder is described. Those of skill in the art will understand that the various illustrative logical blocks and algorithm steps described in connection with the embodiments disclosed herein may be implemented or performed with a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), discrete circuit gate or transistor logic, discrete hardware components such as registers and FIFO, a processor executing a firmware instruction set, or any conventional programmable software module and processor. The processor is advantageously a microprocessor, but in the alternative, the processor may be any conventional processor, microcontroller, or state machine. The software modules may reside in RAM memory, flash memory, registers, or any other type of writable storage medium known in the art. Skilled artisans will also appreciate that data, instructions, commands, information, signals, binary bits, symbols, and chips that may be referenced throughout the above description are represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Thus, the preferred embodiments of the present invention have been described. However, those of ordinary skill in the art will appreciate that numerous modifications may be made to the embodiments disclosed herein without departing from the spirit and scope of the present invention. Accordingly, the invention is not limited except as by the following claims.
Claims (18)
1. A method for partitioning a frequency spectrum of a prototype of a frame, comprising the steps of:
dividing a frequency spectrum into a plurality of segments;
allocating a plurality of frequency bands to each segment;
establishing a bandwidth group for a plurality of frequency bands for each segment;
wherein the establishing step comprises the step of allocating a variable bandwidth to a plurality of frequency bands within a particular segment;
wherein the step of allocating a variable bandwidth to a plurality of frequency bands within a particular segment comprises the steps of:
setting a target bandwidth;
finding the magnitude vector of the prototype for each bandwidth except for any previously established search range covered by the band edges to determine the maximum harmonic number in the band;
determining band edge locations around the maximum number of harmonics for each band such that the total number of harmonics located between the band edges is equal to the fundamental frequency divided by the target bandwidth;
eliminating gaps at the edges of adjacent bands.
2. The method of claim 1, wherein the eliminating step comprises the step of setting adjacent band edges enclosing the gap for each gap to be equal to the average frequency of the original 2 adjacent band edges.
3. The method of claim 1, wherein the eliminating step comprises the step of setting, for each gap, the adjacent band edge corresponding to the band with smaller energy to be equal to the frequency value of the adjacent band edge corresponding to the band with larger energy.
4. The method of claim 1, wherein the eliminating step comprises the step of setting, for each gap, an adjacent band edge corresponding to a band with higher localized energy at the center of the band to be equal to a frequency value of the adjacent band edge corresponding to a band with lower localized energy at the center of the band.
5. The method of claim 1, wherein the eliminating step comprises the steps of: the frequency values of 2 adjacent band edges are adjusted for each gap, i.e. the frequency value of the adjacent band edge corresponding to the band with higher frequency is adjusted by the ratio of x to y relative to the frequency value of the adjacent band edge with lower frequency, wherein x is the band energy of the adjacent band with higher frequency and y is the band energy of the adjacent band with lower frequency.
6. The method of claim 1, wherein the eliminating step comprises the steps of: adjusting the frequency values of 2 adjacent band edges for each gap, i.e. adjusting the frequency value of the adjacent band edge corresponding to the higher frequency band relative to the adjustment of the frequency value of the adjacent band edge having a lower frequency by a ratio of x to y, where x is the ratio of the central harmonic energy of the lower frequency adjacent band to the total energy of the lower frequency adjacent band and y is the ratio of the central resonance energy of the higher frequency adjacent band to the total energy of the higher frequency adjacent band.
7. A speech coder configured to partition a spectrum of frame prototypes, comprising:
means for dividing the frequency spectrum into a plurality of segments;
means for allocating a plurality of frequency bands to each segment;
means for establishing a bandwidth group of a plurality of frequency bands for each segment;
the establishing means comprises means for allocating a variable bandwidth to a plurality of frequency bands of a particular segment;
the allocating means for allocating a variable bandwidth to a plurality of frequency bands of a specific segment includes:
means for setting a target bandwidth;
finding means for finding the magnitude vector of the prototype for each frequency band to determine the maximum harmonic number within the frequency band except for a previously established finding range covered by the band edge;
positioning means for positioning the band edges around the maximum number of harmonics so that the total number of harmonics located between the band edges is equal to the fundamental frequency divided by the target bandwidth;
means for eliminating adjacent band edge gaps.
8. The speech coder of claim 7, wherein the means for eliminating comprises means for setting, for each gap, an adjacent band edge that encloses the gap to be equal to the average of the frequencies of the original 2 adjacent band edges.
9. The speech coder of claim 7, wherein the means for eliminating comprises means for setting, for each gap, an adjacent band edge corresponding to a band with lower energy to a frequency value equal to an adjacent band edge corresponding to a band with higher energy.
10. The speech coder of claim 7, wherein the means for eliminating comprises means for setting, for each gap, an adjacent band edge corresponding to a band with higher localized energy at the center of the band to be equal to a frequency value of an adjacent band edge corresponding to a band with lower localized energy at the center of the band.
11. The speech coder of claim 7, wherein the means for canceling comprises means for adjusting the frequency values of 2 adjacent band edges for each gap by adjusting the frequency values of the adjacent band edges relative to an adjustment of the frequency values of the adjacent band edges at lower frequencies by a ratio of x to y, where x is the band energy of the adjacent band at higher frequencies and y is the band energy of the adjacent band at lower frequencies.
12. The speech coder of claim 7, wherein the means for canceling comprises means for adjusting the frequency values of the 2 adjacent band edges for each gap by adjusting the frequency values of the adjacent band edges relative to the adjustment of the frequency values of the adjacent band edges at lower frequencies by a ratio of x to y, where x is the ratio of the center harmonic energy of the adjacent band at the lower frequency to the total energy of the adjacent band at the lower frequency and y is the ratio of the center resonance energy of the adjacent band at the higher frequency to the total energy of the adjacent band at the higher frequency.
13. A speech coder, comprising:
a prototype extractor configured to extract a prototype from a frame processed by the voice coding apparatus;
a prototype quantizer coupled to the prototype extractor and configured to divide a frequency spectrum of the prototype into a plurality of segments, allocate a plurality of frequency bands to each segment, and establish a bandwidth group for the plurality of frequency bands for each segment;
the prototype quantizer is further configured to establish the set of bandwidths as variable bandwidths for the plurality of frequency bands within the particular segment;
the prototype quantizer is further configured to set a variable bandwidth by setting a target bandwidth; finding the magnitude vector of the prototype for each bandwidth except for any previously established search range covered by the band edges to determine the maximum harmonic number within the band; determining band edge locations around the maximum number of harmonics for each band such that the total number of harmonics located between the band edges is equal to the fundamental frequency divided by the target bandwidth; eliminating gaps at the edges of adjacent bands.
14. The speech coder of claim 13, wherein the prototype quantizer is further configured to eliminate gaps by setting adjacent band edges that encapsulate each gap to be equal to a frequency average of the original 2 adjacent band edges.
15. The speech coder of claim 13, wherein the prototype quantizer is further configured to set, for each gap, an adjacent band edge corresponding to a band of lesser energy, such that the frequency value equal to the adjacent band edge corresponding to the band of greater energy, thereby eliminating the gap.
16. The speech coder of claim 13, wherein the prototype quantizer is further configured to, for each gap, set an adjacent band edge corresponding to a band with higher localized energy at a center of the band to a value equal to a frequency value of the adjacent band edge corresponding to a band with lower localized energy at the center of the band, thereby eliminating the gap.
17. The speech coder of claim 13, wherein the prototype quantizer is further configured to adjust the frequency values of 2 adjacent band edges for each gap by adjusting the frequency values of the adjacent band edges relative to the adjustment of the frequency values of the adjacent band edges at lower frequencies by a ratio of x to y, wherein x is the band energy of the adjacent band at higher frequencies and y is the band energy of the adjacent band at lower frequencies, thereby eliminating the gap.
18. The speech coder of claim 13, wherein the prototype quantizer is further configured to adjust the frequency values of 2 adjacent band edges for each gap by adjusting the frequency values of the corresponding adjacent band edges for higher frequency bands relative to the adjustment of the frequency values of the adjacent band edges for lower frequency bands by a ratio of x to y, where x is the ratio of the center harmonic energy of the lower frequency adjacent band to the total energy of the lower frequency adjacent band and y is the ratio of the center resonance energy of the higher frequency adjacent band to the total energy of the higher frequency adjacent band, thereby eliminating the gap.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US09/356,861 US6434519B1 (en) | 1999-07-19 | 1999-07-19 | Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder |
| US09/356,861 | 1999-07-19 | ||
| PCT/US2000/019603 WO2001006494A1 (en) | 1999-07-19 | 2000-07-18 | Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1058427A1 HK1058427A1 (en) | 2004-05-14 |
| HK1058427B true HK1058427B (en) | 2007-02-09 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN1223989C (en) | Frame Erasure Compensation Method in Variable Rate Speech Coder and Device Using the Method | |
| CN1158647C (en) | Spectral magnetude quantization for a speech coder | |
| EP1214705B1 (en) | Method and apparatus for maintaining a target bit rate in a speech coder | |
| WO2001006491A1 (en) | Method and apparatus for providing feedback from decoder to encoder to improve performance in a predictive speech coder under frame erasure conditions | |
| EP1204968B1 (en) | Method and apparatus for subsampling phase spectrum information | |
| CN1271596C (en) | Method and apparatus for identifying frequency bands to compute linear phase shase shifts between frame prototypes in a speech coder | |
| JP2003524796A (en) | Method and apparatus for crossing line spectral information quantization method in speech coder | |
| HK1058427B (en) | Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder | |
| HK1047817B (en) | Spectral magnitude quantization for a speech coder | |
| HK1064196B (en) | Method and apparatus for subsampling phase spectrum information | |
| HK1055174B (en) | Frame erasure compensation method in a variable rate speech coder and apparautus using the same | |
| HK1055173A (en) | Method and apparatus for predictively quantizing voiced speech |