HK40022645B

HK40022645B - Method and device for processing voice, computer readable storage medium and computer apparatus

Info

Publication number: HK40022645B
Application number: HK42020012869.2A
Authority: HK
Inventors: 梁俊斌
Original assignee: 腾讯科技（深圳）有限公司
Filing date: 2020-07-29
Publication date: 2023-11-17

Description

Speech processing methods, apparatus, computer-readable storage media, and computer devices

技术领域Technical Field

本申请涉及语音处理技术领域，特别是涉及一种语音处理方法、装置、计算机可读存储介质和计算机设备。This application relates to the field of speech processing technology, and in particular to a speech processing method, apparatus, computer-readable storage medium, and computer device.

背景技术Background Technology

在进行语音通话的过程中，语音可以通过网络从发送端传输到接收端，由于网络质量的问题，使得语音在传输的过程中可能会出现语音数据包丢失现象，会导致接收端所接收的语音会出现卡顿和不连贯的情况，从而影响语音接听效果。During a voice call, voice can be transmitted from the sender to the receiver over the network. Due to network quality issues, voice data packets may be lost during transmission, causing the receiver to experience stuttering and discontinuity, thus affecting the quality of the call.

传统抗丢包的方案中，对语音数据包通过FEC(Forward Error Correction，前向纠错)编码得到冗余包，将语音数据包和冗余包一起发送到接收端，若出现丢包情况时，接收端根据冗余包可以恢复出丢包位置的完整语音，从而达到抗丢包的效果。FEC冗余度(即冗余包个数与语音数据包个数的比值)越大抗丢包能力越强，但会额外消耗大量带宽，若FEC冗余度较小，将无法达到纠错的效果。In traditional packet loss mitigation schemes, voice data packets are encoded using Forward Error Correction (FEC) to obtain redundant packets. These redundant packets are then sent together to the receiver. If packet loss occurs, the receiver can recover the complete voice at the point of loss based on the redundant packets, thus achieving packet loss mitigation. A higher FEC redundancy (the ratio of redundant packets to voice data packets) provides stronger packet loss mitigation, but it consumes significantly more bandwidth. Conversely, a lower FEC redundancy will not achieve the desired error correction.

发明内容Summary of the Invention

基于此，有必要针对如何在消耗较低带宽的情况下，保证对语音数据包进行有效纠错的技术问题，提供一种语音处理方法、装置、计算机可读存储介质和计算机设备。Therefore, it is necessary to provide a voice processing method, apparatus, computer-readable storage medium, and computer device to address the technical problem of how to ensure effective error correction of voice data packets while consuming low bandwidth.

一种语音处理方法，包括：A speech processing method, comprising:

对获取的语音进行语速检测，得到语速值；The acquired speech is subjected to speech rate detection to obtain the speech rate value;

获取前向纠错冗余度；Obtain the forward error correction redundancy;

依据所述语速值调整所述前向纠错冗余度，得到目标冗余度；The forward error correction redundancy is adjusted based on the speech rate value to obtain the target redundancy.

对所述语音进行语音编码，得到语音编码包；The speech is encoded to obtain a speech encoding packet;

按照所述目标冗余度对所述语音编码包进行前向纠错编码，得到冗余包；The speech coding packet is forward-corrected according to the target redundancy to obtain a redundant packet;

向接收端发送所述冗余包和所述语音编码包。The redundant packet and the voice encoding packet are sent to the receiving end.

一种语音处理装置，所述装置包括：A voice processing device, the device comprising:

检测模块，用于对获取的语音进行语速检测，得到语速值；The detection module is used to detect the speech rate of the acquired speech and obtain the speech rate value;

获取模块，用于获取前向纠错冗余度；The acquisition module is used to obtain the forward error correction redundancy.

调整模块，用于依据所述语速值调整所述前向纠错冗余度，得到目标冗余度；An adjustment module is used to adjust the forward error correction redundancy based on the speech rate value to obtain the target redundancy.

第一编码模块，用于对所述语音进行语音编码，得到语音编码包；The first encoding module is used to encode the speech to obtain a speech encoding packet;

第二编码模块，用于按照所述目标冗余度对所述语音编码包进行前向纠错编码，得到冗余包；The second encoding module is used to perform forward error correction encoding on the speech encoding packet according to the target redundancy to obtain a redundant packet;

发送模块，用于向接收端发送所述冗余包和所述语音编码包。The sending module is used to send the redundant packet and the voice encoding packet to the receiving end.

一种计算机可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时，使得所述处理器执行所述语音处理方法的步骤。A computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of the speech processing method.

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述计算机程序被所述处理器执行时，使得所述处理器执行所述语音处理方法的步骤。A computer device includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the speech processing method.

上述语音处理方法、装置、计算机可读存储介质和计算机设备，通过对语音进行语速检测，利用检测所得的语速值对前向纠错冗余度进行调整，从而可以使用调整后的目标冗余度对语音编码包进行前向纠错编码，从而得到冗余包，当语速较慢，语音数据包所包含的语音内容较少，而当语速较快，语音数据包所包含的语音内容较多，按照语速值来动态调整前向纠错冗余度，可以确保发生丢失的语音数据包可以有效地被恢复过来，从而实现对语音数据包的有效纠错，而且也能避免额外消耗大量带宽。The aforementioned speech processing method, apparatus, computer-readable storage medium, and computer device, by detecting speech rate and adjusting the forward error correction redundancy using the detected speech rate value, can use the adjusted target redundancy to perform forward error correction encoding on speech coding packets, thereby obtaining redundant packets. When the speech rate is slow, the speech data packet contains less speech content, while when the speech rate is fast, the speech data packet contains more speech content. Dynamically adjusting the forward error correction redundancy according to the speech rate value can ensure that lost speech data packets can be effectively recovered, thereby achieving effective error correction of speech data packets and avoiding the consumption of a large amount of bandwidth.

附图说明Attached Figure Description

图1为一个实施例中语音处理方法的应用环境图；Figure 1 is an application environment diagram of a speech processing method in one embodiment;

图2为一个实施例中语音处理方法的流程示意图；Figure 2 is a flowchart illustrating a speech processing method in one embodiment;

图3为一个实施例中对语音分帧的示意图；Figure 3 is a schematic diagram of speech framing in one embodiment;

图4为一个实施例中发送端进行前向纠错编码的示意图；Figure 4 is a schematic diagram of forward error correction coding performed at the transmitting end in one embodiment;

图5为一个实施例中计算语速值步骤的流程示意图；Figure 5 is a flowchart illustrating the steps for calculating speech rate in one embodiment;

图6为一个实施例中调整前向纠错冗余度，将所得的目标冗余度与冗余度上限值和冗余度下限值进行比较，根据比较结果进行前向纠错编码步骤的流程示意图；Figure 6 is a flowchart illustrating the forward error correction redundancy adjustment process in one embodiment, which involves comparing the obtained target redundancy with the upper and lower limits of redundancy, and performing forward error correction coding based on the comparison results.

图7为一个实施例中发送端调整前向纠错冗余度并进行前向纠错编码，接收端进行前向纠错解码恢复语音编码包的流程示意图；Figure 7 is a schematic diagram of the process in one embodiment where the transmitting end adjusts the forward error correction redundancy and performs forward error correction coding, and the receiving end performs forward error correction decoding to recover the voice coding packet.

图8为一个实施例中语音处理装置的结构框图；Figure 8 is a structural block diagram of a speech processing device in one embodiment;

图9为另一个实施例中语音处理装置的结构框图；Figure 9 is a structural block diagram of the voice processing device in another embodiment;

图10为一个实施例中计算机设备的结构框图。Figure 10 is a structural block diagram of a computer device in one embodiment.

具体实施方式Detailed Implementation

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

图1为一个实施例中语音处理方法的应用环境图。参照图1，该语音处理方法应用于语音处理系统。该语音处理系统包括终端110、传输节点120和终端130。终端110、传输节点120和终端130通过网络连接。终端110可作为发送端(也可用作接收端)，具体可以是台式终端或移动终端，移动终端具体可以手机、平板电脑、笔记本电脑等中的至少一种。对应地，终端130可作为接收端(也可用作发送端)，当终端110为发送端时，终端130可以是接收端，具体可以是台式终端或移动终端，移动终端具体可以手机、平板电脑、笔记本电脑等中的至少一种。传输节点120可以包括网络中的交换机(或路由器)以及其它传输设备，如SDH(Synchronous Digital Hierarchy，同步数字体系)或PTN(Packet Transport Network，分组传送网)设备，此外，传输节点120还可以通信基站，如3G、4G和5G以及后续版本的通信基站。Figure 1 illustrates an application environment of the speech processing method in one embodiment. Referring to Figure 1, the speech processing method is applied to a speech processing system. This system includes a terminal 110, a transmission node 120, and a terminal 130. Terminal 110, transmission node 120, and terminal 130 are connected via a network. Terminal 110 can function as a transmitter (or receiver), specifically a desktop terminal or a mobile terminal, where the mobile terminal can be at least one of a mobile phone, tablet computer, or laptop computer. Correspondingly, terminal 130 can function as a receiver (or transmitter). When terminal 110 is the transmitter, terminal 130 can be the receiver, specifically a desktop terminal or a mobile terminal, where the mobile terminal can be at least one of a mobile phone, tablet computer, or laptop computer. Transmission node 120 may include switches (or routers) in the network and other transmission equipment, such as SDH (Synchronous Digital Hierarchy) or PTN (Packet Transport Network) equipment. In addition, transmission node 120 may also be a communication base station, such as 3G, 4G and 5G and later versions of communication base stations.

如图2所示，在一个实施例中，提供了一种语音处理方法。本实施例主要以该方法应用于上述图1中的终端110来举例说明。参照图2，该语音处理方法具体包括如下步骤：As shown in Figure 2, in one embodiment, a voice processing method is provided. This embodiment mainly uses the application of this method to the terminal 110 in Figure 1 as an example. Referring to Figure 2, the voice processing method specifically includes the following steps:

S202，对获取的语音进行语速检测，得到语速值。S202, Perform speech rate detection on the acquired speech to obtain the speech rate value.

其中，该语言可以是用户在语音或视频通话过程中所发出的语音，也可以是在进行语音或视频直播过程中所发出的语音。语速值可以是用于表示说话人语速快慢的值，不同的说话人，语速值可能存在一定的差异。该语速值可以是平均语速值，也可以是某个瞬时的语速值。The language in question can be the voice uttered by the user during a voice or video call, or during a live voice or video broadcast. The speech rate value indicates the speed at which a speaker speaks; different speakers may have different speech rate values. This speech rate value can be an average speech rate or a value at a specific instant.

在一个实施例中，当进行语音或视频通话时，终端通过麦克风采集用户发出的语音。例如，用户使用即时通信应用与他人进行语音或视频通话时，终端通过内置的麦克风采集用户发出的语音。其中，该即时通信应用可以包括社交应用和其它用于即时通信的应用。In one embodiment, when making a voice or video call, the terminal captures the user's voice through a microphone. For example, when a user makes a voice or video call with another person using an instant messaging application, the terminal captures the user's voice through a built-in microphone. The instant messaging application may include social networking applications and other applications used for instant messaging.

在一个实施例中，当进行语音或视频直播时，终端通过麦克风采集用户发出的语音。例如，用户使用直播软件进行语音或视频直播时，终端通过内置的麦克风采集用户发出的语音。In one embodiment, when conducting live audio or video streaming, the terminal captures the user's voice through a microphone. For example, when a user uses live streaming software to conduct live audio or video streaming, the terminal captures the user's voice through a built-in microphone.

在一个实施例中，终端对采集的语音进行音素检测，得到音素序列；然后基于所得的音素序列计算单位时间内的音素个数，根据单位时间内的音素个数确定语速值。其中，音素个数可以通过基音周期或基音频率的跳变来确定。例如，单位时间内出现了20次基音周期或基音频率的跳变，则可以确定单位时间内有20个音素个数。其中。音素分为元音与辅音两大类，是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。如汉语音节a只有一个音素，ai有两个音素，dai有三个音素等。In one embodiment, the terminal performs phoneme detection on the collected speech to obtain a phoneme sequence; then, based on the obtained phoneme sequence, it calculates the number of phonemes per unit time, and determines the speech rate value based on the number of phonemes per unit time. The number of phonemes can be determined by changes in the fundamental frequency or pitch. For example, if there are 20 changes in the fundamental frequency or pitch per unit time, then there are 20 phonemes per unit time. Phonemes are divided into two main categories: vowels and consonants. They are the smallest units of speech defined based on the natural attributes of speech, analyzed according to the articulation actions within a syllable; one action constitutes one phoneme. For example, in Chinese, the syllable 'a' has only one phoneme, 'ai' has two phonemes, and 'dai' has three phonemes, etc.

在一个实施例中，终端对采集的语音进行音素检测，得到音素序列；然后将音素序列转换为对应的字序列，基于转换所得的字序列计算单位时间内的字个数，根据单位时间内的字个数确定语速值。In one embodiment, the terminal performs phoneme detection on the collected speech to obtain a phoneme sequence; then, it converts the phoneme sequence into a corresponding character sequence, calculates the number of characters per unit time based on the converted character sequence, and determines the speech rate value based on the number of characters per unit time.

在一个实施例中，终端采用窗函数对采集的语音进行分帧，从而得到分帧后的语音。具体地，可以采用交叠分段的方式对所采集的语音进行分帧，从而可以使帧与帧之间平滑过渡。其中，前一帧和后一帧的交叠部分称为帧移，帧移的帧长与语音帧的帧长之间的比值通常为0至0.5范围内。窗函数可以是矩形窗函数、汉宁窗函数、海明窗函数或布莱克曼窗函数等。In one embodiment, the terminal uses a window function to frame the acquired speech, thus obtaining framed speech. Specifically, overlapping segmentation can be used to frame the acquired speech, allowing for smooth transitions between frames. The overlapping portion between the previous and next frames is called the frame shift, and the ratio between the frame shift length and the speech frame length is typically in the range of 0 to 0.5. The window function can be a rectangular window function, Hanning window function, Hamming window function, or Blackman window function, etc.

例如，假设语音用s(n)表示，采用窗函数ω(n)来乘s(n)，从而形成加窗后的语音Sω(n)＝s(n)×ω(n)，如图3所示，语音帧的帧长为N，移帧的帧长为M。For example, assuming the speech is represented by s(n), a window function ω(n) is used to multiply s(n) to form the windowed speech Sω(n) = s(n) × ω(n), as shown in Figure 3. The length of the speech frame is N, and the length of the shifted frame is M.

在一个实施例中，终端对各语音帧进行检测，检测出各语音帧是否包含有语音内容，从而终端可以对包含有语音内容的语音帧进行语速检测，得到语音序列；然后基于所得的音素序列计算单位时间内的音素个数，根据单位时间内的音素个数确定语速值。或者，终端将音素序列转换为对应的字序列，基于转换所得的字序列计算单位时间内的字个数，根据单位时间内的字个数确定语速值。In one embodiment, the terminal detects each speech frame to determine whether it contains speech content. The terminal can then perform speech rate detection on speech frames containing speech content to obtain a speech sequence. Based on the obtained phoneme sequence, the terminal calculates the number of phonemes per unit time and determines the speech rate value accordingly. Alternatively, the terminal converts the phoneme sequence into a corresponding word sequence, calculates the number of words per unit time based on the converted word sequence, and determines the speech rate value based on the number of words per unit time.

S204，获取前向纠错冗余度。S204, obtain the forward error correction redundancy.

其中，上述前向纠错冗余度是根据包丢失率配置所得。前向纠错是一种差错控制方式，即在传输语音的同时也会传输冗余包，当传输中出现丢包或产生错误时，允许接收端根据冗余包重建所丢失或出错的部分语音。例如，语音在送入传输信道之前，预先对语音对应的语音编码包进行前向纠错编码处理，得到带有语音本身特征的冗余包，然后将语音编码包和冗余包一并传输至接收端，接收端对接收到的语音编码包和冗余包进行解码，从而找出在传输过程中产生错误的语音编码包或丢失的语音编码包并将其纠正。前向纠错冗余度可以指在进行前向纠错编码过程中所形成冗余包个数与语音编码包个数的比值。其中，前向纠错冗余度可以是根据语音编码包的丢失率配置所得。The aforementioned forward error correction redundancy is configured based on the packet loss rate. Forward error correction is an error control method that transmits redundant packets simultaneously with voice transmission. When packet loss or errors occur during transmission, the receiver can reconstruct the lost or erroneous portion of the voice from the redundant packets. For example, before voice is sent to the transmission channel, the corresponding voice coding packet undergoes forward error correction coding to obtain a redundant packet with the characteristics of the voice itself. Then, the voice coding packet and the redundant packet are transmitted together to the receiver. The receiver decodes the received voice coding packet and redundant packet to identify and correct any erroneous or lost voice coding packets generated during transmission. Forward error correction redundancy can be defined as the ratio of the number of redundant packets formed during forward error correction coding to the number of voice coding packets. The forward error correction redundancy can be configured based on the voice coding packet loss rate.

在一个实施例中，终端在接收到对方发送的语音编码包时，根据所接收到的语音编码包确定包丢失率配置对应的前向纠错冗余度。当包丢失率较大，则配置的前向纠错冗余度也越大；当包丢失率较小，则配置的前向纠错冗余度也越小。In one embodiment, when a terminal receives a voice-coded packet sent by another party, it determines the forward error correction redundancy based on the packet loss rate of the received voice-coded packet. A higher packet loss rate results in a higher configured forward error correction redundancy; conversely, a lower packet loss rate results in a lower configured forward error correction redundancy.

在一个实施例中，终端也可以根据网络质量预测丢包率，根据预测的丢包率配置对应的前向纠错冗余度。或者，终端也可以根据网络质量配置对应的前向纠错冗余度。In one embodiment, the terminal can also predict the packet loss rate based on network quality and configure the corresponding forward error correction redundancy based on the predicted packet loss rate. Alternatively, the terminal can also configure the corresponding forward error correction redundancy based on network quality.

例如，当网络质量较差时，丢包率通常较大，此时可以配置较大的前向纠错冗余度。当网络质量较好时，丢包率通常较小，此时可以配置较小的前向纠错冗余度。For example, when network quality is poor, the packet loss rate is usually high, so a larger forward error correction redundancy can be configured. When network quality is good, the packet loss rate is usually low, so a smaller forward error correction redundancy can be configured.

S206，依据语速值调整前向纠错冗余度，得到目标冗余度。S206, adjust the forward error correction redundancy based on the speech rate value to obtain the target redundancy.

其中，不同的说话人由于语种不同、说话习惯不同，在说话时，对应的语速值是存在差异的，当说话人的语速快时，单位时间采集到的语音中所含信息量比较大，即在一定时间内包含了许多不同的音素，从而即使丢失少量语音编码包也会导致缺失很多音素，而使接收端得到的信息量不完整。同理，当说话人的语速较慢时，单位时间采集到的语音所含信息量较少，即在一定时间内包含的音素较少，大部分是特征相近的音素，此时即使丢失了少量语音包，接收端的用户通过收听到的剩余音素也能获知发送方所表达的内容。Different speakers, due to differences in language and speaking habits, exhibit varying speech rates. When a speaker speaks quickly, the amount of information contained in the speech collected per unit time is greater, meaning it includes many different phonemes within a given timeframe. Therefore, even the loss of a small number of speech packets can result in the absence of many phonemes, leading to incomplete information received by the receiver. Conversely, when a speaker speaks slowly, the amount of information contained in the speech collected per unit time is less, meaning it contains fewer phonemes within a given timeframe, mostly consisting of phonemes with similar characteristics. In this case, even if a small number of speech packets are lost, the receiver can still understand the content expressed by the sender through the remaining phonemes heard.

在一个实施例中，当语速值较大时，终端可以将前向纠错冗余度调大；当语速值较小时，终端可以将前向纠错冗余度调小，从而得到目标冗余度。In one embodiment, when the speech rate is high, the terminal can increase the forward error correction redundancy; when the speech rate is low, the terminal can decrease the forward error correction redundancy to obtain the target redundancy.

在一个实施例中，当语速值较大时，终端可以获取对应的第一调整系数，将该第一调整系数与前向纠错冗余度之间的乘积作为目标冗余度。当语速值较小时，终端可以获取对应的第二调整系数，将该第二调整系数与前向纠错冗余度之间的乘积作为目标冗余度。In one embodiment, when the speech rate is high, the terminal can obtain a corresponding first adjustment coefficient and use the product of the first adjustment coefficient and the forward error correction redundancy as the target redundancy. When the speech rate is low, the terminal can obtain a corresponding second adjustment coefficient and use the product of the second adjustment coefficient and the forward error correction redundancy as the target redundancy.

S208，对语音进行语音编码，得到语音编码包。S208, perform speech encoding on the speech to obtain a speech encoding packet.

在一个实施例中，终端对采集到的语音进行采样，其中，采样频率大于语音信号最高频率的两倍。然后，终端对采样后的语音进行量化，可以是均匀量化或非均匀量化，非均匀量化可以采用μ律压缩算法或A律压缩算法。最后，终端对量化后的语音进行编码，然后对编码后所得的语音编码数据打包成多个语音编码包，编码的方式包括有波形编码(如脉冲编码调制编码)、参数编码和混合编码。In one embodiment, the terminal samples the acquired speech, wherein the sampling frequency is greater than twice the highest frequency of the speech signal. Then, the terminal quantizes the sampled speech, which can be uniform quantization or non-uniform quantization. Non-uniform quantization can employ μ-law compression or A-law compression algorithms. Finally, the terminal encodes the quantized speech and packages the resulting encoded speech data into multiple speech code packets. The encoding methods include waveform coding (such as pulse code modulation), parametric coding, and hybrid coding.

其中，当采用均匀量化来量化采样后的语音时，无论对幅值大的语音还是幅值小的语音，均采用相同的量化间隔，这样既适应幅度大的语音又保证量化精度。当采用非均匀量化时，对于幅值大的语音采用较大的量化间隔，对于幅值小的语音采用较小的量化间隔，这样就能在保证精度的前提下采用较小的量化位数。When uniform quantization is used to quantize sampled speech, the same quantization interval is applied to both large and small amplitude speech, thus adapting to large amplitude speech while ensuring quantization accuracy. When non-uniform quantization is used, a larger quantization interval is applied to large amplitude speech, and a smaller quantization interval is applied to small amplitude speech, thus allowing for a smaller quantization bit depth while maintaining accuracy.

S210，按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包。S210, perform forward error correction coding on the speech coding packet according to the target redundancy to obtain the redundant packet.

在一个实施例中，终端按照目标冗余度来对语音编码包进行前向纠错编码，得到冗余包，该冗余包的数量为目标冗余度与语音编码包个数的乘积。In one embodiment, the terminal performs forward error correction coding on the speech coding packets according to the target redundancy to obtain redundant packets. The number of redundant packets is the product of the target redundancy and the number of speech coding packets.

例如，设语音编码包的个数为k，字长为w比特(bits)，w的取值可以是8、16或32。终端按照目标冗余度对k个语音编码包进行前向纠错编码，生成m个冗余包与语音编码包对应。For example, suppose the number of speech coding packets is k, and the word length is w bits, where w can be 8, 16, or 32. The terminal performs forward error correction coding on the k speech coding packets according to the target redundancy, generating m redundant packets corresponding to the speech coding packets.

S212，向接收端发送冗余包和语音编码包。S212, send redundant packets and voice coding packets to the receiving end.

在一个实施例中，终端利用实时传输协议对语音编码包和冗余包进行封装，得到封装后的语音数据包，然后向接收端发送封装语音编码包和冗余包所得的语音数据包。In one embodiment, the terminal uses a real-time transmission protocol to encapsulate voice encoding packets and redundancy packets to obtain encapsulated voice data packets, and then sends the encapsulated voice data packets obtained from the voice encoding packets and redundancy packets to the receiving end.

其中，该实时传输协议(Real-time Transport Protocol，RTP)可以为语音提供具有实时特征的端对端传送服务，实时传输协议实行有序传送，实时传输协议中允许接收端对发送端的包序列进行重组，同时序列号也能用于决定适当的包位置。该语音数据包为RTP报文格式的语音数据包，由两部分组成：报头和有效载荷，该有效载荷为语音编码包和冗余包。The Real-Time Transport Protocol (RTP) provides end-to-end voice transmission services with real-time characteristics. RTP implements ordered transmission, allowing the receiver to reassemble the packet sequence from the sender, and the sequence number is also used to determine the appropriate packet position. The voice data packet is an RTP message format voice data packet, consisting of two parts: a header and a payload. The payload consists of voice-coded packets and redundant packets.

例如，设k个语音编码包前向纠错冗余度为r/k，根据目标冗余度和语音编码包计算出冗余包的个数为r；令r个冗余数据包C＝(C₁，C₂，...，C_r)，那么语音数据包表示为其中Y_i＝D_i(0≤i≤k-1)，Y_j＝C_j(k≤j≤n-1)。B为n×k维前向纠错生成矩阵，该前向纠错生成矩阵由单位矩阵I和矩阵G组成，则语音数据包可表示为如下所示：For example, let the forward error correction redundancy of k speech coding packets be r/k. Based on the target redundancy and the speech coding packets, the number of redundant packets is calculated to be r. Let the r redundant data packets be C = ( _C1 , _C2 , ..., _Cr ), then the speech data packet can be represented as follows: where _Yi = _Di (0 ≤ i ≤ k-1), _Yj = _Cj (k ≤ j ≤ n-1). B is an n×k dimensional forward error correction generator matrix, composed of an identity matrix I and a matrix G. The speech data packet can then be represented as follows:

作为一个示例，如图4所示，终端对采集的语音进行编码，得到语音编码包p1-p8，当调整前向纠错冗余度得到目标冗余度时，按照目标冗余度的要求对语音编码包p1-p8进行前向纠错编码，得到冗余包r1、r2和r3，然后通过RTP封装得到包含语音编码包p1-p8和冗余包r1-r3的语音数据包，然后将该语音数据包通过网络向接收端进行发送。As an example, as shown in Figure 4, the terminal encodes the collected speech to obtain speech encoding packets p1-p8. When the forward error correction redundancy is adjusted to obtain the target redundancy, forward error correction encoding is performed on speech encoding packets p1-p8 according to the target redundancy requirements to obtain redundant packets r1, r2, and r3. Then, RTP encapsulation is used to obtain a voice data packet containing speech encoding packets p1-p8 and redundant packets r1-r3. Finally, the voice data packet is sent to the receiving end through the network.

在一个实施例中，终端也可以接收来自接收端发送的语音数据包，该语音数据包包含由语音编码包和冗余包，若解析该语音数据包发现存在丢包现象时，则可以根据剩余的语音编码包和冗余包对所丢失的语音编码包进行重建，从而得到完整的语音编码包，对该语音编码包进行解码，得到对应的语音。In one embodiment, the terminal can also receive voice data packets sent from the receiving end. The voice data packets contain voice encoded packets and redundant packets. If packet loss is found when parsing the voice data packets, the lost voice encoded packets can be reconstructed based on the remaining voice encoded packets and redundant packets to obtain the complete voice encoded packets. The voice encoded packets can then be decoded to obtain the corresponding voice.

例如，在接收端，如果接收端接收到语音数据包中的任意k个数据包，即可根据所收到的数据包在语音数据包中的位置信息，从前向纠错生成矩阵中提取对应的行，组成一个新的k×k维矩阵B’，则有：For example, at the receiving end, if the receiving end receives any k data packets from the voice data packet, it can extract the corresponding rows from the forward error correction generator matrix based on the position information of the received data packets in the voice data packet, and form a new k×k dimensional matrix B’, then:

若矩阵B’为非奇异矩阵，则通过如下逆变换得到原始的语音编码包，完成恢复，变换式如下：If matrix B’ is a non-singular matrix, the original speech code packet is obtained through the following inverse transformation, thus completing the recovery: The transformation formula is as follows:

上述实施例中，通过对语音进行语速检测，利用检测所得的语速值对前向纠错冗余度进行调整，从而可以使用调整后的目标冗余度对语音编码包进行前向纠错编码，从而得到冗余包，当语速较慢，语音数据包所包含的语音内容较少，而当语速较快，语音数据包所包含的语音内容较多，按照语速值来动态调整前向纠错冗余度，可以确保发生丢失的语音数据包可以有效地被恢复过来，从而实现对语音数据包的有效纠错，而且也能避免额外消耗大量带宽。In the above embodiments, by detecting the speech rate, the forward error correction redundancy is adjusted using the detected speech rate value. The adjusted target redundancy can then be used to perform forward error correction encoding on the speech coding packet, thereby obtaining a redundant packet. When the speech rate is slow, the speech data packet contains less speech content, while when the speech rate is fast, the speech data packet contains more speech content. Dynamically adjusting the forward error correction redundancy according to the speech rate value can ensure that lost speech data packets can be effectively recovered, thereby achieving effective error correction of speech data packets and avoiding the consumption of a large amount of bandwidth.

在一个实施例中，如图5所示，S202具体可以包括：In one embodiment, as shown in FIG5, S202 may specifically include:

S502，采集语音。S502, for voice recording.

其中，该语音可以是用户在进行语音或视频通话过程中的前期，通过麦克风采集的语音；或者，可以是用户在进行语音或视频直播过程中的前期，通过麦克风采集的语音。The voice recording can be taken through a microphone during the initial stage of a user's voice or video call; or it can be taken through a microphone during the initial stage of a user's voice or video live stream.

在一个实施例中，当通过即时通信应用进行语音或视频通话时，终端通过麦克风采集用户发出的语音。其中，该即时通信应用可以包括社交应用和其它用于即时通信的应用。In one embodiment, when making a voice or video call through an instant messaging application, the terminal captures the user's voice through a microphone. The instant messaging application may include social networking applications and other applications used for instant messaging.

在一个实施例中，当通过直播软件进行语音或视频直播时，终端通过麦克风采集用户发出的语音。In one embodiment, when conducting voice or video live streaming through live streaming software, the terminal captures the user's voice through a microphone.

在一个实施例中，终端采用窗函数对采集的语音进行分帧，从而得到分帧后的语音。具体地，可以采用交叠分段的方式对所采集的语音进行分帧，从而可以使帧与帧之间平滑过渡。其中，前一帧和后一帧的交叠部分称为帧移，帧移的帧长与语音帧的帧长之间的比值通常为0至0.5范围内。In one embodiment, the terminal uses a window function to frame the acquired speech, thereby obtaining framed speech. Specifically, overlapping segmentation can be used to frame the acquired speech, thus enabling smooth transitions between frames. The overlapping portion between the previous and next frames is called frame shift, and the ratio between the frame shift length and the speech frame length is typically in the range of 0 to 0.5.

在一个实施例中，终端对各语音帧进行检测，检测出各语音帧是否包含有语音内容，从而终端可以对包含有语音内容的语音帧进行语速检测。In one embodiment, the terminal detects each speech frame to determine whether each speech frame contains speech content, thereby enabling the terminal to detect the speech rate of speech frames containing speech content.

在一个实施例中，检测出各语音帧是否包含有语音内容的方式，具体可以包括：终端分别对各语音帧进行脉冲编码调制，得到PCM语音数据，将该PCM语音数据输入vad语音检测函数，然后输出为语音标识。例如，若输出的语音标识为0，则不包含有语音内容；若输出的语音标识为1，则包含有语音内容。In one embodiment, detecting whether each speech frame contains speech content can specifically include: the terminal performing pulse code modulation on each speech frame to obtain PCM speech data, inputting the PCM speech data into a VAD speech detection function, and then outputting a speech identifier. For example, if the output speech identifier is 0, then it does not contain speech content; if the output speech identifier is 1, then it contains speech content.

S504，从语音中识别出音素序列。S504 identifies phoneme sequences from speech.

其中，音素分为元音与辅音两大类，是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。如汉语音节a只有一个音素，ai有两个音素，dai有三个音素等。Phonemes are divided into two main categories: vowels and consonants. They are the smallest units of speech determined by the natural attributes of speech sounds. They are analyzed based on the articulation actions within a syllable, with each action constituting a phoneme. For example, the Chinese syllable 'a' has only one phoneme, 'ai' has two phonemes, and 'dai' has three phonemes, etc.

在一个实施例中，S504具体可以包括：对语音进行脉冲编码调制，得到语音编码数据；从语音编码数据中识别出包含有语音内容的语音段；从语音编码数据的语音段中识别出音素序列。In one embodiment, S504 may specifically include: performing pulse code modulation on the speech to obtain speech code data; identifying speech segments containing speech content from the speech code data; and identifying phoneme sequences from the speech segments of the speech code data.

在一个实施例中，对语音进行脉冲编码调制，得到语音编码数据的步骤，具体可以包括：终端对采集到的语音进行采样，其中，采样频率大于语音信号最高频率的两倍。然后，终端对采样后的语音进行量化，可以是均匀量化或非均匀量化，非均匀量化可以采用μ律压缩算法或A律压缩算法。最后，终端对量化后的语音进行编码，然后对编码后所得的语音编码数据打包成多个语音编码包，编码的方式包括有波形编码、参数编码和混合编码。In one embodiment, the step of performing pulse code modulation on speech to obtain speech coded data may specifically include: the terminal sampling the acquired speech, wherein the sampling frequency is greater than twice the highest frequency of the speech signal; then, the terminal quantizing the sampled speech, which may be uniform quantization or non-uniform quantization, and non-uniform quantization may employ a μ-law compression algorithm or an A-law compression algorithm; finally, the terminal encoding the quantized speech, and then packaging the encoded speech data into multiple speech coded packets, wherein the encoding methods include waveform encoding, parametric encoding, and hybrid encoding.

在另一个实施例中，S504具体可以包括：终端从语音中提取语音特征；对语音特征进行解码，得到解码后语音特征；从解码后语音特征中识别出音素序列。In another embodiment, S504 may specifically include: the terminal extracting speech features from the speech; decoding the speech features to obtain decoded speech features; and identifying a phoneme sequence from the decoded speech features.

其中，语音特征可以是关于语音的对数功率谱或梅尔频率倒谱系数。Among them, speech features can be logarithmic power spectrum or Mel frequency cepstral coefficients with respect to speech.

在一个实施例中，终端对所采集的语音进行傅里叶变换，将时域下的语音转换为频域下的频谱。终端获取频谱对应的幅值，根据该幅值利用功率密度函数计算出功率谱。In one embodiment, the terminal performs a Fourier transform on the collected speech, converting the speech in the time domain into a spectrum in the frequency domain. The terminal obtains the amplitude corresponding to the spectrum and calculates the power spectrum using the power density function based on the amplitude.

例如，假设语音的信号表达式为f(t)，对f(t)进行傅里叶变换获得频谱，令频谱的表达式为F_T(w)，将频谱对应的幅值代入以下功率谱密度函数，即可获得关于语音的功率谱。For example, assuming the signal expression of speech is f(t), we can obtain the spectrum by performing a Fourier transform on f(t). Let the expression of the spectrum be _FT (w). By substituting the amplitude corresponding to the spectrum into the following power spectral density function, we can obtain the power spectrum of speech.

具体地，终端根据自适应的声学模型，对提取出的语音特征进行维特比解码，从解码后地语音特征中识别出音素序列。此外，终端还可以确定音素序列中各音素的起始时间和终止时间。Specifically, the terminal performs Viterbi decoding on the extracted speech features based on an adaptive acoustic model, and identifies the phoneme sequence from the decoded speech features. Furthermore, the terminal can determine the start and end times of each phoneme in the phoneme sequence.

S506，根据音素序列中音素的跳变频次确定语速值。S506, determine the speech rate value based on the frequency of phoneme transitions in the phoneme sequence.

在一个实施例中，S506具体可以包括：在音素序列中检测单位时间内音素的基音周期或基音频率的跳变次数；根据单位时间内的跳变次数确定语速值。In one embodiment, S506 may specifically include: detecting the number of jumps in the fundamental period or fundamental frequency of a phoneme per unit time in a phoneme sequence; and determining the speech rate value based on the number of jumps per unit time.

在一个实施例中，终端判定基音周期或基音频率的跳变次数是否大于预设基频跳变门限，若是，则确定语音的音调发生了明显变化；若否，则确定语音的音调未发生明显变化。其中，基音周期与基音频率互为倒数，可以相互转换。In one embodiment, the terminal determines whether the number of fundamental frequency or fundamental period transitions exceeds a preset fundamental frequency transition threshold. If so, it is determined that the pitch of the speech has changed significantly; otherwise, it is determined that the pitch of the speech has not changed significantly. The fundamental frequency and fundamental period are reciprocals of each other and can be converted between each other.

在一个实施例中，终端分别对各语音帧进行脉冲编码调制，得到PCM语音数据，将该PCM语音数据输入基频估计函数，可以得到各语音帧对应的基音频率。其中，基频估计函数可以基于时域自相关函数。In one embodiment, the terminal performs pulse code modulation on each speech frame to obtain PCM speech data. This PCM speech data is then input into a fundamental frequency estimation function to obtain the fundamental frequency corresponding to each speech frame. The fundamental frequency estimation function can be based on a time-domain autocorrelation function.

上述实施例中，通过从采集的语音中识别出音素序列，根据音素序列中音素的跳变频次确定语速值，从而可以按照语速值来动态调整前向纠错冗余度，可以确保发生丢失的语音数据包可以有效地被恢复过来，从而实现对语音数据包的有效纠错，而且也能避免额外消耗大量带宽。In the above embodiments, by identifying phoneme sequences from the collected speech and determining the speech rate value based on the frequency of phoneme transitions in the phoneme sequence, the forward error correction redundancy can be dynamically adjusted according to the speech rate value. This ensures that lost speech data packets can be effectively recovered, thereby achieving effective error correction of speech data packets and avoiding the consumption of a large amount of bandwidth.

在一个实施例中，如图6所示，S206具体可以包括：In one embodiment, as shown in FIG6, S206 may specifically include:

S602，当语速值大于语速上限值、且小于语速上限值时，基于语速值计算调整参数；按照调整参数对前向纠错冗余度进行调整，得到目标冗余度。S602, when the speech rate value is greater than the speech rate upper limit value but less than the speech rate upper limit value, calculate the adjustment parameters based on the speech rate value; adjust the forward error correction redundancy according to the adjustment parameters to obtain the target redundancy.

在一个实施例中，当语速值大于语速上限值、且小于语速上限值时，若语速值越大，终端则将前向纠错冗余度调大；若语速值越小，终端则将前向纠错冗余度调小。In one embodiment, when the speech rate value is greater than the upper limit of speech rate but less than the upper limit of speech rate, if the speech rate value is larger, the terminal will increase the forward error correction redundancy; if the speech rate value is smaller, the terminal will decrease the forward error correction redundancy.

在一个实施例中，终端将语速值输入用于调整前向纠错冗余度的计算式，在计算出调整参数时，同时也对前向纠错冗余度进行了调整，得到目标冗余度。In one embodiment, the terminal inputs the speech rate value into a formula for adjusting the forward error correction redundancy. When calculating the adjustment parameters, the forward error correction redundancy is also adjusted to obtain the target redundancy.

例如，调整前向纠错冗余度的计算式可以是V₁≤v≤V₂，其中，r'为调整后的目标冗余度，r₀为前向纠错冗余度，为调整参数，其中的c为常数，v为语速值，V₁和V₂分别为语速下限值和语速上限值。For example, the formula for adjusting the forward error correction redundancy can be _V1 ≤ v ≤ _V2 , where r' is the adjusted target redundancy, _r0 is the forward error correction redundancy, and is the adjustment parameter, where c is a constant, v is the speech rate value, and _V1 and _V2 are the lower limit and upper limit of speech rate, respectively.

S604，将目标冗余度分别与冗余度上限值、冗余度下限值进行比较。S604 compares the target redundancy with the upper limit and lower limit of redundancy, respectively.

例如，参考如下函数式V₁≤v≤V₂，将目标冗余度分别与冗余度上限值R_max、冗余度下限值R_min进行比较，根据比较结果确定最终的目标冗余度。当目标冗余度小于冗余度上限值、且大于冗余度下限值时，则将目标冗余度作为最终的目标冗余度，执行S606。当目标冗余度且小于冗余度下限值时，则将冗余度下限值作为最终的目标冗余度，执行S608。当目标冗余度大于冗余度上限值时，则将照冗余度上限值作为最终的目标冗余度，执行S610。For example, referring to the function _V1 ≤ v ≤ _V2 , the target redundancy is compared with the upper limit of redundancy _Rmax and the lower limit of redundancy _Rmin , respectively, and the final target redundancy is determined based on the comparison results. When the target redundancy is less than the upper limit of redundancy but greater than the lower limit of redundancy, the target redundancy is taken as the final target redundancy, and S606 is executed. When the target redundancy is less than the lower limit of redundancy, the lower limit of redundancy is taken as the final target redundancy, and S608 is executed. When the target redundancy is greater than the upper limit of redundancy, the upper limit of redundancy is taken as the final target redundancy, and S610 is executed.

S606，当目标冗余度小于冗余度上限值、且大于冗余度下限值时，则按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包。S606: When the target redundancy is less than the upper limit of redundancy and greater than the lower limit of redundancy, forward error correction coding is performed on the speech coding packet according to the target redundancy to obtain the redundant packet.

S608，当目标冗余度且小于冗余度下限值时，则按照冗余度下限值对语音编码包进行前向纠错编码，得到冗余包。S608: When the target redundancy is less than the redundancy lower limit, the speech coding packet is forward-corrected according to the redundancy lower limit to obtain the redundant packet.

在一个实施例中，当目标冗余度且小于冗余度下限值时，终端按照冗余度下限值来对语音编码包进行前向纠错编码，得到冗余包，该冗余包的数量为冗余度下限值与语音编码包个数的乘积。In one embodiment, when the target redundancy is less than the redundancy lower limit, the terminal performs forward error correction coding on the speech coding packet according to the redundancy lower limit to obtain a redundant packet. The number of the redundant packets is the product of the redundancy lower limit and the number of speech coding packets.

S610，当目标冗余度大于冗余度上限值时，则按照冗余度上限值对语音编码包进行前向纠错编码，得到冗余包。S610: When the target redundancy is greater than the upper limit of redundancy, the speech coding packet is forward-corrected according to the upper limit of redundancy to obtain the redundant packet.

在一个实施例中，当目标冗余度大于冗余度上限值时，终端按照冗余度上限值来对语音编码包进行前向纠错编码，得到冗余包，该冗余包的数量为冗余度上限值与语音编码包个数的乘积。In one embodiment, when the target redundancy is greater than the redundancy limit, the terminal performs forward error correction coding on the speech coding packets according to the redundancy limit to obtain redundant packets. The number of redundant packets is the product of the redundancy limit and the number of speech coding packets.

参考如下函数式若语速值小于语速上限值，从前向纠错冗余度和冗余度下限值中选取最大值，按照最大值对语音编码包进行前向纠错编码，得到冗余包；若语速值大于语速上限值、且小于语速上限值，执行按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包的步骤；若语速值大于语速上限值，从前向纠错冗余度和冗余度上限值中选取最小值，按照最小值对语音编码包进行前向纠错编码，得到冗余包。Referring to the following function, if the speech rate value is less than the upper limit of speech rate, select the maximum value from the forward error correction redundancy and the lower limit of redundancy, and perform forward error correction coding on the speech code packet according to the maximum value to obtain a redundant packet; if the speech rate value is greater than the upper limit of speech rate but less than the upper limit of speech rate, perform forward error correction coding on the speech code packet according to the target redundancy to obtain a redundant packet; if the speech rate value is greater than the upper limit of speech rate, select the minimum value from the forward error correction redundancy and the upper limit of redundancy, and perform forward error correction coding on the speech code packet according to the minimum value to obtain a redundant packet.

上述实施例中，根据语速值对前向纠错冗余度进行调整，得到调整后的目标冗余度，从而可以按照目标冗余度对语音编码包进行前向纠错编码得到冗余包，将冗余包和语音编码包封装为语音数据包向接收端发送，可以确保传输过程中发生丢失的语音数据包可以有效地被恢复过来，从而实现对语音数据包的有效纠错，而且也能避免额外消耗大量带宽。In the above embodiments, the forward error correction redundancy is adjusted according to the speech rate value to obtain the adjusted target redundancy. Thus, the speech coding packet can be forward error corrected according to the target redundancy to obtain a redundant packet. The redundant packet and the speech coding packet are encapsulated into a speech data packet and sent to the receiving end. This can ensure that the speech data packets lost during transmission can be effectively recovered, thereby achieving effective error correction of the speech data packets and avoiding the consumption of a large amount of bandwidth.

作为一个示例，本实施例首先对用户的语音做语速检测，得到平均语速值v，假设基于传统前向纠错方案所得到的前向纠错冗余度为r₀，经过本实施例调整后的目标冗余度为r′，计算语音的平均语速值、调整前向纠错冗余度的方式，以及通过调整后所得的目标冗余度进行前向纠错编码等内容具体如下所述：As an example, this embodiment first performs speech rate detection on the user's speech to obtain an average speech rate value v. Assuming the forward error correction redundancy obtained based on the traditional forward error correction scheme is r _<sub>0</sub> , and the target redundancy after adjustment in this embodiment is r′, the calculation of the average speech rate value, the method of adjusting the forward error correction redundancy, and the forward error correction coding using the adjusted target redundancy are detailed below:

1)计算语音平均语速1) Calculate the average speech rate

在实际通话过程中，由于说话内容是不受限的，本实施例中采用基于无参考的检测方法来测量说话人语速。无参考的语速检测计算是基于vad和基音周期变化速度的统计来实现的，由于同一个音素其前后基音周期(或者基音频率)是连续的，即跳变较小，而不同音素其前后帧的基音周期(或者基音频率)是跳变明显的，通过分析单位时间内基音周期(或基频)突变次数以及语音帧跳变次数来等价描述语速v。其伪代码如下：In actual conversations, since the content of speech is unrestricted, this embodiment employs a no-reference detection method to measure the speaker's speech rate. The no-reference speech rate detection calculation is based on statistics of the vad and the rate of change of the pitch period. Because the pitch period (or pitch frequency) of the same phoneme is continuous with small jumps, while the pitch period (or pitch frequency) of different phonemes changes significantly between frames, the speech rate v is equivalently described by analyzing the number of pitch period (or pitch frequency) abrupt changes and the number of speech frame jumps per unit time. The pseudocode is as follows:

//initial//initial

Changecnt＝0；//Changecnt代表在单位时间内基频跳变次数以及从非语音帧到语音帧的跳变次数之和Changecnt = 0; // Changecnt represents the sum of the number of fundamental frequency transitions per unit time and the number of transitions from non-speech frames to speech frames.

Totcnt＝0；//Totcnt代表当前检测帧数Totcnt = 0; //Totcnt represents the current number of detected frames.

vad_cur＝0；//vad_cur代表当前帧的语音标识，其值为0代表非语音帧，其值为1代表语音帧vad_cur = 0; //vad_cur represents the speech identifier of the current frame. A value of 0 indicates a non-speech frame, and a value of 1 indicates a speech frame.

vad_pre＝0；//vad_pre代表前一帧的语音标识，其值为0代表非语音帧，其值为1代表语音帧vad_pre = 0; // vad_pre represents the voice identifier of the previous frame. A value of 0 indicates a non-voice frame, and a value of 1 indicates a voice frame.

pitchfreq_cur＝0；//pitchfreq_cur代表当前帧的基音频率pitchfreq_cur = 0; // pitchfreq_cur represents the fundamental frequency of the current frame.

pitchfreq_pre＝0；//pitchfreq_pre代表前一帧的基音频率pitchfreq_pre = 0; // pitchfreq_pre represents the fundamental frequency of the previous frame.

pitchfreq_cur＝PitchFreqEst(pcmdata)；//PitchFreqEst()为基频估计函数，输入为PCM语音数据，输出为估计得到的基音频率，基频估计可以基于时域自相关，或者基于倒谱法的方法来求取pitchfreq_cur = PitchFreqEst(pcmdata); // PitchFreqEst() is the fundamental frequency estimation function. The input is PCM speech data, and the output is the estimated fundamental frequency. The fundamental frequency can be estimated based on time-domain autocorrelation or cepstral methods.

vad_cur＝VadDet(pcmdata)；//VadDet()为vad语音检测函数，输入为PCM语音数据，输出为语音标识，输出为0则为不包含语音内容的非语音帧，输出为1则为包含语音内容的语音帧vad_cur = VadDet(pcmdata); // VadDet() is the VAD speech detection function. The input is PCM speech data, and the output is a speech identifier. An output of 0 indicates a non-speech frame without speech content, and an output of 1 indicates a speech frame containing speech content.

If Totcnt<＝T Then//T代表语速检测周期If Totcnt <= T Then //T represents the speech rate detection period

Totcnt++；Totcnt++;

If vad_cur＝1ThenIf vad_cur = 1, then

if abs(pitchfreq_pre-pitchfreq_cur)>threshold1 Then//threshold1代表设定的基频跳变门限，如果变化值超过该门限说明音调发生明显变化if abs(pitchfreq_pre-pitchfreq_cur)>threshold1 Then //threshold1 represents the set fundamental frequency hopping threshold. If the change value exceeds this threshold, it indicates that the pitch has changed significantly.

Changecnt++；Changecnt++;

EndEnd

pitchfreq_pre＝pitchfreq_cur；pitchfreq_pre=pitchfreq_cur;

If vad_cur＝1and vad_pre＝0ThenIf vad_cur＝1and vad_pre＝0Then

Changecnt++；Changecnt++;

EndEnd

ElseElse

v＝Changecnt；v = Changecnt;

Totcnt＝0；Totcnt = 0;

Changecnt＝0；Changecnt = 0;

EndEnd

通过上述过程，可以得到拼接语速值v，用于后续的FEC冗余度计算。Through the above process, the splicing speech rate value v can be obtained, which can be used for subsequent FEC redundancy calculation.

2)FEC冗余度计算2) FEC redundancy calculation

上述得到的平均语速值v，最终的目标冗余度r′通过如下计算公式得到：The average speech rate value v obtained above, and the final target redundancy r′ are calculated using the following formula:

上述公式预先设定以下常数值：语速上限值V₂和语速下限值V₁，FEC冗余度下限值R_min和冗余度上限值R_max，以及c为常数，通过上述公式可以计算出目标冗余度r′。The above formula pre-sets the following constant values: upper limit of speech rate _V2 and lower limit of speech rate _V1 , lower limit of FEC redundancy _Rmin and upper limit of redundancy _Rmax , and c is a constant. The target redundancy r′ can be calculated using the above formula.

3)获取用户发出的语音，并对语音进行语音编码，得到多个语音编码包；然后按照目标冗余度r′对语音编码包进行前向纠错编码，得到对应的冗余包；然后通过RTP方式对冗余包和语音编码包进行打包，得到RTP语音数据包，然后将该RTP语音数据包通过网络发送给接收端，如图7所示。3) Acquire the user's voice and encode it to obtain multiple voice encoding packets; then perform forward error correction encoding on the voice encoding packets according to the target redundancy r′ to obtain the corresponding redundant packets; then package the redundant packets and voice encoding packets together using RTP to obtain RTP voice data packets, and then send the RTP voice data packets to the receiving end through the network, as shown in Figure 7.

如图7所示，接收端在接收端RTP语音数据包之后，一方面统计丢包率，另一方面进行前向纠错解码，将丢失的语音编码包恢复出来，然后将所有的语音编码包进行解码，从而可以得到用户的原始语音。As shown in Figure 7, after receiving the RTP voice data packets, the receiving end calculates the packet loss rate and performs forward error correction decoding to recover the lost voice encoded packets. Then, it decodes all the voice encoded packets to obtain the user's original voice.

基于说话人的语速检测结果来调整FEC冗余度，确保更有效的保护传输的语音内容，提升端到端通话语音质量，以实现高可靠性的VoIP(Voice over Internet Protocol，基于IP的语音传输)、广播以及语音和视频直播等业务的实时语音数据传输。Adjusting FEC redundancy based on speaker rate detection results ensures more effective protection of transmitted voice content, improves end-to-end call voice quality, and enables highly reliable real-time voice data transmission for services such as VoIP (Voice over Internet Protocol), broadcasting, and live voice and video streaming.

图2、5、6为一个实施例中语音处理方法的流程示意图。应该理解的是，虽然图2、5、6的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图2、5、6中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。Figures 2, 5, and 6 are schematic flowcharts of a speech processing method in one embodiment. It should be understood that although the steps in the flowcharts of Figures 2, 5, and 6 are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Furthermore, at least some of the steps in Figures 2, 5, and 6 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.

如图8所示，本发明实施例中提供了一种语音处理装置，该装置包括：检测模块802、获取模块804、调整模块806、第一编码模块808、第二编码模块810和发送模块812；其中：As shown in Figure 8, this embodiment of the invention provides a voice processing device, which includes: a detection module 802, an acquisition module 804, an adjustment module 806, a first encoding module 808, a second encoding module 810, and a transmission module 812; wherein:

检测模块802，用于对获取的语音进行语速检测，得到语速值；The detection module 802 is used to detect the speech rate of the acquired speech and obtain the speech rate value;

获取模块804，用于获取前向纠错冗余度；Module 804 is used to obtain the forward error correction redundancy.

调整模块806，用于依据语速值调整前向纠错冗余度，得到目标冗余度；The adjustment module 806 is used to adjust the forward error correction redundancy based on the speech rate value to obtain the target redundancy.

第一编码模块808，用于对语音进行语音编码，得到语音编码包；The first encoding module 808 is used to encode the speech to obtain a speech encoding packet;

第二编码模块810，用于按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包；The second encoding module 810 is used to perform forward error correction encoding on the speech encoding packet according to the target redundancy to obtain a redundant packet;

发送模块812，用于向接收端发送冗余包和语音编码包。The transmitting module 812 is used to send redundant packets and voice coding packets to the receiving end.

在一个实施例中，如图9所示，该装置还包括：封装模块814；其中：In one embodiment, as shown in FIG9, the device further includes: an encapsulation module 814; wherein:

封装模块814，用于利用实时传输协议对语音编码包和冗余包进行封装，得到封装后的语音数据包；The encapsulation module 814 is used to encapsulate voice coding packets and redundant packets using a real-time transmission protocol to obtain encapsulated voice data packets.

发送模块812，还用于向接收端发送封装语音编码包和冗余包所得的语音数据包。The transmitting module 812 is also used to transmit voice data packets obtained by encapsulating voice coding packets and redundancy packets to the receiving end.

在一个实施例中，检测模块802，还用于：采集语音；从语音中识别出音素序列；根据音素序列中音素的跳变频次确定语速值。In one embodiment, the detection module 802 is further configured to: acquire speech; identify phoneme sequences from the speech; and determine speech rate values based on the frequency of phoneme transitions in the phoneme sequence.

在一个实施例中，检测模块802，还用于：对语音进行脉冲编码调制，得到语音编码数据；从语音编码数据中识别出语音段；从语音编码数据的语音段中识别出音素序列。In one embodiment, the detection module 802 is further configured to: perform pulse code modulation on the speech to obtain speech code data; identify speech segments from the speech code data; and identify phoneme sequences from the speech segments of the speech code data.

在一个实施例中，检测模块802，还用于：从语音中提取语音特征；对语音特征进行解码，得到解码后语音特征；从解码后语音特征中识别出音素序列。In one embodiment, the detection module 802 is further configured to: extract speech features from speech; decode the speech features to obtain decoded speech features; and identify phoneme sequences from the decoded speech features.

在一个实施例中，检测模块802，还用于：在音素序列中检测单位时间内音素的基音周期或基音频率的跳变次数；根据单位时间内的跳变次数确定语速值。In one embodiment, the detection module 802 is further configured to: detect the number of jumps in the fundamental period or fundamental frequency of a phoneme within a unit time in the phoneme sequence; and determine the speech rate value based on the number of jumps within a unit time.

在一个实施例中，调整模块806，还用于：当语速值大于语速上限值、且小于语速上限值时，基于语速值计算调整参数；按照调整参数对前向纠错冗余度进行调整，得到目标冗余度。In one embodiment, the adjustment module 806 is further configured to: calculate adjustment parameters based on the speech rate value when the speech rate value is greater than the speech rate upper limit value but less than the speech rate upper limit value; and adjust the forward error correction redundancy according to the adjustment parameters to obtain the target redundancy.

在一个实施例中，如图9所示，该装置还包括：比较模块816；其中：In one embodiment, as shown in FIG9, the device further includes: a comparison module 816; wherein:

比较模块816，用于将目标冗余度分别与冗余度上限值、冗余度下限值进行比较；当目标冗余度小于冗余度上限值、且大于冗余度下限值时，则通过第二编码模块810按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包。The comparison module 816 is used to compare the target redundancy with the upper limit and lower limit of redundancy, respectively. When the target redundancy is less than the upper limit and greater than the lower limit, the second encoding module 810 performs forward error correction encoding on the speech encoding packet according to the target redundancy to obtain the redundant packet.

在一个实施例中，第二编码模块810，还用于：当目标冗余度且小于冗余度下限值时，则按照冗余度下限值对语音编码包进行前向纠错编码，得到冗余包；当目标冗余度大于冗余度上限值时，则按照冗余度上限值对语音编码包进行前向纠错编码，得到冗余包。In one embodiment, the second encoding module 810 is further configured to: when the target redundancy is less than the lower limit of redundancy, perform forward error correction encoding on the speech encoding packet according to the lower limit of redundancy to obtain a redundant packet; when the target redundancy is greater than the upper limit of redundancy, perform forward error correction encoding on the speech encoding packet according to the upper limit of redundancy to obtain a redundant packet.

在一个实施例中，第二编码模块810，还用于：In one embodiment, the second encoding module 810 is further configured to:

若语速值小于语速上限值，从前向纠错冗余度和冗余度下限值中选取最大值，按照最大值对语音编码包进行前向纠错编码，得到冗余包；If the speech rate value is less than the upper limit of speech rate, the maximum value is selected from the forward error correction redundancy and the lower limit of redundancy. The speech coding packet is then forward error correction encoded according to the maximum value to obtain the redundant packet.

若语速值大于语速上限值、且小于语速上限值，执行按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包的步骤；If the speech rate value is greater than the speech rate limit but less than the speech rate limit, perform forward error correction coding on the speech code packet according to the target redundancy to obtain a redundant packet.

若语速值大于语速上限值，从前向纠错冗余度和冗余度上限值中选取最小值，按照最小值对语音编码包进行前向纠错编码，得到冗余包。If the speech rate value is greater than the upper limit of speech rate, the minimum value is selected from the forward error correction redundancy and the upper limit of redundancy. Forward error correction coding is performed on the speech coding packet according to the minimum value to obtain the redundant packet.

图10示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是图1中的终端110。如图10所示，该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、输入装置和显示屏。其中，存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统，还可存储有计算机程序，该计算机程序被处理器执行时，可使得处理器实现语音处理方法。该内存储器中也可储存有计算机程序，该计算机程序被处理器执行时，可使得处理器执行语音处理方法。计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。Figure 10 shows an internal structural diagram of a computer device in one embodiment. Specifically, this computer device may be the terminal 110 in Figure 1. As shown in Figure 10, the computer device includes a processor, memory, network interface, input device, and display screen connected via a system bus. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system and may also store a computer program. When executed by the processor, this computer program enables the processor to implement a voice processing method. The internal memory may also store a computer program, which, when executed by the processor, enables the processor to perform a voice processing method. The display screen of the computer device may be a liquid crystal display (LCD) or an e-ink display. The input device may be a touch layer covering the display screen, or buttons, a trackball, or a touchpad mounted on the computer device casing, or an external keyboard, touchpad, or mouse, etc.

本领域技术人员可以理解，图10中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in Figure 10 is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or may combine certain components, or may have different component arrangements.

在一个实施例中，本申请提供的语音处理装置可以实现为一种计算机程序的形式，计算机程序可在如图10所示的计算机设备上运行。计算机设备的存储器中可存储组成该语音处理装置的各个程序模块，比如，图8所示的检测模块802、获取模块804、调整模块806、第一编码模块808、第二编码模块810和发送模块812。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的语音处理方法中的步骤。In one embodiment, the voice processing apparatus provided in this application can be implemented as a computer program, which can run on the computer device shown in FIG10. The memory of the computer device can store various program modules constituting the voice processing apparatus, such as the detection module 802, acquisition module 804, adjustment module 806, first encoding module 808, second encoding module 810, and transmission module 812 shown in FIG8. The computer program composed of these program modules causes the processor to execute the steps in the voice processing methods of the various embodiments of this application described in this specification.

例如，图10所示的计算机设备可以通过如图8所示的语音处理装置中的检测模块802执行S202。计算机设备可通过获取模块804执行S204。计算机设备可通过调整模块806执行S206。第一编码模块808执行S208。计算机设备可通过第二编码模块810执行S210。计算机设备可通过发送模块812执行S212。For example, the computer device shown in Figure 10 can execute S202 via the detection module 802 in the voice processing device shown in Figure 8. The computer device can execute S204 via the acquisition module 804. The computer device can execute S206 via the adjustment module 806. The first encoding module 808 executes S208. The computer device can execute S210 via the second encoding module 810. The computer device can execute S212 via the transmission module 812.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器存储有计算机程序，计算机程序被处理器执行时，使得处理器执行以下步骤：对获取的语音进行语速检测，得到语速值；获取前向纠错冗余度；依据语速值调整前向纠错冗余度，得到目标冗余度；对语音进行语音编码，得到语音编码包；按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包；向接收端发送冗余包和语音编码包。In one embodiment, a computer device is provided, including a memory and a processor. The memory stores a computer program, which, when executed by the processor, causes the processor to perform the following steps: performing speech rate detection on acquired speech to obtain a speech rate value; obtaining forward error correction redundancy; adjusting the forward error correction redundancy according to the speech rate value to obtain a target redundancy; performing speech encoding on the speech to obtain a speech encoding packet; performing forward error correction encoding on the speech encoding packet according to the target redundancy to obtain a redundant packet; and sending the redundant packet and the speech encoding packet to a receiving end.

在一个实施例中，计算机程序被处理器执行对获取的语音进行语速检测，得到语速值的步骤时，使得处理器具体执行以下步骤：采集语音；从语音中识别出音素序列；根据音素序列中音素的跳变频次确定语速值。In one embodiment, when a computer program is executed by a processor to detect the speech rate of acquired speech and obtain a speech rate value, the processor specifically performs the following steps: acquiring speech; identifying a phoneme sequence from the speech; and determining the speech rate value based on the frequency of phoneme transitions in the phoneme sequence.

在一个实施例中，计算机程序被处理器执行从语音中识别出音素序列的步骤时，使得处理器具体执行以下步骤：对语音进行脉冲编码调制，得到语音编码数据；从语音编码数据中识别出包含有语音内容的语音段；从语音编码数据的语音段中识别出音素序列。In one embodiment, when a computer program is executed by a processor to identify a phoneme sequence from speech, the processor specifically performs the following steps: pulse code modulation of the speech to obtain speech code data; identifying speech segments containing speech content from the speech code data; and identifying a phoneme sequence from the speech segments of the speech code data.

在一个实施例中，计算机程序被处理器执行从语音中识别出音素序列的步骤时，使得处理器具体执行以下步骤：从语音中提取语音特征；对语音特征进行解码，得到解码后语音特征；从解码后语音特征中识别出音素序列。In one embodiment, when a computer program is executed by a processor to identify a phoneme sequence from speech, the processor specifically performs the following steps: extracting speech features from the speech; decoding the speech features to obtain decoded speech features; and identifying a phoneme sequence from the decoded speech features.

在一个实施例中，计算机程序被处理器执行根据音素序列中音素的跳变频次确定语速值的步骤时，使得处理器具体执行以下步骤：在音素序列中检测单位时间内音素的基音周期或基音频率的跳变次数；根据单位时间内的跳变次数确定语速值。In one embodiment, when a computer program is executed by a processor to determine a speech rate value based on the frequency of phoneme transitions in a phoneme sequence, the processor specifically performs the following steps: detecting the number of transitions of the fundamental period or fundamental frequency of phonemes per unit time in the phoneme sequence; and determining the speech rate value based on the number of transitions per unit time.

在一个实施例中，计算机程序被处理器执行依据语速值调整前向纠错冗余度，得到目标冗余度的步骤时，使得处理器具体执行以下步骤：当语速值大于语速上限值、且小于语速上限值时，基于语速值计算调整参数；按照调整参数对前向纠错冗余度进行调整，得到目标冗余度。In one embodiment, when the computer program is executed by the processor to adjust the forward error correction redundancy based on the speech rate value to obtain the target redundancy, the processor specifically performs the following steps: when the speech rate value is greater than the speech rate upper limit value but less than the speech rate upper limit value, calculate the adjustment parameters based on the speech rate value; adjust the forward error correction redundancy according to the adjustment parameters to obtain the target redundancy.

在一个实施例中，计算机程序被处理器执行时，使得处理器还执行以下步骤：将目标冗余度分别与冗余度上限值、冗余度下限值进行比较；当目标冗余度小于冗余度上限值、且大于冗余度下限值时，则执行按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包的步骤。In one embodiment, when the computer program is executed by the processor, the processor further performs the following steps: comparing the target redundancy with the upper limit of redundancy and the lower limit of redundancy respectively; when the target redundancy is less than the upper limit of redundancy and greater than the lower limit of redundancy, performing forward error correction coding on the speech coding packet according to the target redundancy to obtain a redundant packet.

在一个实施例中，计算机程序被处理器执行时，使得处理器还执行以下步骤：当目标冗余度且小于冗余度下限值时，则按照冗余度下限值对语音编码包进行前向纠错编码，得到冗余包；In one embodiment, when the computer program is executed by the processor, the processor also performs the following steps: when the target redundancy is less than the redundancy lower limit, the speech coding packet is forward-corrected according to the redundancy lower limit to obtain a redundant packet;

当目标冗余度大于冗余度上限值时，则按照冗余度上限值对语音编码包进行前向纠错编码，得到冗余包。When the target redundancy exceeds the upper limit of redundancy, forward error correction coding is performed on the speech code packet according to the upper limit of redundancy to obtain the redundant packet.

在一个实施例中，计算机程序被处理器执行时，使得处理器还执行以下步骤：In one embodiment, when the computer program is executed by the processor, the processor also performs the following steps:

在一个实施例中，计算机程序被处理器执行时，使得处理器还执行以下步骤：利用实时传输协议对语音编码包和冗余包进行封装，得到封装后的语音数据包；In one embodiment, when the computer program is executed by the processor, the processor also performs the following steps: encapsulating the voice coding packet and the redundant packet using a real-time transport protocol to obtain the encapsulated voice data packet;

计算机程序被处理器执行向接收端发送冗余包和语音编码包的步骤时，使得处理器具体执行以下步骤：When a computer program is executed by a processor to send redundant packets and voice-coded packets to the receiving end, the processor specifically performs the following steps:

向接收端发送封装语音编码包和冗余包所得的语音数据包。Send voice data packets, which are encapsulated with voice encoding packets and redundant packets, to the receiving end.

在一个实施例中，提供了一种计算机可读存储介质，存储有计算机程序，计算机程序被处理器执行时，使得处理器执行以下步骤：对获取的语音进行语速检测，得到语速值；获取前向纠错冗余度；依据语速值调整前向纠错冗余度，得到目标冗余度；对语音进行语音编码，得到语音编码包；按照目标冗余度对语音编码包进行前向纠错编码，得到冗余包；向接收端发送冗余包和语音编码包。In one embodiment, a computer-readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the following steps: performing speech rate detection on the acquired speech to obtain a speech rate value; obtaining forward error correction redundancy; adjusting the forward error correction redundancy according to the speech rate value to obtain a target redundancy; performing speech encoding on the speech to obtain a speech encoding packet; performing forward error correction encoding on the speech encoding packet according to the target redundancy to obtain a redundant packet; and sending the redundant packet and the speech encoding packet to a receiving end.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，可以通过计算机程序来指令相关的硬件来完成，所述程序可存储于一非易失性计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、及存储器总线动态RAM等。Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM, etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A speech processing method, comprising:

The acquired speech is subjected to speech rate detection to obtain the speech rate value;

Obtain the forward error correction redundancy;

When the speech rate value is greater than the lower limit of speech rate and less than the upper limit of speech rate, the adjustment parameters are calculated based on the speech rate value; the forward error correction redundancy is adjusted according to the adjustment parameters to obtain the target redundancy.

The speech is encoded to obtain a speech encoding packet;

The speech coding packet is forward-corrected according to the target redundancy to obtain a redundant packet;

The redundant packet and the voice encoding packet are sent to the receiving end.

2. The method according to claim 1, characterized in that, the step of performing speech rate detection on the acquired speech to obtain a speech rate value includes:

Voice recording;

Identify the phoneme sequence from the speech;

The speech rate value is determined based on the frequency of phoneme transitions in the phoneme sequence.

3. The method according to claim 2, wherein identifying the phoneme sequence from the speech comprises:

The speech is subjected to pulse code modulation to obtain speech coded data;

Identify speech segments containing speech content from the speech coding data;

Phoneme sequences are identified from the speech segments of the speech encoded data.

4. The method according to claim 2, wherein identifying the phoneme sequence from the speech comprises:

Extract speech features from the speech;

The speech features are decoded to obtain the decoded speech features;

Phoneme sequences are identified from the decoded speech features.

5. The method according to claim 2, wherein determining the speech rate value based on the frequency of phoneme transitions in the phoneme sequence comprises:

The number of fundamental frequency jumps of a phoneme per unit time is detected in the phoneme sequence.

The speech rate value is determined based on the number of jumps per unit time.

6. The method according to any one of claims 1 to 5, wherein the forward error correction redundancy is configured according to network quality; or,

The forward error correction redundancy is determined based on the packet loss rate, which is obtained based on the network quality prediction.

7. The method according to claim 1, characterized in that the method further comprises:

The target redundancy is compared with the upper limit and lower limit of redundancy, respectively.

When the target redundancy is less than the upper limit of redundancy and greater than the lower limit of redundancy, the step of performing forward error correction coding on the speech code packet according to the target redundancy to obtain a redundant packet is performed.

8. The method according to claim 7, characterized in that the method further comprises:

When the target redundancy is less than the redundancy lower limit, the speech coding packet is forward error correction encoded according to the redundancy lower limit to obtain a redundant packet;

When the target redundancy is greater than the upper limit of redundancy, the speech coding packet is forward-corrected according to the upper limit of redundancy to obtain a redundant packet.

9. The method according to any one of claims 1 to 5, characterized in that the method further comprises:

If the speech rate value is less than the upper limit of speech rate, the maximum value is selected from the forward error correction redundancy and the lower limit of redundancy, and the speech coding packet is forward error correction encoded according to the maximum value to obtain a redundant packet;

If the speech rate value is greater than the upper limit of speech rate, the minimum value is selected from the forward error correction redundancy and the upper limit of redundancy, and the speech coding packet is forward error correction encoded according to the minimum value to obtain a redundant packet.

10. The method according to any one of claims 1 to 5, characterized in that, before sending the redundant packet and the voice coding packet to the receiving end, the method further comprises:

The voice encoding packet and the redundant packet are encapsulated using a real-time transmission protocol to obtain an encapsulated voice data packet;

Sending the redundant packet and the voice coding packet to the receiving end includes:

Send a voice data packet, which encapsulates the voice encoding packet and the redundant packet, to the receiving end.

11. A voice processing device, characterized in that the device comprises:

The detection module is used to detect the speech rate of the acquired speech and obtain the speech rate value;

The acquisition module is used to obtain the forward error correction redundancy.

An adjustment module is used to calculate adjustment parameters based on the speech rate value when the speech rate value is greater than the lower limit of speech rate and less than the upper limit of speech rate; and to adjust the forward error correction redundancy according to the adjustment parameters to obtain the target redundancy.

The first encoding module is used to encode the speech to obtain a speech encoding packet;

The second encoding module is used to perform forward error correction encoding on the speech encoding packet according to the target redundancy to obtain a redundant packet;

The sending module is used to send the redundant packet and the voice encoding packet to the receiving end.

12. The apparatus according to claim 11, wherein the detection module is further configured to:

Voice recording;

Identify the phoneme sequence from the speech;

13. The apparatus according to claim 12, wherein the detection module is further configured to:

The speech is subjected to pulse code modulation to obtain speech coded data;

Speech segments are identified from the speech coding data;

14. The apparatus according to claim 12, wherein the detection module is further configured to:

Extract speech features from the speech;

The speech features are decoded to obtain the decoded speech features;

Phoneme sequences are identified from the decoded speech features.

15. The apparatus according to claim 12, wherein the detection module is further configured to:

The speech rate value is determined based on the number of jumps per unit time.

16. The apparatus according to any one of claims 11 to 15, wherein the forward error correction redundancy is configured according to network quality; or,

17. The apparatus according to claim 11, wherein the apparatus further comprises:

The comparison module is used to compare the target redundancy with the upper limit of redundancy and the lower limit of redundancy, respectively.

The second encoding module is further configured to perform forward error correction encoding on the speech encoding packet according to the target redundancy when the target redundancy is less than the upper limit of redundancy and greater than the lower limit of redundancy, so as to obtain a redundant packet.

18. The apparatus according to claim 17, wherein the second encoding module is further configured to: when the target redundancy is less than the lower limit of redundancy, perform forward error correction encoding on the speech encoding packet according to the lower limit of redundancy to obtain a redundant packet; and when the target redundancy is greater than the upper limit of redundancy, perform forward error correction encoding on the speech encoding packet according to the upper limit of redundancy to obtain a redundant packet.

19. The apparatus according to any one of claims 11 to 15, wherein the second encoding module is further configured to: if the speech rate value is less than the upper limit of speech rate, select the maximum value from the forward error correction redundancy and the lower limit of redundancy, and perform forward error correction encoding on the speech encoding packet according to the maximum value to obtain a redundant packet; if the speech rate value is greater than the upper limit of speech rate, select the minimum value from the forward error correction redundancy and the upper limit of redundancy, and perform forward error correction encoding on the speech encoding packet according to the minimum value to obtain a redundant packet.

20. The apparatus according to any one of claims 11 to 15, characterized in that the apparatus further comprises:

An encapsulation module is used to encapsulate the voice encoding packet and the redundant packet using a real-time transmission protocol to obtain an encapsulated voice data packet;

The sending module is also used to send a voice data packet, which encapsulates the voice encoding packet and the redundant packet, to the receiving end.

21. A computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of the method as claimed in any one of claims 1 to 10.

22. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method as claimed in any one of claims 1 to 10.