CN118803151A

CN118803151A - Voice data processing method, device, equipment, medium and program product

Info

Publication number: CN118803151A
Application number: CN202410143273.5A
Authority: CN
Inventors: 周骏华; 吴庆航; 陈民; 程宝平
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-10-18

Abstract

The present invention provides a voice data processing method, device, equipment, medium and program product, wherein the method comprises: obtaining uplink bandwidth data, determining a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data, and determining a target encoding format based on the first candidate encoding format; sending a notification instruction to a call counterpart, and monitoring a reply message to the notification instruction sent by the call counterpart, wherein the notification instruction is used to indicate a switching of the encoding format; after receiving the reply message, encoding the voice data based on the target encoding format to obtain a voice data packet, wherein the voice data packet includes a flag bit reflecting the target encoding format, and the voice data packet is sent to the call counterpart. The present invention can realize a switching of encoding formats without additional overhead and provide a stable call experience by adaptively selecting an encoding format according to actual uplink bandwidth data during a call, and notifying the call counterpart of the change of the encoding format by adding a flag bit in the data packet.

Description

Voice data processing method, device, equipment, medium and program product

技术领域Technical Field

本发明涉及物联网技术领域，尤其涉及语音数据处理方法、装置、设备、介质及程序产品。The present invention relates to the technical field of Internet of Things, and in particular to a voice data processing method, device, equipment, medium and program product.

背景技术Background Art

NB-IoT(窄带物联网)聚焦于低功耗广覆盖(LPWA)物联网(IoT)市场，是一种可在全球范围内广泛应用的新兴技术。具有覆盖广、连接多、速率低、成本低、功耗低、架构优等特点。NB-IOT使用License频段，可采取带内、保护带或独立载波等三种部署方式，与现有网络共存。目前已经出现了大量物与物的联接，这些联接大多通过蓝牙、Wi-Fi等短距通信技术承载，非运营商移动网络。家庭门禁、门铃等系统，以家庭为单位其设备连接多，速率低，功耗低的需求，以及其使用频率低，时延要求低的特点，和NB-IoT的特点非常匹配，因此，现有的很多家庭智能设备采用窄带物联网实现互联。NB-IoT (Narrowband Internet of Things) focuses on the low power and wide coverage (LPWA) Internet of Things (IoT) market and is an emerging technology that can be widely used worldwide. It has the characteristics of wide coverage, multiple connections, low speed, low cost, low power consumption, and excellent architecture. NB-IOT uses the licensed frequency band and can be deployed in three ways: in-band, guard band, or independent carrier, to coexist with the existing network. At present, a large number of things have been connected to each other, and most of these connections are carried by short-range communication technologies such as Bluetooth and Wi-Fi, not by operator mobile networks. Home access control, doorbell and other systems have the requirements of multiple device connections, low speed, and low power consumption based on the family unit, as well as the characteristics of low usage frequency and low latency requirements, which are very compatible with the characteristics of NB-IoT. Therefore, many existing home smart devices use narrowband Internet of Things to achieve interconnection.

但是NB-IoT其上下行速率的稳定性与运营商移动网络比还是相差甚远，虽然上行速率理论值为16.9Kbps，但某些不稳定网络情况下网速只有3kbps左右，由于网络情况不稳定通常无法在此网络环境下提供稳定的通话体验。However, the stability of NB-IoT's uplink and downlink rates is still far from that of operators' mobile networks. Although the theoretical uplink rate is 16.9Kbps, the network speed is only about 3kbps under certain unstable network conditions. Due to the unstable network conditions, it is usually impossible to provide a stable call experience in this network environment.

发明内容Summary of the invention

本发明提供语音数据处理方法、装置、设备、介质及程序产品，用以解决现有技术中窄带物联网网络情况不稳定无法提供稳定的通话体验的缺陷，实现根据网络情况动态选择语音编码方式，提升通话体验稳定性。The present invention provides a voice data processing method, device, equipment, medium and program product to solve the defect in the prior art that the narrowband Internet of Things network is unstable and cannot provide a stable call experience, and to dynamically select a voice encoding method according to the network situation to improve the stability of the call experience.

本发明提供一种语音数据处理方法，包括：The present invention provides a voice data processing method, comprising:

获取上行带宽数据，根据所述上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于所述第一候选编码格式确定目标编码格式；Acquire uplink bandwidth data, determine a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data, and determine a target encoding format based on the first candidate encoding format;

向通话对端发送通知指令，并监听所述通话对端发送的针对所述通知指令的答复消息，所述通知指令用于指示编码格式切换；Sending a notification instruction to a call peer, and monitoring a reply message to the notification instruction sent by the call peer, wherein the notification instruction is used to instruct the coding format to switch;

在接收到所述答复消息后，基于所述目标编码格式对语音数据进行编码，得到语音数据包，所述语音数据包中包括反映所述目标编码格式的标志位，将所述语音数据包发送至所述通话对端。After receiving the reply message, the voice data is encoded based on the target coding format to obtain a voice data packet, the voice data packet includes a flag bit reflecting the target coding format, and the voice data packet is sent to the call partner.

根据本发明提供的一种语音数据处理方法，所述多个预设编码格式包括Lyra编码格式，所述根据所述上行带宽数据在多个预设编码格式中确定第一候选编码格式，包括：According to a voice data processing method provided by the present invention, the multiple preset coding formats include a Lyra coding format, and determining a first candidate coding format from the multiple preset coding formats according to the uplink bandwidth data includes:

当所述上行带宽数据低于第一预设阈值时，确定Lyra编码格式作为所述第一候选编码格式。When the uplink bandwidth data is lower than a first preset threshold, the Lyra encoding format is determined as the first candidate encoding format.

根据本发明提供的一种语音数据处理方法，所述基于所述目标编码格式对语音数据进行编码，得到语音数据包，包括：According to a voice data processing method provided by the present invention, encoding the voice data based on the target encoding format to obtain a voice data packet includes:

当所述上行带宽数据小于第二预设阈值时，对所述语音数据进行话音激活检测，得到多个语音数据段以及各个所述语音数据段对应的时间戳，所述语音数据段中包括语音信号；When the uplink bandwidth data is less than a second preset threshold, performing voice activation detection on the voice data to obtain a plurality of voice data segments and timestamps corresponding to each of the voice data segments, wherein the voice data segments include voice signals;

基于所述目标编码格式对所述多个语音数据段进行编码，得到多个语音帧以及各个所述语音帧对应的时间戳；Encoding the plurality of voice data segments based on the target coding format to obtain a plurality of voice frames and a timestamp corresponding to each of the voice frames;

基于多个所述语音帧生成一个所述语音数据包，所述语音数据包中包括各个所述语音帧对应的时间戳。A voice data packet is generated based on a plurality of the voice frames, and the voice data packet includes a timestamp corresponding to each of the voice frames.

根据本发明提供的一种语音数据处理方法，所述基于所述目标编码格式对语音数据进行编码之前，包括：According to a voice data processing method provided by the present invention, before encoding the voice data based on the target encoding format, the method comprises:

获取采集数据，对所述采集数据进行特征提取，得到语音特征，所述采集数据是语音采集装置进行语音采集得到的数据；Acquire collected data, perform feature extraction on the collected data, and obtain speech features, wherein the collected data is data obtained by a speech collection device through speech collection;

将所述语音特征输入至降噪模型中，获取所述降噪模型输出的处理特征；Inputting the speech features into a noise reduction model to obtain processing features output by the noise reduction model;

对所述处理特征进行量化处理，得到所述语音数据。The processing feature is quantized to obtain the speech data.

根据本发明提供的一种语音数据处理方法，所述基于所述第一候选编码格式确定目标编码格式，包括：According to a speech data processing method provided by the present invention, determining a target encoding format based on the first candidate encoding format includes:

将所述第一候选编码格式发送至所述通话对端；Sending the first candidate encoding format to the call peer;

获取所述通话对端发送的第一候选解码格式，基于所述第一候选编码格式和所述第一候选解码格式确定所述目标编码格式。Acquire a first candidate decoding format sent by the call partner, and determine the target encoding format based on the first candidate encoding format and the first candidate decoding format.

根据本发明提供的一种语音数据处理方法，还包括：A voice data processing method according to the present invention also includes:

获取设备性能数据，基于所述设备性能数据在多个解码格式中确定第二候选解码格式；Acquire device performance data, and determine a second candidate decoding format from a plurality of decoding formats based on the device performance data;

基于第二候选解码格式确定目标解码格式；determining a target decoding format based on the second candidate decoding format;

基于所述目标解码格式对所述通话对端发送的数据包进行解码。The data packet sent by the call partner is decoded based on the target decoding format.

本发明还提供一种语音数据处理装置，包括：The present invention also provides a voice data processing device, comprising:

网络判断模块，用于获取上行带宽数据，根据所述上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于所述第一候选编码格式确定目标编码格式；A network determination module, configured to obtain uplink bandwidth data, determine a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data, and determine a target encoding format based on the first candidate encoding format;

语音编码控制模块，用于向通话对端发送通知指令，并监听所述通话对端发送的针对所述通知指令的答复消息，所述通知指令用于指示编码格式切换；A voice coding control module, used to send a notification instruction to a call peer and monitor a reply message sent by the call peer to the notification instruction, wherein the notification instruction is used to instruct the coding format to switch;

语音编解码模块，用于在接收到所述答复消息后，基于所述目标编码格式对语音数据进行编码，得到语音数据包，所述语音数据包中包括反映所述目标编码格式的标志位，将所述语音数据包发送至所述通话对端。The voice encoding and decoding module is used to encode the voice data based on the target encoding format after receiving the reply message to obtain a voice data packet, the voice data packet including a flag reflecting the target encoding format, and send the voice data packet to the call partner.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述语音数据处理方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, any of the above-mentioned voice data processing methods is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述语音数据处理方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the voice data processing method described in any one of the above is implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述语音数据处理方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the voice data processing method described above is implemented.

本发明提供的一种语音数据处理方法、装置、设备、介质及程序产品，通过获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式，确定目标编码格式，向通话对端发送通知指令，接收到通话对端针对该通知指令的答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，并且在语音数据包中插入标志位，该标志位可以反映当前的编码格式已经切换为目标编码格式，通话对端就可以对应切换为目标编码格式对应的解码格式进行解码。本发明通过在通话过程中根据实际的上行带宽数据适应性地选择编码格式，就可以在上行网络速率不佳的时候，选择更加适合低带宽环境的语音编码技术，在上行网络速度较好的时候，选择更加适合高带宽环境的语音编码技术，并且通过在数据包中添加标志位的方式通知通话对端编码格式的变更，可以实现无额外开销的编码格式切换，提供稳定的通话体验的效果。The present invention provides a voice data processing method, device, equipment, medium and program product, which obtains uplink bandwidth data, determines a first candidate encoding format from multiple preset encoding formats according to the uplink bandwidth data, determines a target encoding format based on the first candidate encoding format, determines the target encoding format, sends a notification instruction to the call peer, and after receiving a reply message from the call peer to the notification instruction, encodes the voice data based on the target encoding format to obtain a voice data packet, and inserts a flag bit in the voice data packet, which can reflect that the current encoding format has been switched to the target encoding format, and the call peer can switch to the decoding format corresponding to the target encoding format for decoding. The present invention can select the encoding format adaptively according to the actual uplink bandwidth data during the call process, so that when the uplink network rate is not good, a voice encoding technology that is more suitable for a low-bandwidth environment can be selected, and when the uplink network speed is good, a voice encoding technology that is more suitable for a high-bandwidth environment can be selected, and by adding a flag bit in the data packet to notify the call peer of the change of the encoding format, the encoding format switching without additional overhead can be achieved, providing a stable call experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, a brief introduction will be given below to the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本发明提供的语音数据处理方法的流程示意图；FIG1 is a flow chart of a method for processing speech data provided by the present invention;

图2是本发明提供的语音数据处理方法中本地终端的模块架构示意图；2 is a schematic diagram of the module architecture of the local terminal in the voice data processing method provided by the present invention;

图3是本发明提供的语音数据处理方法中通话对端的模块架构示意图；3 is a schematic diagram of the module architecture of the call counterpart in the voice data processing method provided by the present invention;

图4是发明提供的语音数据处理方法中通话过程中编解码格式异构控制图；4 is a diagram of heterogeneous control of encoding and decoding formats during a call in the voice data processing method provided by the invention;

图5是本发明提供的语音数据处理方法中切换编码格式流程示意图；5 is a schematic diagram of the process of switching encoding formats in the voice data processing method provided by the present invention;

图6是本发明提供的语音数据处理方法中进行融帧操作的示意图；6 is a schematic diagram of a frame fusion operation in the voice data processing method provided by the present invention;

图7是本发明提供的语音数据处理方法中拾音增强的示意图一；FIG7 is a schematic diagram 1 of sound pickup enhancement in the voice data processing method provided by the present invention;

图8是本发明提供的语音数据处理方法中拾音增强的示意图二；FIG8 is a second schematic diagram of sound pickup enhancement in the voice data processing method provided by the present invention;

图9是本发明提供的语音数据处理装置的结构示意图；9 is a schematic diagram of the structure of a voice data processing device provided by the present invention;

图10是本发明提供的电子设备的结构示意图。FIG. 10 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

下面结合图1-图8描述本发明提供的语音数据处理方法，如图1所示，该方法包括步骤：The following describes the voice data processing method provided by the present invention in conjunction with FIG. 1 to FIG. 8. As shown in FIG. 1, the method comprises the steps of:

S110、获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式；S110, acquiring uplink bandwidth data, determining a first candidate coding format from a plurality of preset coding formats according to the uplink bandwidth data, and determining a target coding format based on the first candidate coding format;

S120、向通话对端发送通知指令，并监听通话对端发送的针对通知指令的答复消息；S120, sending a notification instruction to the call peer, and monitoring a reply message to the notification instruction sent by the call peer;

S130、在接收到答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，语音数据包中包括反映目标编码格式的标志位，将语音数据包发送至通话对端。S130. After receiving the reply message, encode the voice data based on the target coding format to obtain a voice data packet, which includes a flag reflecting the target coding format, and send the voice data packet to the call peer.

本发明提供的方法，可以是由窄带物联网设备执行，如图2所示，该设备中可以包括网络判断模块、语音编码控制模块、语音编解码模块，网络判断模块用于带宽检测，语音编码控制模块用于实现语音编码格式切换工作，实现在不同网络情况下的语音通话编码模式选择，而语音编码模块实现编码功能，可以预先设置多种编码格式，如Lyra、G.711、AAC、ARM-WB、OPUS等。The method provided by the present invention can be executed by a narrowband Internet of Things device. As shown in Figure 2, the device may include a network judgment module, a voice coding control module, and a voice coding and decoding module. The network judgment module is used for bandwidth detection, and the voice coding control module is used to implement voice coding format switching and realize voice call coding mode selection under different network conditions. The voice coding module implements the coding function and can pre-set multiple coding formats, such as Lyra, G.711, AAC, ARM-WB, OPUS, etc.

根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，包括：Determining a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data includes:

当上行带宽数据低于第一预设阈值时，确定Lyra编码格式作为第一候选编码格式。When the uplink bandwidth data is lower than a first preset threshold, the Lyra encoding format is determined as the first candidate encoding format.

传统低功耗品类设备的音频通信方案，如对讲、固话通信业务，一般使用G.711、AAC、OPUS语音编码，从业内评测报告以及实际测试可以得知，在3kbps下语音可懂度已经下降非常明显，不足以进行正常语音通信。Audio communication solutions for traditional low-power devices, such as intercom and fixed-line communication services, generally use G.711, AAC, and OPUS voice coding. Industry evaluation reports and actual tests show that at 3kbps, voice intelligibility has dropped significantly and is insufficient for normal voice communication.

Lyra是google开发的一种面向3kbps带宽环境的语音编码技术，其在此带宽下，质量指标均高于传统的语音编码，可以大大提高此窄带下的语音可懂度，实现正常语音通信。Lyra is a voice coding technology developed by Google for 3kbps bandwidth environment. Under this bandwidth, its quality indicators are higher than traditional voice coding. It can greatly improve the voice intelligibility under this narrow band and realize normal voice communication.

Lyra编码格式在一些高带宽场景下，效果不如其他传统语音编码。本发明提供的方法中，在上行带宽数据低于第一预设阈值时，才确定Lyra编码格式作为第一候选编码格式，可以实现在上行带宽较低时，采用在低带宽下也能保证语音可懂度的编码格式进行编码，保障网络波动时的通话体验稳定性。在上行带宽不低于第一预设阈值时，可以采用在高带宽下效果更高的传统语音编码。The Lyra encoding format is not as effective as other traditional voice encodings in some high-bandwidth scenarios. In the method provided by the present invention, the Lyra encoding format is determined as the first candidate encoding format only when the uplink bandwidth data is lower than the first preset threshold value. When the uplink bandwidth is low, the encoding format that can also ensure voice intelligibility under low bandwidth can be used for encoding, thereby ensuring the stability of the call experience when the network fluctuates. When the uplink bandwidth is not lower than the first preset threshold value, traditional voice encoding that is more effective under high bandwidth can be used.

可选的编码格式包括Lyra、G.711、AAC、ARM-WB、OPUS等，在一种可能的实现方式中，第一预设阈值可以为12kbps，根据上行带宽数据确定第一候选编码格式可以通过下述公式来确定：The optional coding formats include Lyra, G.711, AAC, ARM-WB, OPUS, etc. In a possible implementation, the first preset threshold may be 12 kbps, and the first candidate coding format may be determined according to the uplink bandwidth data by the following formula:

其中，Cs表示第一候选编码格式，v表示上行带宽数据，η表示设备性能数据，设备性能数据反映设备的性能，设备性能数据越大，说明设备性能越好。Wherein, Cs represents the first candidate coding format, v represents uplink bandwidth data, and η represents device performance data. The device performance data reflects the performance of the device. The larger the device performance data, the better the device performance.

在一种可能的实现方式中，基于第一候选编码格式确定目标编码格式，可以是直接将第一候选编码格式作为目标编码格式。由于在编码后发往通话通话对端，通话对端需要采用相应的解码格式进行解码，才能实现成功的通话。如图2和图3所示，执行本发明提供的方法的终端和与之进行通话的通话对端中均可以设置媒体协商模块以实现对双方均支持的编解码格式的协商一致。In a possible implementation, the target encoding format is determined based on the first candidate encoding format, and the first candidate encoding format may be directly used as the target encoding format. Since the encoding is sent to the call peer end, the call peer end needs to use the corresponding decoding format for decoding to achieve a successful call. As shown in Figures 2 and 3, the terminal executing the method provided by the present invention and the call peer end communicating with it can both be provided with a media negotiation module to achieve consensus on the encoding and decoding formats supported by both parties.

基于第一候选编码格式确定目标编码格式，包括：Determining a target encoding format based on the first candidate encoding format includes:

将第一候选编码格式发送至通话对端；Sending the first candidate encoding format to the call peer;

获取通话对端发送的第一候选解码格式，基于第一候选编码格式和第一候选解码格式确定目标编码格式。A first candidate decoding format sent by the call peer is obtained, and a target encoding format is determined based on the first candidate encoding format and the first candidate decoding format.

通话对端发送的第一候选解码格式是通话对端选择的适用编码格式，第一候选解码格式可能与第一候选编码格式一致或不一致，当第一候选编码格式和第一候选解码格式一致时，确定目标编码格式为第一候选编码格式，当第一候选编码格式和第一候选解码格式不一致时，说明通话对端不支持第一候选编码格式，并且通话对端基于自身的状态选择第一候选解码格式作为适应的解码格式，那么可以将第一候选解码格式对应的编码格式作为目标编码格式。The first candidate decoding format sent by the call partner is the applicable encoding format selected by the call partner. The first candidate decoding format may be consistent with or inconsistent with the first candidate encoding format. When the first candidate encoding format and the first candidate decoding format are consistent, the target encoding format is determined to be the first candidate encoding format. When the first candidate encoding format and the first candidate decoding format are inconsistent, it means that the call partner does not support the first candidate encoding format, and the call partner selects the first candidate decoding format as the adaptive decoding format based on its own status. In this case, the encoding format corresponding to the first candidate decoding format can be used as the target encoding format.

本发明提供的方法，为了进一步保证通话成功，在确定第一候选编码格式后，与进行通话的通话对端进行格式协商，以防止通话对端不支持对第一候选编码格式的解码，保证了通话成功率。In order to further ensure the success of the call, the method provided by the present invention performs format negotiation with the call partner after determining the first candidate encoding format to prevent the call partner from not supporting decoding of the first candidate encoding format, thereby ensuring the call success rate.

Lyra格式解码性能要求很高，测试在1GHZ低功耗品类的设备上，解码延时在秒级以上，无法实现实时同步，同时窄带通信搭载的设备算力更低，此问题会更加严重。The Lyra format decoding performance requirements are very high. When tested on a 1GHZ low-power device, the decoding delay is over seconds, making real-time synchronization impossible. At the same time, the computing power of devices equipped with narrowband communications is lower, making this problem even more serious.

如图4所示，本发明提供的方法中，采用编解码异构控制的方式来实现降低算力，实现对低功耗设备的适用，即上行数据的编码格式和下行数据的解码格式可以不一致。具体包括：As shown in FIG4 , in the method provided by the present invention, the coding and decoding heterogeneous control method is adopted to reduce the computing power and realize the applicability to low-power devices, that is, the coding format of the uplink data and the decoding format of the downlink data may be inconsistent. Specifically, it includes:

获取设备性能数据，基于设备性能数据在多个解码格式中确定第二候选解码格式；Acquire device performance data, and determine a second candidate decoding format from a plurality of decoding formats based on the device performance data;

基于目标解码格式对通话对端发送的数据包进行解码。Decode the data packets sent by the call peer based on the target decoding format.

其中，第二候选解码格式是非Lyra解码格式。采用传统的解码格式而非Lyra解码格式对下行数据进行解码，可以避免Lyra格式解码对性能要求高的缺陷，实现降低通话算力要求，降低功耗的效果，实现对低算力低功耗的设备的适用。Among them, the second candidate decoding format is a non-Lyra decoding format. Using the traditional decoding format instead of the Lyra decoding format to decode the downlink data can avoid the defect of Lyra format decoding having high performance requirements, achieve the effect of reducing the call computing power requirements and reducing power consumption, and achieve the applicability to low computing power and low power consumption devices.

本发明提供的方法中，在预设设置的解码格式中并不设置Lyra解码格式，基于设备性能数据在多个解码格式中确定第二候选解码格式，可以是通过下式实现：In the method provided by the present invention, the Lyra decoding format is not set in the preset decoding format, and the second candidate decoding format is determined from multiple decoding formats based on the device performance data, which can be implemented by the following formula:

Cr＝common(η)Cr＝common(η)

其中，Cr表示第二候选解码格式，η表示设备性能数据，设备性能数据反映设备的性能，设备性能数据越大，说明设备性能越好。Wherein, Cr represents the second candidate decoding format, η represents device performance data, and the device performance data reflects the performance of the device. The larger the device performance data, the better the device performance.

在一种可能的实现方式中，可以直接将第二候选解码格式作为目标解码格式，类似的，为了保证通话对端能够支持本地选择的解码格式对应的编码格式进行编码，本发明提供的方法中，在确定了第二候选解码格式后，基于第二候选解码格式确定目标解码格式，包括：In a possible implementation, the second candidate decoding format may be directly used as the target decoding format. Similarly, in order to ensure that the call peer can support the encoding format corresponding to the locally selected decoding format for encoding, in the method provided by the present invention, after determining the second candidate decoding format, determining the target decoding format based on the second candidate decoding format includes:

将第二候选解码格式发送至通话对端；Sending the second candidate decoding format to the call peer;

获取通话对端发送的第二候选编码格式，基于第二候选解码格式和第二候选编码格式确定目标解码格式；Acquire a second candidate encoding format sent by the call peer, and determine a target decoding format based on the second candidate decoding format and the second candidate encoding format;

基于第二候选解码格式确定目标解码格式的过程可以与基于第一候选编码格式确定目标编码格式的过程一致，在此不再赘述。The process of determining the target decoding format based on the second candidate decoding format may be consistent with the process of determining the target encoding format based on the first candidate encoding format, and will not be described in detail herein.

在每次和通话对端进行协商后，可以记录通话对端支持的编解码格式，在确定第一候选编码格式，且根据记录可以确定通话对端支持对第一候选编码格式对应的解码时，可以直接将第一候选编码作为目标编码格式，在确定目标编码格式后，切换至目标编码格式对语音数据进行编码发送至通话对端。After each negotiation with the other end of the call, the codec formats supported by the other end of the call can be recorded. When the first candidate encoding format is determined, and it can be determined based on the records that the other end of the call supports the corresponding decoding of the first candidate encoding format, the first candidate encoding can be directly used as the target encoding format. After determining the target encoding format, switch to the target encoding format to encode the voice data and send it to the other end of the call.

在确定目标编码格式后，切换至目标编码格式对语音数据进行编码发送至通话对端，在不与通话对端进行协商，直接确定第一候选编码作为目标编码格式的情况下，通话对端会默认按照之前的解码格式进行解码，这会造成解码失败。本发明提供的方法，在确定目标编码格式后，若目标编码格式和当前的编码格式不同，则向通话对端发送通知指令，通知指令指示产生了编码格式切换。After determining the target encoding format, the target encoding format is switched to encode the voice data and sent to the call peer. Without negotiating with the call peer and directly determining the first candidate encoding as the target encoding format, the call peer will decode according to the previous decoding format by default, which will cause decoding failure. The method provided by the present invention, after determining the target encoding format, if the target encoding format is different from the current encoding format, sends a notification instruction to the call peer, and the notification instruction indicates that the encoding format switching has occurred.

具体来说，常规的sip信令在窄带条件下，开销过大，会影响媒体速度的收发稳定性，即使使用具备压缩效果的PB(protobuf)协议，也无法保证语音通话不被打断。如图5所示，本发明提供的方法，在媒体通信协议RTCP中，使用媒体切换通知指令传递，发送通知指令，提前通知通话对端进行流切换准备，在接收到通话对端针对通知指令的答复消息后，确定通话对端已经明确将要进行格式切换，此时才切换至目标编码格式进行编码，将编码后的数据包发送至通话对端。Specifically, conventional SIP signaling has too much overhead under narrowband conditions, which will affect the stability of media speed transmission and reception. Even if the PB (protobuf) protocol with compression effect is used, it cannot guarantee that the voice call will not be interrupted. As shown in Figure 5, the method provided by the present invention uses the media switching notification instruction transmission in the media communication protocol RTCP to send a notification instruction to notify the call counterpart in advance to prepare for stream switching. After receiving the reply message of the call counterpart to the notification instruction, it is determined that the call counterpart has clearly indicated that the format will be switched. At this time, it switches to the target encoding format for encoding and sends the encoded data packet to the call counterpart.

进一步地，本发明提供的方法，在生语音数据包后，在数据包中插入反映目标编码格式的标志位(以RTP数据包为例，可以在RTP包的extension个插入flag标志位反映目标编码格式)，这样通话对端可以进一步确定接收到的数据包需要采用什么格式进行解码，保证了编码格式切换的平滑进行。Furthermore, the method provided by the present invention inserts a flag reflecting the target encoding format into the data packet after generating a voice data packet (taking the RTP data packet as an example, a flag reflecting the target encoding format can be inserted into the extension of the RTP packet). In this way, the call partner can further determine what format the received data packet needs to be decoded, thereby ensuring smooth switching of the encoding format.

为了进一步提升在更加恶劣的网络带宽环境下的通话质量，本发明提供的方法中，基于目标编码格式对语音数据进行编码，得到语音数据包，包括：In order to further improve the call quality in a worse network bandwidth environment, the method provided by the present invention encodes the voice data based on the target encoding format to obtain a voice data packet, including:

当上行带宽数据小于第二预设阈值时，对语音数据进行话音激活检测，得到多个语音数据段以及各个语音数据段对应的时间戳，语音数据段中包括语音信号；When the uplink bandwidth data is less than a second preset threshold, voice activation detection is performed on the voice data to obtain a plurality of voice data segments and a timestamp corresponding to each voice data segment, wherein the voice data segment includes a voice signal;

对多个语音数据段进行编码，得到多个语音帧以及各个语音帧对应的时间戳；Encoding multiple voice data segments to obtain multiple voice frames and timestamps corresponding to each voice frame;

基于目标编码格式对多个语音数据段进行编码，得到多个语音帧以及各个语音帧对应的时间戳；Encoding the multiple voice data segments based on the target coding format to obtain multiple voice frames and timestamps corresponding to the respective voice frames;

基于多个语音帧生成一个语音数据包，语音数据包中包括各个语音帧对应的时间戳。A voice data packet is generated based on multiple voice frames, and the voice data packet includes a timestamp corresponding to each voice frame.

在传统的语音通话过程中，是对采集到的语音数据进行实时编码的，即使当前采集到的语音数据中存在噪音信号，也会进行编码传输，这虽然可以还原真实语音情况，但是会造成更多的数据传输量，在上行带宽数据小于第二预设阈值(第二预设阈值小于第一预设阈值)，即上行带宽数据非常小时，更多的数据传输量显然会造成语音不流畅。In a traditional voice call, the collected voice data is encoded in real time. Even if there is a noise signal in the currently collected voice data, it will be encoded and transmitted. Although this can restore the real voice situation, it will cause more data transmission. When the uplink bandwidth data is less than the second preset threshold (the second preset threshold is less than the first preset threshold), that is, the uplink bandwidth data is very small, more data transmission will obviously cause the voice to be unsmooth.

如图6所示，第二预设阈值可以为2kpbs，本发明提供的方法中，在上行带宽数据不小于第二预设阈值时，采用正常数据包生成方式，即在采集到语音数据后，实时地进行编码得到语音帧，每次得到一个语音帧后就生成RTP数据包发送至通话对端，而在上行带宽数据小于第二预设阈值时，采用融帧发送的方式，即在采集到语音数据后，在语音数据长度达到预设长度值后，对其进行话音激活检测(VAD)，得到多个语音数据段，以及每个语音数据段的开始时间段，语音数据段是经过话音激活检测后去除了语音数据中的噪音部分得到的。对多个语音数据段进行编码，得到多个语音帧以及各个语音帧对应的时间戳，语音帧对应的时间戳反映语音帧对应的语音的开始时间。基于多个语音帧生成一个语音数据包，语音数据包中包括各个语音帧对应的时间戳，通话对端在接收到语音数据包后各个语音帧中对应的时间戳进行对齐，保证整体的语音同步性。As shown in FIG6 , the second preset threshold value may be 2 kpbs. In the method provided by the present invention, when the uplink bandwidth data is not less than the second preset threshold value, a normal data packet generation method is adopted, that is, after the voice data is collected, it is encoded in real time to obtain a voice frame, and each time a voice frame is obtained, an RTP data packet is generated and sent to the call counterpart. When the uplink bandwidth data is less than the second preset threshold value, a fused frame sending method is adopted, that is, after the voice data is collected, after the voice data length reaches the preset length value, a voice activation detection (VAD) is performed on it to obtain multiple voice data segments and the start time period of each voice data segment. The voice data segment is obtained by removing the noise part in the voice data after the voice activation detection. Multiple voice data segments are encoded to obtain multiple voice frames and the timestamps corresponding to each voice frame. The timestamps corresponding to the voice frames reflect the start time of the voice corresponding to the voice frame. A voice data packet is generated based on multiple voice frames, and the voice data packet includes the timestamps corresponding to each voice frame. After receiving the voice data packet, the call counterpart aligns the corresponding timestamps in each voice frame to ensure the overall voice synchronization.

本发明提供的方法，通过VAD检测，只对涉及语音部分的数据进行发送，这样虽然会在开始存在一定的首开延迟，但是由于去除了噪音的部分，只对涉及语音部分的数据进行收发，传输数据量更小，而且在语音数据积攒到预设长度值的过程中提供了对带宽延迟的包容，可以实现后续的流畅语音。The method provided by the present invention, through VAD detection, only sends the data related to the voice part. Although there will be a certain initial delay at the beginning, since the noise part is removed, only the data related to the voice part is sent and received, the transmission data volume is smaller, and in the process of accumulating voice data to a preset length value, it provides tolerance for bandwidth delay, and can achieve subsequent smooth speech.

家庭实际使用环境中，设备需要具备远程拾音的能力，来保证用户的通话效果。但是Lyra编码格式在远程拾音处理中存在明显不足，当声源超过一定的距离，Lyra编码结果的可懂度下降比较明显，机械声音严重。而对于低功耗窄带设备，成本是其重要考虑点之一，因为设备算力一般无法承载过于复杂的AI降噪和AGC算法。因此通过在设备部署更加复杂的AI降噪算法和AGC算法来提升Lyra编码时的拾音能力是不合适的。本发明提供的方法中，基于目标编码格式对语音数据进行编码之前，包括：In the actual home use environment, the device needs to have the ability to pick up sound remotely to ensure the user's call quality. However, the Lyra encoding format has obvious shortcomings in remote sound pickup processing. When the sound source exceeds a certain distance, the intelligibility of the Lyra encoding result decreases significantly, and the mechanical sound is serious. For low-power narrowband devices, cost is one of its important considerations, because the computing power of the device generally cannot support overly complex AI noise reduction and AGC algorithms. Therefore, it is inappropriate to improve the sound pickup ability of Lyra encoding by deploying more complex AI noise reduction algorithms and AGC algorithms on the device. In the method provided by the present invention, before encoding the voice data based on the target coding format, it includes:

获取采集数据，对采集数据进行特征提取，得到语音特征，采集数据是语音采集装置进行语音采集得到的数据；Acquire collected data, perform feature extraction on the collected data, and obtain speech features, wherein the collected data is data obtained by a speech collection device through speech collection;

将语音特征输入至降噪模型中，获取降噪模型输出的处理特征；Input the speech features into the noise reduction model to obtain the processed features output by the noise reduction model;

对处理特征进行量化处理，得到语音数据。The processed features are quantized to obtain speech data.

也就是说，本发明提供的方法中，在对语音采集设备采集到的原始数据进行降噪时，并不是像图7中示出的传统方式那样，直接将采集到的语音(即采集数据)作为降噪模型的输入，之后对降噪模型的输出数据进行时域转换后再进行特征提取，得到MFCC特征，再进行量化，转化为语音数据进行后续编码。而是如图8所示，先对采集数据进行特征提取，得到MFCC语音特征，将语音特征作为降噪模型的输入，降噪模型输出的是处理后的MFCC特征(即处理特征)，对处理特征进行量化处理，得到语音数据用于后续编码。其中降噪模型可以为AI降噪模型，其损失函数为：That is to say, in the method provided by the present invention, when the original data collected by the voice acquisition device is denoised, it is not like the traditional method shown in FIG7, directly using the collected voice (i.e., collected data) as the input of the denoising model, and then performing time domain conversion on the output data of the denoising model and then extracting features to obtain MFCC features, which are then quantized and converted into voice data for subsequent encoding. Instead, as shown in FIG8, feature extraction is first performed on the collected data to obtain MFCC voice features, and the voice features are used as the input of the denoising model. The denoising model outputs the processed MFCC features (i.e., processed features), and the processed features are quantized to obtain voice data for subsequent encoding. The denoising model can be an AI denoising model, and its loss function is:

其中g_b为模型输出的增益估计参数，为真实的增益数据，γ为感知参数，用于控制抑制噪声的力度。Where g _b is the gain estimation parameter output by the model, is the real gain data, and γ is the perception parameter used to control the strength of noise suppression.

经过测试验证，在600Mhz的低功耗设备上，与传统方法比较，本发明提供的方法，通过采集语音特征作为降噪模型的输入，整体开销降低15％，在保持音频MOS不变的情况，拾音距离提升了1.5米。有效弥补了Lyra编码拾音距离不足的缺陷，也不需要部署更加复杂的模型算法，实现了对低功耗设备的拾音距离增大。After testing and verification, on a 600Mhz low-power device, compared with the traditional method, the method provided by the present invention reduces the overall overhead by 15% by collecting speech features as the input of the noise reduction model, and increases the pickup distance by 1.5 meters while keeping the audio MOS unchanged. This effectively makes up for the defect of insufficient pickup distance of Lyra encoding, and does not require the deployment of more complex model algorithms, thereby increasing the pickup distance of low-power devices.

下面对本发明提供的语音数据处理装置进行描述，下文描述的语音数据处理装置与上文描述的语音数据处理方法可相互对应参照。如图9所示，本发明提供的语音数据处理装置包括：The following is a description of the speech data processing device provided by the present invention. The speech data processing device described below and the speech data processing method described above can be referred to in correspondence with each other. As shown in FIG9 , the speech data processing device provided by the present invention includes:

网络判断模块910，用于获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式；The network determination module 910 is used to obtain uplink bandwidth data, determine a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data, and determine a target encoding format based on the first candidate encoding format;

语音编码控制模块920，用于向通话对端发送通知指令，并监听通话对端发送的针对通知指令的答复消息，通知指令用于指示编码格式切换；The speech coding control module 920 is used to send a notification instruction to the call peer and monitor the reply message sent by the call peer to the notification instruction, where the notification instruction is used to instruct the coding format to switch;

语音编解码模块930，用于在接收到答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，语音数据包中包括反映目标编码格式的标志位，将语音数据包发送至通话对端。The voice encoding and decoding module 930 is used to encode the voice data based on the target encoding format after receiving the reply message to obtain a voice data packet, which includes a flag reflecting the target encoding format, and send the voice data packet to the call partner.

图10示例了一种电子设备的实体结构示意图，如图10所示，该电子设备可以包括：处理器(processor)1010、通信接口(Communications Interface)1020、存储器(memory)1030和通信总线1040，其中，处理器1010，通信接口1020，存储器1030通过通信总线1040完成相互间的通信。处理器1010可以调用存储器1030中的逻辑指令，以执行语音数据处理方法，该方法包括：获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式；向通话对端发送通知指令，并监听通话对端发送的针对通知指令的答复消息，通知指令用于指示编码格式切换；在接收到答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，语音数据包中包括反映目标编码格式的标志位，将语音数据包发送至通话对端。FIG10 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG10 , the electronic device may include: a processor 1010, a communication interface 1020, a memory 1030 and a communication bus 1040, wherein the processor 1010, the communication interface 1020 and the memory 1030 communicate with each other through the communication bus 1040. The processor 1010 may call the logic instructions in the memory 1030 to execute the voice data processing method, the method comprising: obtaining uplink bandwidth data, determining a first candidate encoding format from a plurality of preset encoding formats according to the uplink bandwidth data, and determining a target encoding format based on the first candidate encoding format; sending a notification instruction to the call counterpart, and monitoring a reply message sent by the call counterpart to the notification instruction, the notification instruction is used to indicate the switching of the encoding format; after receiving the reply message, encoding the voice data based on the target encoding format to obtain a voice data packet, the voice data packet includes a flag reflecting the target encoding format, and sending the voice data packet to the call counterpart.

此外，上述的存储器1030中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 1030 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的语音数据处理方法，该方法包括：获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式；向通话对端发送通知指令，并监听通话对端发送的针对通知指令的答复消息，通知指令用于指示编码格式切换；在接收到答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，语音数据包中包括反映目标编码格式的标志位，将语音数据包发送至通话对端。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the voice data processing method provided by the above methods, the method including: obtaining uplink bandwidth data, determining a first candidate encoding format from multiple preset encoding formats according to the uplink bandwidth data, and determining a target encoding format based on the first candidate encoding format; sending a notification instruction to the call counterpart, and listening to a reply message sent by the call counterpart to the notification instruction, the notification instruction is used to indicate the switching of the encoding format; after receiving the reply message, encoding the voice data based on the target encoding format to obtain a voice data packet, the voice data packet including a flag reflecting the target encoding format, and sending the voice data packet to the call counterpart.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的语音数据处理方法，该方法包括：获取上行带宽数据，根据上行带宽数据在多个预设编码格式中确定第一候选编码格式，基于第一候选编码格式确定目标编码格式；向通话对端发送通知指令，并监听通话对端发送的针对通知指令的答复消息，通知指令用于指示编码格式切换；在接收到答复消息后，基于目标编码格式对语音数据进行编码，得到语音数据包，语音数据包中包括反映目标编码格式的标志位，将语音数据包发送至通话对端。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the voice data processing method provided by the above-mentioned methods, the method comprising: obtaining uplink bandwidth data, determining a first candidate encoding format from multiple preset encoding formats according to the uplink bandwidth data, and determining a target encoding format based on the first candidate encoding format; sending a notification instruction to the call counterpart, and listening to a reply message sent by the call counterpart to the notification instruction, the notification instruction being used to indicate a coding format switch; after receiving the reply message, encoding the voice data based on the target coding format to obtain a voice data packet, the voice data packet including a flag reflecting the target coding format, and sending the voice data packet to the call counterpart.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of processing speech data, comprising:

Acquiring uplink bandwidth data, determining a first candidate coding format in a plurality of preset coding formats according to the uplink bandwidth data, and determining a target coding format based on the first candidate coding format;

Sending a notification instruction to a call opposite terminal, and monitoring a reply message which is sent by the call opposite terminal and is aimed at the notification instruction, wherein the notification instruction is used for indicating the switching of coding formats;

After receiving the reply message, encoding voice data based on the target encoding format to obtain a voice data packet, wherein the voice data packet comprises a flag bit reflecting the target encoding format, and the voice data packet is sent to the opposite call end.

2. The method according to claim 1, wherein the plurality of preset encoding formats includes a Lyra encoding format, and the determining a first candidate encoding format among the plurality of preset encoding formats according to the uplink bandwidth data includes:

And when the uplink bandwidth data is lower than a first preset threshold value, determining a Lyra coding format as the first candidate coding format.

3. The method for processing voice data according to claim 1, wherein the encoding the voice data based on the target encoding format to obtain a voice data packet comprises:

when the uplink bandwidth data is smaller than a second preset threshold value, voice activation detection is carried out on the voice data to obtain a plurality of voice data segments and time stamps corresponding to the voice data segments, wherein the voice data segments comprise voice signals;

Encoding the voice data segments based on the target encoding format to obtain a plurality of voice frames and corresponding time stamps of the voice frames;

And generating one voice data packet based on a plurality of voice frames, wherein the voice data packet comprises a time stamp corresponding to each voice frame.

4. The method according to claim 1, characterized in that before encoding the voice data based on the target encoding format, comprising:

Acquiring acquisition data, and performing feature extraction on the acquisition data to obtain voice features, wherein the acquisition data is data obtained by voice acquisition by a voice acquisition device;

Inputting the voice characteristics into a noise reduction model, and obtaining processing characteristics output by the noise reduction model;

and carrying out quantization processing on the processing characteristics to obtain the voice data.

5. The method of claim 1, wherein the determining a target coding format based on the first candidate coding format comprises:

Transmitting the first candidate coding format to the call opposite terminal;

And acquiring a first candidate decoding format sent by the call opposite terminal, and determining the target coding format based on the first candidate coding format and the first candidate decoding format.

6. The voice data processing method of claim 1, wherein the method further comprises:

Acquiring device performance data, determining a second candidate decoding format among a plurality of decoding formats based on the device performance data;

determining a target decoding format based on the second candidate decoding format;

and decoding the data packet sent by the opposite call terminal based on the target decoding format.

7. A voice data processing apparatus, comprising:

the network judgment module is used for acquiring uplink bandwidth data, determining a first candidate coding format in a plurality of preset coding formats according to the uplink bandwidth data, and determining a target coding format based on the first candidate coding format;

The voice coding control module is used for sending a notification instruction to a call opposite terminal and monitoring a reply message which is sent by the call opposite terminal and is aimed at the notification instruction, wherein the notification instruction is used for indicating the switching of coding formats;

and the voice encoding and decoding module is used for encoding voice data based on the target encoding format after receiving the reply message to obtain a voice data packet, wherein the voice data packet comprises a flag bit reflecting the target encoding format, and the voice data packet is sent to the opposite call end.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech data processing method according to any one of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the speech data processing method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the speech data processing method according to any one of claims 1 to 6.