CN114627886A

CN114627886A - Conference voice processing method and device

Info

Publication number: CN114627886A
Application number: CN202210234843.2A
Authority: CN
Inventors: 李文; 郑相全; 夏启斌; 都赟赟; 张鸿
Original assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Current assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-14
Anticipated expiration: 2042-03-10
Also published as: CN114627886B

Abstract

The invention discloses a conference voice processing method and a device, wherein the method comprises the following steps: receiving audio data packets of a plurality of conference terminals and carrying out format adaptation to obtain an adapted audio data set and obtain a terminal type information set; analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; processing the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set; and processing the audio analysis result set by using the audio mixing rule, and mixing the audio analysis results conforming to the audio mixing rule. Therefore, the method and the device are favorable for adapting to complex conference environments, and further improve the sound mixing effect of the multi-party audio conference.

Description

Conference voice processing method and device

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a conference speech processing method and apparatus.

Background

With the development of network technology, audio conferences are more and more widely used, and each receiving end needs to be able to hear the sound emitted by other terminals and cannot hear the sound emitted by the receiving end, so that a sound mixing function is required. In the prior art, noise reduction is performed on the sound emitted by each terminal, and sound mixing is performed after background noise is removed, so that the sound of a speaking conference participant is prevented from being covered by the background noise of other participants. At present, in most conferences, a sound mixing processing platform is difficult to ensure that all terminals can effectively participate in sound mixing, for example, because some conference terminals are subjected to noise processing because of small sound, and some conference terminals are subjected to intermittent sound because of small sound, the voice quality after sound mixing cannot meet the requirements of participants.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a conference voice processing method and device, which can adapt to a complex conference environment and improve the audio mixing effect of a multi-party audio conference.

In order to solve the above technical problem, a first aspect of an embodiment of the present invention discloses a conference voice processing method, where the method includes:

101. receiving audio data packets of a plurality of conference terminals and carrying out format adaptation to obtain an adaptive audio data set, wherein the adaptive audio data set comprises a plurality of adaptive audio data; scanning each conference terminal to obtain a terminal type information set, wherein the terminal type information set comprises a plurality of terminal type information;

102. analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results;

103. processing the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set; the terminal type rule comprises an amplitude characteristic value threshold value and a frequency characteristic value threshold value;

104. and processing the audio analysis result set by using the audio mixing rule, and mixing the audio analysis results conforming to the audio mixing rule.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the parsing the adapted audio data set by using a preset voice parsing rule includes: for the adaptive audio data of any conference terminal, processing the adaptive audio data of the conference terminal by using a preset first analysis rule to obtain first target audio information corresponding to the conference terminal; processing the adaptive audio data by using a preset first analysis rule, including analyzing the adaptive audio data of the conference terminal, extracting audio information for coding, decoding and converting, and establishing a digital filter for digitally filtering the audio information to obtain the first target audio information; processing the first target audio information by using a preset second analysis rule to obtain second target audio information corresponding to the conference terminal; and processing the first target audio information by using a preset second analysis rule, wherein the processing comprises slicing the first target audio information, writing the sliced first target audio information into an external storage device of a voice channel corresponding to the conference terminal, and regularly reading data of the voice channel and adapting the data to be a TDM data stream to obtain second target audio information.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the processing, by using the mixing rule, the set of audio parsing results includes: analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value; and if the first amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, processing the audio analysis result set by using the mixing rule includes: analyzing second target audio information of any conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value; and if the second amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the second frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, processing the audio analysis result set by using the mixing rule includes: s1401, analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value; s1402, if the first amplitude characteristic value is larger than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is smaller than the frequency activation threshold corresponding to the conference terminal, preliminarily judging that the conference terminal participates in sound mixing, and executing S1403; otherwise, the preliminary judgment is that the audio mixing is not involved, and S1405 is executed; s1403, analyzing the second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value; s1404, if a second amplitude characteristic value is smaller than the amplitude activation threshold corresponding to the conference terminal, or a second frequency characteristic value is larger than the frequency activation threshold corresponding to the conference terminal, judging whether effective voices exist in the first R pieces of the adaptive audio data, wherein R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, the final judgment is that the sound mixing is not participated in; s1405, performing the operations from S1401 to S1404 on all the conference terminals to obtain a result of determining whether each conference terminal participates in mixing.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the adaptive audio data of each conference terminal is mirrored into two paths to be processed simultaneously: the first path executes step 102, and analyzes the adaptive audio data set by using the preset voice analysis rule to obtain the audio analysis result set; the second path of execution 104 is used for analyzing the adaptive audio data of each conference terminal and extracting a first amplitude characteristic and a first frequency characteristic; and analyzing the second target audio information of each conference terminal, and extracting a second amplitude characteristic value and a second frequency characteristic value.

A second aspect of an embodiment of the present invention discloses a conference voice processing apparatus, including:

the scanning receiving module is used for receiving audio data packets of a plurality of conference terminals and carrying out format adaptation to obtain an adaptive audio data set, wherein the adaptive audio data set comprises a plurality of adaptive audio data; scanning each conference terminal to obtain a terminal type information set, wherein the terminal type information set comprises a plurality of terminal type information;

the first processing module is used for analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results;

the second processing module is used for processing the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set; the terminal type rule comprises an amplitude characteristic value threshold value and a frequency characteristic value threshold value;

and the third processing module is used for processing the audio analysis result set by using the audio mixing rule and mixing the audio analysis results conforming to the audio mixing rule.

The third aspect of the present invention discloses another conference voice processing apparatus, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program codes stored in the memory to execute part or all of the steps in the conference voice processing method.

A fourth aspect of the present invention discloses a computer storage medium, where the computer storage medium stores computer instructions, and when the computer instructions are called, the computer instructions are used to execute part or all of the steps in the conference voice processing method disclosed in the first aspect of the embodiments of the present invention.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the conference terminal is scanned to obtain the audio terminal type information, the terminal type rule is utilized to process the audio terminal type information to obtain the audio mixing rule set, the audio mixing rule is utilized to process the audio data analysis result, and the audio data conforming to the audio mixing rule is subjected to audio mixing, so that the problems of wrong mixing (missing mixing, multiple mixing and interruption) during audio mixing are solved, the adaptation to a complex conference environment is facilitated, and the audio mixing effect of a multi-party audio conference is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a conference voice processing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a conference voice processing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of another conference voice processing apparatus disclosed in the embodiment of the present invention;

fig. 4 is a schematic structural diagram of another conference voice processing apparatus according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to those listed but may alternatively include other steps or elements not listed or inherent to such process, method, product, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

RTP: a Real-time Transport Protocol (Real-time Transport Protocol), and an audio RTP packet is a data packet with an audio signal and transmitted based on the RTP Protocol.

CPS: (abbreviation of Common part sublayer).

The invention discloses a conference voice processing method and device, which can obtain audio terminal type information by scanning a conference terminal, process the audio terminal type information by using a terminal type rule to obtain a sound mixing rule set, process an audio data analysis result by using a sound mixing rule, mix audio of audio data according with the sound mixing rule, solve the problem of mismixing (missing mixing, multiple mixing and intermittence) during sound mixing, are beneficial to adapting to a complex conference environment and further improve the sound mixing effect of a multi-party audio conference. The following are detailed below.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a conference voice processing method according to an embodiment of the present invention. The conference voice processing method described in fig. 1 is applied to a voice data processing system, such as a platform end or a terminal side for conference voice processing, and the embodiment of the present invention is not limited thereto. As shown in fig. 1, the conference voice processing method may include the following operations:

104. and processing the audio analysis result set by using the audio mixing rule, and mixing the audio analysis results meeting the audio mixing rule.

As an optional implementation manner, the specific manner of scanning each of the conference terminals to obtain the terminal type information is as follows:

carrying out signaling interaction identification by utilizing the switch to which each conference terminal is attached to determine a terminal type information set; the terminal type information set comprises a plurality of terminal type information.

Optionally, the terminal type information includes channel type information.

Optionally, the channel type information includes a wired channel, and/or a short-distance wireless channel, and/or a long-distance wireless channel, and/or other channel types, which is not limited in the embodiments of the present invention.

Optionally, the format of the audio data packet includes an RTP packet format, and/or an Ahelp coding HDLC frame format, and/or a g.729 format, and/or a CVSD format, and/or other audio data formats, which is not limited in the embodiment of the present invention.

In this optional embodiment, as another optional implementation, the specific manner of performing format adaptation on the received audio data packets of the multiple conference terminals is as follows:

and when the audio data packet format sent by the conference terminal is the RTP packet format, converting the audio data packet into the PCM coding format.

When the audio data format sent by the conference terminal is Ahelp, G.729 or CVSD coding format, the audio data packet is adapted and converted into the RTP packet format of PCM coding.

Therefore, the conference voice processing method described in the embodiment of the present invention can perform format adaptation processing on different types of audio data packets, and convert the audio data packets into a PCM encoded RTP packet format, so as to unify the formats of the audio data, realize communication between terminals of different voice types, facilitate adaptation to a complex conference environment, and further improve the audio mixing effect of a multi-party audio conference.

In an optional embodiment, the specific manner of parsing the adapted audio data set by using the preset voice parsing rule is as follows:

for the adaptive audio data of any conference terminal, processing the adaptive audio data by using a preset first analysis rule to obtain first target audio information corresponding to the conference terminal;

and processing the first target audio information by using a preset second analysis rule to obtain second target audio information corresponding to the conference terminal.

In this optional embodiment, as an optional implementation manner, the specific manner of processing the adapted audio data by using the preset first parsing rule to obtain the first target audio information corresponding to the conference terminal is as follows:

and analyzing the packet header of the adaptive audio data corresponding to the conference terminal, extracting audio information for coding, decoding and converting, and establishing a digital filter for digitally filtering the audio information.

In this optional embodiment, as an optional implementation, the specific way of extracting the audio information for encoding and decoding conversion is as follows:

extracting audio information in the RTP voice packet, shaping the audio information into 8-bit signed number, and converting the audio information from PCM coding into linear code coding data.

The reason for converting the PCM code into the linear code coded data is that the linear code data really represent the amplitude of a signal, and the amplitude characteristic value can be obtained by directly operating the linear code data.

In this optional embodiment, as an optional implementation, the above establishing a digital filter to digitally filter the audio information specifically includes:

and filtering the linear code coded data converted from the PCM code by using a band-pass filter with a passband of 50 Hz-3000 kHz, and converting the linear code coded data into a PCM packet after filtering is finished, namely the first target audio information.

Therefore, the conference voice processing method described in the embodiment of the present invention analyzes the packet header of the RTP voice packet encoded by the PCM corresponding to each conference terminal, extracts the audio information, performs encoding and decoding conversion, and then establishes the digital filter to digitally filter the audio information after the encoding and decoding conversion, so as to filter the data in the non-human voice spectrum range in the voice, which is beneficial to adapting to a complex conference environment, and further improve the sound mixing effect of a multi-party audio conference.

In this optional embodiment, as an optional implementation manner, a specific manner of processing the first target audio information by using a preset second parsing rule is as follows:

and analyzing the packet header of the adapted audio data (RTP voice packet coded by PCM), and analyzing to obtain the voice channel corresponding to the adapted audio data according to the information of destination MAC, destination IP and the like in the packet header.

Optionally, any conference terminal corresponds to a unique voice channel, and a Mac address, an ip address, and the like are allocated according to packet header information of the conference terminal.

The first target audio information is sliced and written into an external storage device of the corresponding voice channel.

Alternatively, the above slice format may be a 10-byte cps slice.

And carrying out timing reading time configuration on the voice channel, and periodically reading the data of the voice channel from the external storage device and adapting the data to be a TDM data stream.

Alternatively, the timing read time is set to read 16 cps pieces, i.e., 160 bytes of audio data, every 20ms to constitute one PCM-encoded RTP packet.

Therefore, the conference voice processing method described in the embodiment of the present invention guarantees the continuity and stability of the sound by performing slice caching and timing reading on the filtered first target audio information, can effectively alleviate the problem of easy sound leakage caused by the unstable RTP packet, and is beneficial to adapting to a complex conference environment, thereby improving the sound mixing effect of a multi-party audio conference.

In another optional embodiment, the processing the terminal type information by using a preset terminal type rule to obtain a mixing rule set specifically includes:

each terminal type corresponds to a unique preset terminal type rule;

the terminal type rule comprises an amplitude characteristic value threshold and a frequency characteristic value threshold corresponding to the conference terminal of the type and is used for judging whether any conference terminal of the type participates in sound mixing;

as an optional implementation manner, the specific manner of determining the terminal type rule is as follows:

for a conference terminal with a wired channel as a channel type, acquiring the following parameters of an exchanger bearing the conference terminal: amplitude characteristic value D of silence of microphone when background noise is low_m1Frequency characteristic value Z_m1And amplitude characteristic value D of human speaking with normal volume_n1。

The amplitude eigenvalue threshold D of the conference terminal with the channel type being the wired channel type₁And a frequency eigenvalue threshold value Z₁Respectively as follows:

D₁＝D_m1+A*(D_n1-D_m1)

Z₁＝Z_m1*B

a, B is a configurable parameter, where a is 0.2 and B is 1.1 in general, which is suitable for most conference situations.

Optionally, if the environmental noise greatly changes during a call, the value of A, B needs to be adjusted appropriately, generally without changing, and a default value is used.

Amplitude eigenvalue threshold D of conference terminal with short-distance wireless channel as channel type₂And a frequency eigenvalue threshold value Z₂The setting mode is the same as that of the conference terminal with the channel type being the wired channel, and the description is omitted here.

Amplitude eigenvalue threshold D of conference terminal with long-distance wireless channel type₃And a frequency eigenvalue threshold value Z₃The setting mode is the same as that of the conference terminal with the channel type being the wired channel, and the description is omitted here.

In the embodiment, the default threshold is adopted for conference terminals of other channel types except for wired channel, short-distance wireless channel and long-distance wireless channel, and is respectively amplitude eigenvalue threshold D₄And a frequency eigenvalue threshold value Z₄The specific calculation method is as follows:

D₄＝(D₁+D₂+D₃)/6

Z₄＝(Z₁+Z₂+Z₃)/2

therefore, by implementing the conference voice processing method described in the embodiment of the present invention, different thresholds are set for different recognized types of conference terminal channels, so that whether to activate the conference terminal can be determined according to different characteristics of different conference terminal connection channels, and voice quality can be improved. The embodiment obtains the activation threshold suitable for the voice characteristics of different conference terminal devices by analyzing the amplitude and frequency characteristics of the different conference terminal devices. For example, analog telephony, because it is a wired channel transmission, has a background noise spectrum that is a relatively lower proportion of the total spectrum than a radio station transmitting over a wireless channel, and has a relatively high energy transmission efficiency, while the frequency characteristic threshold set in the manner described above is relatively small and the amplitude characteristic threshold is relatively large, matching the characteristics of the audio transmitted over the wired channel.

In yet another optional embodiment, the processing the audio analysis result set by using the mixing rule includes:

judging whether any audio analysis result meets the condition of participating in audio mixing according to the audio mixing rule;

mixing the audio analysis results meeting the audio mixing condition, packaging the audio analysis results into RTP packets, sending the RTP packets to the switches of a plurality of conference terminals participating in the conference, and sending the RTP packets to each conference terminal through the switches;

in this optional embodiment, as a first optional determining manner, determining whether the mixing rule satisfies a mixing participation condition according to the mixing rule includes:

analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value;

wherein the first amplitude characteristic value calculation formula is as follows:

wherein p is_allAs amplitude characteristic value, d_mIs the value of the lower seven bits of the mth sampling point, i.e. the absolute value of the audio signal amplitude corresponding to the sampling point. The sampling point can be set as required, and the embodiment is not limited.

The calculation formula of the first frequency characteristic value is as follows:

wherein Z is a frequency characteristic value, Z_nAnd the highest bit value of the nth sampling point is used for representing the positive and negative of the sampling point. The sampling point can be set as required, and the embodiment is not limited.

And if the first amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

In this optional embodiment, as a second optional determining manner, determining whether the mixing rule satisfies a mixing participation condition according to the mixing rule includes:

analyzing second target audio information of any conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value;

in the embodiment of the present invention, the calculation manner of the second amplitude characteristic value and the second frequency characteristic value may refer to the calculation manner of the first amplitude characteristic value and the first frequency characteristic value, and is not described again.

And if the second amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the second frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

In this optional embodiment, as a third optional determining manner, determining whether the mixing rule satisfies a mixing participation condition according to the mixing rule includes:

s1401, analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value;

s1402, if the first amplitude characteristic value is larger than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is smaller than the frequency activation threshold corresponding to the conference terminal, preliminarily judging that the conference terminal participates in sound mixing, and executing S1403; otherwise, the preliminary judgment is that the audio mixing is not involved, and S1405 is executed;

s1403, analyzing second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value;

s1404, if the second amplitude characteristic value is smaller than the amplitude activation threshold corresponding to the conference terminal, or the second frequency characteristic value is larger than the frequency activation threshold corresponding to the conference terminal, determining whether there is valid voice in the first R pieces of adapted audio data, where R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, finally judging that the sound mixing is not participated in;

s1405, performing the operations from S1401 to S1404 on all the conference terminals to obtain a result of determining whether each conference terminal participates in mixing.

Optionally, R is set to 7, if R is too small, the effect of preventing misjudgment is not obvious, and if R is too large, the calculation amount may be too large, resulting in a reduction in effective voice judgment efficiency and an influence on sound mixing efficiency.

By adopting the mode, the following advantages are achieved:

(1) since the RTP packet data (adapted audio data) is more time-efficient than the TDM stream data (second target audio information), the change process of the voice packet from nothing to nothing can be recognized more quickly by performing the preliminary judgment using the RTP packet data (adapted audio data).

(2) Since the RTP packet is not particularly stable, the above-mentioned preliminary judgment error is more likely, and the judgment of the voice termination at this time is performed by judging the TDM stream data (second target audio information) again and it is necessary that none of the consecutive R voice packets is a valid voice, so that the erroneous judgment can be prevented.

(3) Synthesizing the analysis results of RTP packet data (adaptive audio data) and TDM stream data (second target audio information) to automatically activate and control the participants of the voice conference, so that the activation control is more accurate;

under different meeting environment requirements, different sound mixing rule strategies are adopted in the implementation mode, and user requirements can be better matched. For example, when the real-time requirement is very high, the first or second mixing rule strategy can be adopted; when the requirements for noise removal and continuity are very high, a third mixing rule may be employed.

Therefore, the conference voice processing method described in the embodiment of the invention sets different sound mixing rules, can match the complexity of the conference environment and different sound mixing requirements, and is beneficial to adapting to the complex conference environment.

Example two

Another conference voice processing method disclosed in the embodiment of the present invention may include the following operations:

201. receiving audio data packets of a plurality of conference terminals and carrying out format adaptation to obtain an adaptive audio data set, wherein the adaptive audio data set comprises a plurality of adaptive audio data; scanning each conference terminal to obtain a terminal type information set, wherein the terminal type information set comprises a plurality of terminal type information;

202. analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results;

203. processing the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set;

204. and processing the audio analysis result set by using the audio mixing rule, and mixing the audio analysis results conforming to the audio mixing rule.

In the embodiment of the present invention, the processing of the audio analysis result set by using the mixing rule is performed synchronously with step 202. The method specifically comprises the following steps:

and mirroring the adaptive audio data of each conference terminal into two paths to be processed simultaneously: wherein the first path performs step 202; the second path analyzes the adaptive audio data of each conference terminal and extracts a first amplitude characteristic and a first frequency characteristic; and analyzing the second target audio information of each conference terminal, and extracting a second amplitude characteristic value and a second frequency characteristic value.

Optionally, as a fourth optional determining means, determining whether the mixing rule satisfies a mixing participation condition according to the mixing rule, specifically:

s2401, analyzing adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value;

s2402, if the first amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, preliminarily judging that the conference terminal participates in sound mixing, and executing S2403; otherwise, the preliminary judgment is that the sound mixing is not involved, and S2405 is executed;

s2403, analyzing second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value;

s2404, if the second amplitude characteristic value is smaller than an amplitude activation threshold corresponding to the conference terminal, or the second frequency characteristic value is larger than a frequency activation threshold corresponding to the conference terminal, judging whether effective voices exist in the previous R pieces of adaptive audio data, wherein R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, finally judging that the sound mixing is not participated in;

s2405, performing the operations from S2401 to S2404 on all the conference terminals, and obtaining a result of determining whether each conference terminal participates in audio mixing.

It can be seen that, in addition to the advantages of the third audio mixing rule mentioned in the first embodiment, the conference voice processing method described in the embodiment of the present invention adopts two mirror paths for processing simultaneously, where the first path analyzes the adapted audio data set by using a preset voice analysis rule; and the second path processes the audio analysis result set by using the obtained sound mixing rule. And finally, mixing the audio analysis result which accords with the mixing rule.

Therefore, by implementing the conference voice processing method described in the embodiment of the present invention, whether the voice data of the conference terminal participates in the audio mixing is determined by bypassing the mirror image processing, so that the requirements of removing noise and voice continuity can be satisfied while obviously improving the timeliness of the voice in the conference.

In the embodiment of the present invention, for specific technical details and technical noun explanations of step 201 to step 203, reference may be made to the detailed description of step 101 to step 103 in the first embodiment, and details are not repeated in the embodiment of the present invention.

EXAMPLE III

Referring to fig. 3, fig. 3 is a schematic structural diagram of a conference voice processing apparatus according to an embodiment of the present invention. The apparatus described in fig. 3 can be applied to a data processing system, such as a platform side or a terminal side for conference voice processing, and the embodiment of the present invention is not limited thereto. As shown in fig. 3, the apparatus may include:

a scanning receiving module 301, configured to receive audio data packets of multiple conference terminals and perform format adaptation to obtain an adapted audio data set, where the adapted audio data set includes multiple adapted audio data; scanning each conference terminal to obtain a terminal type information set, wherein the terminal type information set comprises a plurality of terminal type information;

the first processing module 302 is configured to analyze the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results;

a second processing module 303, configured to process the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set; the terminal type rule comprises an amplitude characteristic value threshold value and a frequency characteristic value threshold value;

the third processing module 304 is configured to process the audio analysis result set by using the audio mixing rule, and mix audio analysis results meeting the audio mixing rule.

It can be seen that, by implementing the conference voice processing apparatus described in fig. 3, it is possible to obtain audio terminal type information by scanning a conference terminal, and process the audio terminal type information by using a terminal type rule to obtain a mixing rule set, and then process an audio data analysis result by using a mixing rule to mix audio data conforming to the mixing rule, thereby solving the problem of mismixing (missing mixing, multiple mixing, and intermittence) occurring during mixing audio, facilitating adaptation to a complex conference environment, and further improving the mixing effect of a multi-party audio conference.

In another alternative embodiment, as shown in fig. 3, the preset voice parsing rules include a first parsing rule and a second parsing rule;

the first processing module 302 includes a first conversion sub-module, a second conversion sub-module, wherein:

the first conversion sub-module 3021 is configured to, for adaptive audio data of any conference terminal, process the adaptive audio data of the conference terminal by using a preset first parsing rule to obtain first target audio information corresponding to the conference terminal;

the processing of the adapted audio data by using the preset first analysis rule includes analyzing the adapted audio data of the conference terminal, extracting audio information for coding, decoding and converting, and establishing a digital filter for digitally filtering the audio information to obtain the first target audio information;

the second conversion submodule 3022, configured to process the first target audio information by using a preset second parsing rule, to obtain second target audio information corresponding to the conference terminal;

and processing the first target audio information by using a preset second analysis rule, wherein the step of slicing the first target audio information is included, the first target audio information is written into an external storage device of a voice channel corresponding to the conference terminal, and the data of the voice channel is read at regular time and is adapted to be a TDM data stream to obtain the second target audio information.

As can be seen, with the conference voice processing apparatus described in fig. 3, by analyzing the packet header of the RTP voice packet encoded by the PCM corresponding to each conference terminal, extracting the audio information and performing encoding, decoding and conversion, and then establishing a digital filter to perform digital filtering on the audio information after encoding, decoding and conversion, data in the non-human voice spectrum range in the voice can be filtered; the continuity and stability of sound are guaranteed by carrying out slice caching and timing reading on the filtered first target audio information, the problem that an RTP packet is not stable enough to cause sound leakage easily can be effectively solved, the method is favorable for adapting to a complex conference environment, and the sound mixing effect of a multi-party audio conference is further improved.

In another optional embodiment, the specific way of processing the terminal type information by using a preset terminal type rule in the second processing module 303 to obtain the mixing rule set is as follows:

each terminal type corresponds to a unique preset terminal type rule;

the detailed description of the first embodiment of determining the terminal type rule is already provided, and is not repeated here.

It can be seen that, with the conference speech processing apparatus described in fig. 3, different thresholds are set for different recognized types of conference terminal channels, so that whether the conference terminal is activated can be determined according to different characteristics of different conference terminal connection channels, and the voice quality can be improved.

In another alternative embodiment, the third processing module 304 includes a first determining submodule 3041, a second determining submodule 3042, a third determining submodule 3043:

the first judgment sub-module 3041: the adaptive audio data analysis module is used for analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value; and if the first amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

Second determination sub-module 3042: the second target audio information of any conference terminal is analyzed to obtain a second amplitude characteristic value and a second frequency characteristic value; and if the second amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the second frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, judging that the conference terminal participates in sound mixing, otherwise, judging that the conference terminal does not participate in sound mixing.

Third determination sub-module 3043: the adaptive audio data analysis device is used for analyzing the adaptive audio data of any conference terminal to obtain a first amplitude characteristic value and a first frequency characteristic value; if the first amplitude characteristic value is larger than the amplitude activation threshold value corresponding to the conference terminal and the first frequency characteristic value is smaller than the frequency activation threshold value corresponding to the conference terminal, preliminarily judging that the conference terminal participates in sound mixing, and executing the next step; otherwise, the preliminary judgment is that the voice mixing is not participated.

Analyzing second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value; if the second amplitude characteristic value is smaller than the amplitude activation threshold value corresponding to the conference terminal, or the second frequency characteristic value is larger than the frequency activation threshold value corresponding to the conference terminal, judging whether effective voices exist in the first R pieces of adaptive audio data, wherein R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, finally judging that the sound mixing is not participated in;

and performing the operation on all the conference terminals to obtain a judgment result of whether each conference terminal participates in sound mixing.

Therefore, the conference voice processing device described in fig. 3 is implemented, different mixing rules are set, the complexity of the conference environment and different mixing requirements can be matched, and the conference voice processing device is beneficial to adapting to the complex conference environment.

In another optional embodiment, the first processing module 302 and the third processing module 304 simultaneously process the adaptive audio data of each conference terminal, specifically:

the first processing module 302 is configured to analyze the adapted audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the third processing module 304 is configured to analyze the adapted audio data of each conference terminal, and extract a first amplitude characteristic and a first frequency characteristic; and analyzing the second target audio information of each conference terminal, and extracting a second amplitude characteristic value and a second frequency characteristic value.

According to the third determining sub-module 3043, the audio analysis result set is processed by using the audio mixing rule, which is not described herein again.

Therefore, by implementing the conference voice processing apparatus described in fig. 3, whether the voice data of the conference terminal participates in sound mixing is determined in a manner of bypass mirroring and simultaneous processing of two paths, so that the requirements of removing noise and voice continuity can be met while obviously improving the timeliness of the voice in a conference.

Example four

Referring to fig. 4, fig. 4 is a schematic structural diagram of another conference voice processing apparatus according to an embodiment of the present invention. The apparatus described in fig. 4 can be applied to a data processing system, such as a local server or a cloud server for data processing management of complex network communication, and the embodiment of the present invention is not limited thereto.

As shown in fig. 4, the apparatus may include:

a memory 401 storing executable program code;

a processor 402 coupled with the memory 401;

the processor 402 calls the executable program code stored in the memory 401 for executing the steps in the data processing method for complex network communication described in the first embodiment or the second embodiment.

EXAMPLE five

The embodiment of the invention discloses a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program enables a computer to execute the steps of the data processing method for complex network communication described in the first embodiment or the second embodiment.

EXAMPLE six

The embodiment of the invention discloses a computer program product, which comprises a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to make a computer execute the steps in the data processing method for complex network communication described in the first embodiment or the second embodiment.

The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions essentially or contributing to the prior art may be embodied in the form of software products, which may be stored in a computer-readable storage medium, the storage medium including a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an optical Disc (CD-ROM), or other disk memories, CD-ROMs, magnetic disks, or other magnetic memories, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

Finally, it should be noted that: the data processing method and apparatus for complex network communication disclosed in the embodiments of the present invention are only the preferred embodiments of the present invention, which are only used for illustrating the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A conference voice processing method, comprising:

101, receiving audio data packets of a plurality of conference terminals and performing format adaptation to obtain an adapted audio data set, wherein the adapted audio data set comprises a plurality of adapted audio data; scanning each conference terminal to obtain a terminal type information set, wherein the terminal type information set comprises a plurality of terminal type information;

102, analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results;

103, processing the terminal type information by using a preset terminal type rule to obtain a sound mixing rule set; the terminal type rule comprises an amplitude characteristic value threshold value and a frequency characteristic value threshold value;

and 104, processing the audio analysis result set by using the sound mixing rule, and mixing the audio analysis results conforming to the sound mixing rule.

2. The method of claim 1, wherein the parsing the adapted audio data set using a preset voice parsing rule comprises:

for the adaptive audio data of any conference terminal, processing the adaptive audio data of the conference terminal by using a preset first analysis rule to obtain first target audio information corresponding to the conference terminal;

processing the adaptive audio data by using a preset first analysis rule, including analyzing the adaptive audio data of the conference terminal, extracting audio information for coding, decoding and converting, and establishing a digital filter for digitally filtering the audio information to obtain the first target audio information;

processing the first target audio information by using a preset second analysis rule to obtain second target audio information corresponding to the conference terminal;

and processing the first target audio information by using a preset second analysis rule, wherein the processing comprises slicing the first target audio information, writing the sliced first target audio information into an external storage device of a voice channel corresponding to the conference terminal, and regularly reading data of the voice channel and adapting the data to be a TDM data stream to obtain second target audio information.

3. The conference voice processing method according to claim 1, wherein processing the audio analysis result set by using the mixing rule includes:

4. The conference voice processing method according to claim 2, wherein processing the audio analysis result set by using the mixing rule includes:

and if the second amplitude characteristic value is greater than the amplitude activation threshold corresponding to the conference terminal and the second frequency characteristic value is less than the frequency activation threshold corresponding to the conference terminal, determining to participate in sound mixing, otherwise, determining not to participate in sound mixing.

5. The conference voice processing method according to claim 2, wherein processing the audio analysis result set by using the mixing rule includes:

s1401, analyze the adaptation audio data of any conference terminal, obtain first amplitude eigenvalue and first frequency eigenvalue;

s1403, analyzing the second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value;

s1404, if the second amplitude characteristic value is smaller than the amplitude activation threshold corresponding to the conference terminal, or the second frequency characteristic value is larger than the frequency activation threshold corresponding to the conference terminal, determining whether there is valid voice in the first R pieces of the adapted audio data, where R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, the final judgment is that the sound mixing is not participated in;

and S1405, performing the operations from the S1401 to the S1404 on all the conference terminals to obtain a judgment result of whether each conference terminal participates in sound mixing.

6. The conference voice processing method according to claim 2, wherein the adapted audio data of each conference terminal is mirrored into two paths for processing simultaneously: the first path executes step 102, and analyzes the adaptive audio data set by using the preset voice analysis rule to obtain the audio analysis result set; the second path of execution 104 is used for analyzing the adaptive audio data of each conference terminal and extracting a first amplitude characteristic and a first frequency characteristic; and analyzing the second target audio information of each conference terminal, and extracting a second amplitude characteristic value and a second frequency characteristic value.

7. The conference voice processing method according to claim 6, wherein processing the audio analysis result set by using the mixing rule includes:

s2402, if the first amplitude characteristic value is larger than the amplitude activation threshold corresponding to the conference terminal and the first frequency characteristic value is smaller than the frequency activation threshold corresponding to the conference terminal, preliminarily judging that the audio mixing is participated, and executing S2403; otherwise, the preliminary judgment is that the sound mixing is not involved, and S2405 is executed;

s2403, analyzing the second target audio information of the conference terminal to obtain a second amplitude characteristic value and a second frequency characteristic value;

s2404, if the second amplitude characteristic value is smaller than the amplitude activation threshold corresponding to the conference terminal, or the second frequency characteristic value is larger than the frequency activation threshold corresponding to the conference terminal, judging whether effective voices exist in the first R pieces of the adaptive audio data, wherein R is a positive integer; if yes, the voice mixing is finally judged to be involved; if not, the final judgment is that the audio mixing is not involved;

8. A conference voice processing apparatus, characterized in that the apparatus comprises:

the first processing module is used for analyzing the adaptive audio data set by using a preset voice analysis rule to obtain an audio analysis result set; the audio analysis result set comprises a plurality of audio analysis results:

9. A conference voice processing apparatus, characterized in that the apparatus comprises:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute the conference voice processing method according to any one of claims 1 to 7.

10. A computer-storable medium that stores computer instructions that, when invoked, perform a conference speech processing method according to any one of claims 1-7.