CN113572898B

CN113572898B - Method and corresponding device for detecting silent abnormality in voice call

Info

Publication number: CN113572898B
Application number: CN202110062809.7A
Authority: CN
Inventors: 陈静聪; 李斌; 罗程
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2023-07-14
Anticipated expiration: 2041-01-18
Also published as: CN113572898A

Abstract

The application provides a method and a corresponding device for detecting a silent abnormality in a voice call, comprising the following steps: in the voice communication process, acquiring at least one opposite-end audio stream, and respectively counting continuous blank frames of voice frames of each audio stream; for each audio stream, if the current voice frame is a target voice frame, acquiring a current continuous blank frame count value; if the continuous blank frame count value is not smaller than the preset value, acquiring the opening and closing states of the corresponding voice acquisition equipment at the opposite end; if the voice acquisition equipment of the opposite terminal is in an on state, determining that the corresponding audio stream in the voice call has soundless abnormality. By combining the number of continuous blank frames in the audio stream of the opposite terminal and the opening and closing state of the voice acquisition equipment of the opposite terminal, whether the audio stream has soundless abnormality or not is determined.

Description

Method and corresponding device for detecting silent abnormality in voice call

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and a corresponding device for detecting a silent abnormality in a voice call.

Background

In the voice call process, especially in the multi-user voice call process, the voice quality is important to the user experience, and is an operation index which is focused by the voice call system provider. Due to the variability of the terminal devices of the user accessing the voice call system and the complex and variable environments in which the user is actually located, voice quality problems often occur, and common voice quality problems include: silence, noise, acoustic echo, howling, etc., where silence anomalies are the most severe.

The voice quality detection in the existing voice call only can judge whether the voice quality problem exists or not, but the voice quality problem which cannot be clearly exists is the voice quality problem, namely the prior art cannot independently determine the soundless abnormality in the voice call.

Disclosure of Invention

The purpose of the present application is to at least solve one of the above technical drawbacks, and the technical solutions provided in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a method for detecting silence abnormality in a voice call, including:

in the voice communication process, acquiring at least one opposite-end audio stream, and respectively counting continuous blank frames of voice frames of each audio stream;

For each audio stream, if the current voice frame is a target voice frame, acquiring a current continuous blank frame count value;

if the continuous blank frame count value is not smaller than the preset value, acquiring the opening and closing states of the corresponding voice acquisition equipment at the opposite end;

if the voice acquisition equipment of the opposite terminal is in an on state, determining that the corresponding audio stream in the voice call has soundless abnormality.

In an alternative embodiment of the present application, the target speech frame is a first frame non-blank frame following a blank frame in the corresponding audio stream.

In an optional embodiment of the present application, if the current speech frame is the target speech frame, obtaining the current continuous blank frame count value includes:

if the current voice frame is a target voice frame and the current voice frame in the corresponding audio stream contains a preset number of voice frames before the current voice frame, acquiring a current continuous blank frame count value.

In an alternative embodiment of the present application, the method further comprises:

if the continuous blank frame count value is smaller than the preset value, carrying out continuous blank frame count on the voice frames after the current voice frame again;

if the continuous blank frame count value is not less than the preset value and the voice acquisition equipment at the opposite end is in a closed state, the continuous blank frame count is carried out on the voice frames after the current voice frame again;

If the continuous blank frame count value is not smaller than the preset value and the voice acquisition equipment at the opposite end is in an on state, the continuous blank frame count is carried out on the voice frames after the current voice frame again.

acquiring an on-off state table of the voice acquisition equipment, wherein the on-off state table of the voice acquisition equipment stores the corresponding relation between the identifiers of each opposite terminal and the on-off state of the voice acquisition equipment;

acquiring the opening and closing states of the corresponding voice acquisition equipment at the opposite end, including:

acquiring an identifier of an opposite terminal based on an audio stream of the opposite terminal;

and acquiring the opening and closing state of the voice acquisition equipment of the opposite terminal from the opening and closing state table of the voice acquisition equipment based on the identification of the opposite terminal.

In an alternative embodiment of the present application, acquiring a voice acquisition device open/close state table includes:

receiving voice acquisition equipment control signaling of each opposite terminal sent by a corresponding server, wherein the voice acquisition equipment control signaling is sent to the server by the corresponding opposite terminal and is used for indicating the opening and closing states of the corresponding voice acquisition equipment;

and acquiring an opening and closing state table of the voice acquisition equipment based on the voice acquisition equipment control signaling of each opposite terminal.

after determining that the corresponding audio stream in the voice call has silent abnormality, obtaining corresponding abnormality information;

and reporting the abnormal information to a corresponding server so that the server can analyze the cause of the abnormality based on the abnormal information.

and reporting the abnormal information to an initiator or maintainer of the voice call.

In an alternative embodiment of the present application, the anomaly information comprises at least one of:

packet loss rate of audio stream with silence anomaly, audio buffer size of audio stream with silence anomaly, decoder type of audio stream with silence anomaly, network delay of client corresponding to near end, volume of client audio player corresponding to near end, sampling rate of client audio player corresponding to near end, type and model of client audio device corresponding to near end, CPU occupancy rate of client CPU corresponding to near end, memory occupancy rate of client corresponding to near end, and receiving rate of client audio corresponding to near end.

After determining that the corresponding audio stream in the voice call has the silence abnormality, sending silence abnormality prompt information to each opposite terminal.

after determining that the corresponding audio stream in the voice call has silence abnormality, mixing other audio streams except the audio stream with silence abnormality in the voice call.

In a second aspect, an embodiment of the present application provides a device for detecting a silence abnormality in a voice call, including:

the blank frame counting module is used for acquiring at least one audio stream of the opposite end in the voice call process and respectively counting the continuous blank frames of the voice frames of each audio stream;

the counting value acquisition module is used for acquiring the current continuous blank frame counting value if the current voice frame is a target voice frame for each audio stream;

the on-off state acquisition module is used for acquiring the on-off state of the corresponding voice acquisition equipment at the opposite end if the continuous blank frame count value is not smaller than a preset value;

and the soundless abnormality determination module is used for determining that the corresponding audio stream in the voice call has soundless abnormality if the voice acquisition equipment of the opposite terminal is in an on state.

In an alternative embodiment of the present application, the count value obtaining module is further configured to:

In an alternative embodiment of the present application, the apparatus further includes a blank frame recounting module configured to:

In an optional embodiment of the present application, the apparatus further includes a voice acquisition device open/closed state table acquisition module configured to:

the open/close state acquisition module is specifically configured to:

In an alternative embodiment of the present application, the voice acquisition device open/close state table acquisition module is specifically configured to:

In an optional embodiment of the present application, the apparatus further includes a first abnormal information reporting module, configured to:

In an optional embodiment of the present application, the apparatus further includes a second abnormal information reporting module, configured to:

In an alternative embodiment of the present application, the apparatus further includes a prompting module configured to:

In an optional embodiment of the present application, the apparatus further includes a mixing screening module configured to:

In a third aspect, embodiments of the present application provide an electronic device including a memory and a processor;

a memory having a computer program stored therein;

a processor for executing a computer program to implement the method provided in the first aspect embodiment or any of the alternative embodiments of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium, wherein the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method provided in the embodiment of the first aspect or any of the alternative embodiments of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer readable storage medium by a processor of a computer device, which processor executes the computer instructions such that the computer device, when executed, implements the method provided in the embodiment of the first aspect or any of the alternative embodiments of the first aspect.

The beneficial effects that this application provided technical scheme brought are:

in the voice call process, the number of continuous blank frames in the audio stream of the opposite terminal and the opening and closing state of the voice acquisition equipment of the opposite terminal are combined to determine whether the audio stream has soundless abnormality.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a display interface of a multi-person voice call client in an example of an embodiment of the present application;

FIG. 2 is a schematic diagram of a voice call system in an example of an embodiment of the present application;

fig. 3a is a schematic diagram of a multi-audio stream mixing process performed by a client in an example of an embodiment of the present application;

FIG. 3b is a schematic diagram of a server performing multi-audio stream mixing processing in an example of an embodiment of the present application;

fig. 4 is a flowchart of a method for detecting a silent abnormality in a voice call according to an embodiment of the present application;

FIG. 5a is a flowchart of silent anomaly detection for an audio stream during a counting period according to an embodiment of the present application;

FIG. 5b is a schematic diagram of a specific flow of loop silence anomaly detection for an audio stream according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a microphone open/close state table update in one example of an embodiment of the present application;

fig. 7 is a block diagram of a device for detecting a silent abnormality in a voice call according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The application scenario of the scheme provided by the embodiment of the application may be a multi-person voice call scenario (for example, a multi-person voice conference) in a voice call, where silent anomaly detection is performed on audio streams corresponding to each voice call participant in the voice call. Specifically, the voice call or the multi-person voice call may be implemented through a voice call system including a front end client and a background server, wherein the front end client may be an Application (APP) installed in a terminal, which may be a personal computer or a mobile phone, etc. The participants participating in the voice call may be considered as each user, each terminal used by each user, or a client for the voice call installed on each terminal used by each user, and then a certain user, a terminal used by the user, and a client for the voice call installed on a terminal used by the user are in a corresponding relationship.

For example, as shown in fig. 1, a display interface of a multi-user voice call client is shown, where each head portrait represents a participant, and the display interface is a display interface of a client corresponding to user 1, so that user 2 and user 3 are also participating in a multi-user voice call using the corresponding clients. Meanwhile, as can be seen from the figure, the user 1 can also perform other operations related to the voice call, such as turning on or off a microphone, etc., through the display interface of the corresponding client. As shown in fig. 2, 3 users participate in the multi-person voice through the multi-person conference APP installed on their respective mobile phones, the mobile phone of the user 1 is installed with the client a of the multi-person voice conference APP, the mobile phone of the user 2 is installed with the client B of the multi-person voice conference APP, the mobile phone of the user 3 is installed with the client C of the multi-person voice conference APP, the client a, the client B and the client C can respectively establish connection with the corresponding servers (voice conference servers) and respectively upload the received voice information of the users to the servers and receive the voice information pushed by the servers, in other words, each client in the voice system can upload the voice information of the corresponding users to the servers and receive the voice information uploaded by other users pushed by the servers. For example, during a multiplayer conference, on the one hand, client a uploads the voice of user 1 to the server, and on the other hand, client a receives the voice of user 2 and user 3 pushed by the server, wherein the voice of user 2 and user 3 is uploaded by client B and client C. The scheme of the embodiment of the application is applied to detecting the silence abnormality in the voice stream of each user in the multi-person conference process. It should be noted that, among the parties of the voice call, if they stand at the angle of the user 1 and the client a, the client B and the client C may be referred to as the opposite ends, that is, the client a has two opposite ends, and the client a may be referred to as the near end.

Further, in the process of multi-user voice call, for any client in the voice call system, it needs to acquire the voices of other clients participating in the voice call, mix the voices and play the voices. Specifically, the mixing may be performed in the client or in the server. For example, if the participant in the voice call system has a client a, a client B, a client C and a client D, the opposite end of the client a includes the client B, the client C and the client D, and then the client a mixes the audio streams of the three opposite ends of the client B, the client C and the client D before playing the audio streams during the voice call.

As shown in fig. 3a, audio streams of three opposite ends of a client B, a client C and a client D are mixed in a client a. The client A acquires data stream pushing or network packet receiving of the server through a network thread, namely, acquires an audio stream B, an audio stream C and an audio stream D of three opposite ends from the server. And then, respectively carrying out decoding operation on each audio stream to obtain corresponding voice frames, and storing the continuous voice frames obtained by decoding into the corresponding voice frame buffer modules. Then, the silence anomaly detection module detects the silence anomaly of the voice frame in the voice frame buffer module, and as can be seen from the figure, the voice frame buffer module uses a plurality of rectangular frames to represent the voice frame, wherein the dotted rectangular frame represents a blank frame, the rectangular frame represents a non-blank frame, and the subsequent silence anomaly detection is actually to determine whether the continuous blank frame corresponds to the silence anomaly. After the silence detection is completed, the corresponding voice frames in each audio stream are subjected to voice mixing processing, and a voice player is called to play the voice frames obtained by the voice mixing processing, so that a user can hear the voice of the opposite-end user.

As shown in fig. 3B, audio streams of three opposite ends of the client B, the client C and the client D are mixed in the server. And acquiring the audio streams B, C and D of the three opposite ends through network packet receiving operation. And then respectively decoding each audio stream to obtain corresponding voice frames, and storing the continuous voice frames obtained by decoding into the corresponding voice frame buffer modules. And then, the soundless anomaly detection module detects soundless anomalies of the voice frames in the voice frame buffer module, after the detection is finished, the voice mixing processing is carried out, the audio stream after the voice mixing is sent to the client A, and the client A calls the voice player to play the voice frames obtained by the voice mixing processing, so that the user can hear the voice of the opposite-end user.

As can be seen from the above description, whether the audio streams of the respective opposite ends are subjected to the mixing processing at the client end or the mixing processing at the server end, the voice frames of the respective audio streams need to be subjected to the silence anomaly detection in the respective silence anomaly detection modules. For convenience of description, the embodiment of the present application describes the silent abnormality detection scheme in detail by performing the mixing process on the client, but the present application is not limited thereto.

Fig. 4 is a flowchart of a method for detecting a silent abnormality in a voice call according to an embodiment of the present application, where, as shown in fig. 4, the method may include:

step S401, during the voice communication process, at least one audio stream of the opposite end is obtained, and continuous blank frame count is performed on the voice frame of each audio stream.

The blank frame refers to a frame obtained after decoding an audio stream and not containing voice information, namely a silent frame, if continuous blank frames are fewer, the frame appears as a clip when playing, if continuous blank frames are more, the frame appears as a silent abnormality when playing, for example, in a multi-person voice conference, a user 1 participating in the conference hears that the sound of a user 2 appears as a clip or a silent section appears in a complete speech, and the blank frame is indicated to exist in the audio stream corresponding to the user 2. Blank frames in the audio stream can be derived from two aspects, one of which is that when the audio stream is generated, the corresponding frame does not input corresponding voice information, namely, the corresponding user in the voice call does not input voice information, for example, the voice acquisition equipment of the terminal where the client is located is in a closed state, or the voice acquisition equipment is occupied by other APP, namely, the APP of the current voice call is in a closed state; and the other is that the voice information of the frame corresponding to the audio stream is lost due to the fault, namely, the silence abnormality exists. Then, as can be seen from the above description, it is possible to determine whether there is a silent abnormality in the audio stream by integrating the number of continuous blank frames in the audio stream and the on-off state of the voice acquisition device of the terminal where the client is located. It should be appreciated that the occurrence of blank frames in an audio stream may also be referred to as "lost frames," and that consecutive blank frames may be referred to as consecutive lost frames.

Specifically, the continuous blank frame counter may be set to count the voice frames in each audio stream, specifically, for each audio stream, the continuous blank frame counter may be set to zero after each counting period is ended, and then the continuous blank frame counter is restarted in the next counting period. The duration of each counting period is different and can be set according to actual requirements, namely when the continuous blank frame counter is subjected to zero setting operation, the duration of each counting period can be set according to the actual requirements.

Step S402, for each audio stream, if the current speech frame is the target speech frame, the current continuous blank frame count value is obtained.

The current speech frame refers to a speech frame used for mixing speech frames corresponding to other audio streams at the current time. The target voice frame can be understood as a specific voice frame, the specific voice frame can be a first frame non-blank frame after a blank frame, or an nth frame non-blank frame after a blank frame, N is a positive integer, and the N value can be set according to actual requirements.

Specifically, for each audio stream, when the current speech frame is mixed, if the current speech frame is determined to be a specific speech frame, the current continuous blank frame count value is obtained from the continuous blank frame counter, that is, the number of blank frames in the current count period is obtained.

It should be noted that, in one or more alternative embodiments, if the current speech frame is the target speech frame, the obtaining the current continuous blank frame count value includes: if the current voice frame is a target voice frame and the current voice frame in the corresponding audio stream contains a preset number of voice frames before the current voice frame, a current continuous blank frame count value is obtained. In other words, when the silence detection is performed on the audio stream, it is first determined whether the number of speech frames included in the audio stream reaches a preset number, that is, the silence anomaly detection is performed on the audio stream after the audio stream continues for a period of time and the number of speech frames is sufficiently large. Specifically, if it is determined that the number of voice frames included in the audio stream reaches a preset number and the current voice frame is the target voice frame, the current continuous blank frame count value is obtained, and then silent abnormality detection is performed.

Step S403, if the continuous blank frame count value is not less than the preset value, the open/close state of the corresponding voice acquisition equipment at the opposite end is obtained.

The voice capture device may be a microphone on the terminal.

The preset value may be set according to actual requirements, for example, the preset value may be set to 5 frames, and it may be understood that in the embodiment of the present application, it is considered that in the current counting period, if the count value of the current continuous blank frame is smaller than the preset value, no silence abnormality exists; if the count value of the current continuous blank frame is greater than or equal to the preset value, there is a possibility of silence abnormality, and in order to further determine whether silence abnormality exists, it is necessary to further acquire the on-off state of the voice acquisition device at the opposite end corresponding to the audio stream, and further determine whether silence abnormality exists by combining the on-off state of the voice acquisition device.

Step S404, if the voice collection device of the opposite terminal is in an on state, determining that the corresponding opposite terminal in the voice call has a silent abnormality.

The voice acquisition equipment of the opposite terminal is in an on state, which indicates that the corresponding client terminal can normally acquire the voice of the user, but more blank frames appear in the corresponding audio stream, and the blank frames are considered to be silent abnormality in the audio stream, namely, the silent abnormality exists in the audio stream. The voice collection equipment of the opposite terminal is in a closed state, which indicates that the corresponding client terminal cannot collect the voice of the user, and more blank frames can be expected to appear in the corresponding audio stream, so that the audio stream can be considered to have no silence abnormality.

In the above-described voice call, before the silence abnormality detection is performed on each audio stream, it may be determined whether or not a speech frame of a speech is included in each audio stream, if so, the silence abnormality detection is performed, and if not, the abnormality detection is not necessary. Because, in the voice call, there is a possibility that a certain party does not speak or does not speak until the party is in the turn, but the voice acquisition device of the party acquires other voices except for speaking, in this case, if the silence abnormality judgment is performed, erroneous judgment is caused, and in order to avoid the erroneous judgment, the detection of the speaking voice of the audio stream can be performed. Specifically, whether to perform silence abnormality detection may be determined according to a preset speaking sequence, and if the speaking sequence at the current time is that a certain participant speaks, silence abnormality detection may be performed on an audio stream corresponding to the participant. Whether to perform silent abnormality detection can be determined according to a preset algorithm, and if the current time is determined to be the speaking of a certain participant according to the preset algorithm, the silent abnormality detection is performed on the audio stream corresponding to the participant.

According to the scheme provided by the application, in the voice call process, whether the soundless abnormality exists in the audio stream or not is determined by combining the number of continuous blank frames in the audio stream of the opposite terminal and the opening and closing states of the voice acquisition equipment of the opposite terminal, and the soundless abnormality in the voice call can be detected by the scheme, so that system operation and maintenance personnel can be prompted to monitor possible terminal and background service defects in the voice call system in time.

Specifically, the target speech frame is a non-blank frame, and is the first frame after the blank frame in the audio stream, in other words, the occurrence of the target speech frame marks that the speech frame in the audio stream is changed from the blank frame to the non-blank frame, and further marks the end of a section of continuous blank frame. If the current speech frame is determined to be the target speech frame, a section of continuous blank frame is ended before the description, at this time, the count value of the continuous blank frame can be obtained, and the subsequent condition judgment can be performed, so that the detection of the silent abnormality is completed.

It should be noted that, the continuous blank frame counter counts according to the foregoing counting period, and enters the next counting period after making a silent abnormality determination, that is, it can be considered that only one continuous blank frame occurs in one counting period, and then in the current counting period, the occurrence of the target speech frame indicates that the counting of the continuous blank frame in the current counting period is completed, that is, whether the continuous blank frame corresponds to the silent abnormality can be determined according to the continuous blank frame count value in combination with other information. Specifically, the occurrence of the target voice frame is a trigger condition for acquiring the count value of the continuous blank frame, and after it is determined that the count value of the continuous blank frame is not less than the preset value, the on-off state of the voice acquisition device of the terminal corresponding to the audio stream needs to be acquired, and in combination with the on-off state, whether the continuous blank frame corresponds to a silence abnormality is determined, that is, whether the audio stream has the silence abnormality is determined.

In an alternative embodiment of the present application, the method may further include:

If the target voice frame appears in the current counting period (i.e. the current voice frame is determined to be the target voice frame), it is indicated that the continuous blank frame in the current counting period is ended, and the continuous blank frame count value is compared with the preset value, and the subsequent condition judgment is performed according to the comparison result to determine whether the continuous blank frame corresponds to the silence anomaly. Specifically, if the continuous blank frame count value is smaller than the preset value, it means that the blank frames are fewer, the no-speech information input time is shorter, and the blank frames can be regarded as normal pauses, not silent anomalies. If the count value of the continuous blank frames is not less than the preset value, the blank frames are more, the voice information input time is longer, and two situations possibly exist at the moment, wherein one of the situations is that the voice acquisition equipment of the terminal corresponding to the audio stream is in a closed state, and the continuous blank frames correspond to the emotion without voice input and are not abnormal without sound; secondly, the client corresponding to the audio stream loses the voice signal due to failure, and the continuous blank frames correspond to silence anomalies, in other words, in this case, whether silence anomalies exist in the audio stream needs to be determined finally, so that the open/close state of the voice acquisition device of the corresponding terminal needs to be determined.

Specifically, as shown in fig. 5a, for each audio stream, the voice frames are sequentially extracted from the corresponding voice frame buffer modules, first, whether the current voice frame is a blank frame is determined, if the current voice frame is a blank frame, the count value of the continuous blank frame counter is increased by 1, and if the current voice frame is not a blank frame, the end of the continuous blank frame in the count period is indicated, that is, the current voice frame is the target voice frame. Then, if the current voice frame is determined to be the target voice frame, the current continuous blank frame count is obtained, the continuous blank frame count value is compared with a preset value, if the continuous blank frame count value is smaller than the preset value, the continuous blank frame is not corresponding to the silent abnormality, the continuous blank frame counter is set to zero, and the next counting period is entered; if the continuous blank frame count value is not less than the preset value, the continuous blank frame is possibly corresponding to the silence abnormality. Then, if the continuous blank frame count value is not smaller than the preset value, the opening and closing state of the voice acquisition equipment of the terminal corresponding to the audio stream is obtained, if the voice acquisition equipment is in the closing state, the continuous blank frame is determined not to correspond to the silence abnormality, and the continuous blank frame counter is set to zero, namely the next counting period is entered; if the voice acquisition equipment is in an on state, determining that the continuous blank frame corresponds to the silent abnormality, namely determining that the audio stream has the silent abnormality, at the moment, performing operation of a subsequent abnormality information reporting server, and simultaneously setting a continuous blank frame counter to zero, namely entering the next counting period. The contents and reporting manner included in the anomaly information will be described in detail later.

It can be understood that, during the voice call, the above silent anomaly detection is performed cyclically, that is, each counting period is in the process of counting the continuous blank frames of the voice frames in the period, so as to determine whether the continuous blank frames in each counting period correspond to the silent anomaly, and enter the next counting period in the above manner, and perform the next determination, as shown in fig. 5b, which is a schematic cyclic process diagram of the method for detecting the silent anomaly in the voice call provided in the embodiment of the present application.

Before or during the silence anomaly detection, the client may acquire a voice acquisition device open/close state table, where the state table stores the correspondence between the identifiers of the opposite ends participating in the voice call and the voice acquisition devices, that is, the identifiers of each opposite end correspond to whether the corresponding voice acquisition device is in an on state or an off state. Specifically, each opposite terminal has a unique identifier, the audio stream of each opposite terminal also carries the identifier of the corresponding opposite terminal, and when the open-close state of the voice acquisition device of the opposite terminal corresponding to the audio stream needs to be acquired, the voice acquisition device can be queried and acquired in the open-close state table of the voice acquisition device according to the corresponding identifier.

Specifically, for each audio stream, when it is determined that the current speech frame is the target speech frame and the current continuous blank frame count value is not less than the preset value, the open/close state of the corresponding voice acquisition device at the opposite end needs to be acquired. The audio stream carries the corresponding identifier of the opposite terminal, and the client acquires the on-off state table of the voice acquisition equipment before or during the silent abnormality detection, so that when the on-off state of the voice acquisition equipment of the opposite terminal corresponding to the audio stream needs to be acquired, the on-off state of the corresponding on-off state table of the voice acquisition equipment is acquired from the on-off state table of the voice acquisition equipment according to the corresponding identifier of the opposite terminal.

It should be noted that, the states of the voice acquisition devices corresponding to the identifiers of different opposite ends in the voice acquisition device open-close state table are updated at any time, that is, when the open-close state of the voice acquisition device of a certain opposite end is changed, the corresponding states in the voice acquisition device open-close state table in the client end are also changed.

receiving voice acquisition equipment control signaling of each opposite terminal sent by a corresponding server, wherein the voice acquisition equipment control signaling is sent to the server by the corresponding opposite terminal and is used for controlling the opening and closing states of the corresponding voice acquisition equipment;

The voice acquisition equipment control signaling is generally sent after a user triggers a specific button of a corresponding client, the voice acquisition equipment control signaling is sent to a corresponding server by the client after being generated, and the server distributes all the voice acquisition equipment control signaling to other clients for updating the voice acquisition equipment opening and closing state tables in the other clients.

Specifically, as shown in fig. 6, the voice collection devices of the client a, the client B, the client C and the client D are all microphones, the user can realize "turn on the microphone" or "turn off the microphone" by clicking the microphone control button in the display interface of each client, and when the user clicks the microphone control button in the corresponding client, i.e. the on-off state of the microphone is changed, the client sends a voice collection device control signaling to the server. Specifically, when the user 2 clicks the microphone control button in the corresponding client B to realize "microphone on", that is, to realize the switch of the microphone from the on state to the off state, the voice collecting device control signaling (that is, the microphone control signaling in the figure) sent by the client B to the server indicates that the microphone corresponding to the client B is in the on state. Before or during the silence anomaly detection by the client a, after receiving the voice acquisition device control signaling sent by the client B, the client C or the client D, the server forwards the received voice acquisition device control signaling to the client a, and the client a updates the voice acquisition device open/close state table (i.e., the microphone open/close state table in the figure) according to the voice acquisition device open/close state indicated by the received acquisition device control signaling. Specifically, the user 2 clicks a microphone control button in the corresponding client B to realize "turn on microphone", the voice acquisition device control signaling sent by the client B to the server indicates that the microphone corresponding to the client B is in an on state, the service forwards the control signaling to the client a, the client B and the client C, and for the client a, the on-off state of the voice acquisition device corresponding to the client B in the on-off state table of the voice acquisition device is updated from the off state to the on state. It can be understood that the voice acquisition device control signaling also carries the identifier of the corresponding opposite terminal (i.e. the identifier of the corresponding client terminal).

after determining that a corresponding opposite terminal in the voice call has silent abnormality, acquiring corresponding abnormality information;

Specifically, when the client determines that a silent abnormality exists in a certain audio stream by the silent abnormality detection module before mixing, abnormality information is acquired, and the abnormality information is sent to a corresponding server to analyze the cause of the silent abnormality based on the abnormality information. Specifically, a corresponding abnormality analysis module may be preset in the server to analyze the cause of the silent abnormality based on the abnormality information, or the abnormality information may be displayed to a technician, and the technician shares the cause of the silent abnormality based on the abnormality information. By acquiring and feeding back the abnormal information, positioning and analysis can be guided, defects can be repaired in time, and the stability of the system is improved.

And reporting the abnormal information to the initiator or maintainer of the voice call.

Specifically, besides reporting the abnormal information to the corresponding server, the abnormal information can also be directly reported to the corresponding voice call initiator or maintainer, so that the voice call initiator or maintainer can perform abnormal analysis according to the reported abnormal information, thereby more timely finding out the abnormal reason and performing corresponding adjustment, and improving the experience of each participant in the voice call.

The anomaly information may be information about an audio stream in which a silent anomaly exists, and information about a client (i.e., a client corresponding to a near end) that mixes the audio.

It can be understood that when the near-end client is used for performing the audio mixing process, the abnormal information may also include other information according to the actual requirement, which is not limited in this application. Meanwhile, when the server is adopted for mixing, the abnormal information can also contain other information according to actual requirements.

Specifically, after determining that a silent abnormality exists in a certain audio stream in a voice call, in order to enable participants in other voice calls to know the situation in time, unnecessary misunderstanding is avoided, and silent abnormality prompt information can be sent to terminals (i.e., opposite terminals) where each participant in the voice call is located. In particular, the silent abnormality alert message may include an identification of the party that has occurred the silent abnormality and the type of abnormality (i.e., silent abnormality). For example, the participants in the voice call include a user a (corresponding to terminal a), a user B (corresponding to terminal B), and a user C (corresponding to terminal C), and if there is a silent abnormality in the terminal a, the terminal B and the terminal C are sent with silent abnormality prompt information of "silent abnormality occurs in the user a" so that the user B and the user C can timely learn that the silent abnormality occurs in the user a.

Specifically, as can be seen from the foregoing description, in the voice call, the voice after mixing is played by each terminal, in order to ensure the voice effect after mixing, the audio stream with silence abnormality in each audio stream in the voice call may be screened out, i.e. the audio stream with silence abnormality is not added into the mixing process during the mixing process.

Fig. 7 is a block diagram of a device for detecting a silent abnormality in a voice call according to an embodiment of the present application, and as shown in fig. 7, the device 700 may include: a blank frame counting module 701, a count value acquisition module 702, an open-closed state acquisition module 703, and a silent abnormality determination module 704, wherein:

the blank frame counting module 701 is configured to obtain at least one audio stream of the opposite end during a voice call, and respectively count continuous blank frames of voice frames of each audio stream;

the count value obtaining module 702 is configured to obtain, for each audio stream, a current continuous blank frame count value if the current speech frame is the target speech frame;

The on-off state acquisition module 703 is configured to acquire an on-off state of the corresponding voice acquisition device at the opposite end if the continuous blank frame count value is not less than a preset value;

the silence anomaly determination module 704 is configured to determine that a silence anomaly exists in a corresponding audio stream in a voice call if a voice collection device at an opposite terminal is in an on state.

the open/close state acquisition module is specifically configured to:

Referring now to fig. 8, there is illustrated a schematic diagram of an electronic device (e.g., a terminal device or server performing the method illustrated in fig. 4) 800 suitable for use in implementing embodiments of the present disclosure. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

An electronic device includes: a memory and a processor, where the processor may be referred to as a processing device 801 described below, the memory may include at least one of a Read Only Memory (ROM) 802, a Random Access Memory (RAM) 803, and a storage device 808 described below, as follows:

As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that the computer readable storage medium described above in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

in the voice communication process, acquiring at least one opposite-end audio stream, and respectively counting continuous blank frames of voice frames of each audio stream; for each audio stream, if the current voice frame is a target voice frame, acquiring a current continuous blank frame count value; if the continuous blank frame count value is not smaller than the preset value, acquiring the opening and closing states of the corresponding voice acquisition equipment at the opposite end; if the voice acquisition equipment of the opposite terminal is in an on state, determining that the corresponding audio stream in the voice call has soundless abnormality.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of a module or unit is not limited to the unit itself in some cases, and for example, the count value acquisition module may also be described as "a module that acquires a count value".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions such that the computer device performs:

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A silent abnormality detection method in a voice call, comprising:

in the voice call process, determining a speaking participant at the current moment according to a preset speaking sequence or a preset algorithm, acquiring an audio stream of an opposite terminal corresponding to the speaking participant at the current moment, and continuously counting blank frames of voice frames of the audio stream of the opposite terminal corresponding to the speaking participant at the current moment; the blank frame is that the corresponding frame does not input the corresponding voice information when the audio stream is generated or the voice information of the frame corresponding to the audio stream is lost due to the fault;

If the current voice frame is a target voice frame, acquiring a current continuous blank frame count value, and carrying out continuous blank frame count again on the voice frames after the current voice frame, wherein the target voice frame is a first frame non-blank frame after a blank frame or an Nth frame non-blank frame after a blank frame, and N is a positive integer;

if the continuous blank frame count value is not smaller than a preset value, acquiring the opening and closing states of corresponding voice acquisition equipment at the opposite end;

if the voice acquisition equipment of the opposite terminal is in an on state, determining that a corresponding audio stream in the voice call has soundless abnormality;

after determining that the corresponding audio stream in the voice call has silent abnormality, obtaining corresponding abnormality information, wherein the abnormality information comprises related information of the corresponding audio stream in the voice call and related information of a client corresponding to a near end.

2. The method of claim 1, wherein if the current speech frame is the target speech frame, obtaining the current continuous blank frame count value comprises:

if the current voice frame is a target voice frame and the current voice frame in the corresponding audio stream contains a preset number of voice frames before the current voice frame, a current continuous blank frame count value is obtained.

3. The method according to claim 1, wherein the method further comprises:

if the continuous blank frame count value is not smaller than the preset value and the voice acquisition equipment of the opposite terminal is in a closed state, carrying out continuous blank frame count again on the voice frames after the current voice frame;

and if the continuous blank frame count value is not smaller than the preset value and the voice acquisition equipment at the opposite end is in an on state, carrying out continuous blank frame count again on the voice frames after the current voice frame.

4. The method according to claim 1, wherein the method further comprises:

acquiring an on-off state table of voice acquisition equipment, wherein the on-off state table of the voice acquisition equipment stores the corresponding relation between the identifiers of each opposite terminal and the on-off state of the voice acquisition equipment;

the acquiring the opening and closing states of the corresponding voice acquisition equipment at the opposite end comprises the following steps:

acquiring an identifier of the opposite terminal based on the audio stream of the opposite terminal;

5. The method of claim 4, wherein the obtaining the voice acquisition device on-off state table comprises:

6. The method according to claim 1, wherein the method further comprises:

7. The method of claim 6, wherein the method further comprises:

and reporting the abnormal information to an initiator or a maintainer of the voice call.

8. The method according to claim 6 or 7, wherein the anomaly information comprises at least one of:

9. The method according to claim 1, wherein the method further comprises:

and after determining that the corresponding audio stream in the voice call has the silence abnormality, sending silence abnormality prompt information to each opposite terminal.

10. The method according to claim 1, wherein the method further comprises:

11. A silent abnormality detection apparatus in a voice call, comprising:

the blank frame counting module is used for determining the speaking participants at the current moment according to a preset speaking sequence or a preset algorithm in the voice call process, acquiring the audio streams of the opposite ends corresponding to the speaking participants at the current moment, and continuously counting the blank frames of the voice streams of the opposite ends corresponding to the speaking participants at the current moment; the blank frame is that the corresponding frame does not input the corresponding voice information when the audio stream is generated or the voice information of the frame corresponding to the audio stream is lost due to the fault;

the counting value acquisition module is used for acquiring the current continuous blank frame counting value if the current voice frame is a target voice frame, and carrying out continuous blank frame counting on the voice frames after the current voice frame again, wherein the target voice frame is a first frame non-blank frame after the blank frame or an Nth frame non-blank frame after the blank frame, and N is a positive integer;

the soundless anomaly determination module is used for determining that the corresponding audio stream in the voice call has soundless anomaly if the voice acquisition equipment of the opposite terminal is in an on state; after determining that the corresponding audio stream in the voice call has silent abnormality, obtaining corresponding abnormality information, wherein the abnormality information comprises related information of the corresponding audio stream in the voice call and related information of a client corresponding to a near end.

12. An electronic device comprising a memory and a processor;

the memory stores a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 10.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 10.