WO2023050921A1

WO2023050921A1 - Video and audio data sending method, display method, sending end and receiving end

Info

Publication number: WO2023050921A1
Application number: PCT/CN2022/100589
Authority: WO
Inventors: 刘志龙; 石挺干
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-09-30
Filing date: 2022-06-22
Publication date: 2023-04-06
Also published as: CN115914653A

Abstract

The embodiments of the present application relate to the field of data transmissions. Disclosed is a video and audio sending method, which is applied to a sending end. The method comprises: encoding collected audio data and video data; and according to a real-time detection result of network quality, sending the video and audio data by using a video and audio sending policy corresponding to the real-time detection result, wherein the real-time detection result at least comprises a first quality level and a second quality level, the network quality of the second quality level is lower than that of the first quality level, and a video and audio sending policy corresponding to the second quality level comprises: sending the encoded audio data and part of the encoded video data to a receiving end, wherein the part of video data comprises at least some key frames and motion information of video frames which have not been transmitted.

Description

Video and audio data sending method, display method, sending end and receiving end

cross reference

This application is based on the Chinese patent application with the application number "202111165982.6" and the filing date is September 30, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference. Apply.

technical field

The embodiments of the present application relate to the field of data transmission, and in particular to a method for sending video and audio data, a display method, a sending end, a receiving end, electronic equipment, and a storage medium.

Background technique

With the continuous development of video and audio technology and mobile Internet technology, the transmission system of video and audio data has been widely used in people's daily life, and the audience of typical video and audio systems such as video conferencing and video chat is becoming wider and wider. At the same time, there is also a problem. The existence of a large number of weak network environments in the existing network environment cannot always meet the needs of simultaneous transmission of video and audio data, for example, in elevators, underground garages, public transportation, and high-speed rail. , crowd gathering places, fifth-generation mobile communication/fourth-generation mobile communication and other network switching scenarios, video and audio images often freeze, blurred or even directly interrupted, resulting in poor user experience.

Contents of the invention

The embodiment of the present application provides a method for sending video and audio data, which is applied to the sending end, including: encoding the collected audio data and video data; according to the real-time detection result of the network quality, using the video data corresponding to the real-time detection result The audio transmission strategy is used to transmit video and audio data; wherein, the real-time detection results include at least the first quality level and the second quality level; the network quality of the second quality level is lower than the first quality level; the video frequency corresponding to the second quality level The audio sending strategy includes: sending encoded audio data and encoded partial video data to the receiving end, the partial video data including at least some key frames and motion information of untransmitted video frames.

The embodiment of the present application also provides a video and audio data display method, which is applied to the receiving end, including: receiving encoded data; decoding the received encoded data; the decoded data includes audio data and part of video data In this case, according to the motion information of the untransmitted video frame, the untransmitted video frame is reconstructed to obtain the reconstructed video frame; wherein, the part of the video data includes at least part of the key frame and the untransmitted video frame Motion information: rendering and displaying the reconstructed video frame, key frames in the decoded data and the audio data.

The embodiment of the present application also provides a sending end, including: an encoding module, configured to encode the collected audio data and video data; and a sending module, configured to use the real-time detection result corresponding to the network quality The video and audio transmission strategy is used to send video and audio data; wherein, the real-time detection results include at least the first quality level and the second quality level; the network quality of the second quality level is lower than the first quality level; the network quality corresponding to the second quality level The video and audio sending strategy includes: sending encoded audio data and encoded partial video data to the receiving end, the partial video data including at least some key frames and motion information of untransmitted video frames.

The embodiment of the present application also provides a receiving end, including: a receiving module for receiving encoded data; a decoding module for decoding the received encoded data, and the decoded data includes audio data and part of video data In the case of , according to the motion information of the untransmitted video frame, the untransmitted video frame is reconstructed to obtain the reconstructed video frame; wherein, the part of the video data includes at least part of the key frame and the untransmitted video frame motion information; a display module, configured to display the reconstructed video frame, the key frame in the decoded data, and the obtained data when the decoded data includes audio data and part of the video data The above audio data is rendered and displayed.

The embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores instructions executable by at least one processor, and the instructions are executed by at least one processor. The processor is executed, so that at least one processor can execute the above-mentioned method for sending video and audio data or the method for displaying video and audio data.

The embodiment of the present application also provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a processor, the above method for sending video and audio data or the method for displaying video and audio data is implemented.

Description of drawings

Fig. 1 is a flowchart of a method for sending video and audio data provided according to an embodiment of the present application;

Fig. 2 is a flow chart of a method for displaying video and audio data provided according to another embodiment of the present application;

FIG. 3 is a schematic structural diagram of a sending end provided according to another embodiment of the present application;

FIG. 4 is a schematic structural diagram of a receiving end provided according to another embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device provided according to another embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the application, many technical details are provided for readers to better understand the application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.

An embodiment of the present application relates to a method for sending video and audio data, which is applied to a sending end. The method includes: encoding the collected audio data and video data; according to the real-time detection result of the network quality, adopting the video and audio transmission strategy corresponding to the real-time detection result to send the video and audio data; wherein, the real-time detection result includes at least the first The first quality level and the second quality level; the network quality of the second quality level is lower than the first quality level; the video and audio transmission strategy corresponding to the second quality level includes: the encoded audio data and the encoded video data The motion information of key frames and non-key frames is sent to the receiver.

Referring to FIG. 1, the implementation details of the video and audio transmission method of the embodiment of the present application will be described in detail. The following content is only the implementation details provided for easy understanding, and is not necessary for implementing this solution.

The application scenarios of the embodiments of the present application may include but not limited to: video conferencing, video chat, and intelligent customer service.

Step 101: The sending end collects video and audio data and encodes it.

Specifically, after the audio data and video data are collected by the sending end, they are encoded. Wherein, the collection device includes but not limited to a camera and a microphone.

Step 102: According to the real-time detection result of network quality, judge whether the current network quality level is at the first quality level, if yes, go to step 103, if not, go to step 104.

Specifically, after collecting video and audio data, the sending end needs to judge the quality level of the current network quality before sending it. According to the real-time detection results of the network quality, the video and audio data corresponding to the real-time detection results are used. Send strategy to send video and audio data. When performing real-time detection of network quality, the reference information is quality indicators such as packet loss, delay, jitter, false alarm rate, round-trip delay (Round-Trip Time, RTT), bandwidth, etc., or any combination of the above indicators. When the network is at different quality levels, the corresponding sending strategies are also different.

When the above-mentioned network quality level is at the first quality level, it means that the current network status is normal, and the audio data and video data can be transmitted normally and simultaneously.

When the current network quality level is at the first quality level, go to step 103: send the encoded audio data and encoded video data to the receiving end, the sending is completed, and the process ends.

Specifically, when the current network quality level is at the first quality level, the video and audio transmission strategy corresponding to the first quality level is adopted to transmit video and audio data, and the video and audio transmission strategy corresponding to the first quality level is The encoded audio data and encoded video data are sent to the receiving end. The above-mentioned first quality level means that the current network supports the simultaneous transmission of normally encoded audio data and video data, which is the most ideal state of the network in this embodiment of the application.

Step 104: Determine whether the current network is at the second quality level, if yes, go to step 105, if not, go to step 106.

Specifically, when judging the quality level of the network, the referenced information is the same as step 102, and will not be repeated here. When the current network quality is at the above-mentioned second quality level, the video and audio transmission strategy corresponding to the second quality level is adopted. The network quality is at the second quality level, which means that the current network does not support simultaneous transmission of audio data and complete video data, but can support normal transmission of audio data and partial video data.

In one example, after determining that the current network is at the second quality level, the key points and the Jacobian matrix of the video frame can be extracted according to a fixed quantization step size, and the extracted key points and Jacobian matrix can be used as a video Frame motion information. This is because in the process of data transmission, the amount of motion data used to characterize the key features of video frames is much smaller than the amount of traditional video frame data. Based on this, the transmission of motion information representing key features of video can greatly reduce the impact on the network bandwidth requirements.

Step 105: Send the encoded audio data and part of the encoded video data to the receiving end, the sending is completed, and the process ends. Wherein, the partial video data includes at least some key frames and motion information of untransmitted video frames. The motion information of the video frame is used to reconstruct the video frame, so that the receiving end can render and display according to the reconstructed video frame, transmitted key frames and audio data.

In one example, the sending end may select some key frames as the transmitted video frames, that is, the transmitted video data includes the selected key frames and the motion information of the unselected key frames and the motion information of the non-key frames. For example, select the first key frame, the fifth key frame, the 10th key frame...the Nth key frame as the transmitted video frame, the key frames not transmitted in the middle and the non-key frames are all transmitted through the motion information only way to reconstruct at the receiving end.

In another example, the sending end may also use all key frames as transmitted video frames, that is, the transmitted video data includes motion information of all key frames and non-key frames. Non-key frames are reconstructed at the receiving end according to the motion information of the non-key frames. Selecting some key frames as video frames for transmission, or using all key frames as video frames for transmission, can be determined according to the requirements of the service on image quality.

Wherein, the aforementioned key frame refers to the frame where the key action of the character or object is moving or changing, and the number of key frames is not limited in this embodiment of the present application.

In one example, while encoding the audio data, the sender extracts the motion information of the non-key frame of the game screen, first selects a frame of the reference game screen for transmission, and then extracts the non-reference game frame according to the fixed quantization step size. Frame keypoints and Jacobian matrices, used to represent motion information for these non-referenced game frames. In the above process, the data volume of the motion information representing the key features of the video is much smaller than the data volume of the real frame of the optimized picture in the traditional technology, thereby reducing the requirements on the network. After decoding it, when the obtained data includes audio data and some video data, according to the motion information of the untransmitted video frame, the untransmitted video frame is reconstructed to obtain the reconstructed video frame; wherein, The part of video data includes at least part of key frames and motion information of untransmitted video frames, and then the reconstructed video frames, transmitted key frames and audio data are rendered and displayed.

When it is determined in step 104 that the current network quality is not at the second quality level, it means that the current network quality is at the third quality level, and enter step 106: send the encoded audio data to the receiving end, the transmission is completed, and the process ends.

When the network quality is at the third quality level, the video and audio sending strategy corresponding to the third quality level is adopted, that is, only the encoded audio data is sent to the receiving end. Among them, the network quality is at the third quality level, which means that the current network cannot carry audio data and video data at the same time, and can only carry the normal transmission of audio data.

The method for transmitting video and audio data proposed in the embodiment of the present application, this process includes: the transmitting end encodes the collected audio data and video data, and adopts a corresponding video and video transmission strategy for video and audio transmission according to the real-time detection results of network quality send. In the case of weak network quality, in the encoded video data, determine key frames and non-key frames, and send the encoded audio data and the motion information of the key frames and non-key frames in the encoded video data to Receiving end. The above-mentioned real-time detection results include at least the first quality level and the second quality level, and the video and audio transmission strategy corresponding to the second quality level is to combine the motion information of key frames and non-key frames in the encoded audio data and encoded video data sent to the receiving end. Since the amount of data representing the motion information of a video frame is much smaller than the traditional video frame data, it can greatly reduce the requirements for network bandwidth. Therefore, the above process solves the problem of video and audio data that appears in the environment of weak network, network switching, etc. Unsatisfactory transmission results lead to video and audio freezes, blurred screens, and interruptions, which lead to poor user experience. In order to greatly reduce the data volume of the transmission data in the transmission process, it is ensured that the data required by the receiving end can also be sent in different network environments.

Another embodiment of the present application relates to a method for displaying video and audio data, which is applied to a receiving end. The implementation details of the video and audio display method of the embodiment of the present application will be described in detail below with reference to FIG. 2 . The following content is only the implementation details provided for the convenience of understanding, and is not necessary for implementing the solution.

Step 201: Receive coded data.

Specifically, encoded data refers to encoded data of audio data and/or video data.

Step 202: After decoding the received coded data, whether the obtained data only includes audio data, if yes, go to step 206, if not, go to step 203.

After decoding the received encoded data, it is necessary to judge whether the obtained data includes only audio data or video data in addition to audio data. After the judgment, different processing methods must be adopted to process the relevant data and render it for display.

Step 203: Further judge whether the decoded data includes audio data and part of video data. If the decoded data includes audio data and partial video data, go to step 204; if the decoded data includes audio data and complete video data, go to step 208.

Step 204: According to the motion information of the untransmitted video frames, reconstruct the untransmitted video frames to obtain reconstructed video frames.

Specifically, if the decoded data includes motion information in video data in addition to audio data, the video frame is reconstructed according to the motion information of the video frame for subsequent rendering and display.

Step 205: Perform rendering and display according to the reconstructed video frame, key frames in the decoded data and audio data, and end the process.

Specifically, the decoded data includes motion information of key frames and non-key frames in audio data and video data.

In one example, the decoded data includes motion information of key frames and non-key frames in audio data and video data, and the receiving end reconstructs the non-key frames according to the motion information of key frames and non-key frames to obtain reconstruction After the non-key frame, the reconstructed non-key frame, key frame and audio data are rendered and displayed.

If it is determined in step 202 that the obtained data only includes audio data, go to step 206: drive the avatar model to generate a dynamic video frame of the avatar moving with the audio data, and go to step 207.

When the decoded data only includes audio data, the receiving end drives the virtual human model according to the audio data, thereby generating dynamic video frames of the virtual human whose actions change with the audio data.

Step 207: Render and display the dynamic video frame and audio data of the avatar above, and end the process.

Specifically, the above-mentioned virtual human model is a human model of different roles preset in the database. When driven by audio data, a dynamic video frame of the virtual human is generated. Generate dynamic video frames along with the audio data, avoiding screen freezes when the network environment becomes poor, and improving user experience.

If it is determined in step 203 that the decoded data includes audio data and complete video data, go to step 208: render and display the decoded audio data and video data.

The method for displaying video and audio data provided by the embodiment of the present application includes, when the decoded data only includes audio data, using the audio data to drive the avatar model to generate a dynamic video frame of the avatar whose actions change with the audio data, and using the above-mentioned avatar Dynamic video frames and audio modules are rendered and displayed; when audio data and partial video data are included after decoding, the untransmitted video frames are reconstructed first, and then rendered according to the reconstructed video frames, transmitted key frames and audio data show. The above process enables the receiving end to ensure that the displayed video and audio images will not be stuck, blurred or even directly interrupted based on the decoded data in scenarios such as a weak network environment or network switching, which greatly enhances the user experience.

Another embodiment of the present application also relates to a sending end, as shown in FIG. 3 , including: an encoding module 301 and a sending module 302 .

Specifically, the encoding module 301 is used to encode the collected audio data and video data; the sending module 302 is used to transmit the video and audio data according to the real-time detection result of the network quality, using the video and audio transmission strategy corresponding to the real-time detection result. Wherein, the real-time detection result includes at least the first quality level and the second quality level; the network quality of the second quality level is lower than the first quality level; the video and audio transmission strategy corresponding to the second quality level includes: The audio data and encoded partial video data are sent to the receiving end, and the partial video data includes at least some key frames and motion information of untransmitted video frames.

In an example, the video and audio sending strategy corresponding to the first quality level includes: sending encoded audio data and encoded video data to a receiving end.

In an example, the real-time detection result further includes a third quality level, and the video and audio transmission strategy corresponding to the third quality level includes: only sending encoded audio data to the receiving end.

In an example, the encoding module 301 is further configured to extract the key points and Jacobian matrix of the video frame according to a fixed quantization step, and use the extracted key points and Jacobian matrix as motion information of the video frame.

In an example, the encoding module 301 is further configured to detect the quality index of the transmission network in real time before adopting the video and audio transmission strategy corresponding to the real-time detection result according to the real-time detection result of the network quality, wherein the quality index includes one of the following or Any combination of them: packet loss rate, delay, jitter, false alarm rate, round trip delay RTT, bandwidth.

In one example, the coding module 301 mainly completes video and audio coding, motion information extraction and other functions. It consists of video and audio sub-coding and feature extraction sub-modules. Among them, the video and audio encoding sub-module is the same as the encoding module in the traditional video and audio system, mainly responsible for encoding and compressing the original video and audio data, and outputting media data for network transmission. The feature sub-extraction module is mainly responsible for extracting the motion information that characterizes the key features of the video. First, a frame of reference image is selected for transmission, and then the key points and Jacobian matrix of the transmitted video image are extracted according to the fixed quantization step size, which are used to represent these unknown The motion information of the transmitted video frame, because the data volume of the motion information representing the key features of the video is much smaller than the traditional video frame data, so the requirement for network bandwidth can be greatly reduced.

The sender in this embodiment makes different sending strategies respectively under different network environments, and encodes the collected audio data and video data; adopts corresponding strategies according to the real-time detection results of network quality. Send video and audio data to greatly reduce the amount of data transmitted during the transmission process, and ensure that the data required by the receiving end can also be sent in scenarios such as weak network environments or network switching.

Another embodiment of the present application also relates to a receiving end, as shown in FIG. 4 , including: a receiving module 401 , a decoding module 402 and a display module 403 .

Specifically, the receiving module 401 is used to receive encoded data; the decoding module 402 is used to decode the received encoded data, and when the decoded data includes audio data and part of video data, according to the untransmitted The motion information of the video frame is reconstructed to the untransmitted video frame to obtain the reconstructed video frame; wherein, the part of the video data includes at least part of the key frame and the motion information of the untransmitted video frame; the display module 403 , for rendering and displaying the reconstructed video frame, key frames in the decoded data, and the audio data when the decoded data includes audio data and partial video data .

In an example, the decoding module 402 is also used to drive the avatar model according to the audio data when the decoded data only includes audio data, and generate dynamic video frames of the avatar whose actions change with the audio data; display Module 403 is also used for rendering and displaying the dynamic video frame and audio data of the avatar when the decoded data only includes audio data.

In one example, the decoding module 402 mainly performs functions such as decoding video and audio, reconstructing video frames according to motion information representing key features of the video, and the like. It can be composed of video and audio decoding sub-module, audio-driven video sub-module and video reconstruction sub-module. Among them, the video and audio decoding sub-module is the same as the decoding module in the traditional video and audio system, and is mainly responsible for the decoding of video and audio media data. The audio-driven video sub-module is responsible for driving the avatar model with the transmitted audio information, so that the avatar model can generate dynamic video images of the avatar following the changes in the audio data. The video reconstruction sub-module is mainly responsible for using the motion information that characterizes the key features of the video to drive the previously transmitted video frames to be reconstructed into moving video images, so as to realize the reconstruction and restoration of the video.

In this embodiment, when the decoded data only includes audio data, the receiver uses the audio data to drive the avatar model to generate dynamic video frames of the avatar whose actions change with the audio data, and uses the above-mentioned avatar dynamic video frames and audio modules to perform Rendering and display; when the motion information of key frames and non-key frames in audio data and video data is included after decoding, the non-key frames are reconstructed first, and then rendered according to the reconstructed non-key frames, key frames and audio data show. The above process enables the receiving end to ensure that the displayed video and audio images will not be stuck, blurred or even directly interrupted based on the decoded data in scenarios such as a weak network environment or network switching, which greatly enhances the user experience.

It is not difficult to find that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and this embodiment can be implemented in cooperation with the above method embodiment. The relevant technical details and technical effects mentioned in the above embodiments are still valid in this embodiment, and will not be repeated here to reduce repetition. Correspondingly, the relevant technical details mentioned in this embodiment can also be applied in the above embodiments.

It is worth mentioning that all the modules involved in this embodiment and the previous embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or It can be implemented as a combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

Another embodiment of the present application also relates to an electronic device, as shown in FIG. 5 , including: at least one processor 501; and a memory 502 communicatively connected to the at least one processor 501; wherein, the memory 502 stores There are instructions that can be executed by the at least one processor 501, and the instructions are executed by the at least one processor 501, so that the at least one processor 501 can execute the above-mentioned sending method of video and audio data, or the above-mentioned video and audio data transmission method. Display method of audio data.

Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.

The processor manages the bus and general processing, and can also provide functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.

Another embodiment of the present application also relates to a computer-readable storage medium storing a computer program. When the computer program is executed by the processor, the above-mentioned sending method of video and audio data or the above-mentioned display method of video and audio data are realized.

That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Those of ordinary skill in the art can understand that the above-mentioned implementation modes are specific examples for realizing the present application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the present application. scope.

Claims

A method for sending video and audio data, applied to a sending end, comprising:

Coding the collected audio data and video data;

According to the real-time detection result of network quality, adopt the video-audio sending strategy corresponding to described real-time detection result, carry out the sending of video-audio data;

Wherein, the real-time detection result includes at least a first quality level and a second quality level; the network quality of the second quality level is lower than the first quality level;

The video and audio sending strategy corresponding to the second quality level includes: sending encoded audio data and encoded partial video data to the receiving end, the partial video data including at least some key frames and untransmitted video Frame motion information.
The video and audio data transmission method according to claim 1, wherein the real-time detection result further includes a third quality level, and the network quality of the third quality level is lower than the second quality level;

The video and audio sending strategy corresponding to the third quality level includes: only sending encoded audio data to the receiving end.
The method for transmitting video and audio data according to claim 1 or 2, wherein, before adopting the video and audio transmission strategy corresponding to the real-time detection result according to the real-time detection result of network quality, further comprising:

The key points and Jacobian matrix of the video frame are extracted according to a fixed quantization step size, and the extracted key points and Jacobian matrix are used as motion information of the video frame.
The method for transmitting video and audio data according to any one of claims 1 to 3, wherein,

The video and audio sending strategy corresponding to the first quality level includes: sending encoded audio data and encoded video data to a receiving end.
The method for transmitting video and audio data according to any one of claims 1 to 4, wherein, before adopting the video and audio transmission strategy corresponding to the real-time detection result according to the real-time detection result of the network quality, further comprising:

Real-time detection of quality indicators of the transmission network, wherein the quality indicators include one of the following or any combination thereof:

Packet loss rate, delay, jitter, false alarm rate, round trip delay RTT, bandwidth.
A method for displaying video and audio data, applied to a receiving end, comprising:

receive encoded data;

Decode the received encoded data;

When the decoded data includes audio data and partial video data, according to the motion information of the untransmitted video frame, the untransmitted video frame is reconstructed to obtain the reconstructed video frame; wherein, the portion of the video data includes at least some of the key frames and motion information for untransmitted video frames;

Rendering and displaying the reconstructed video frame, the key frame in the decoded data and the audio data.
The method for displaying video and audio data according to claim 6, wherein, after decoding the received encoded data, further comprising:

In the case that the decoded data only includes audio data, drive the virtual human model according to the audio data to generate dynamic video frames of the virtual human whose actions change with the audio data;

Rendering and displaying the virtual human dynamic video frame and the audio data.
A sender, comprising:

An encoding module, configured to encode the collected audio data and video data;

The sending module is used to send the video and audio data according to the real-time detection result of the network quality, using the video and audio transmission strategy corresponding to the real-time detection result;

Wherein, the real-time detection result includes at least a first quality level and a second quality level; the network quality of the second quality level is lower than the first quality level;

The video and audio sending strategy corresponding to the second quality level includes: sending encoded audio data and encoded partial video data to the receiving end, the partial video data including at least some key frames and untransmitted video Frame motion information.
The sending end according to claim 8, wherein the real-time detection result further includes a third quality level, and the network quality of the third quality level is lower than the second quality level;

The video and audio sending strategy corresponding to the third quality level includes: only sending encoded audio data to the receiving end.
A receiver, comprising:

A receiving module, configured to receive encoded data;

The decoding module is used to decode the received coded data, and when the decoded data includes audio data and partial video data, according to the motion information of the untransmitted video frame, the untransmitted video frame is reconstructed. structure to obtain a reconstructed video frame; wherein, the part of the video data includes at least part of the key frame and motion information of the untransmitted video frame;

A display module, configured to display the reconstructed video frame, key frames in the decoded data, and the audio data when the decoded data includes audio data and partial video data Rendered display.
The receiver according to claim 10, wherein,

The decoding module is also used to drive the avatar model according to the audio data when the decoded data only includes audio data, and generate a dynamic video frame of the avatar that moves as the audio data changes;

The display module is further configured to render and display the virtual human dynamic video frame and the audio data when the decoded data only includes audio data.
An electronic device comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 5 The method for sending video and audio data described above, or performing the method for displaying video and audio data as described in claim 6 or 7.
A computer-readable storage medium, storing a computer program, when the computer program is executed by a processor, it realizes the method for sending video and audio data according to any one of claims 1 to 5, or executes the method according to claim 6 or The method for displaying video and audio data described in 7.