CN119883006A

CN119883006A - Virtual human interaction method, device, related equipment and computer program product

Info

Publication number: CN119883006A
Application number: CN202510380799.XA
Authority: CN
Inventors: 殷兵; 刘坤; 黄元举; 李珍松; 龙明康; 朱陈军; 朱晓庆
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2025-03-28
Filing date: 2025-03-28
Publication date: 2025-04-25

Abstract

The application discloses a virtual human interaction method, a device, related equipment and a computer program product, which relate to the technical field of artificial intelligence, the application supports the creation of more than two virtual human roles, more than two virtual person roles can be created according to the role identification in response to the virtual person role creation request, and video streams of the virtual person roles are respectively generated through different virtual person engines. The method can simultaneously control more than two virtual person roles, combine corresponding video frames in the video streams of the more than two virtual person roles, generate combined target video streams and push the combined target video streams to the client for playing. According to the application, more than two virtual person roles can be combined into one video stream, so that the same-screen interaction of more than two virtual persons is realized, the number of interactable virtual person roles is increased, the diversity of virtual person display is further increased, the expressive force of the virtual person is enriched, the virtual person can be more vivid in more complex scenes, and the personalized demands of users are met.

Description

Virtual human interaction method, device, related equipment and computer program product

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a virtual human interaction method, apparatus, related device, and computer program product.

Background

The interactive virtual person is a digitalized character image created by computer graphics, artificial intelligence and other technologies, and has high simulation and interactive capability. The interactive virtual person is widely applied in the fields of entertainment, games, business marketing, electronic commerce, finance, education, medical treatment and the like.

The existing interactive virtual person scheme can generally only perform processing control on one virtual person role. In some scenes, the diversity and flexibility are slightly insufficient, and particularly the expressive force of some complex scenes is slightly insufficient, so that the personalized requirements of different users and different scenes can not be met.

Disclosure of Invention

In view of the above problems, the present application provides a virtual person interaction method, apparatus, related device and computer program product, so as to implement multi-role virtual person on-screen interaction, improve diversity of virtual person interaction, and meet personalized requirements of different users and different scenes. The specific scheme is as follows:

in a first aspect, a virtual human interaction method is provided, including:

acquiring a virtual person role creation request, wherein the request comprises role identifiers of more than two virtual persons to be created;

creating more than two corresponding virtual person roles according to the role identification, and generating video streams of the virtual person roles through virtual person engines corresponding to each virtual person role;

merging corresponding video frames in the video streams of more than two virtual human roles, wherein each video frame after merging is used as a target video stream;

And pushing the target video stream to a client for playing.

In a possible design, in another implementation manner of the first aspect of the embodiment of the present application, the video stream of the virtual person role generated by the virtual person engine is stored to a streaming media service module;

The process of merging corresponding video frames in the video streams of more than two virtual human roles and taking each video frame after merging as a target video stream comprises the following steps:

pulling video streams of the more than two virtual roles from the streaming media service module;

and merging corresponding video frames in the pulled video streams of the more than two virtual roles, taking each video frame after merging as a target video stream, and storing the target video stream into the streaming media service module so that the client pulls the target video stream from the streaming media service module.

In another implementation manner of the first aspect of the embodiment of the present application, in one possible design, the process of generating, by the virtual person engine corresponding to each virtual person role, a video stream of the virtual person role includes:

determining dialogue text of each virtual person role;

the dialog text for each avatar is synthesized into speech, and the avatar engine is driven based on the speech to generate a video stream for the avatar.

In another implementation manner of the first aspect of the embodiment of the present application, the determining the dialogue text of each virtual person role includes:

Acquiring script information input by a client, wherein the script information comprises dialogue texts configured for each virtual person role;

and extracting dialogue text of each virtual person role from the script information.

acquiring interactive voice input by a client, and recognizing the interactive voice into interactive text;

And calling a natural language processing model, determining a target virtual person role needing to be responded based on the interactive text, and generating response information of the target virtual person role as a dialogue text of the target virtual person role.

Acquiring an interactive text input by a client;

the method comprises the steps of obtaining multi-modal interaction information input by a client, wherein the multi-modal interaction information comprises information of more than two modes of voice, facial expression and gestures;

And calling a multi-mode processing model, determining a target virtual person role needing to be responded based on the multi-mode interaction information, and generating response information of the target virtual person role as a dialogue text of the target virtual person role.

In another implementation manner of the first aspect of the embodiment of the present application, before generating the video stream of each avatar by the avatar engine corresponding to the avatar, the method further includes:

Receiving video control information input by a client, wherein the video control information comprises at least one of subtitle control information, virtual person position control information, background/foreground control information, interaction control information and rendering control information;

The process of generating the video stream of the virtual person role through the virtual person engine corresponding to each virtual person role comprises the following steps:

And generating a video stream of the virtual person role according to the video control information through a virtual person engine corresponding to each virtual person role.

In another implementation manner of the first aspect of the embodiment of the present application, the process of creating more than two corresponding virtual person roles according to the role identifier includes:

Searching virtual character resources corresponding to each character identifier in the configured virtual character resources, and creating more than two corresponding virtual characters by using the searched virtual character resources;

Or alternatively, the first and second heat exchangers may be,

According to the character identification, determining respective parameter information of more than two virtual human characters to be created;

and calling a virtual person generating engine, and creating corresponding virtual person roles by utilizing the parameter information to obtain more than two created virtual person roles.

In a second aspect, a virtual human interaction device is provided, including:

a request acquisition unit, configured to acquire a virtual person character creation request, where the request includes character identifiers of two or more virtual persons to be created;

the virtual person creating unit is used for creating more than two corresponding virtual person roles according to the role identification;

the virtual person video generating unit is used for generating a video stream of each virtual person role through a virtual person engine corresponding to each virtual person role;

The virtual person merging processing unit is used for merging corresponding video frames in the video streams of more than two virtual person roles, and each video frame after merging is used as a target video stream;

And the video pushing unit is used for pushing the target video stream to the client for playing.

In a third aspect, an electronic device is provided that includes a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the virtual human interaction method described in any one of the foregoing first aspects of the present application.

In a fourth aspect, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the virtual human interaction method described in any of the preceding aspects of the application.

In a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the virtual human interaction method described in any of the preceding first aspects of the application.

By means of the technical scheme, the method and the device support creation of more than two virtual human roles, can respond to a virtual human role creation request, create corresponding more than two virtual human roles according to the role identification, and generate video streams of the virtual human roles through different virtual human engines respectively. The method can simultaneously control more than two virtual person roles, combine corresponding video frames in the video streams of the more than two virtual person roles, generate combined target video streams and push the combined target video streams to the client for playing. According to the scheme, more than two virtual person roles can be combined into one video stream, the same-screen interaction of more than two virtual persons is realized, the number of the interactable virtual person roles is increased, the diversity of virtual person display is further increased, the expressive force of the virtual persons is enriched, the virtual persons can be more vivid in more complex scenes, and the personalized requirements of different users and different scenes are met.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic diagram of a system architecture for implementing a virtual human interaction method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a virtual human interaction method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another virtual human interaction method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of another virtual human interaction method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of another virtual human interaction method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of another virtual human interaction method according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of another virtual human interaction method according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of a virtual human interaction device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The virtual human interaction scheme of the application relates to the following technologies:

1. real-time rendering techniques:

Three-dimensional modeling and texture mapping-creating the appearance and morphology of a virtual mannequin using three-dimensional modeling software. And realistic details such as skin, clothes, hair and the like are added to the virtual human model through a texture mapping technology, so that the visual effect is improved.

Bone animation and skinning, namely creating a bone system for a virtual human model and defining the motion relation among bones.

And binding the virtual human model with the bones by using a skin technology, so as to realize the deformation effect of the virtual human model along with the movement of the bones.

And the real-time rendering engine adopts a high-performance real-time rendering engine to realize the high-efficiency rendering of the virtual person in a real-time environment. The reality and immersion of the virtual person are enhanced by utilizing special effects such as illumination, shadow, reflection and the like provided by the engine.

Performance optimization, namely performing optimization measures such as model simplification, texture compression and the like aiming at the performance requirement of real-time rendering.

And (3) using a performance analysis tool provided by the engine to find out a performance bottleneck and performing targeted optimization.

2. Interactive technology

And the voice recognition and synthesis are integrated with the voice recognition technology, so that a virtual person can understand the voice instruction of the user.

The response of the virtual person is converted into natural and smooth voice output through a voice synthesis technology.

Natural language processing, namely, utilizing natural language processing technology to enable a virtual person to understand and generate natural language.

Appropriate responses are generated by the dialog management system based on the user's intent and context information.

Facial expression capturing and synthesizing, namely capturing facial expression changes of a real person in real time by using a facial expression capturing technology. The captured expression data is converted into facial expressions of the virtual person, so that finer emotion expression is realized.

And integrating gesture recognition technology to enable the virtual person to recognize gesture instructions of the user. Through gesture interaction, the user can interact with the virtual person more intuitively, such as controlling actions of the virtual person, selecting menu items and the like.

Multimode interaction, namely realizing richer and natural interaction experience by combining various interaction modes such as voice, facial expression, gestures and the like. Through multi-modal interaction, the virtual person can more accurately understand the intention and emotion state of the user, and the interaction accuracy and efficiency are improved.

The virtual human interaction scheme of the application can be applied to a plurality of different fields such as entertainment, games, business marketing, electronic commerce, finance, education, medical treatment and the like. Some possible application scenarios are for example:

in an entertainment scenario, interactions between multiple virtual idol images, between virtual idol images and users, may be provided,

In an online education scene, a virtual lecturer and a virtual teaching assistant role can be provided in a screen at the same time, and the virtual lecturer and the virtual teaching assistant role can interact in real time to conduct online teaching, so that online learning experience and reality of students are improved.

In a live broadcast scene, more than two anchor virtual person roles can be provided at a live broadcast end, and different anchor virtual person roles can interact, such as interactive games, interactive product propaganda and the like.

Of course, the above only exemplifies some possible application scenarios of the virtual human interaction scheme of the present application, and in addition, the virtual human interaction scheme of the present application can be applied to other various scenarios.

The application provides a virtual human interaction method, which can be applied to a system architecture shown in fig. 1, wherein the system can comprise a client 100 and a server 200. The server 200 may include one or more servers.

Either the client 100 or the server 200 may be separately configured to execute the virtual human interaction method provided in the embodiment of the present application. In addition, the client 100 or the server 200 may also cooperate to execute the virtual human interaction method provided in the embodiment of the present application.

The client 100 in the embodiment of the present application may be a mobile phone, a tablet computer, a large teaching screen, a wearable device, a vehicle-mounted device, a conference terminal, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or the like, which is not limited in any way.

The client 100 is provided with a display unit for displaying information input by a user or information provided to the user, various menus of the client 100, an interactive interface, file display, and/or playback of any of a variety of multimedia files. In the embodiment of the application, the display unit can be used for displaying each interaction interface in the virtual human interaction method, the synthesized target video containing more than two virtual human roles and the like.

The embodiment of the application provides a virtual human interaction method, which is applied to computer equipment for illustration, wherein the computer equipment can be specifically the server 200 in fig. 1. Referring to fig. 2, the virtual human interaction method specifically includes the following steps:

step S100, obtaining a virtual person role creation request, wherein the request comprises role identifiers of more than two virtual persons to be created.

Specifically, a user may initiate a avatar character creation request through a client, and specifically may request creation of more than two avatar characters. The request contains character identifiers of more than two virtual persons to be created. The character identifier of the virtual person is used to uniquely represent a virtual person character.

Step S110, creating more than two corresponding virtual person roles according to the role identification, and generating video streams of the virtual person roles through virtual person engines corresponding to the virtual person roles.

Specifically, after receiving the request for creating the virtual character, the server may create a corresponding virtual character according to the character identifier carried in the request.

In an alternative embodiment, the server may be configured with resource information of the avatar corresponding to each of the different role identifiers. When the server receives the virtual character creation request, the virtual character resources corresponding to the requested character identification can be queried, and then the corresponding virtual character is created by using the searched virtual character resources. When the virtual character creation request contains more than two character identifiers, the server may create more than two virtual characters.

In another alternative embodiment, the server side reads each character identifier from the virtual character creation request when the virtual character creation request is received. Further, according to the character identification, the parameter information of each of more than two virtual human characters to be created is determined, such as appearance parameters, physical parameters, action parameters, other parameters and the like. And then, calling a virtual person generating engine, and creating corresponding virtual person roles by utilizing the parameter information to obtain more than two virtual person roles after final creation.

After creating more than two virtual person roles requested by the user, a virtual person engine may be configured for each virtual person role, and a video stream of the virtual person role may be generated by the virtual person engine.

The virtual man engine can drive the virtual man role based on the voice, and the voice is the voice to be reported by the virtual man role. The voice can drive the broadcasting voice, mouth shape and limb action of the virtual character, and the like, so as to obtain the generated video stream of the virtual character.

The voice driving the virtual character may be input by the user in the form of text, voice, etc., or may be a reply voice of the virtual character determined based on natural language understanding technology in a scene where the virtual character dialogues with the user.

And step 120, merging corresponding video frames in the video streams of the more than two virtual human roles, wherein each video frame after merging is taken as a target video stream.

Specifically, in the previous step, corresponding video streams are generated for each virtual person role respectively, and the video stream corresponding to each virtual person role only contains the image of a single virtual person role. In the step, the video streams of each virtual person role can be combined and processed, so that the aim of playing more than two virtual persons on the same screen is fulfilled. Specifically, the video frames with the same time in the video streams of more than two virtual human roles are combined, and the combined video frames simultaneously containing more than two virtual human roles can be obtained through an image synthesis technology. The process of merging two video frames includes, but is not limited to, image synthesis means such as layer superposition and audio mixing.

Further, each video frame after the merging process is used as a target video stream. The target video stream is a video stream containing more than two virtual human roles.

Step S130, pushing the target video stream to a client for playing.

Specifically, the server may actively push the target video stream to the client for playing. The storage address of the target video stream can also be informed to the client, and the client actively pulls the target video stream based on the storage address. The user can see the virtual character video streams of more than two characters at the client.

The method provided by the embodiment of the application supports the creation of more than two virtual human roles, can respond to the virtual human role creation request, creates more than two corresponding virtual human roles according to the role identification, and generates video streams of the virtual human roles through different virtual human engines respectively. The method can simultaneously control more than two virtual person roles, combine corresponding video frames in the video streams of the more than two virtual person roles, generate combined target video streams and push the combined target video streams to the client for playing. According to the scheme, more than two virtual person roles can be combined into one video stream, the same-screen interaction of more than two virtual persons is realized, the number of the interactable virtual person roles is increased, the diversity of virtual person display is further increased, the expressive force of the virtual persons is enriched, the virtual persons can be more vivid in more complex scenes, and the personalized requirements of different users and different scenes are met.

In a possible implementation, in the data transmission process between the client and the service end, a WebRTC, webSocket protocol and the like can be adopted to realize real-time data transmission, so that timeliness of data transmission is improved, and further, the interaction effect is improved.

Referring to FIG. 3, another implementation flow of virtual human interaction is provided that supports the creation of more than two virtual human roles. Only 2 avatar roles are created in fig. 3 for illustration:

The client may initiate a multi-role session request. The request may contain character identifications of 2 virtual persons to be started. When the server receives the multi-role session request, the server starts a role 1 and a role 2 (corresponding to the role identifications of 2 virtual persons respectively) respectively. And then, generating a role 1 video stream by the virtual human engine corresponding to the role 1, and generating a role 2 video stream by the virtual human engine corresponding to the role 2.

In this embodiment, the video stream of the avatar character generated by the avatar engine may be stored in the streaming media service module. The streaming media service module can be integrated in the server or can be other devices independent from the server.

The generated character 1 video stream and the generated character 2 video stream are both stored in the streaming media service module. The server pulls more than two virtual character video streams (i.e., pulls a character 1 video stream and a character 2 video stream) from the streaming media service module. And carrying out merging processing on corresponding video frames in the pulled video streams, and taking each video frame after the merging processing as a target video stream (namely a video stream 3) to realize multi-role merging. The generated target video stream is stored in a streaming media service module.

In an optional example, the server may perform the steps of pulling and merging the video streams of the virtual person roles through a configured virtual person post-processing engine to obtain the target video stream. The storage addresses of the role 1 video stream and the role 2 video stream in the streaming media service module may be pre-designated. Thus, the virtual post-processing engine can pull the character 1 video stream and the character 2 video stream at the streaming media service module based on the 2 pull addresses (i.e., the storage addresses of the character 1 video stream and the character 2 video stream at the streaming media service module) input.

The client can pull the target video stream to play based on the storage address of the target video stream in the streaming media service module.

In this embodiment, the streaming media service module is configured to temporarily store video streams with different virtual roles for the virtual post-processing engine to pull the video streams for merging, store the merged target video stream, and on this basis, enable the client to pull the target video stream, realize downloading and playing, and start watching the video without the client waiting for the completion of downloading the whole video. Meanwhile, the server side respectively executes different task processing through a plurality of different processing engines, for example, video streams with different virtual human roles are generated through different virtual human engines, and the merging processing of the video streams with different roles is realized through the virtual human post-processing engine, so that the processing flow of the server side is simplified and clear.

Of course, the above-mentioned fig. 3 is only illustrated by using 2 virtual human roles, and in addition, the above-mentioned fig. 3 can be extended to scenes with more virtual human roles, and the description of this embodiment is omitted.

In some embodiments of the present application, the process of generating the video stream of each avatar by the avatar engine corresponding to each avatar in the step S110 is further described.

The avatar engine may be generated based on a video stream of the voice driven character that corresponds to the dialog text of the avatar character, i.e., the text to be announced by the avatar character.

An alternative example is to first determine the dialog text for each avatar character. Further, the dialog text for each avatar is synthesized into speech, and the avatar engine is driven based on the speech to generate a video stream for the avatar.

The dialog text for each avatar character can be obtained in a number of ways, depending on the different usage scenarios.

In one example scenario, a user may control dialog text for different avatar roles in script form through a client. For example, in a live virtual person scene, a user can send a dialogue text to be broadcasted by different virtual person roles in a script mode through a client.

Referring to fig. 4, the process of determining the dialog text of each avatar character may include:

The server acquires script information input by the client, wherein the script information comprises dialogue texts configured for each virtual person role. The script information can be added with virtual character identifiers, and the dialogue texts of different virtual characters can be distinguished through different character identifiers.

Further, different dialogue texts in the script information can be marked with broadcasting sequences, so that the server side can conveniently control the voice synthesis of the different dialogue texts and the driving of the virtual person roles.

And the server side extracts the dialogue text of each virtual person role from the script information.

The server may extract the dialogue text of each avatar from the script information according to the identification of each avatar. The dialog text is further converted into speech, whereby the different avatar roles are driven by the speech. The subsequent process flow may be described with reference to the foregoing descriptions, and will not be repeated here.

According to the scheme provided by the embodiment, the user can be supported to configure the dialogue texts of different virtual person roles through the script form, and then the plurality of different virtual person roles are controlled to conduct the same-screen interaction.

In another example scenario, a user may interact with multiple different avatar roles through a client. For example, in a virtual multi-person chat scenario, a user may input dialogue content in voice or text form through a client, generate reply content from each virtual person character, synthesize the reply content into a target video stream, and play the target video stream.

Referring to fig. 5, a flow of a user talking chat with multiple different avatar roles in voice form is illustrated. Wherein, the process of determining the dialogue text of each virtual character may include:

the method comprises the steps that a server side obtains interactive voice input by a client side and recognizes the interactive voice into interactive text.

Wherein the user may input interactive voice through the client, the interactive voice being conversational voice initiated by the user for one or more of the plurality of different virtual human roles. Taking a chat scenario of a user with two virtual human roles as an example, virtual human roles A and B are defined. The interactive voice entered by the user may be "today's weather is good, say where we should go to play". ", the interactive voice may trigger roles a and B to reply respectively. For another example, the interactive voice entered by the user may be "today's weather is good, A you want to go where to play" and then the interactive voice may trigger character A to reply alone.

After receiving interactive voice input by user, the service end firstly recognizes the interactive voice as interactive text. Further, a natural language processing model configured by the server side, such as a large model, can be called, and a target virtual person role to be responded in the round is determined based on the interactive text. The target avatar character is any one or more of the entire avatar characters.

Further, the server may invoke the natural language processing model to generate response information of the target avatar role as a dialog text of the target avatar role.

In an alternative example, the server may invoke a large model to determine the target avatar role that needs to be responded to in this round, and response information of the target avatar role, based on the dialogue history and the interactive text of the user.

Referring to fig. 6, a flow of a user performing conversational chat with multiple different avatar roles in text form is illustrated. Wherein, the process of determining the dialogue text of each virtual character may include:

and acquiring the interactive text input by the client.

Compared with the mode that the user inputs through the voice, in the embodiment, the user can input the interactive text through the text mode, at the moment, the link of voice conversion text processing can be omitted, the natural language processing model is called, and the target virtual person role needing to be responded and the response information thereof are determined directly based on the input interactive text and serve as the dialogue text of the target virtual person role.

The above embodiments introduce the scenario that the user inputs the dialogue content in the form of voice or text, and the server can determine the target virtual person role to be responded and the response information thereof through the voice recognition technology and the natural voice understanding technology, further synthesize the target virtual person role into voice through the voice synthesis technology, and generate the video stream for driving the target virtual person role through voice, so as to support the user to flexibly select the interaction form.

Referring to fig. 7, a flow of a user's conversational chat with multiple different avatar roles through the form of multi-modal input is illustrated. Wherein, the process of determining the dialogue text of each virtual character may include:

the method comprises the steps of obtaining multi-modal interaction information input by a client, wherein the multi-modal interaction information comprises information of more than two modes of voice, facial expression and gestures.

Compared with the foregoing embodiments, the present embodiment supports a multi-modal interaction form, that is, an interaction input form in which a user can enter through more than two modes, such as voice, facial expression, gesture, and the like. For example, the client may obtain input of two modalities, namely a user's voice and facial expression. On the basis, the server side supports the processing of the multi-mode information, and can particularly call a multi-mode processing model to process the input various different-mode information, so that the target virtual person role needing to be responded and the response information of the target virtual person role are determined. According to the method and the device for achieving the interaction, through supporting multi-mode interaction, richer and natural interaction experience can be achieved, through the multi-mode interaction, the intention and the emotion state of a user can be understood more accurately by the virtual person role, and the interaction accuracy and efficiency are improved.

In some embodiments of the present application, another implementation flow of a virtual human interaction method is provided. In the virtual human interaction process, a user can control the generated style of the virtual human character video stream through a client. Specifically, before the aforementioned step S110 generates the video stream of the avatar character through the avatar engine corresponding to each avatar character, the following processing steps may be further included:

and receiving video control information input by the client.

On the basis, the process of generating the video stream of the virtual person role through the virtual person engine corresponding to each virtual person role comprises the following steps:

The video control information may include at least one of subtitle control information, virtual person position control information, background/foreground control information, interactive control information, and rendering control information.

The video control information is used to describe and manage the manner in which the video content is presented.

Wherein:

And subtitle control information for managing display contents, positions, fonts, colors, sizes, appearance times, and the like of subtitles. Examples are subtitle text content (e.g. dialog text), subtitle style (e.g. font, color, background transparency), subtitle timeline (e.g. start time, end time).

And the virtual person position control information is used for controlling the position, the size, the rotation angle, the action and the like of the virtual person in the video picture. Examples are coordinate positions of the virtual person (e.g. X, Y, Z axis coordinates), motion instructions of the virtual person (e.g. waving hands, nodding heads), expression changes of the virtual person (e.g. smiling, frowning).

Background control information for managing the content, color, dynamic effects, etc. of the video background. Examples are background images or video material, background dynamic effects (e.g. fade, blur, particle effects), background switching time points.

The foreground control information is used for managing the display modes of foreground elements (such as virtual persons, props and special effects). Examples are transparency, size, position of foreground elements, front Jing Texiao (e.g. halo, shadow), interaction effect of foreground and background (e.g. occlusion relation).

And the time axis control information is used for managing the occurrence time, duration and synchronization relation of each element in the video. Examples are the synchronization time point of the subtitle and the dummy action, the triggering time of the background switching and the foreground special effect.

And the interaction control information is used for managing interaction behaviors of the virtual person and the user or other elements. Examples are feedback actions after the user clicks on the virtual person, and dynamic adjustment of the dialog content by the virtual person based on user input.

Rendering control information, which is used for managing the rendering modes of the video, such as resolution, frame rate, illumination effect and the like. Examples are video resolution (e.g., 1080p, 4K), lighting parameters (e.g., light source location, intensity), configuration parameters of the rendering engine.

In the method provided by the embodiment, the user can control the synthesis control information of the virtual character video flow through the client, such as subtitles, virtual character positions, foreground, background information and the like, so that the expression form of the synthesized virtual character video flow is improved.

In other possible implementations, after determining the dialogue text of the virtual person roles, the server may automatically determine relevant control information of the synthesized video stream based on the natural language processing model, and further generate, by using the virtual person engine corresponding to each virtual person role, the video stream of the virtual person roles according to the determined video control information.

For example, the server may generate matching video control information based on the dialog text of the avatar role by invoking the capabilities of the large model.

Example scenario:

assume that a user has a conversation with one of a plurality of avatar characters a. The dialog text is as follows:

the user is "do you bring me to see beautiful seaside sunset?

The virtual person role A is 'certainly, let us enjoy sunset beauty bar of seaside |'.

The server side calls a large model to analyze the dialogue text, and identifies key information such as 'seaside' and 'sunset'. And searching video clips related to 'seaside sunset' from a preset video resource library according to the identified key information. These video clips may contain different seaside sunset scenes and one or more clips that best match the user description may be selected.

Foreground and background matching:

In selected video segments, the large model further distinguishes between foreground and background information. The foreground may be dynamic elements such as pedestrians or animals on sea waves and beach, and the background may be static or slowly changing elements such as sky and sea surface. The large model automatically adjusts the presentation of foreground and background according to the context of the dialog text and the user's desires.

After the video control information such as foreground and background information is determined, the server can call the virtual man engine, and a video stream of the virtual man role is generated according to the video control information.

Example results:

finally, a video is generated that contains a virtual person to user dialog scene. In video, a virtual person stands on the sea, and the background is a beautiful sunset picture. With the progress of the dialogue, the actions and expressions of the virtual person are kept synchronous with the voice, and an immersive feeling is brought to the user.

In the method provided by the embodiment, the user does not need to manually control the related parameter information of the virtual character video stream synthesis process, the server can call the large model and other capabilities, and the appropriate video control information is automatically matched based on the dialogue text, so that more intelligent virtual human interaction is realized.

The virtual human interaction device provided by the embodiment of the application is described below, and the virtual human interaction device described below and the virtual human interaction method described above can be referred to correspondingly.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a virtual human interaction device according to an embodiment of the present application.

As shown in fig. 8, the apparatus may include:

a request acquisition unit 11 for acquiring a virtual person character creation request including character identifications of two or more virtual persons to be created;

A virtual person creating unit 12, configured to create more than two corresponding virtual person roles according to the role identifier;

A virtual person video generating unit 13, configured to generate a video stream of a virtual person role through a virtual person engine corresponding to each virtual person role;

A virtual person merging processing unit 14, configured to merge corresponding video frames in the video streams of the more than two virtual person roles, where each video frame after the merging process is used as a target video stream;

and the video pushing unit 15 is used for pushing the target video stream to the client for playing.

In a possible implementation, the virtual person video generating unit stores the video stream of the virtual person character generated by the virtual person engine to a streaming media service module. The virtual person merging processing unit merges corresponding video frames in the video streams of the more than two virtual person roles, and the process of taking each video frame after merging as a target video stream comprises the following steps:

In a possible implementation, the process of generating, by the avatar video generating unit, a video stream of the avatar character through the avatar engine corresponding to each avatar character includes:

determining dialogue text of each virtual person role;

In a possible implementation, the process of determining the dialogue text of each avatar role by the avatar video generating unit includes:

In another possible implementation, the process of determining the dialogue text of each avatar role by the avatar video generating unit includes:

In still another possible implementation, the process of determining the dialogue text of each avatar role by the avatar video generating unit includes:

Acquiring an interactive text input by a client;

In a possible implementation, the apparatus of the present application may further include:

The video control information receiving unit is used for receiving video control information input by the client before the video stream of the virtual person role is generated through the virtual person engine corresponding to each virtual person role, wherein the video control information comprises at least one of subtitle control information, virtual person position control information, background/foreground control information, interaction control information and rendering control information. The process of generating a video stream of the avatar character by the avatar engine corresponding to each avatar character by the avatar video generating unit includes:

In a possible implementation, the process of creating more than two corresponding virtual person roles by the virtual person creation unit according to the role identifier includes:

Or alternatively, the first and second heat exchangers may be,

The embodiment of the application also provides electronic equipment. Referring to fig. 9, a schematic diagram of an electronic device suitable for use in implementing embodiments of the present application is shown. The electronic device in the embodiment of the application can include, but is not limited to, a fixed terminal such as a mobile phone, a tablet computer, a vehicle-mounted terminal, a large teaching screen, a wearable device and the like. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in fig. 9, the electronic device may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603, to implement the virtual human interaction method of the foregoing embodiment of the present application. In the state where the electronic device is powered on, various programs and data necessary for the operation of the electronic device are also stored in the RAM 603. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, devices may be connected to I/O interface 605 including input devices 606, including for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 607, including for example, liquid Crystal Displays (LCDs), speakers, vibrators, etc., storage devices 608, including for example, memory cards, hard disks, etc., and communication devices 609. The communication means 609 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 shows an electronic device having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

The embodiment of the application also provides a computer program product, which comprises computer readable instructions, wherein the computer readable instructions, when running on the electronic equipment, cause the electronic equipment to realize any virtual human interaction method provided by the embodiment of the application.

The embodiment of the application also provides a computer readable storage medium, which carries one or more computer programs, and when the one or more computer programs are executed by the electronic equipment, the electronic equipment can realize any virtual human interaction method provided by the embodiment of the application.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

Claims

1. A virtual human interaction method, comprising:

And pushing the target video stream to a client for playing.

2. The method of claim 1, wherein the video stream of the avatar character generated by the avatar engine is stored to a streaming media service module;

3. The method of claim 1, wherein generating a video stream of the avatar by the avatar engine corresponding to each avatar comprises:

determining dialogue text of each virtual person role;

4. A method according to claim 3, wherein determining the dialog text for each avatar role comprises:

5. A method according to claim 3, wherein determining the dialog text for each avatar role comprises:

6. A method according to claim 3, wherein determining the dialog text for each avatar role comprises:

Acquiring an interactive text input by a client;

7. A method according to claim 3, wherein determining the dialog text for each avatar role comprises:

8. The method of claim 1, further comprising, prior to generating the video stream of the avatar by the avatar engine corresponding to each avatar,:

9. The method according to any one of claims 1 to 8, wherein the process of creating corresponding two or more virtual person roles from the role identification includes:

Or alternatively, the first and second heat exchangers may be,

10. A virtual human interaction device, comprising:

11. An electronic device is characterized by comprising a memory and a processor;

the memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the virtual human interaction method according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the virtual human interaction method according to any of claims 1-9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the virtual human interaction method of any of claims 1 to 9.